Creating New Columns in a DataFrame
There are two primary methods of creating new columns in a DataFrame:
Creating a new column calculated from the data you already have (ex: adding a new, calculated value to your DataFrame), or
Creating a new column of new data, directly in Python, that is not from another dataset or otherwise already exists.
Note: Creating a new column is different than merging two existing DataFrames together. If you're looking for that, see TODO.
The Movie Dataset
Throughout this guide, we will use a small DataFrame with data about movies:
movie | release date | domestic box office | worldwide box office | |
---|---|---|---|---|
0 | The Truman Show | 1996-06-05 | 125618201 | 264118201 |
1 | Rogue One: A Star Wars Story | 2016-12-16 | 532177324 | 1055135598 |
2 | Iron Man | 2008-05-02 | 318604126 | 585171547 |
3 | Blade Runner | 1982-06-25 | 32656328 | 3953583 |
4 | Breakfast at Tiffany's | 1961-10-05 | 9551904 | 9794721 |
Create a New Column Using a Calculation
We can perform simple mathematical operations on columns and store the resulting numbers in a new column. This includes addition, multiplication, subtraction, and division. We do this so our results are easier to see and available for future analysis.
Let's think about the box office columns in the movie DataFrame. We already have domestic (US) box office and worldwide box office. But what if we wanted to figure out the international box office for each movie? To find this value for each movie, we could subtract domestic from worldwide for every movie by hand — or we could allow Python to do it for us:
movie | release date | domestic box office | worldwide box office | personal rating | international box office | |
---|---|---|---|---|---|---|
0 | The Truman Show | 1996-06-05 | 125618201 | 264118201 | 10 | 138500000 |
1 | Rogue One: A Star Wars Story | 2016-12-16 | 532177324 | 1055135598 | 9 | 522958274 |
2 | Iron Man | 2008-05-02 | 318604126 | 585171547 | 7 | 266567421 |
3 | Blade Runner | 1982-06-25 | 32656328 | 39535837 | 8 | 6879509 |
4 | Breakfast at Tiffany's | 1961-10-05 | 9551904 | 9794721 | 7 | 242817 |
Your DataFrame has been permanently modified and will always contain the new columns.
Create a New Columns With New Data
We can add a new column of data directly with new data organized in a list. One reason we might want to add a column is when we obtain a brand new variable related to the DataFrame. Suppose you wanted to include a new column of your personal rating for each movie:
movie | release date | domestic box office | worldwide box office | personal rating | |
---|---|---|---|---|---|
0 | The Truman Show | 1996-06-05 | 125618201 | 264118201 | 10 |
1 | Rogue One: A Star Wars Story | 2016-12-16 | 532177324 | 1055135598 | 9 |
2 | Iron Man | 2008-05-02 | 318604126 | 585171547 | 7 |
3 | Blade Runner | 1982-06-25 | 32656328 | 39535837 | 8 |
4 | Breakfast at Tiffany's | 1961-10-05 | 9551904 | 9794721 | 7 |
Creating a New Column With df.loc
If you have data that exists for only a small number of observations, you can use .loc
to modify a DataFrame based on the row index value and the column name. A row index value is the leftmost, bold column in a DataFrame that defaults to a numbered list starting at 0
. When using .loc
, choose the row index value(s) that correspond to the rows you have information for.
Continuing our example, say we learned that the critics' rating for "Blade Runner" is an 8.9. Since "Blade Runner" is in the row with index 3
, we add its critic rating
with the following code:
movie | release date | domestic box office | worldwide box office | personal rating | (3, critic rating) | |
---|---|---|---|---|---|---|
0 | The Truman Show | 1996-06-05 | 125618201 | 264118201 | 10 | NaN |
1 | Rogue One: A Star Wars Story | 2016-12-16 | 532177324 | 1055135598 | 9 | NaN |
2 | Iron Man | 2008-05-02 | 318604126 | 585171547 | 7 | NaN |
3 | Blade Runner | 1982-06-25 | 32656328 | 39535837 | 8 | 8.9 |
4 | Breakfast at Tiffany's | 1961-10-05 | 9551904 | 9794721 | 7 | NaN |