Creating New Columns in a DataFrame


There are two primary methods of creating new columns in a DataFrame:

  1. Creating a new column calculated from the data you already have (ex: adding a new, calculated value to your DataFrame), or

  2. Creating a new column of new data, directly in Python, that is not from another dataset or otherwise already exists.

Note: Creating a new column is different than merging two existing DataFrames together. If you're looking for that, see TODO.

The Movie Dataset

Throughout this guide, we will use a small DataFrame with data about movies:

import pandas as pd

#Creates a DataFrame of "movie", "release date", "domestic gross," and "worldwide gross" columns
df = pd.DataFrame([
  {"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201}, 
  {"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598}, 
  {"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547}, 
  {"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837}, 
  {"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721}
  ])
df
movierelease datedomestic box officeworldwide box office
0The Truman Show1996-06-05125618201264118201
1Rogue One: A Star Wars Story2016-12-165321773241055135598
2Iron Man2008-05-02318604126585171547
3Blade Runner1982-06-25326563283953583
4Breakfast at Tiffany's1961-10-0595519049794721
Creating the movie DataFrame

Create a New Column Using a Calculation

We can perform simple mathematical operations on columns and store the resulting numbers in a new column. This includes addition, multiplication, subtraction, and division. We do this so our results are easier to see and available for future analysis.

Let's think about the box office columns in the movie DataFrame. We already have domestic (US) box office and worldwide box office. But what if we wanted to figure out the international box office for each movie? To find this value for each movie, we could subtract domestic from worldwide for every movie by hand — or we could allow Python to do it for us:

df["international box office"] = df["worldwide box office"] - df["domestic box office"]
df
movierelease datedomestic box officeworldwide box officepersonal ratinginternational box office
0The Truman Show1996-06-0512561820126411820110138500000
1Rogue One: A Star Wars Story2016-12-1653217732410551355989522958274
2Iron Man2008-05-023186041265851715477266567421
3Blade Runner1982-06-25326563283953583786879509
4Breakfast at Tiffany's1961-10-05955190497947217242817
Making a new column with operations

Your DataFrame has been permanently modified and will always contain the new columns.

Create a New Columns With New Data

We can add a new column of data directly with new data organized in a list. One reason we might want to add a column is when we obtain a brand new variable related to the DataFrame. Suppose you wanted to include a new column of your personal rating for each movie:

# When specifying a list, the order of the data must match the order of the DataFrame exactly:
df["personal rating"] = [10, 9, 7, 8, 7]
df
movierelease datedomestic box officeworldwide box officepersonal rating
0The Truman Show1996-06-0512561820126411820110
1Rogue One: A Star Wars Story2016-12-1653217732410551355989
2Iron Man2008-05-023186041265851715477
3Blade Runner1982-06-2532656328395358378
4Breakfast at Tiffany's1961-10-05955190497947217
Creating a new DataFrame column with a "personal rating" column

Creating a New Column With df.loc

If you have data that exists for only a small number of observations, you can use .loc to modify a DataFrame based on the row index value and the column name. A row index value is the leftmost, bold column in a DataFrame that defaults to a numbered list starting at 0 . When using .loc, choose the row index value(s) that correspond to the rows you have information for.

Continuing our example, say we learned that the critics' rating for "Blade Runner" is an 8.9. Since "Blade Runner" is in the row with index 3, we add its critic rating with the following code:

df.loc[3, "critic rating"] = 8.9
df
movierelease datedomestic box officeworldwide box officepersonal rating(3, critic rating)
0The Truman Show1996-06-0512561820126411820110NaN
1Rogue One: A Star Wars Story2016-12-1653217732410551355989NaN
2Iron Man2008-05-023186041265851715477NaN
3Blade Runner1982-06-25326563283953583788.9
4Breakfast at Tiffany's1961-10-05955190497947217NaN
Creating a new DataFrame column with a "critic rating" column for a single row