Working With Columns and Series in a DataFrame


Manipulating columns — sometimes called "variables" — is a foundational data science skill.

The Movie Dataset

To explore columns and series, we'll use a DataFrame of five different movies, including information about their release date and how much money they made in US dollars.

import pandas as pd

#Creates a DataFrame of "movie", "release date", "domestic gross", and "worldwide gross" columns
df = pd.DataFrame([ 
  {"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201}, 
  {"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598}, 
  {"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547}, 
  {"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837}, 
  {"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721}
])
df
movierelease datedomestic box officeworldwide box office
0The Truman Show1996-06-05125618201264118201
1Rogue One: A Star Wars Story2016-12-165321773241055135598
2Iron Man2008-05-02318604126585171547
3Blade Runner1982-06-25326563283953583
4Breakfast at Tiffany's1961-10-0595519049794721
Creating a DataFrame to work with columns

DataFrames vs. Series

Both DataFrames and Series are methods of storing data with Pandas, but there are a few differences between them. Series are one-dimensional, often displayed as lists of data. DataFrames, on the other hand, are two-dimensional, like tables of data.

To load existing data into a DataFrame, use the pandas function below. The data parameter can be filled with any two-dimensional data structure, including .csv files and Excel files. Store it in a variable so you can apply pandas operations on it later.

df = pd.DataFrame(data = ...)
Making a DataFrame

To load existing data into a Series, use the pandas function below. Here, the data parameter can be filled with any one-dimensional data structure, including DataFrame columns, lists, and dictionaries. After storing the Series in a variable, you can work with it and apply functions to it just like a DataFrame.

series = pd.Series(data = ...)
Making a Series

Note: If data is your only parameter, you don't have two write data = in the parenthesis.

Series within DataFrames

We can think of a DataFrame as a bunch of Series put together to make a table. In this context, a Series is a single column of a DataFrame, but in list form. To pull a Series out of a DataFrame, use a single set of brackets around the column name:

worldboxoffice_series = df["worldwide box office"]
worldboxoffice_series
0     264118201
1    1055135598
2     585171547
3      39535837
4       9794721
Name: worldwide box office, dtype: int64
Pulling a Series from the movie DataFrame

Creating a DataFrame With a Subset of Columns

Sometimes, we want to select a certain group of columns from a DataFrame instead of looking at the whole thing.

And don't forget to store the result in a new variable — df2, for example — so you just overwrite your original DataFrame!

Single Column

The code for creating a new DataFrame with one column involves double square brackets. The two sets of brackets are important, as they keep the data in the two-dimensional table format. Otherwise, we'd be creating a one-dimensional list.

df_release_date = df[["release date"]]
df_release_date
release date
01996-06-05
12016-12-16
22008-05-02
31982-06-25
41961-10-05
Creating a new DataFrame with one column

Multiple Columns

Creating a DataFrame with a subset of multiple columns is similar. This time, put a list of the columns you want inside the inner list (the innermost bracket). Each column name should be separated by a comma.

df_box_office_comparison = df[["movie", "domestic box office", "worldwide box office"]]
df_box_office_comparison
moviedomestic box officeworldwide box office
0The Truman Show125618201264118201
1Rogue One: A Star Wars Story5321773241055135598
2Iron Man318604126585171547
3Blade Runner3265632839535837
4Breakfast at Tiffany's95519049794721
Creating a new DataFrame with a subset of multiple columns

Renaming Columns

It can be helpful to rename a column. You might want to have a name that's more self-explanatory, easier to remember, or easier to work with in code. This is especially true for datasets created by other people.

For example, you might think that title is a better name for the movie column, just because it makes more sense to you. Remember, set your code equal to a variable to save your results. Here's how to rename a column with pandas:

df_rename = df.rename(columns = {"movie": "title"})
df_rename
titlerelease datedomestic box officeworldwide box office
0The Truman Show1996-06-05125618201264118201
1Rogue One: A Star Wars Story2016-12-165321773241055135598
2Iron Man2008-05-02318604126585171547
3Blade Runner1982-06-253265632839535837
4Breakfast at Tiffany's1961-10-0595519049794721
renaming a column in a DataFrame

Take note of the curly braces inside the parenthesis! Your new DataFrame will look exactly the same besides the name you changed.