Working With Columns and Series in a DataFrame
Manipulating columns — sometimes called "variables" — is a foundational data science skill.
The Movie Dataset
To explore columns and series, we'll use a DataFrame of five different movies, including information about their release date and how much money they made in US dollars.
import pandas as pd
#Creates a DataFrame of "movie", "release date", "domestic gross", and "worldwide gross" columns
df = pd.DataFrame([
{"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201},
{"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598},
{"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547},
{"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837},
{"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721}
])
df
movie | release date | domestic box office | worldwide box office | |
---|---|---|---|---|
0 | The Truman Show | 1996-06-05 | 125618201 | 264118201 |
1 | Rogue One: A Star Wars Story | 2016-12-16 | 532177324 | 1055135598 |
2 | Iron Man | 2008-05-02 | 318604126 | 585171547 |
3 | Blade Runner | 1982-06-25 | 32656328 | 3953583 |
4 | Breakfast at Tiffany's | 1961-10-05 | 9551904 | 9794721 |
DataFrames vs. Series
Both DataFrames and Series are methods of storing data with Pandas, but there are a few differences between them. Series are one-dimensional, often displayed as lists of data. DataFrames, on the other hand, are two-dimensional, like tables of data.
To load existing data into a DataFrame, use the pandas function below. The data parameter can be filled with any two-dimensional data structure, including .csv files and Excel files. Store it in a variable so you can apply pandas operations on it later.
To load existing data into a Series, use the pandas function below. Here, the data parameter can be filled with any one-dimensional data structure, including DataFrame columns, lists, and dictionaries. After storing the Series in a variable, you can work with it and apply functions to it just like a DataFrame.
Note: If data
is your only parameter, you don't have two write data =
in the parenthesis.
Series within DataFrames
We can think of a DataFrame as a bunch of Series put together to make a table. In this context, a Series is a single column of a DataFrame, but in list form. To pull a Series out of a DataFrame, use a single set of brackets around the column name:
Creating a DataFrame With a Subset of Columns
Sometimes, we want to select a certain group of columns from a DataFrame instead of looking at the whole thing.
And don't forget to store the result in a new variable — df2
, for example — so you just overwrite your original DataFrame!
Single Column
The code for creating a new DataFrame with one column involves double square brackets. The two sets of brackets are important, as they keep the data in the two-dimensional table format. Otherwise, we'd be creating a one-dimensional list.
df_release_date = df[["release date"]]
df_release_date
release date | |
---|---|
0 | 1996-06-05 |
1 | 2016-12-16 |
2 | 2008-05-02 |
3 | 1982-06-25 |
4 | 1961-10-05 |
Multiple Columns
Creating a DataFrame with a subset of multiple columns is similar. This time, put a list of the columns you want inside the inner list (the innermost bracket). Each column name should be separated by a comma.
df_box_office_comparison = df[["movie", "domestic box office", "worldwide box office"]]
df_box_office_comparison
movie | domestic box office | worldwide box office | |
---|---|---|---|
0 | The Truman Show | 125618201 | 264118201 |
1 | Rogue One: A Star Wars Story | 532177324 | 1055135598 |
2 | Iron Man | 318604126 | 585171547 |
3 | Blade Runner | 32656328 | 39535837 |
4 | Breakfast at Tiffany's | 9551904 | 9794721 |
Renaming Columns
It can be helpful to rename a column. You might want to have a name that's more self-explanatory, easier to remember, or easier to work with in code. This is especially true for datasets created by other people.
For example, you might think that title
is a better name for the movie
column, just because it makes more sense to you. Remember, set your code equal to a variable to save your results. Here's how to rename a column with pandas:
df_rename = df.rename(columns = {"movie": "title"})
df_rename
title | release date | domestic box office | worldwide box office | |
---|---|---|---|---|
0 | The Truman Show | 1996-06-05 | 125618201 | 264118201 |
1 | Rogue One: A Star Wars Story | 2016-12-16 | 532177324 | 1055135598 |
2 | Iron Man | 2008-05-02 | 318604126 | 585171547 |
3 | Blade Runner | 1982-06-25 | 32656328 | 39535837 |
4 | Breakfast at Tiffany's | 1961-10-05 | 9551904 | 9794721 |
Take note of the curly braces inside the parenthesis! Your new DataFrame will look exactly the same besides the name you changed.