Working with Dates and Times in a DataFrame
Dates and times can be tricky to deal with in a DataFrame. Even though we recognize these values as numbers, date and time data are often stored as strings. To properly analyze these values, we must first convert them to float or int dtypes using a function called pd.DatetimeIndex
.
The Movie Dataset
To explore this function, we'll use a DataFrame of five different movies, including information about their release date, how much money they made in US dollars, and a personal rating out of 10.
import pandas as pd
#Creates a DataFrame of "movie", "release date", "domestic gross", "worldwide gross", "personal rating", and "international box office" columns
df = pd.DataFrame([
{"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201, "personal rating": 10, "international box office": 138500000},
{"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598, "personal rating": 9, "international box office": 522958274},
{"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547, "personal rating": 7, "international box office": 266567421},
{"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837, "personal rating": 8, "international box office": 6879509},
{"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721, "personal rating": 7, "international box office": 242817}
])
df
movie | release date | domestic box office | worldwide box office | personal rating | international box office | |
---|---|---|---|---|---|---|
0 | The Truman Show | 1996-06-05 | 125618201 | 264118201 | 10 | 138500000 |
1 | Rogue One: A Star Wars Story | 2016-12-16 | 532177324 | 1055135598 | 9 | 522958274 |
2 | Iron Man | 2008-05-02 | 318604126 | 585171547 | 7 | 266567421 |
3 | Blade Runner | 1982-06-25 | 32656328 | 39535837 | 8 | 6879509 |
4 | Breakfast at Tiffany's | 1961-10-05 | 9551904 | 9794721 | 7 | 242817 |
The datetime dtype
On its own, pd.DatetimeIndex
recognizes data that represents a date and converts it into a new dtype called datetime. Usually this step can be written in the same line of code used to extract numbers from the date, but it is crucial to understanding the purpose of the function.
Currently, the "release date"
column stores its data as strings. To convert it to the datetime dtype, input the column name as a data
parameter:
As you can see, the new dtype is datetime64[ns]
, which is easier to manipulate for our purposes.
If we're just concerned with converting the data to the datetime dtype, we can also use the function pd.to_datetime
. The output of this function is a bit cleaner than that of pd.DatetimeIndex
:
However, pd.DatetimeIndex
is more versatile and allows us to work with separate components of the date, so that's what we'll work with moving forward.
With either function, you can store the converted data as a new column in your DataFrame. Notice that even though the data looks exactly the same as the "release date"
column, pandas recognizes it as a different dtype.
df["release date datetime"] = pd.DatetimeIndex(data = df["release date"])
df
movie | release date | domestic box office | worldwide box office | personal rating | international box office | release date datetime | |
---|---|---|---|---|---|---|---|
0 | The Truman Show | 1996-06-05 | 125618201 | 264118201 | 10 | 138500000 | 1996-06-05 |
1 | Rogue One: A Star Wars Story | 2016-12-16 | 532177324 | 1055135598 | 9 | 522958274 | 2016-12-16 |
2 | Iron Man | 2008-05-02 | 318604126 | 585171547 | 7 | 266567421 | 2008-05-02 |
3 | Blade Runner | 1982-06-25 | 32656328 | 39535837 | 8 | 6879509 | 1982-06-25 |
4 | Breakfast at Tiffany's | 1961-10-05 | 9551904 | 9794721 | 7 | 242817 | 1961-10-05 |
Extracting Date Information
Now that we understand what pd.DatetimeIndex
does, we can use it to extract date information.
Each part of the date — day, month, and year — has its own attribute that allows us to isolate it from the rest of the entry. Some of the most common and most useful attributes are:
.year
.month
.day
.dayofyear
- (Between 1 and 365)
.weekday
- (Between 0 (Monday) and 6 (Sunday)
and more!
For example, we can create a new column with just the release year of each movie by attaching .year
to the end of the function:
df["release year"] = pd.DatetimeIndex(df["release date"]).year
df
movie | release date | domestic box office | worldwide box office | personal rating | international box office | release year | |
---|---|---|---|---|---|---|---|
0 | The Truman Show | 1996-06-05 | 125618201 | 264118201 | 10 | 138500000 | 1996 |
1 | Rogue One: A Star Wars Story | 2016-12-16 | 532177324 | 1055135598 | 9 | 522958274 | 2016 |
2 | Iron Man | 2008-05-02 | 318604126 | 585171547 | 7 | 266567421 | 2008 |
3 | Blade Runner | 1982-06-25 | 32656328 | 39535837 | 8 | 6879509 | 1982 |
4 | Breakfast at Tiffany's | 1961-10-05 | 9551904 | 9794721 | 7 | 242817 | 1961 |
The "release year"
column will automatically be stored as an int dtype, allowing us to perform numerical analysis — we can calculate the average release year, earliest and latest release years, and more for the movies in this dataset. Before applying pd.DatetimeIndex
, this was impossible because the "release date"
entries were stored as strings!
Extracting Time Information
pd.DatetimeIndex
can be applied similarly to time data. We'll use a very simple DataFrame to demonstrate this example:
df2 = pd.DataFrame([
{"time": "1:01"}, {"time": "2:30"}, {"time": "4:18"}, {"time": "11:11"}, {"time": "17:58"}, {"time": "23:46"}
])
df2
time | |
---|---|
0 | 1:01 |
1 | 2:30 |
2 | 4:18 |
3 | 11:11 |
4 | 17:58 |
5 | 23:46 |
Time information can be expressed with the datetime dtype as well; pd.DatetimeIndex
allows Pandas to recognize components of time measurement. By default, Pandas will assign the current date to each entry, but this usually doesn't affect our analysis.
The time-related attributes of pd.DatetimeIndex
include:
.hour
.minute
.second
.microsecond
.nanosecond
For example, we can extract the minute component of every entry in "time"
:
df2["minutes"] = pd.DatetimeIndex(df2["time"]).minute
df2
time | minutes | |
---|---|---|
0 | 1:01 | 1 |
1 | 2:30 | 30 |
2 | 4:18 | 18 |
3 | 11:11 | 11 |
4 | 17:58 | 58 |
5 | 23:46 | 46 |
Again, the entries in "minutes"
have the integer dtype, which enables numerical analysis. We can apply any of the usual Pandas functions to this new column!