Working with Dates and Times in a DataFrame

Dates and times can be tricky to deal with in a DataFrame. Even though we recognize these values as numbers, date and time data are often stored as strings. To properly analyze these values, we must first convert them to float or int dtypes using a function called pd.DatetimeIndex.

The Movie Dataset

To explore this function, we'll use a DataFrame of five different movies, including information about their release date, how much money they made in US dollars, and a personal rating out of 10.

import pandas as pd

#Creates a DataFrame of "movie", "release date", "domestic gross", "worldwide gross", "personal rating", and "international box office" columns
df = pd.DataFrame([
  {"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201, "personal rating": 10, "international box office": 138500000}, 
  {"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598, "personal rating": 9, "international box office": 522958274}, 
  {"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547, "personal rating": 7, "international box office": 266567421}, 
  {"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837, "personal rating": 8, "international box office": 6879509}, 
  {"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721, "personal rating": 7, "international box office": 242817}
  ])
df

	movie	release date	domestic box office	worldwide box office	personal rating	international box office
0	The Truman Show	1996-06-05	125618201	264118201	10	138500000
1	Rogue One: A Star Wars Story	2016-12-16	532177324	1055135598	9	522958274
2	Iron Man	2008-05-02	318604126	585171547	7	266567421
3	Blade Runner	1982-06-25	32656328	39535837	8	6879509
4	Breakfast at Tiffany's	1961-10-05	9551904	9794721	7	242817

Creating the movie DataFrame

The datetime dtype

On its own, pd.DatetimeIndex recognizes data that represents a date and converts it into a new dtype called datetime. Usually this step can be written in the same line of code used to extract numbers from the date, but it is crucial to understanding the purpose of the function.

Currently, the "release date" column stores its data as strings. To convert it to the datetime dtype, input the column name as a data parameter:

#Usually data = is unnecessary here, as it is the default parameter.
pd.DatetimeIndex(data = df["release date"])

DatetimeIndex(['1996-06-05', '2016-12-16', '2008-05-02', '1982-06-25',
               '1961-10-05'],
              dtype='datetime64[ns]', name='release date', freq=None)

Converting release date to datetime

As you can see, the new dtype is datetime64[ns], which is easier to manipulate for our purposes.

If we're just concerned with converting the data to the datetime dtype, we can also use the function pd.to_datetime. The output of this function is a bit cleaner than that of pd.DatetimeIndex:

pd.to_datetime(df["release date"])

0   1996-06-05
1   2016-12-16
2   2008-05-02
3   1982-06-25
4   1961-10-05
Name: release date, dtype: datetime64[ns]

Demonstrating another function

However, pd.DatetimeIndex is more versatile and allows us to work with separate components of the date, so that's what we'll work with moving forward.

With either function, you can store the converted data as a new column in your DataFrame. Notice that even though the data looks exactly the same as the "release date" column, pandas recognizes it as a different dtype.

df["release date datetime"] = pd.DatetimeIndex(data = df["release date"])
df

	movie	release date	domestic box office	worldwide box office	personal rating	international box office	release date datetime
0	The Truman Show	1996-06-05	125618201	264118201	10	138500000	1996-06-05
1	Rogue One: A Star Wars Story	2016-12-16	532177324	1055135598	9	522958274	2016-12-16
2	Iron Man	2008-05-02	318604126	585171547	7	266567421	2008-05-02
3	Blade Runner	1982-06-25	32656328	39535837	8	6879509	1982-06-25
4	Breakfast at Tiffany's	1961-10-05	9551904	9794721	7	242817	1961-10-05

Storing datetime data in a new column

df.dtypes

movie                               object
release date                        object
domestic box office                  int64
worldwide box office                 int64
personal rating                      int64
international box office             int64
release date datetime       datetime64[ns]
dtype: object

dtypes of all columns

Extracting Date Information

Now that we understand what pd.DatetimeIndex does, we can use it to extract date information.

Each part of the date — day, month, and year — has its own attribute that allows us to isolate it from the rest of the entry. Some of the most common and most useful attributes are:

.year
.month
.day
.dayofyear
- (Between 1 and 365)
.weekday
- (Between 0 (Monday) and 6 (Sunday)

and more!

For example, we can create a new column with just the release year of each movie by attaching .year to the end of the function:

df["release year"] = pd.DatetimeIndex(df["release date"]).year
df

	movie	release date	domestic box office	worldwide box office	personal rating	international box office	release year
0	The Truman Show	1996-06-05	125618201	264118201	10	138500000	1996
1	Rogue One: A Star Wars Story	2016-12-16	532177324	1055135598	9	522958274	2016
2	Iron Man	2008-05-02	318604126	585171547	7	266567421	2008
3	Blade Runner	1982-06-25	32656328	39535837	8	6879509	1982
4	Breakfast at Tiffany's	1961-10-05	9551904	9794721	7	242817	1961

Extracting year from the release date column

The "release year" column will automatically be stored as an int dtype, allowing us to perform numerical analysis — we can calculate the average release year, earliest and latest release years, and more for the movies in this dataset. Before applying pd.DatetimeIndex, this was impossible because the "release date" entries were stored as strings!

yr_mean = df["release year"].mean()
yr_min = df["release year"].min()
yr_max = df["release year"].max()
print(yr_mean, yr_min, yr_max)

1992.6 1961 2016

Analyzing release year of movies

Extracting Time Information

pd.DatetimeIndex can be applied similarly to time data. We'll use a very simple DataFrame to demonstrate this example:

df2 = pd.DataFrame([
  {"time": "1:01"}, {"time": "2:30"}, {"time": "4:18"}, {"time": "11:11"}, {"time": "17:58"}, {"time": "23:46"}
  ])
df2

	time
0	1:01
1	2:30
2	4:18
3	11:11
4	17:58
5	23:46

Creating a very simple time DataFrame

Time information can be expressed with the datetime dtype as well; pd.DatetimeIndex allows Pandas to recognize components of time measurement. By default, Pandas will assign the current date to each entry, but this usually doesn't affect our analysis.

pd.DatetimeIndex(df2["time"])

DatetimeIndex(['2022-08-03 01:01:00', '2022-08-03 02:30:00',
               '2022-08-03 04:18:00', '2022-08-03 11:11:00',
               '2022-08-03 17:58:00', '2022-08-03 23:46:00'],
              dtype='datetime64[ns]', name='time', freq=None)

Converting to datetime dtype

The time-related attributes of pd.DatetimeIndex include:

.hour
.minute
.second
.microsecond
.nanosecond

For example, we can extract the minute component of every entry in "time":

df2["minutes"] = pd.DatetimeIndex(df2["time"]).minute
df2

	time	minutes
0	1:01	1
1	2:30	30
2	4:18	18
3	11:11	11
4	17:58	58
5	23:46	46

Extracting minutes from time column

Again, the entries in "minutes" have the integer dtype, which enables numerical analysis. We can apply any of the usual Pandas functions to this new column!