Working with Dates and Times in a DataFrame

Dates and times can be tricky to deal with in a DataFrame. Even though we recognize these values as numbers, date and time data are often stored as strings. To properly analyze these values, we must first convert them to float or int dtypes using a function called pd.DatetimeIndex.

The Movie Dataset

To explore this function, we'll use a DataFrame of five different movies, including information about their release date, how much money they made in US dollars, and a personal rating out of 10.

import pandas as pd

#Creates a DataFrame of "movie", "release date", "domestic gross", "worldwide gross", "personal rating", and "international box office" columns
df = pd.DataFrame([
  {"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201, "personal rating": 10, "international box office": 138500000}, 
  {"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598, "personal rating": 9, "international box office": 522958274}, 
  {"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547, "personal rating": 7, "international box office": 266567421}, 
  {"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837, "personal rating": 8, "international box office": 6879509}, 
  {"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721, "personal rating": 7, "international box office": 242817}
movierelease datedomestic box officeworldwide box officepersonal ratinginternational box office
0The Truman Show1996-06-0512561820126411820110138500000
1Rogue One: A Star Wars Story2016-12-1653217732410551355989522958274
2Iron Man2008-05-023186041265851715477266567421
3Blade Runner1982-06-25326563283953583786879509
4Breakfast at Tiffany's1961-10-05955190497947217242817
Creating the movie DataFrame

The datetime dtype

On its own, pd.DatetimeIndex recognizes data that represents a date and converts it into a new dtype called datetime. Usually this step can be written in the same line of code used to extract numbers from the date, but it is crucial to understanding the purpose of the function.

Currently, the "release date" column stores its data as strings. To convert it to the datetime dtype, input the column name as a data parameter:

#Usually data = is unnecessary here, as it is the default parameter.
pd.DatetimeIndex(data = df["release date"])
DatetimeIndex(['1996-06-05', '2016-12-16', '2008-05-02', '1982-06-25',
              dtype='datetime64[ns]', name='release date', freq=None)
Converting release date to datetime

As you can see, the new dtype is datetime64[ns], which is easier to manipulate for our purposes.

If we're just concerned with converting the data to the datetime dtype, we can also use the function pd.to_datetime. The output of this function is a bit cleaner than that of pd.DatetimeIndex:

pd.to_datetime(df["release date"])
0   1996-06-05
1   2016-12-16
2   2008-05-02
3   1982-06-25
4   1961-10-05
Name: release date, dtype: datetime64[ns]
Demonstrating another function

However, pd.DatetimeIndex is more versatile and allows us to work with separate components of the date, so that's what we'll work with moving forward.

With either function, you can store the converted data as a new column in your DataFrame. Notice that even though the data looks exactly the same as the "release date" column, pandas recognizes it as a different dtype.

df["release date datetime"] = pd.DatetimeIndex(data = df["release date"])
movierelease datedomestic box officeworldwide box officepersonal ratinginternational box officerelease date datetime
0The Truman Show1996-06-05125618201264118201101385000001996-06-05
1Rogue One: A Star Wars Story2016-12-16532177324105513559895229582742016-12-16
2Iron Man2008-05-0231860412658517154772665674212008-05-02
3Blade Runner1982-06-253265632839535837868795091982-06-25
4Breakfast at Tiffany's1961-10-059551904979472172428171961-10-05
Storing datetime data in a new column
movie                               object
release date                        object
domestic box office                  int64
worldwide box office                 int64
personal rating                      int64
international box office             int64
release date datetime       datetime64[ns]
dtype: object
dtypes of all columns

Extracting Date Information

Now that we understand what pd.DatetimeIndex does, we can use it to extract date information.

Each part of the date — day, month, and year — has its own attribute that allows us to isolate it from the rest of the entry. Some of the most common and most useful attributes are:

  • .year
  • .month
  • .day
  • .dayofyear
    • (Between 1 and 365)
  • .weekday
    • (Between 0 (Monday) and 6 (Sunday)

and more!

For example, we can create a new column with just the release year of each movie by attaching .year to the end of the function:

df["release year"] = pd.DatetimeIndex(df["release date"]).year
movierelease datedomestic box officeworldwide box officepersonal ratinginternational box officerelease year
0The Truman Show1996-06-05125618201264118201101385000001996
1Rogue One: A Star Wars Story2016-12-16532177324105513559895229582742016
2Iron Man2008-05-0231860412658517154772665674212008
3Blade Runner1982-06-253265632839535837868795091982
4Breakfast at Tiffany's1961-10-059551904979472172428171961
Extracting year from the release date column

The "release year" column will automatically be stored as an int dtype, allowing us to perform numerical analysis — we can calculate the average release year, earliest and latest release years, and more for the movies in this dataset. Before applying pd.DatetimeIndex, this was impossible because the "release date" entries were stored as strings!

yr_mean = df["release year"].mean()
yr_min = df["release year"].min()
yr_max = df["release year"].max()
print(yr_mean, yr_min, yr_max)
1992.6 1961 2016
Analyzing release year of movies

Extracting Time Information

pd.DatetimeIndex can be applied similarly to time data. We'll use a very simple DataFrame to demonstrate this example:

df2 = pd.DataFrame([
  {"time": "1:01"}, {"time": "2:30"}, {"time": "4:18"}, {"time": "11:11"}, {"time": "17:58"}, {"time": "23:46"}
Creating a very simple time DataFrame

Time information can be expressed with the datetime dtype as well; pd.DatetimeIndex allows Pandas to recognize components of time measurement. By default, Pandas will assign the current date to each entry, but this usually doesn't affect our analysis.

DatetimeIndex(['2022-08-03 01:01:00', '2022-08-03 02:30:00',
               '2022-08-03 04:18:00', '2022-08-03 11:11:00',
               '2022-08-03 17:58:00', '2022-08-03 23:46:00'],
              dtype='datetime64[ns]', name='time', freq=None)
Converting to datetime dtype

The time-related attributes of pd.DatetimeIndex include:

  • .hour
  • .minute
  • .second
  • .microsecond
  • .nanosecond

For example, we can extract the minute component of every entry in "time":

df2["minutes"] = pd.DatetimeIndex(df2["time"]).minute
Extracting minutes from time column

Again, the entries in "minutes" have the integer dtype, which enables numerical analysis. We can apply any of the usual Pandas functions to this new column!