Working with Dates and Times in a DataFrame


Dates and times can be tricky to deal with in a DataFrame. Even though we recognize these values as numbers, date and time data are often stored as strings. To properly analyze these values, we must first convert them to float or int dtypes using a function called pd.DatetimeIndex.

The Movie Dataset

To explore this function, we'll use a DataFrame of five different movies, including information about their release date, how much money they made in US dollars, and a personal rating out of 10.

Reset Code Python Output:
movie release date domestic box office worldwide box office personal rating international box office
0 The Truman Show 1996-06-05 125618201 264118201 10 138500000
1 Rogue One: A Star Wars Story 2016-12-16 532177324 1055135598 9 522958274
2 Iron Man 2008-05-02 318604126 585171547 7 266567421
3 Blade Runner 1982-06-25 32656328 39535837 8 6879509
4 Breakfast at Tiffany's 1961-10-05 9551904 9794721 7 242817

The datetime dtype

On its own, pd.DatetimeIndex recognizes data that represents a date and converts it into a new dtype called datetime. Usually this step can be written in the same line of code used to extract numbers from the date, but it is crucial to understanding the purpose of the function.

Currently, the "release date" column stores its data as strings. To convert it to the datetime dtype, input the column name as a data parameter:

Reset Code Python Output:
DatetimeIndex(['1996-06-05', '2016-12-16', '2008-05-02', '1982-06-25',
               '1961-10-05'],
              dtype='datetime64[ns]', name='release date', freq=None)

As you can see, the new dtype is datetime64[ns], which is easier to manipulate for our purposes.

If we're just concerned with converting the data to the datetime dtype, we can also use the function pd.to_datetime. The output of this function is a bit cleaner than that of pd.DatetimeIndex:

Reset Code Python Output:
0   1996-06-05
1   2016-12-16
2   2008-05-02
3   1982-06-25
4   1961-10-05
Name: release date, dtype: datetime64[ns]

However, pd.DatetimeIndex is more versatile and allows us to work with separate components of the date, so that's what we'll work with moving forward.

With either function, you can store the converted data as a new column in your DataFrame. Notice that even though the data looks exactly the same as the "release date" column, pandas recognizes it as a different dtype.

Reset Code Python Output:
movie release date domestic box office worldwide box office personal rating international box office release date datetime
0 The Truman Show 1996-06-05 125618201 264118201 10 138500000 1996-06-05
1 Rogue One: A Star Wars Story 2016-12-16 532177324 1055135598 9 522958274 2016-12-16
2 Iron Man 2008-05-02 318604126 585171547 7 266567421 2008-05-02
3 Blade Runner 1982-06-25 32656328 39535837 8 6879509 1982-06-25
4 Breakfast at Tiffany's 1961-10-05 9551904 9794721 7 242817 1961-10-05
Reset Code Python Output:
movie                               object
release date                        object
domestic box office                  int64
worldwide box office                 int64
personal rating                      int64
international box office             int64
release date datetime       datetime64[ns]
dtype: object

Extracting Date Information

Now that we understand what pd.DatetimeIndex does, we can use it to extract date information.

Each part of the date — day, month, and year — has its own attribute that allows us to isolate it from the rest of the entry. Some of the most common and most useful attributes are:

  • .year
  • .month
  • .day
  • .dayofyear
    • (Between 1 and 365)
  • .weekday
    • (Between 0 (Monday) and 6 (Sunday)

and more!

For example, we can create a new column with just the release year of each movie by attaching .year to the end of the function:

Reset Code Python Output:
movie release date domestic box office worldwide box office personal rating international box office release year
0 The Truman Show 1996-06-05 125618201 264118201 10 138500000 1996
1 Rogue One: A Star Wars Story 2016-12-16 532177324 1055135598 9 522958274 2016
2 Iron Man 2008-05-02 318604126 585171547 7 266567421 2008
3 Blade Runner 1982-06-25 32656328 39535837 8 6879509 1982
4 Breakfast at Tiffany's 1961-10-05 9551904 9794721 7 242817 1961

The "release year" column will automatically be stored as an int dtype, allowing us to perform numerical analysis — we can calculate the average release year, earliest and latest release years, and more for the movies in this dataset. Before applying pd.DatetimeIndex, this was impossible because the "release date" entries were stored as strings!

Reset Code Python Output:
1992.6 1961 2016

Extracting Time Information

pd.DatetimeIndex can be applied similarly to time data. We'll use a very simple DataFrame to demonstrate this example:

Reset Code Python Output:
time
0 1:01
1 2:30
2 4:18
3 11:11
4 17:58
5 23:46

Time information can be expressed with the datetime dtype as well; pd.DatetimeIndex allows Pandas to recognize components of time measurement. By default, Pandas will assign the current date to each entry, but this usually doesn't affect our analysis.

Reset Code Python Output:
DatetimeIndex(['2022-08-03 01:01:00', '2022-08-03 02:30:00',
               '2022-08-03 04:18:00', '2022-08-03 11:11:00',
               '2022-08-03 17:58:00', '2022-08-03 23:46:00'],
              dtype='datetime64[ns]', name='time', freq=None)

The time-related attributes of pd.DatetimeIndex include:

  • .hour
  • .minute
  • .second
  • .microsecond
  • .nanosecond

For example, we can extract the minute component of every entry in "time":

Reset Code Python Output:
time minutes
0 1:01 1
1 2:30 30
2 4:18 18
3 11:11 11
4 17:58 58
5 23:46 46

Again, the entries in "minutes" have the integer dtype, which enables numerical analysis. We can apply any of the usual Pandas functions to this new column!