🏠 Data Science Guides Working with Dates and Times in a DataFrame Dates and times can be tricky to deal with in a DataFrame. Even though we recognize these values as numbers, date and time data are often stored as strings. To properly analyze these values, we must first convert them to float or int dtypes using a function called pd.DatetimeIndex
.
The Movie Dataset To explore this function, we'll use a DataFrame of five different movies, including information about their release date, how much money they made in US dollars, and a personal rating out of 10.
import pandas as pd\n \n#Creates a DataFrame of "movie", "release date", "domestic gross", "worldwide gross", "personal rating", and "international box office" columns\ndf = pd.DataFrame([\n {"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201, "personal rating": 10, "international box office": 138500000},\n {"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598, "personal rating": 9, "international box office": 522958274},\n {"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547, "personal rating": 7, "international box office": 266567421},\n {"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837, "personal rating": 8, "international box office": 6879509},\n {"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721, "personal rating": 7, "international box office": 242817}\n ])\ndf Run Code
Reset Code Python Output:
movie
release date
domestic box office
worldwide box office
personal rating
international box office
0
The Truman Show
1996-06-05
125618201
264118201
10
138500000
1
Rogue One: A Star Wars Story
2016-12-16
532177324
1055135598
9
522958274
2
Iron Man
2008-05-02
318604126
585171547
7
266567421
3
Blade Runner
1982-06-25
32656328
39535837
8
6879509
4
Breakfast at Tiffany's
1961-10-05
9551904
9794721
7
242817
The datetime dtype On its own, pd.DatetimeIndex
recognizes data that represents a date and converts it into a new dtype called datetime. Usually this step can be written in the same line of code used to extract numbers from the date, but it is crucial to understanding the purpose of the function.
Currently, the "release date"
column stores its data as strings. To convert it to the datetime dtype, input the column name as a data
parameter:
import pandas as pd\n \ndf = pd.DataFrame([\n {"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201, "personal rating": 10, "international box office": 138500000},\n {"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598, "personal rating": 9, "international box office": 522958274},\n {"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547, "personal rating": 7, "international box office": 266567421},\n {"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837, "personal rating": 8, "international box office": 6879509},\n {"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721, "personal rating": 7, "international box office": 242817}\n ])\n#Usually data = is unnecessary here, as it is the default parameter.\npd.DatetimeIndex(data = df["release date"]) Run Code
Reset Code Python Output:
DatetimeIndex(['1996-06-05', '2016-12-16', '2008-05-02', '1982-06-25',
'1961-10-05'],
dtype='datetime64[ns]', name='release date', freq=None)
As you can see, the new dtype is datetime64[ns]
, which is easier to manipulate for our purposes.
If we're just concerned with converting the data to the datetime dtype, we can also use the function pd.to_datetime
. The output of this function is a bit cleaner than that of pd.DatetimeIndex
:
import pandas as pd\n \ndf = pd.DataFrame([\n {"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201, "personal rating": 10, "international box office": 138500000},\n {"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598, "personal rating": 9, "international box office": 522958274},\n {"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547, "personal rating": 7, "international box office": 266567421},\n {"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837, "personal rating": 8, "international box office": 6879509},\n {"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721, "personal rating": 7, "international box office": 242817}\n ])\npd.to_datetime(df["release date"]) Run Code
Reset Code Python Output:
0 1996-06-05
1 2016-12-16
2 2008-05-02
3 1982-06-25
4 1961-10-05
Name: release date, dtype: datetime64[ns]
However, pd.DatetimeIndex
is more versatile and allows us to work with separate components of the date, so that's what we'll work with moving forward.
With either function, you can store the converted data as a new column in your DataFrame. Notice that even though the data looks exactly the same as the "release date"
column, pandas recognizes it as a different dtype.
import pandas as pd\n \ndf = pd.DataFrame([\n {"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201, "personal rating": 10, "international box office": 138500000},\n {"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598, "personal rating": 9, "international box office": 522958274},\n {"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547, "personal rating": 7, "international box office": 266567421},\n {"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837, "personal rating": 8, "international box office": 6879509},\n {"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721, "personal rating": 7, "international box office": 242817}\n ])\ndf["release date datetime"] = pd.DatetimeIndex(data = df["release date"])\ndf Run Code
Reset Code Python Output:
movie
release date
domestic box office
worldwide box office
personal rating
international box office
release date datetime
0
The Truman Show
1996-06-05
125618201
264118201
10
138500000
1996-06-05
1
Rogue One: A Star Wars Story
2016-12-16
532177324
1055135598
9
522958274
2016-12-16
2
Iron Man
2008-05-02
318604126
585171547
7
266567421
2008-05-02
3
Blade Runner
1982-06-25
32656328
39535837
8
6879509
1982-06-25
4
Breakfast at Tiffany's
1961-10-05
9551904
9794721
7
242817
1961-10-05
import pandas as pd\n \ndf = pd.DataFrame([\n {"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201, "personal rating": 10, "international box office": 138500000},\n {"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598, "personal rating": 9, "international box office": 522958274},\n {"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547, "personal rating": 7, "international box office": 266567421},\n {"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837, "personal rating": 8, "international box office": 6879509},\n {"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721, "personal rating": 7, "international box office": 242817}\n ])\ndf.dtypes Run Code
Reset Code Python Output:
movie object
release date object
domestic box office int64
worldwide box office int64
personal rating int64
international box office int64
release date datetime datetime64[ns]
dtype: object
Now that we understand what pd.DatetimeIndex
does, we can use it to extract date information.
Each part of the date — day, month, and year — has its own attribute that allows us to isolate it from the rest of the entry. Some of the most common and most useful attributes are:
.year
.month
.day
.dayofyear
.weekday
(Between 0 (Monday) and 6 (Sunday) and more!
For example, we can create a new column with just the release year of each movie by attaching .year
to the end of the function:
import pandas as pd\n \ndf = pd.DataFrame([\n {"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201, "personal rating": 10, "international box office": 138500000},\n {"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598, "personal rating": 9, "international box office": 522958274},\n {"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547, "personal rating": 7, "international box office": 266567421},\n {"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837, "personal rating": 8, "international box office": 6879509},\n {"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721, "personal rating": 7, "international box office": 242817}\n ])\ndf["release year"] = pd.DatetimeIndex(df["release date"]).year\ndf Run Code
Reset Code Python Output:
movie
release date
domestic box office
worldwide box office
personal rating
international box office
release year
0
The Truman Show
1996-06-05
125618201
264118201
10
138500000
1996
1
Rogue One: A Star Wars Story
2016-12-16
532177324
1055135598
9
522958274
2016
2
Iron Man
2008-05-02
318604126
585171547
7
266567421
2008
3
Blade Runner
1982-06-25
32656328
39535837
8
6879509
1982
4
Breakfast at Tiffany's
1961-10-05
9551904
9794721
7
242817
1961
The "release year"
column will automatically be stored as an int dtype, allowing us to perform numerical analysis — we can calculate the average release year, earliest and latest release years, and more for the movies in this dataset. Before applying pd.DatetimeIndex
, this was impossible because the "release date"
entries were stored as strings!
import pandas as pd\n \ndf = pd.DataFrame([\n {"movie": "The Truman Show", "release date": "1996-06-05", "domestic box office": 125618201, "worldwide box office": 264118201, "personal rating": 10, "international box office": 138500000},\n {"movie": "Rogue One: A Star Wars Story", "release date": "2016-12-16", "domestic box office": 532177324, "worldwide box office": 1055135598, "personal rating": 9, "international box office": 522958274},\n {"movie": "Iron Man", "release date": "2008-05-02", "domestic box office": 318604126, "worldwide box office": 585171547, "personal rating": 7, "international box office": 266567421},\n {"movie": "Blade Runner", "release date": "1982-06-25", "domestic box office": 32656328, "worldwide box office": 39535837, "personal rating": 8, "international box office": 6879509},\n {"movie": "Breakfast at Tiffany's", "release date": "1961-10-05", "domestic box office": 9551904, "worldwide box office": 9794721, "personal rating": 7, "international box office": 242817}\n ])\ndf["release year"] = pd.DatetimeIndex(df["release date"]).year\nyr_mean = df["release year"].mean()\nyr_min = df["release year"].min()\nyr_max = df["release year"].max()\n(yr_mean, yr_min, yr_max) Run Code
Reset Code Python Output:
pd.DatetimeIndex
can be applied similarly to time data. We'll use a very simple DataFrame to demonstrate this example:
import pandas as pd\n \ndf2 = pd.DataFrame([\n {"time": "1:01"}, {"time": "2:30"}, {"time": "4:18"}, {"time": "11:11"}, {"time": "17:58"}, {"time": "23:46"}\n ])\ndf2 Run Code
Reset Code Python Output:
time
0
1:01
1
2:30
2
4:18
3
11:11
4
17:58
5
23:46
Time information can be expressed with the datetime dtype as well; pd.DatetimeIndex
allows Pandas to recognize components of time measurement. By default, Pandas will assign the current date to each entry, but this usually doesn't affect our analysis.
import pandas as pd\n \ndf2 = pd.DataFrame([\n {"time": "1:01"}, {"time": "2:30"}, {"time": "4:18"}, {"time": "11:11"}, {"time": "17:58"}, {"time": "23:46"}\n ])\npd.DatetimeIndex(df2["time"]) Run Code
Reset Code Python Output:
DatetimeIndex(['2022-08-03 01:01:00', '2022-08-03 02:30:00',
'2022-08-03 04:18:00', '2022-08-03 11:11:00',
'2022-08-03 17:58:00', '2022-08-03 23:46:00'],
dtype='datetime64[ns]', name='time', freq=None)
The time-related attributes of pd.DatetimeIndex
include:
.hour
.minute
.second
.microsecond
.nanosecond
For example, we can extract the minute component of every entry in "time"
:
import pandas as pd\n \ndf2 = pd.DataFrame([\n {"time": "1:01"}, {"time": "2:30"}, {"time": "4:18"}, {"time": "11:11"}, {"time": "17:58"}, {"time": "23:46"}\n ])\ndf2["minutes"] = pd.DatetimeIndex(df2["time"]).minute\ndf2 Run Code
Reset Code Python Output:
time
minutes
0
1:01
1
1
2:30
30
2
4:18
18
3
11:11
11
4
17:58
58
5
23:46
46
Again, the entries in "minutes"
have the integer dtype, which enables numerical analysis. We can apply any of the usual Pandas functions to this new column!