🏠 Data Science Guides Using Previous Observations when Computation Values in a DataFrame When you're analyzing data reported on a regular basis (ex: daily cases, monthly reports, etc), it is common to need to use the values from the previous one or more observations in your calculation. The df.column.shift(1) observation reports the value for a column from one observation earlier.
Below is a small subset of Johns Hopkins University's COVID-19 dataset. The JHU COVID-19 dataset contains the total cumulative number of COVID-19 cases in a given location:
import pandas as pd\ndf = pd.DataFrame([\n {"Date": "2020-05-01", "Province_State": "Illinois", "Confirmed": 56055 },\n {"Date": "2020-05-02", "Province_State": "Illinois", "Confirmed": 58505 },\n {"Date": "2020-05-03", "Province_State": "Illinois", "Confirmed": 61499 },\n {"Date": "2020-05-04", "Province_State": "Illinois", "Confirmed": 63777 },\n {"Date": "2020-05-05", "Province_State": "Illinois", "Confirmed": 65889 },\n {"Date": "2020-05-06", "Province_State": "Illinois", "Confirmed": 68232 },\n {"Date": "2020-05-07", "Province_State": "Illinois", "Confirmed": 70871 },\n {"Date": "2020-05-08", "Province_State": "Illinois", "Confirmed": 73760 },\n {"Date": "2020-05-09", "Province_State": "Illinois", "Confirmed": 76085 },\n {"Date": "2020-05-10", "Province_State": "Illinois", "Confirmed": 77741 },\n {"Date": "2020-05-11", "Province_State": "Illinois", "Confirmed": 79007 },\n {"Date": "2020-05-12", "Province_State": "Illinois", "Confirmed": 83021 },\n {"Date": "2020-05-13", "Province_State": "Illinois", "Confirmed": 84694 },\n {"Date": "2020-05-14", "Province_State": "Illinois", "Confirmed": 87937 },\n])\ndf Run Code
Reset Code Run All to Here Python Output:
(Run your code to see your code result's here.)
Adding a New Column Shifted by One The most direct use of the DataFrame.shift function is to add a new column based on shifting the data by a set amount. In our example data, using df.Confirmed.shift(1) creates a DataFrame that contains the confirmed cases from the pervious day:
import pandas as pd\ndf = pd.DataFrame([\n {"Date": "2020-05-01", "Province_State": "Illinois", "Confirmed": 56055 },\n {"Date": "2020-05-02", "Province_State": "Illinois", "Confirmed": 58505 },\n {"Date": "2020-05-03", "Province_State": "Illinois", "Confirmed": 61499 },\n {"Date": "2020-05-04", "Province_State": "Illinois", "Confirmed": 63777 },\n {"Date": "2020-05-05", "Province_State": "Illinois", "Confirmed": 65889 },\n {"Date": "2020-05-06", "Province_State": "Illinois", "Confirmed": 68232 },\n {"Date": "2020-05-07", "Province_State": "Illinois", "Confirmed": 70871 },\n {"Date": "2020-05-08", "Province_State": "Illinois", "Confirmed": 73760 },\n {"Date": "2020-05-09", "Province_State": "Illinois", "Confirmed": 76085 },\n {"Date": "2020-05-10", "Province_State": "Illinois", "Confirmed": 77741 },\n {"Date": "2020-05-11", "Province_State": "Illinois", "Confirmed": 79007 },\n {"Date": "2020-05-12", "Province_State": "Illinois", "Confirmed": 83021 },\n {"Date": "2020-05-13", "Province_State": "Illinois", "Confirmed": 84694 },\n {"Date": "2020-05-14", "Province_State": "Illinois", "Confirmed": 87937 },\n])\ndf["ConfirmedYesterday"] = df.Confirmed.shift(1)\ndf Run Code
Reset Code Run All to Here Python Output:
Date
Province_State
Confirmed
ConfirmedYesterday
0
2020-05-01
Illinois
56055
NaN
1
2020-05-02
Illinois
58505
56055.0
2
2020-05-03
Illinois
61499
58505.0
3
2020-05-04
Illinois
63777
61499.0
4
2020-05-05
Illinois
65889
63777.0
5
2020-05-06
Illinois
68232
65889.0
6
2020-05-07
Illinois
70871
68232.0
7
2020-05-08
Illinois
73760
70871.0
8
2020-05-09
Illinois
76085
73760.0
9
2020-05-10
Illinois
77741
76085.0
10
2020-05-11
Illinois
79007
77741.0
11
2020-05-12
Illinois
83021
79007.0
12
2020-05-13
Illinois
84694
83021.0
13
2020-05-14
Illinois
87937
84694.0
Notice that the new column, ConfirmedYesterday, contains the new daily cases each day. On 2020-05-01, the column records a missing value (NaN) since there was no previous row to use in the calculation.
Calculating the Change from the Previous Observation When a cumulative total is provided in sorted order, the difference between the observations indicates the change in the data. In the case of our example data, the difference between any two observations is the daily change in COVID-19 cases in Illinois.
The shift(1) function will shift the contents of the DataFrame, allowing us to subtract the current day's confirmed case count from the previous day's confirmed case count. The code below create a new column DailyConfirmed:
import pandas as pd\ndf = pd.DataFrame([\n {"Date": "2020-05-01", "Province_State": "Illinois", "Confirmed": 56055 },\n {"Date": "2020-05-02", "Province_State": "Illinois", "Confirmed": 58505 },\n {"Date": "2020-05-03", "Province_State": "Illinois", "Confirmed": 61499 },\n {"Date": "2020-05-04", "Province_State": "Illinois", "Confirmed": 63777 },\n {"Date": "2020-05-05", "Province_State": "Illinois", "Confirmed": 65889 },\n {"Date": "2020-05-06", "Province_State": "Illinois", "Confirmed": 68232 },\n {"Date": "2020-05-07", "Province_State": "Illinois", "Confirmed": 70871 },\n {"Date": "2020-05-08", "Province_State": "Illinois", "Confirmed": 73760 },\n {"Date": "2020-05-09", "Province_State": "Illinois", "Confirmed": 76085 },\n {"Date": "2020-05-10", "Province_State": "Illinois", "Confirmed": 77741 },\n {"Date": "2020-05-11", "Province_State": "Illinois", "Confirmed": 79007 },\n {"Date": "2020-05-12", "Province_State": "Illinois", "Confirmed": 83021 },\n {"Date": "2020-05-13", "Province_State": "Illinois", "Confirmed": 84694 },\n {"Date": "2020-05-14", "Province_State": "Illinois", "Confirmed": 87937 },\n])\ndf["DailyConfirmed"] = df.Confirmed - df.Confirmed.shift(1)\ndf Run Code
Reset Code Run All to Here Python Output:
Date
Province_State
Confirmed
DailyConfirmed
0
2020-05-01
Illinois
56055
NaN
1
2020-05-02
Illinois
58505
2450.0
2
2020-05-03
Illinois
61499
2994.0
3
2020-05-04
Illinois
63777
2278.0
4
2020-05-05
Illinois
65889
2112.0
5
2020-05-06
Illinois
68232
2343.0
6
2020-05-07
Illinois
70871
2639.0
7
2020-05-08
Illinois
73760
2889.0
8
2020-05-09
Illinois
76085
2325.0
9
2020-05-10
Illinois
77741
1656.0
10
2020-05-11
Illinois
79007
1266.0
11
2020-05-12
Illinois
83021
4014.0
12
2020-05-13
Illinois
84694
1673.0
13
2020-05-14
Illinois
87937
3243.0
Notice that the new column, DailyConfirmed, contains the new daily cases each day. On 2020-05-01, the column records a missing value (NaN) since there was no previous row to use in the calculation.
Calculating a Running Average A "running average" is a term used to calculate an average based off a fixed previous number of observations, which is constantly updated each day. In the case of our example data, a "7-day running average" of cases provides us with the average number of new cases over the past week by calculating the average new of daily new cases over the most recent seven observations.
The shift function can be used to shift the DataFrame by any number of rows. To calculate the weekly running average, we need the current daily cases AND data from the previous six days to get the full 7-day running average:
import pandas as pd\ndf = pd.DataFrame([\n {"Date": "2020-05-01", "Province_State": "Illinois", "Confirmed": 56055 },\n {"Date": "2020-05-02", "Province_State": "Illinois", "Confirmed": 58505 },\n {"Date": "2020-05-03", "Province_State": "Illinois", "Confirmed": 61499 },\n {"Date": "2020-05-04", "Province_State": "Illinois", "Confirmed": 63777 },\n {"Date": "2020-05-05", "Province_State": "Illinois", "Confirmed": 65889 },\n {"Date": "2020-05-06", "Province_State": "Illinois", "Confirmed": 68232 },\n {"Date": "2020-05-07", "Province_State": "Illinois", "Confirmed": 70871 },\n {"Date": "2020-05-08", "Province_State": "Illinois", "Confirmed": 73760 },\n {"Date": "2020-05-09", "Province_State": "Illinois", "Confirmed": 76085 },\n {"Date": "2020-05-10", "Province_State": "Illinois", "Confirmed": 77741 },\n {"Date": "2020-05-11", "Province_State": "Illinois", "Confirmed": 79007 },\n {"Date": "2020-05-12", "Province_State": "Illinois", "Confirmed": 83021 },\n {"Date": "2020-05-13", "Province_State": "Illinois", "Confirmed": 84694 },\n {"Date": "2020-05-14", "Province_State": "Illinois", "Confirmed": 87937 },\n])\ndf["DailyConfirmed"] = df.Confirmed - df.Confirmed.shift(1)\ndf["RunningAverage_7day"] = (df.DailyConfirmed + df.DailyConfirmed.shift(1) + df.DailyConfirmed.shift(2) + df.DailyConfirmed.shift(3) + df.DailyConfirmed.shift(4) + df.DailyConfirmed.shift(5) + df.DailyConfirmed.shift(6)) / 7\ndf Run Code
Reset Code Run All to Here Python Output:
Date
Province_State
Confirmed
DailyConfirmed
RunningAverage_7day
0
2020-05-01
Illinois
56055
NaN
NaN
1
2020-05-02
Illinois
58505
2450.0
NaN
2
2020-05-03
Illinois
61499
2994.0
NaN
3
2020-05-04
Illinois
63777
2278.0
NaN
4
2020-05-05
Illinois
65889
2112.0
NaN
5
2020-05-06
Illinois
68232
2343.0
NaN
6
2020-05-07
Illinois
70871
2639.0
NaN
7
2020-05-08
Illinois
73760
2889.0
2529.285714
8
2020-05-09
Illinois
76085
2325.0
2511.428571
9
2020-05-10
Illinois
77741
1656.0
2320.285714
10
2020-05-11
Illinois
79007
1266.0
2175.714286
11
2020-05-12
Illinois
83021
4014.0
2447.428571
12
2020-05-13
Illinois
84694
1673.0
2351.714286
13
2020-05-14
Illinois
87937
3243.0
2438.000000
Pandas Documentation See the panda's developer documentation on pandas.DataFrame.shift