Using Previous Observations when Computation Values in a DataFrame


When you're analyzing data reported on a regular basis (ex: daily cases, monthly reports, etc), it is common to need to use the values from the previous one or more observations in your calculation. The df.column.shift(1) observation reports the value for a column from one observation earlier.

Below is a small subset of Johns Hopkins University's COVID-19 dataset. The JHU COVID-19 dataset contains the total cumulative number of COVID-19 cases in a given location:

Reset Code Python Output:
(Run your code to see your code result's here.)

Adding a New Column Shifted by One

The most direct use of the DataFrame.shift function is to add a new column based on shifting the data by a set amount. In our example data, using df.Confirmed.shift(1) creates a DataFrame that contains the confirmed cases from the pervious day:

Reset Code Python Output:
Date Province_State Confirmed ConfirmedYesterday
0 2020-05-01 Illinois 56055 NaN
1 2020-05-02 Illinois 58505 56055.0
2 2020-05-03 Illinois 61499 58505.0
3 2020-05-04 Illinois 63777 61499.0
4 2020-05-05 Illinois 65889 63777.0
5 2020-05-06 Illinois 68232 65889.0
6 2020-05-07 Illinois 70871 68232.0
7 2020-05-08 Illinois 73760 70871.0
8 2020-05-09 Illinois 76085 73760.0
9 2020-05-10 Illinois 77741 76085.0
10 2020-05-11 Illinois 79007 77741.0
11 2020-05-12 Illinois 83021 79007.0
12 2020-05-13 Illinois 84694 83021.0
13 2020-05-14 Illinois 87937 84694.0

Notice that the new column, ConfirmedYesterday, contains the new daily cases each day. On 2020-05-01, the column records a missing value (NaN) since there was no previous row to use in the calculation.

Calculating the Change from the Previous Observation

When a cumulative total is provided in sorted order, the difference between the observations indicates the change in the data. In the case of our example data, the difference between any two observations is the daily change in COVID-19 cases in Illinois.

The shift(1) function will shift the contents of the DataFrame, allowing us to subtract the current day's confirmed case count from the previous day's confirmed case count. The code below create a new column DailyConfirmed:

Reset Code Python Output:
Date Province_State Confirmed DailyConfirmed
0 2020-05-01 Illinois 56055 NaN
1 2020-05-02 Illinois 58505 2450.0
2 2020-05-03 Illinois 61499 2994.0
3 2020-05-04 Illinois 63777 2278.0
4 2020-05-05 Illinois 65889 2112.0
5 2020-05-06 Illinois 68232 2343.0
6 2020-05-07 Illinois 70871 2639.0
7 2020-05-08 Illinois 73760 2889.0
8 2020-05-09 Illinois 76085 2325.0
9 2020-05-10 Illinois 77741 1656.0
10 2020-05-11 Illinois 79007 1266.0
11 2020-05-12 Illinois 83021 4014.0
12 2020-05-13 Illinois 84694 1673.0
13 2020-05-14 Illinois 87937 3243.0

Notice that the new column, DailyConfirmed, contains the new daily cases each day. On 2020-05-01, the column records a missing value (NaN) since there was no previous row to use in the calculation.

Calculating a Running Average

A "running average" is a term used to calculate an average based off a fixed previous number of observations, which is constantly updated each day. In the case of our example data, a "7-day running average" of cases provides us with the average number of new cases over the past week by calculating the average new of daily new cases over the most recent seven observations.

The shift function can be used to shift the DataFrame by any number of rows. To calculate the weekly running average, we need the current daily cases AND data from the previous six days to get the full 7-day running average:

Reset Code Python Output:
Date Province_State Confirmed DailyConfirmed RunningAverage_7day
0 2020-05-01 Illinois 56055 NaN NaN
1 2020-05-02 Illinois 58505 2450.0 NaN
2 2020-05-03 Illinois 61499 2994.0 NaN
3 2020-05-04 Illinois 63777 2278.0 NaN
4 2020-05-05 Illinois 65889 2112.0 NaN
5 2020-05-06 Illinois 68232 2343.0 NaN
6 2020-05-07 Illinois 70871 2639.0 NaN
7 2020-05-08 Illinois 73760 2889.0 2529.285714
8 2020-05-09 Illinois 76085 2325.0 2511.428571
9 2020-05-10 Illinois 77741 1656.0 2320.285714
10 2020-05-11 Illinois 79007 1266.0 2175.714286
11 2020-05-12 Illinois 83021 4014.0 2447.428571
12 2020-05-13 Illinois 84694 1673.0 2351.714286
13 2020-05-14 Illinois 87937 3243.0 2438.000000

Pandas Documentation

See the panda's developer documentation on pandas.DataFrame.shift