Using Previous Observations when Computation Values in a DataFrame


When you're analyzing data reported on a regular basis (ex: daily cases, monthly reports, etc), it is common to need to use the values from the previous one or more observations in your calculation. The df.column.shift(1) observation reports the value for a column from one observation earlier.

Below is a small subset of Johns Hopkins University's COVID-19 dataset. The JHU COVID-19 dataset contains the total cumulative number of COVID-19 cases in a given location:

import pandas as pd
df = pd.DataFrame([
    {"Date": "2020-05-01", "Province_State": "Illinois", "Confirmed": 56055 },
    {"Date": "2020-05-02", "Province_State": "Illinois", "Confirmed": 58505 },
    {"Date": "2020-05-03", "Province_State": "Illinois", "Confirmed": 61499 },
    {"Date": "2020-05-04", "Province_State": "Illinois", "Confirmed": 63777 },
    {"Date": "2020-05-05", "Province_State": "Illinois", "Confirmed": 65889 },
    {"Date": "2020-05-06", "Province_State": "Illinois", "Confirmed": 68232 },
    {"Date": "2020-05-07", "Province_State": "Illinois", "Confirmed": 70871 },
    {"Date": "2020-05-08", "Province_State": "Illinois", "Confirmed": 73760 },
    {"Date": "2020-05-09", "Province_State": "Illinois", "Confirmed": 76085 },
    {"Date": "2020-05-10", "Province_State": "Illinois", "Confirmed": 77741 },
    {"Date": "2020-05-11", "Province_State": "Illinois", "Confirmed": 79007 },
    {"Date": "2020-05-12", "Province_State": "Illinois", "Confirmed": 83021 },
    {"Date": "2020-05-13", "Province_State": "Illinois", "Confirmed": 84694 },
    {"Date": "2020-05-14", "Province_State": "Illinois", "Confirmed": 87937 },
])
Two weeks of COVID-19 confirmed cases during May 2020 in Illinois, from the Johns Hopkins University's COVID-19 dataset

Adding a New Column Shifted by One

The most direct use of the DataFrame.shift function is to add a new column based on shifting the data by a set amount. In our example data, using df.Confirmed.shift(1) creates a DataFrame that contains the confirmed cases from the pervious day:

df["ConfirmedYesterday"]  = df.Confirmed.shift(1)
df
DateProvince_StateConfirmedConfirmedYesterday
02020-05-01Illinois56055NaN
12020-05-02Illinois5850556055.0
22020-05-03Illinois6149958505.0
32020-05-04Illinois6377761499.0
42020-05-05Illinois6588963777.0
52020-05-06Illinois6823265889.0
62020-05-07Illinois7087168232.0
72020-05-08Illinois7376070871.0
82020-05-09Illinois7608573760.0
92020-05-10Illinois7774176085.0
102020-05-11Illinois7900777741.0
112020-05-12Illinois8302179007.0
122020-05-13Illinois8469483021.0
132020-05-14Illinois8793784694.0
Using the shift function to use the previous observation's data in a new column ConfirmedYesterday.

Notice that the new column, ConfirmedYesterday, contains the new daily cases each day. On 2020-05-01, the column records a missing value (NaN) since there was no previous row to use in the calculation.

Calculating the Change from the Previous Observation

When a cumulative total is provided in sorted order, the difference between the observations indicates the change in the data. In the case of our example data, the difference between any two observations is the daily change in COVID-19 cases in Illinois.

The shift(1) function will shift the contents of the DataFrame, allowing us to subtract the current day's confirmed case count from the previous day's confirmed case count. The code below create a new column DailyConfirmed:

df["DailyConfirmed"]  = df.Confirmed - df.Confirmed.shift(1)
df
DateProvince_StateConfirmedDailyConfirmed
02020-05-01Illinois56055NaN
12020-05-02Illinois585052450.0
22020-05-03Illinois614992994.0
32020-05-04Illinois637772278.0
42020-05-05Illinois658892112.0
52020-05-06Illinois682322343.0
62020-05-07Illinois708712639.0
72020-05-08Illinois737602889.0
82020-05-09Illinois760852325.0
92020-05-10Illinois777411656.0
102020-05-11Illinois790071266.0
112020-05-12Illinois830214014.0
122020-05-13Illinois846941673.0
132020-05-14Illinois879373243.0
Using the shift function to use the previous observation's data in a calculation.

Notice that the new column, DailyConfirmed, contains the new daily cases each day. On 2020-05-01, the column records a missing value (NaN) since there was no previous row to use in the calculation.

Calculating a Running Average

A "running average" is a term used to calculate an average based off a fixed previous number of observations, which is constantly updated each day. In the case of our example data, a "7-day running average" of cases provides us with the average number of new cases over the past week by calculating the average new of daily new cases over the most recent seven observations.

The shift function can be used to shift the DataFrame by any number of rows. To calculate the weekly running average, we need the current daily cases AND data from the previous six days to get the full 7-day running average:

df["RunningAverage_7day"]  = (df.DailyConfirmed + df.DailyConfirmed.shift(1) + df.DailyConfirmed.shift(2) + df.DailyConfirmed.shift(3) + df.DailyConfirmed.shift(4) + df.DailyConfirmed.shift(5) + df.DailyConfirmed.shift(6)) / 7
df
DateProvince_StateConfirmedDailyConfirmedRunningAverage_7day
02020-05-01Illinois56055NaNNaN
12020-05-02Illinois585052450.0NaN
22020-05-03Illinois614992994.0NaN
32020-05-04Illinois637772278.0NaN
42020-05-05Illinois658892112.0NaN
52020-05-06Illinois682322343.0NaN
62020-05-07Illinois708712639.0NaN
72020-05-08Illinois737602889.02529.285714
82020-05-09Illinois760852325.02511.428571
92020-05-10Illinois777411656.02320.285714
102020-05-11Illinois790071266.02175.714286
112020-05-12Illinois830214014.02447.428571
122020-05-13Illinois846941673.02351.714286
132020-05-14Illinois879373243.02438.000000
Using the shift function to calculate a running average.

Pandas Documentation

See the panda's developer documentation on pandas.DataFrame.shift