Using Previous Observations when Computation Values in a DataFrame
When you're analyzing data reported on a regular basis (ex: daily cases, monthly reports, etc), it is common to need to use the values from the previous one or more observations in your calculation. The df.column.shift(1)
observation reports the value for a column
from one observation earlier.
Below is a small subset of Johns Hopkins University's COVID-19 dataset. The JHU COVID-19 dataset contains the total cumulative number of COVID-19 cases in a given location:
Adding a New Column Shifted by One
The most direct use of the DataFrame.shift
function is to add a new column based on shifting the data by a set amount. In our example data, using df.Confirmed.shift(1)
creates a DataFrame that contains the confirmed cases from the pervious day:
df["ConfirmedYesterday"] = df.Confirmed.shift(1)
df
Date | Province_State | Confirmed | ConfirmedYesterday | |
---|---|---|---|---|
0 | 2020-05-01 | Illinois | 56055 | NaN |
1 | 2020-05-02 | Illinois | 58505 | 56055.0 |
2 | 2020-05-03 | Illinois | 61499 | 58505.0 |
3 | 2020-05-04 | Illinois | 63777 | 61499.0 |
4 | 2020-05-05 | Illinois | 65889 | 63777.0 |
5 | 2020-05-06 | Illinois | 68232 | 65889.0 |
6 | 2020-05-07 | Illinois | 70871 | 68232.0 |
7 | 2020-05-08 | Illinois | 73760 | 70871.0 |
8 | 2020-05-09 | Illinois | 76085 | 73760.0 |
9 | 2020-05-10 | Illinois | 77741 | 76085.0 |
10 | 2020-05-11 | Illinois | 79007 | 77741.0 |
11 | 2020-05-12 | Illinois | 83021 | 79007.0 |
12 | 2020-05-13 | Illinois | 84694 | 83021.0 |
13 | 2020-05-14 | Illinois | 87937 | 84694.0 |
shift
function to use the previous observation's data in a new column ConfirmedYesterday
.Notice that the new column, ConfirmedYesterday
, contains the new daily cases each day. On 2020-05-01
, the column records a missing value (NaN
) since there was no previous row to use in the calculation.
Calculating the Change from the Previous Observation
When a cumulative total is provided in sorted order, the difference between the observations indicates the change in the data. In the case of our example data, the difference between any two observations is the daily change in COVID-19 cases in Illinois.
The shift(1)
function will shift the contents of the DataFrame, allowing us to subtract the current day's confirmed case count from the previous day's confirmed case count. The code below create a new column DailyConfirmed
:
df["DailyConfirmed"] = df.Confirmed - df.Confirmed.shift(1)
df
Date | Province_State | Confirmed | DailyConfirmed | |
---|---|---|---|---|
0 | 2020-05-01 | Illinois | 56055 | NaN |
1 | 2020-05-02 | Illinois | 58505 | 2450.0 |
2 | 2020-05-03 | Illinois | 61499 | 2994.0 |
3 | 2020-05-04 | Illinois | 63777 | 2278.0 |
4 | 2020-05-05 | Illinois | 65889 | 2112.0 |
5 | 2020-05-06 | Illinois | 68232 | 2343.0 |
6 | 2020-05-07 | Illinois | 70871 | 2639.0 |
7 | 2020-05-08 | Illinois | 73760 | 2889.0 |
8 | 2020-05-09 | Illinois | 76085 | 2325.0 |
9 | 2020-05-10 | Illinois | 77741 | 1656.0 |
10 | 2020-05-11 | Illinois | 79007 | 1266.0 |
11 | 2020-05-12 | Illinois | 83021 | 4014.0 |
12 | 2020-05-13 | Illinois | 84694 | 1673.0 |
13 | 2020-05-14 | Illinois | 87937 | 3243.0 |
shift
function to use the previous observation's data in a calculation.Notice that the new column, DailyConfirmed
, contains the new daily cases each day. On 2020-05-01
, the column records a missing value (NaN
) since there was no previous row to use in the calculation.
Calculating a Running Average
A "running average" is a term used to calculate an average based off a fixed previous number of observations, which is constantly updated each day. In the case of our example data, a "7-day running average" of cases provides us with the average number of new cases over the past week by calculating the average new of daily new cases over the most recent seven observations.
The shift
function can be used to shift the DataFrame by any number of rows. To calculate the weekly running average, we need the current daily cases AND data from the previous six days to get the full 7-day running average:
df["RunningAverage_7day"] = (df.DailyConfirmed + df.DailyConfirmed.shift(1) + df.DailyConfirmed.shift(2) + df.DailyConfirmed.shift(3) + df.DailyConfirmed.shift(4) + df.DailyConfirmed.shift(5) + df.DailyConfirmed.shift(6)) / 7
df
Date | Province_State | Confirmed | DailyConfirmed | RunningAverage_7day | |
---|---|---|---|---|---|
0 | 2020-05-01 | Illinois | 56055 | NaN | NaN |
1 | 2020-05-02 | Illinois | 58505 | 2450.0 | NaN |
2 | 2020-05-03 | Illinois | 61499 | 2994.0 | NaN |
3 | 2020-05-04 | Illinois | 63777 | 2278.0 | NaN |
4 | 2020-05-05 | Illinois | 65889 | 2112.0 | NaN |
5 | 2020-05-06 | Illinois | 68232 | 2343.0 | NaN |
6 | 2020-05-07 | Illinois | 70871 | 2639.0 | NaN |
7 | 2020-05-08 | Illinois | 73760 | 2889.0 | 2529.285714 |
8 | 2020-05-09 | Illinois | 76085 | 2325.0 | 2511.428571 |
9 | 2020-05-10 | Illinois | 77741 | 1656.0 | 2320.285714 |
10 | 2020-05-11 | Illinois | 79007 | 1266.0 | 2175.714286 |
11 | 2020-05-12 | Illinois | 83021 | 4014.0 | 2447.428571 |
12 | 2020-05-13 | Illinois | 84694 | 1673.0 | 2351.714286 |
13 | 2020-05-14 | Illinois | 87937 | 3243.0 | 2438.000000 |
shift
function to calculate a running average.Pandas Documentation
See the panda's developer documentation on pandas.DataFrame.shift