Grouping Data in Python


DataFrames are great tools for exploring data about individual observations, or rows, of data in a dataset. However, we often want to know about statistics on an entire category of data!

For example, the Illini Football Dataset contains all football games played by Illinois since 1892. When sorted by the Opponent you can see that we have played many teams more than once (in the full dataset, we've played some teams over 70 times!):

df = pd.read_csv("https://waf.cs.illinois.edu/discovery/football.csv")
df.sort_values("Opponent").head(20)
SeasonDateLocationOpponentOpponentRankResultIlliniScoreOpponentScoreNote
48319799/22/1979@Air ForceNaNW27.019.0NaN
47319809/27/1980vs.Air ForceNaNT20.020.0NaN
920198/31/2019vs.AkronNaNW42.03.0NaN
28819969/21/1996vs.AkronNaNW38.07.0NaN
458198212/29/1982Memphis, TNAlabamaNaNL15.021.0Liberty Bowl
114519039/30/1903vs.American OsteopathNaNW36.00.0NaN
1159190210/1/1902vs.American OsteopathNaNW22.00.0NaN
28719969/14/1996@ArizonaNaNL0.041.0NaN
35519909/8/1990@ArizonaNaNL16.028.0NaN
32019939/18/1993vs.ArizonaNaNL14.016.0NaN
29819959/16/1995vs.ArizonaNaNW9.07.0NaN
39119879/12/1987vs.Arizona StateNaNL7.021.0NaN
9620129/8/2012@Arizona StateNaNL14.045.0NaN
37919889/10/1988@Arizona StateNaNL16.021.0NaN
10920119/17/2011vs.Arizona StateNaNW17.014.0NaN
10720119/3/2011vs.Arkansas StateNaNW33.015.0NaN
21820029/14/2002vs.Arkansas StateNaNW59.07.0NaN
25119999/4/1999vs.Arkansas StateNaNW41.03.0NaN
423198512/31/1985Atlanta, GAArmyNaNL29.031.0Peach Bowl
913193411/3/1934vs.ArmyNaNW7.00.0NaN

The first twenty rows of the "Illini Football Dataset" when sorted by Opponent.

When we group data in a DataFrame, we are exploring data about all rows with similar data in one or more columns. Several questions we can ask about the Illini Football dataset includes:

  • What was the average number of points scored by the Illini score against each opponent they played? Ex: How many points did the Illini score verses Purdue? ...verses Arizona? ...verses Nebraska? (We are grouping on different values in the Opponent column.)
  • What is the standard deviation of points scored by the Illini in home games vs. away games? (The grouping on different values in the Location column, where vs. are home games and @ are away games.)

Using df.groupby to Create Groups

The pandas cheat sheet provides us an overview of the "Group Data" in Python:

Pandas Cheat Sheet, Page 2 with the "Group Data" section highlighted
"Group Data" section from Page 2 of the Pandas Cheat Sheet

The syntax of groupby requires us to provide one or more columns to create groups of data. For example, if we group by only the Opponent column, the following command creates groups based on the unique values in the Opponent column:

df.groupby(by="Opponent")

Commonly, the by= argument name is excluded since it is not required for simple groups:

df.groupby("Opponent")   # Identical to by="Opponent" above

Internally pandas is creating many groups, where each group contains ALL rows that have the same value in Opponent. Looking at the twenty rows from the Illini Football Dataset again, I can highlight the different groups to show what pandas is doing:

SeasonDateLocationOpponentOpponentRankResultIlliniScoreOpponentScoreNote
Group where Opponent == "Air Force" (2 rows)
48319799/22/1979@Air ForceNaNW27.019.0NaN
47319809/27/1980vs.Air ForceNaNT20.020.0NaN
Group where Opponent == "Akron" (2 rows)
920198/31/2019vs.AkronNaNW42.03.0NaN
28819969/21/1996vs.AkronNaNW38.07.0NaN
Group where Opponent == "Alabama" (1 row)
458198212/29/1982Memphis, TNAlabamaNaNL15.021.0Liberty Bowl
Group where Opponent == "American Osteopath" (2 rows)
114519039/30/1903vs.American OsteopathNaNW36.00.0NaN
1159190210/1/1902vs.American OsteopathNaNW22.00.0NaN
Group where Opponent == "Arizona" (4 rows)
28719969/14/1996@ArizonaNaNL0.041.0NaN
35519909/8/1990@ArizonaNaNL16.028.0NaN
32019939/18/1993vs.ArizonaNaNL14.016.0NaN
29819959/16/1995vs.ArizonaNaNW9.07.0NaN
Group where Opponent == "Arizona State" (4 rows)
39119879/12/1987vs.Arizona StateNaNL7.021.0NaN
9620129/8/2012@Arizona StateNaNL14.045.0NaN
37919889/10/1988@Arizona StateNaNL16.021.0NaN
10920119/17/2011vs.Arizona StateNaNW17.014.0NaN
Group where Opponent == "Arkansas State" (3 rows)
10720119/3/2011vs.Arkansas StateNaNW33.015.0NaN
21820029/14/2002vs.Arkansas StateNaNW59.07.0NaN
25119999/4/1999vs.Arkansas StateNaNW41.03.0NaN
Group where Opponent == "Army" (2 rows)
423198512/31/1985Atlanta, GAArmyNaNL29.031.0Peach Bowl
913193411/3/1934vs.ArmyNaNW7.00.0NaN

The first twenty rows of the "Illini Football Dataset" when sorted by Opponent.

Working with Group Data

Unfortunately, a group is complex and hard to visually represent so pandas will not provide any visual output when you create a group. Instead, we will either:

  1. Use the group to explore summary statistics about each group (group.describe()), or
  2. Aggregate all of the groups using a aggravation function (group.agg(...)).

We'll explore them both!

Viewing Summary Statistics of Group Data

When we create a group data in Python, we will commonly refer to the variable that contains the grouped data as group:

group = df.groupby("Opponent")
# group now contains our group data

On a group, group.describe() provides summary statistics of all numeric columns of data for each group. Using the same DataFrame with the Illini Football Data grouped by Opponent, we see the summary statistic of every group:

group = df.groupby("Opponent")
group.describe()
SeasonOpponentRankIlliniScoreOpponentScore
countmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%max
Opponent
Air Force2.01979.5000000.7071071979.01979.251979.51979.751980.00.0NaNNaNNaNNaNNaNNaNNaN2.023.5000004.94974720.021.7523.525.2527.02.019.5000000.70710719.019.2519.519.7520.0
Akron2.02007.50000016.2634561996.02001.752007.52013.252019.00.0NaNNaNNaNNaNNaNNaNNaN2.040.0000002.82842738.039.0040.041.0042.02.05.0000002.8284273.04.005.06.007.0
Alabama1.01982.000000NaN1982.01982.001982.01982.001982.00.0NaNNaNNaNNaNNaNNaNNaN1.015.000000NaN15.015.0015.015.0015.01.021.000000NaN21.021.0021.021.0021.0
American Osteopath2.01902.5000000.7071071902.01902.251902.51902.751903.00.0NaNNaNNaNNaNNaNNaNNaN2.029.0000009.89949522.025.5029.032.5036.02.00.0000000.0000000.00.000.00.000.0
Arizona4.01993.5000002.6457511990.01992.251994.01995.251996.00.0NaNNaNNaNNaNNaNNaNNaN4.09.7500007.1355920.06.7511.514.5016.04.023.00000014.7648237.013.7522.031.2541.0
...................................................................................................
Western Illinois3.02013.3333335.6862412007.02011.002015.02016.502018.00.0NaNNaNNaNNaNNaNNaNNaN3.033.00000011.53256321.027.5034.039.0044.03.04.6666678.0829040.00.000.07.0014.0
Western Kentucky2.02015.5000002.1213202014.02014.752015.52016.252017.00.0NaNNaNNaNNaNNaNNaNNaN2.031.00000015.55634920.025.5031.036.5042.02.020.50000019.0918837.013.7520.527.2534.0
Western Michigan6.01999.66666726.1125771947.02005.002009.52011.752016.00.0NaNNaNNaNNaNNaNNaNNaN6.027.33333317.38581810.018.5023.528.5060.06.020.8333339.5376457.015.5021.526.0034.0
Wisconsin87.01966.75862134.1647741895.01944.001969.01994.502020.01.014.0NaN14.014.014.014.014.087.016.56321811.4075670.07.0014.024.0051.087.019.54023014.1238740.07.0020.027.5056.0
Youngstown State1.02014.000000NaN2014.02014.002014.02014.002014.00.0NaNNaNNaNNaNNaNNaNNaN1.028.000000NaN28.028.0028.028.0028.01.017.000000NaN17.017.0017.017.0017.0

The output of df.groupby("Opponent").describe().

A few observations:

  • The rows are now labeled with the Opponent. The rows will always be based on what we grouped our data by in the groupby command.
  • The columns are all of the numeric categories in the dataset (Season, OpponentRank, IlliniScore, and OpponentScore).
  • The sub-columns are various summary statistics (count is the number of total rows included in the group; mean is the average of the values of those rows; min is the minimum value, 50% is the median value, etc).

Aggregating Groups in Grouped Data

In many cases, we want to continue the analysis with a summary statistic of group data. On a group, group.agg(...) provides the ability to aggregate all rows within a group into a single value. There are hundreds of available functions, but a few are very common:

  • df_count = group.agg("count") will count the number of rows in each group. This does not look at the data within the rows at all, it only counts how many rows are each group! For example, the Illini has played Akron only twice so the count is 2.

  • df_mean = group.agg("mean") will find the mean (average) of each numeric column for each group. For example, the two Akron games the Illini scored 42 and 38. Therefore, the mean is 40.

  • df_sum = group.agg("sum") will find the sum of each numeric column for each group. The sum of all the points the Illini scored against Akron is 80.

  • Other strings that can be used include: "std" for each group's standard deviation, "median" for each group's median value, "var" for variance, and many more.

To make working with the DataFrame as simple as possible, we will ALWAYS use .reset_index() after .agg(...).

Example: Average Points Scored in Home vs. Away Games

I have always heard that a sports team has a "home team advantage", where they will preform better at home than away. Let's see if the Illini, on average, has scored more points at home or away?

Exploring the dataset, the Location column indicates if the game is played at home (vs.), at the opponent's stadium (@), or at a neutral site (ex: Memphis, TN):

SeasonDateLocationOpponentOpponentRankResultIlliniScoreOpponentScoreNote
48319799/22/1979@Air ForceNaNW27.019.0NaN
47319809/27/1980vs.Air ForceNaNT20.020.0NaN
920198/31/2019vs.AkronNaNW42.03.0NaN
28819969/21/1996vs.AkronNaNW38.07.0NaN
458198212/29/1982Memphis, TNAlabamaNaNL15.021.0Liberty Bowl
114519039/30/1903vs.American OsteopathNaNW36.00.0NaN
1159190210/1/1902vs.American OsteopathNaNW22.00.0NaN
28719969/14/1996@ArizonaNaNL0.041.0NaN
35519909/8/1990@ArizonaNaNL16.028.0NaN
32019939/18/1993vs.ArizonaNaNL14.016.0NaN
29819959/16/1995vs.ArizonaNaNW9.07.0NaN
39119879/12/1987vs.Arizona StateNaNL7.021.0NaN
9620129/8/2012@Arizona StateNaNL14.045.0NaN
37919889/10/1988@Arizona StateNaNL16.021.0NaN
10920119/17/2011vs.Arizona StateNaNW17.014.0NaN
10720119/3/2011vs.Arkansas StateNaNW33.015.0NaN
21820029/14/2002vs.Arkansas StateNaNW59.07.0NaN
25119999/4/1999vs.Arkansas StateNaNW41.03.0NaN
423198512/31/1985Atlanta, GAArmyNaNL29.031.0Peach Bowl
913193411/3/1934vs.ArmyNaNW7.00.0NaN

The first twenty rows of the "Illini Football Dataset" when sorted by Opponent.

Creating group data based rows that have the same value for Location, we use groupby:

group = df.groupby("Location")

Since we know we're interested in the average points, we can aggregate our group data together using "mean". The full Python code includes a groupby, an agg, and a reset_index:

group = df.groupby("Location")
df_mean = group.agg("mean").reset_index()
df_mean
LocationSeasonOpponentRankIlliniScoreOpponentScore
0@1963.70097114.515.90291320.147573
1Atlanta, GA1985.000000NaN29.00000031.000000
2Birmingham, AL1988.000000NaN10.00000014.000000
3Bronx, NY1938.500000NaN0.0000006.500000
4Chicago, IL1963.111111NaN16.55555618.444444
5Cleveland, OH1944.500000NaN10.25000024.000000
6Dallas, TX2014.000000NaN18.00000035.000000
7Detroit, MI1953.000000NaN11.00000017.500000
8El Paso, TX1991.000000NaN3.0000006.000000
9Houston, TX2010.000000NaN38.00000014.000000
10Indianapolis, IN1928.333333NaN17.3333332.333333
11Memphis, TN1988.000000NaN22.50000010.500000
12Miami, FL1999.000000NaN63.00000021.000000
13Milwaukee, WI1899.000000NaN0.00000023.000000
14New Orleans, LA2001.000000NaN34.00000047.000000
15Omaha, NE1892.000000NaN20.0000000.000000
16Orlando, FL1989.000000NaN31.00000021.000000
17Pasadena, CA1970.000000NaN25.60000024.400000
18Peoria, IL1897.000000NaN6.0000000.000000
19Rock Island, IL1899.000000NaN0.00000058.000000
20San Diego, CA1992.000000NaN17.00000027.000000
21San Francisco, CA2011.000000NaN20.00000014.000000
22Santa Clara, CA2019.000000NaN20.00000035.000000
23St. Louis, MO1969.100000NaN17.70000026.000000
24Tampa, FL1990.000000NaN0.00000030.000000
25vs.1959.52677313.521.11594216.021739

The average group value for all rows in the dataset, where each group is based on the value of each row's Location. (26 rows)

There are quite a few neutral sites, so let's write a conditional on df_mean to display only the home and away games:

df_mean[ (df_mean.Location == "vs.") | (df_mean.Location == "@") ]
LocationSeasonOpponentRankIlliniScoreOpponentScore
0@1963.70097114.515.90291320.147573
25vs.1959.52677313.521.11594216.021739

Average group value for home and away games. (2 rows)

This DataFrame provides our answer! Specifically:

  • When the Illini is playing games at home (when Location == "vs."), the Illini score an average of 21.1 points.
  • When the Illini is playing games at the opponent's stadium (when Location == "@"), the Illini score an average of only 15.9 points.
  • This is a very large difference (5.2 points) and suggests that the "home team advantage" may be a real thing!

Example Walk-Throughs with Worksheets

Video 1: GPA Dataset and Operations on Groups

Follow along with the worksheet to work through the problem:

Video 2: Aggregation Functions for groupby Groups

Follow along with the worksheet to work through the problem:

Video 3: Examples Aggregating groupby Groups

Follow along with the worksheet to work through the problem:

Practice Questions

Q1: Which two commands will make equivalent groups in Python?
Q2: You have a DataFrame `df` holding STAT 100 midterm scores for each student and their major. How would you find the average midterm score for each major in Python?
Q3: How would you fix the following line of code that is attempting to find how many students per major are in STAT 107?
Q4: You already created grouped data in Python using the .groupby() function. What function would you use to display summary statistics for all numeric columns for each group?
Q5: In order to simplify the process of working with DataFrames in Python, what function should always be used after using .agg()