# Grouping Data in Python

DataFrames are great tools for exploring data about individual observations, or rows, of data in a dataset. However, we often want to know about statistics on an entire category of data!

For example, the Illini Football Dataset contains all football games played by Illinois since 1892. When sorted by the `Opponent`

you can see that we have played many teams more than once (in the full dataset, we've played some teams over 70 times!):

Season | Date | Location | Opponent | OpponentRank | Result | IlliniScore | OpponentScore | Note | |
---|---|---|---|---|---|---|---|---|---|

483 | 1979 | 9/22/1979 | @ | Air Force | NaN | W | 27.0 | 19.0 | NaN |

473 | 1980 | 9/27/1980 | vs. | Air Force | NaN | T | 20.0 | 20.0 | NaN |

9 | 2019 | 8/31/2019 | vs. | Akron | NaN | W | 42.0 | 3.0 | NaN |

288 | 1996 | 9/21/1996 | vs. | Akron | NaN | W | 38.0 | 7.0 | NaN |

458 | 1982 | 12/29/1982 | Memphis, TN | Alabama | NaN | L | 15.0 | 21.0 | Liberty Bowl |

1145 | 1903 | 9/30/1903 | vs. | American Osteopath | NaN | W | 36.0 | 0.0 | NaN |

1159 | 1902 | 10/1/1902 | vs. | American Osteopath | NaN | W | 22.0 | 0.0 | NaN |

287 | 1996 | 9/14/1996 | @ | Arizona | NaN | L | 0.0 | 41.0 | NaN |

355 | 1990 | 9/8/1990 | @ | Arizona | NaN | L | 16.0 | 28.0 | NaN |

320 | 1993 | 9/18/1993 | vs. | Arizona | NaN | L | 14.0 | 16.0 | NaN |

298 | 1995 | 9/16/1995 | vs. | Arizona | NaN | W | 9.0 | 7.0 | NaN |

391 | 1987 | 9/12/1987 | vs. | Arizona State | NaN | L | 7.0 | 21.0 | NaN |

96 | 2012 | 9/8/2012 | @ | Arizona State | NaN | L | 14.0 | 45.0 | NaN |

379 | 1988 | 9/10/1988 | @ | Arizona State | NaN | L | 16.0 | 21.0 | NaN |

109 | 2011 | 9/17/2011 | vs. | Arizona State | NaN | W | 17.0 | 14.0 | NaN |

107 | 2011 | 9/3/2011 | vs. | Arkansas State | NaN | W | 33.0 | 15.0 | NaN |

218 | 2002 | 9/14/2002 | vs. | Arkansas State | NaN | W | 59.0 | 7.0 | NaN |

251 | 1999 | 9/4/1999 | vs. | Arkansas State | NaN | W | 41.0 | 3.0 | NaN |

423 | 1985 | 12/31/1985 | Atlanta, GA | Army | NaN | L | 29.0 | 31.0 | Peach Bowl |

913 | 1934 | 11/3/1934 | vs. | Army | NaN | W | 7.0 | 0.0 | NaN |

When we group data in a DataFrame, we are exploring **data about all rows with similar data in one or more columns**. Several questions we can ask about the Illini Football dataset includes:

- What was the
**average number of points scored by the Illini score against each opponent they played**? Ex: How many points did the Illini score verses Purdue? ...verses Arizona? ...verses Nebraska? (We are grouping on different values in the`Opponent`

column.) - What is the
**standard deviation of points scored by the Illini in home games vs. away games**? (The grouping on different values in the`Location`

column, where`vs.`

are home games and`@`

are away games.)

## Using `df.groupby`

to Create Groups

The pandas cheat sheet provides us an overview of the "Group Data" in Python:

The syntax of `groupby`

requires us to provide one or more columns to create groups of data. For example, if we group by only the `Opponent`

column, the following command creates groups based on the unique values in the `Opponent`

column:

`df.groupby(by="Opponent")`

Commonly, the `by=`

argument name is excluded since it is not required for simple groups:

`df.groupby("Opponent") # Identical to by="Opponent" above`

Internally pandas is creating many groups, where each group contains **ALL** rows that have the same value in `Opponent`

. Looking at the twenty rows from the Illini Football Dataset again, I can highlight the different groups to show what pandas is doing:

Season | Date | Location | Opponent | OpponentRank | Result | IlliniScore | OpponentScore | Note | |
---|---|---|---|---|---|---|---|---|---|

Group where `Opponent == "Air Force"` (2 rows) | |||||||||

483 | 1979 | 9/22/1979 | @ | Air Force | NaN | W | 27.0 | 19.0 | NaN |

473 | 1980 | 9/27/1980 | vs. | Air Force | NaN | T | 20.0 | 20.0 | NaN |

Group where `Opponent == "Akron"` (2 rows) | |||||||||

9 | 2019 | 8/31/2019 | vs. | Akron | NaN | W | 42.0 | 3.0 | NaN |

288 | 1996 | 9/21/1996 | vs. | Akron | NaN | W | 38.0 | 7.0 | NaN |

Group where `Opponent == "Alabama"` (1 row) | |||||||||

458 | 1982 | 12/29/1982 | Memphis, TN | Alabama | NaN | L | 15.0 | 21.0 | Liberty Bowl |

Group where `Opponent == "American Osteopath"` (2 rows) | |||||||||

1145 | 1903 | 9/30/1903 | vs. | American Osteopath | NaN | W | 36.0 | 0.0 | NaN |

1159 | 1902 | 10/1/1902 | vs. | American Osteopath | NaN | W | 22.0 | 0.0 | NaN |

Group where `Opponent == "Arizona"` (4 rows) | |||||||||

287 | 1996 | 9/14/1996 | @ | Arizona | NaN | L | 0.0 | 41.0 | NaN |

355 | 1990 | 9/8/1990 | @ | Arizona | NaN | L | 16.0 | 28.0 | NaN |

320 | 1993 | 9/18/1993 | vs. | Arizona | NaN | L | 14.0 | 16.0 | NaN |

298 | 1995 | 9/16/1995 | vs. | Arizona | NaN | W | 9.0 | 7.0 | NaN |

Group where `Opponent == "Arizona State"` (4 rows) | |||||||||

391 | 1987 | 9/12/1987 | vs. | Arizona State | NaN | L | 7.0 | 21.0 | NaN |

96 | 2012 | 9/8/2012 | @ | Arizona State | NaN | L | 14.0 | 45.0 | NaN |

379 | 1988 | 9/10/1988 | @ | Arizona State | NaN | L | 16.0 | 21.0 | NaN |

109 | 2011 | 9/17/2011 | vs. | Arizona State | NaN | W | 17.0 | 14.0 | NaN |

Group where `Opponent == "Arkansas State"` (3 rows) | |||||||||

107 | 2011 | 9/3/2011 | vs. | Arkansas State | NaN | W | 33.0 | 15.0 | NaN |

218 | 2002 | 9/14/2002 | vs. | Arkansas State | NaN | W | 59.0 | 7.0 | NaN |

251 | 1999 | 9/4/1999 | vs. | Arkansas State | NaN | W | 41.0 | 3.0 | NaN |

Group where `Opponent == "Army"` (2 rows) | |||||||||

423 | 1985 | 12/31/1985 | Atlanta, GA | Army | NaN | L | 29.0 | 31.0 | Peach Bowl |

913 | 1934 | 11/3/1934 | vs. | Army | NaN | W | 7.0 | 0.0 | NaN |

## Working with Group Data

Unfortunately, a group is **complex and hard to visually represent** so pandas will not provide any visual output when you create a group. Instead, we will either:

- Use the group to explore summary statistics about each group (
`group.describe()`

), or - Aggregate all of the groups using a aggravation function (
`group.agg(...)`

).

We'll explore them both!

### Viewing Summary Statistics of Group Data

When we create a group data in Python, we will commonly refer to the variable that contains the grouped data as `group`

:

`group = df.groupby("Opponent")`

# group now contains our group data

On a group, `group.describe()`

provides summary statistics of all **numeric columns of data** for each group. Using the same DataFrame with the Illini Football Data grouped by `Opponent`

, we see the summary statistic of every group:

Season | OpponentRank | IlliniScore | OpponentScore | |||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |

Opponent | ||||||||||||||||||||||||||||||||

Air Force | 2.0 | 1979.500000 | 0.707107 | 1979.0 | 1979.25 | 1979.5 | 1979.75 | 1980.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 23.500000 | 4.949747 | 20.0 | 21.75 | 23.5 | 25.25 | 27.0 | 2.0 | 19.500000 | 0.707107 | 19.0 | 19.25 | 19.5 | 19.75 | 20.0 |

Akron | 2.0 | 2007.500000 | 16.263456 | 1996.0 | 2001.75 | 2007.5 | 2013.25 | 2019.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 40.000000 | 2.828427 | 38.0 | 39.00 | 40.0 | 41.00 | 42.0 | 2.0 | 5.000000 | 2.828427 | 3.0 | 4.00 | 5.0 | 6.00 | 7.0 |

Alabama | 1.0 | 1982.000000 | NaN | 1982.0 | 1982.00 | 1982.0 | 1982.00 | 1982.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 15.000000 | NaN | 15.0 | 15.00 | 15.0 | 15.00 | 15.0 | 1.0 | 21.000000 | NaN | 21.0 | 21.00 | 21.0 | 21.00 | 21.0 |

American Osteopath | 2.0 | 1902.500000 | 0.707107 | 1902.0 | 1902.25 | 1902.5 | 1902.75 | 1903.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 29.000000 | 9.899495 | 22.0 | 25.50 | 29.0 | 32.50 | 36.0 | 2.0 | 0.000000 | 0.000000 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 |

Arizona | 4.0 | 1993.500000 | 2.645751 | 1990.0 | 1992.25 | 1994.0 | 1995.25 | 1996.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | 9.750000 | 7.135592 | 0.0 | 6.75 | 11.5 | 14.50 | 16.0 | 4.0 | 23.000000 | 14.764823 | 7.0 | 13.75 | 22.0 | 31.25 | 41.0 |

... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

Western Illinois | 3.0 | 2013.333333 | 5.686241 | 2007.0 | 2011.00 | 2015.0 | 2016.50 | 2018.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3.0 | 33.000000 | 11.532563 | 21.0 | 27.50 | 34.0 | 39.00 | 44.0 | 3.0 | 4.666667 | 8.082904 | 0.0 | 0.00 | 0.0 | 7.00 | 14.0 |

Western Kentucky | 2.0 | 2015.500000 | 2.121320 | 2014.0 | 2014.75 | 2015.5 | 2016.25 | 2017.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 31.000000 | 15.556349 | 20.0 | 25.50 | 31.0 | 36.50 | 42.0 | 2.0 | 20.500000 | 19.091883 | 7.0 | 13.75 | 20.5 | 27.25 | 34.0 |

Western Michigan | 6.0 | 1999.666667 | 26.112577 | 1947.0 | 2005.00 | 2009.5 | 2011.75 | 2016.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6.0 | 27.333333 | 17.385818 | 10.0 | 18.50 | 23.5 | 28.50 | 60.0 | 6.0 | 20.833333 | 9.537645 | 7.0 | 15.50 | 21.5 | 26.00 | 34.0 |

Wisconsin | 87.0 | 1966.758621 | 34.164774 | 1895.0 | 1944.00 | 1969.0 | 1994.50 | 2020.0 | 1.0 | 14.0 | NaN | 14.0 | 14.0 | 14.0 | 14.0 | 14.0 | 87.0 | 16.563218 | 11.407567 | 0.0 | 7.00 | 14.0 | 24.00 | 51.0 | 87.0 | 19.540230 | 14.123874 | 0.0 | 7.00 | 20.0 | 27.50 | 56.0 |

Youngstown State | 1.0 | 2014.000000 | NaN | 2014.0 | 2014.00 | 2014.0 | 2014.00 | 2014.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 28.000000 | NaN | 28.0 | 28.00 | 28.0 | 28.00 | 28.0 | 1.0 | 17.000000 | NaN | 17.0 | 17.00 | 17.0 | 17.00 | 17.0 |

A few observations:

- The
**rows**are now labeled with the**Opponent**. The rows will always be based on what we grouped our data by in the`groupby`

command. - The columns are all of the numeric categories in the dataset (
`Season`

,`OpponentRank`

,`IlliniScore`

, and`OpponentScore`

). - The sub-columns are various summary statistics (
`count`

is the number of total rows included in the group;`mean`

is the average of the values of those rows;`min`

is the minimum value,`50%`

is the median value, etc).

### Aggregating Groups in Grouped Data

In many cases, we want to continue the analysis with a summary statistic of group data. On a group, `group.agg(...)`

provides the ability to aggregate all rows within a group into a single value. There are hundreds of available functions, but a few are very common:

`df_count = group.agg("count")`

will**count**the number of rows in each group. This**does not look at the data within the rows**at all, it only counts how many rows are each group! For example, the Illini has played Akron only twice so the**count**is 2.`df_mean = group.agg("mean")`

will find the**mean**(average) of each numeric column for each group. For example, the two Akron games the Illini scored 42 and 38. Therefore, the**mean**is 40.`df_sum = group.agg("sum")`

will find the**sum**of each numeric column for each group. The**sum**of all the points the Illini scored against Akron is 80.Other strings that can be used include:

`"std"`

for each group's standard deviation,`"median"`

for each group's median value,`"var"`

for variance, and many more.

To make working with the DataFrame as simple as possible, we will **ALWAYS** use `.reset_index()`

after `.agg(...)`

.

#### Example: Average Points Scored in Home vs. Away Games

I have always heard that a sports team has a "home team advantage", where they will preform better at home than away. Let's see if the Illini, on average, has scored more points at home or away?

Exploring the dataset, the `Location`

column indicates if the game is played at home (`vs.`

), at the opponent's stadium (`@`

), or at a neutral site (ex: `Memphis, TN`

):

Season | Date | Location | Opponent | OpponentRank | Result | IlliniScore | OpponentScore | Note | |
---|---|---|---|---|---|---|---|---|---|

483 | 1979 | 9/22/1979 | @ | Air Force | NaN | W | 27.0 | 19.0 | NaN |

473 | 1980 | 9/27/1980 | vs. | Air Force | NaN | T | 20.0 | 20.0 | NaN |

9 | 2019 | 8/31/2019 | vs. | Akron | NaN | W | 42.0 | 3.0 | NaN |

288 | 1996 | 9/21/1996 | vs. | Akron | NaN | W | 38.0 | 7.0 | NaN |

458 | 1982 | 12/29/1982 | Memphis, TN | Alabama | NaN | L | 15.0 | 21.0 | Liberty Bowl |

1145 | 1903 | 9/30/1903 | vs. | American Osteopath | NaN | W | 36.0 | 0.0 | NaN |

1159 | 1902 | 10/1/1902 | vs. | American Osteopath | NaN | W | 22.0 | 0.0 | NaN |

287 | 1996 | 9/14/1996 | @ | Arizona | NaN | L | 0.0 | 41.0 | NaN |

355 | 1990 | 9/8/1990 | @ | Arizona | NaN | L | 16.0 | 28.0 | NaN |

320 | 1993 | 9/18/1993 | vs. | Arizona | NaN | L | 14.0 | 16.0 | NaN |

298 | 1995 | 9/16/1995 | vs. | Arizona | NaN | W | 9.0 | 7.0 | NaN |

391 | 1987 | 9/12/1987 | vs. | Arizona State | NaN | L | 7.0 | 21.0 | NaN |

96 | 2012 | 9/8/2012 | @ | Arizona State | NaN | L | 14.0 | 45.0 | NaN |

379 | 1988 | 9/10/1988 | @ | Arizona State | NaN | L | 16.0 | 21.0 | NaN |

109 | 2011 | 9/17/2011 | vs. | Arizona State | NaN | W | 17.0 | 14.0 | NaN |

107 | 2011 | 9/3/2011 | vs. | Arkansas State | NaN | W | 33.0 | 15.0 | NaN |

218 | 2002 | 9/14/2002 | vs. | Arkansas State | NaN | W | 59.0 | 7.0 | NaN |

251 | 1999 | 9/4/1999 | vs. | Arkansas State | NaN | W | 41.0 | 3.0 | NaN |

423 | 1985 | 12/31/1985 | Atlanta, GA | Army | NaN | L | 29.0 | 31.0 | Peach Bowl |

913 | 1934 | 11/3/1934 | vs. | Army | NaN | W | 7.0 | 0.0 | NaN |

Creating group data based rows that have the same value for `Location`

, we use `groupby`

:

`group = df.groupby("Location")`

Since we know we're interested in the **average** points, we can aggregate our group data together using `"mean"`

. The full Python code includes a `groupby`

, an `agg`

, and a `reset_index`

:

Location | Season | OpponentRank | IlliniScore | OpponentScore | |
---|---|---|---|---|---|

0 | @ | 1963.700971 | 14.5 | 15.902913 | 20.147573 |

1 | Atlanta, GA | 1985.000000 | NaN | 29.000000 | 31.000000 |

2 | Birmingham, AL | 1988.000000 | NaN | 10.000000 | 14.000000 |

3 | Bronx, NY | 1938.500000 | NaN | 0.000000 | 6.500000 |

4 | Chicago, IL | 1963.111111 | NaN | 16.555556 | 18.444444 |

5 | Cleveland, OH | 1944.500000 | NaN | 10.250000 | 24.000000 |

6 | Dallas, TX | 2014.000000 | NaN | 18.000000 | 35.000000 |

7 | Detroit, MI | 1953.000000 | NaN | 11.000000 | 17.500000 |

8 | El Paso, TX | 1991.000000 | NaN | 3.000000 | 6.000000 |

9 | Houston, TX | 2010.000000 | NaN | 38.000000 | 14.000000 |

10 | Indianapolis, IN | 1928.333333 | NaN | 17.333333 | 2.333333 |

11 | Memphis, TN | 1988.000000 | NaN | 22.500000 | 10.500000 |

12 | Miami, FL | 1999.000000 | NaN | 63.000000 | 21.000000 |

13 | Milwaukee, WI | 1899.000000 | NaN | 0.000000 | 23.000000 |

14 | New Orleans, LA | 2001.000000 | NaN | 34.000000 | 47.000000 |

15 | Omaha, NE | 1892.000000 | NaN | 20.000000 | 0.000000 |

16 | Orlando, FL | 1989.000000 | NaN | 31.000000 | 21.000000 |

17 | Pasadena, CA | 1970.000000 | NaN | 25.600000 | 24.400000 |

18 | Peoria, IL | 1897.000000 | NaN | 6.000000 | 0.000000 |

19 | Rock Island, IL | 1899.000000 | NaN | 0.000000 | 58.000000 |

20 | San Diego, CA | 1992.000000 | NaN | 17.000000 | 27.000000 |

21 | San Francisco, CA | 2011.000000 | NaN | 20.000000 | 14.000000 |

22 | Santa Clara, CA | 2019.000000 | NaN | 20.000000 | 35.000000 |

23 | St. Louis, MO | 1969.100000 | NaN | 17.700000 | 26.000000 |

24 | Tampa, FL | 1990.000000 | NaN | 0.000000 | 30.000000 |

25 | vs. | 1959.526773 | 13.5 | 21.115942 | 16.021739 |

There are quite a few neutral sites, so let's write a conditional on `df_mean`

to display **only** the home and away games:

Location | Season | OpponentRank | IlliniScore | OpponentScore | |
---|---|---|---|---|---|

0 | @ | 1963.700971 | 14.5 | 15.902913 | 20.147573 |

25 | vs. | 1959.526773 | 13.5 | 21.115942 | 16.021739 |

This DataFrame provides our answer! Specifically:

- When the Illini is playing games at home (when
`Location == "vs."`

), the Illini score an average of**21.1 points**. - When the Illini is playing games at the opponent's stadium (when
`Location == "@"`

), the Illini score an average of only**15.9 points**. - This is a
**very large difference**(5.2 points) and suggests that the "home team advantage" may be a real thing!

# Example Walk-Throughs with Worksheets

### Video 1: GPA Dataset and Operations on Groups

### Video 2: Aggregation Functions for groupby Groups

### Video 3: Examples Aggregating groupby Groups

# Practice Questions

**Q1**: Which two commands will make equivalent groups in Python?

**Q2**: You have a DataFrame `df` holding STAT 100 midterm scores for each student and their major. How would you find the average midterm score for each major in Python?

**Q3**: How would you fix the following line of code that is attempting to find how many students per major are in STAT 107?

**Q4**: You already created grouped data in Python using the .groupby() function. What function would you use to display summary statistics for all numeric columns for each group?

**Q5**: In order to simplify the process of working with DataFrames in Python, what function should always be used after using .agg()

# Mastery-Based Assessment

A mastery-based assessment is available for Grouping Data in Python:- Access PrairieLearn (prairielearn.org)
- Complete the

Grouping Data in Python mastery assessment on PrairieLearn**m2-03** - Continue to master material and earn 100% mastery on all assessments in the "Exploratory Data Analysis" section to earn the Exploratory Data Analysis Mastery Badge!