Grouping Data by column in a DataFrame
The groupby
function is primarily used to combine duplicate rows of a given column of a pandas DataFrame.
To explore the groupby
function we will use a DataFrame of the St. Louis Cardinals starting lineups in a 4 game series against the Washington Nationals:
import pandas as pd
df = pd.DataFrame([
#9/8/2022 WSH 11 - 6 STL
{'name' : 'Brendan Donovan', 'position' : '2B', 'AB' : 3, 'R' : 2, 'H' : 3, 'RBI' : 0, 'BB' : 2, 'SO' : 0, 'PA' : 5 },
{'name' : 'Tommy Edman', 'position' : 'SS', 'AB' : 4, 'R' : 0, 'H' : 2, 'RBI' : 0, 'BB' : 1, 'SO' : 1, 'PA' : 5 },
{'name' : 'Paul Goldschmidt', 'position' : '1B', 'AB' : 4, 'R' : 0, 'H' : 1, 'RBI' : 1, 'BB' : 0, 'SO' : 1, 'PA' : 4 },
{'name' : 'Nolan Arenado', 'position' : '3B', 'AB' : 4, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 4 },
{'name' : 'Corey Dickerson', 'position' : 'LF', 'AB' : 5, 'R' : 0, 'H' : 2, 'RBI' : 2, 'BB' : 0, 'SO' : 1, 'PA' : 5 },
{'name' : 'Albert Pujols', 'position' : 'DH', 'AB' : 5, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 5 },
{'name' : 'Alec Burleson', 'position' : 'RF', 'AB' : 4, 'R' : 1, 'H' : 0, 'RBI' : 0, 'BB' : 1, 'SO' : 0, 'PA' : 5 },
{'name' : 'Yadier Molina', 'position' : 'C', 'AB' : 4, 'R' : 2, 'H' : 2, 'RBI' : 3, 'BB' : 0, 'SO' : 0, 'PA' : 4 },
{'name' : 'Ben DeLuzio', 'position' : 'CF', 'AB' : 2, 'R' : 0, 'H' : 1, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 2 },
#9/7/2022 WSH 5 - 6 STL
{'name' : 'Lars Nootbaar', 'position' : 'RF', 'AB' : 4, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 0, 'SO' : 2, 'PA' : 4 },
{'name' : 'Brendan Donovan', 'position' : 'DH', 'AB' : 4, 'R' : 1, 'H' : 2, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 4 },
{'name' : 'Paul Goldschmidt', 'position' : '1B', 'AB' : 3, 'R' : 2, 'H' : 1, 'RBI' : 1, 'BB' : 1, 'SO' : 0, 'PA' : 4 },
{'name' : 'Nolan Arenado', 'position' : '3B', 'AB' : 4, 'R' : 1, 'H' : 2, 'RBI' : 1, 'BB' : 0, 'SO' : 0, 'PA' : 4 },
{'name' : 'Corey Dickerson', 'position' : 'LF', 'AB' : 4, 'R' : 0, 'H' : 1, 'RBI' : 1, 'BB' : 0, 'SO' : 0, 'PA' : 4 },
{'name' : 'Tyler O\'Neill', 'position' : 'CF', 'AB' : 3, 'R' : 1, 'H' : 0, 'RBI' : 0, 'BB' : 1, 'SO' : 3, 'PA' : 4 },
{'name' : 'Nolan Gorman', 'position' : '2B', 'AB' : 4, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 0, 'SO' : 3, 'PA' : 4 },
{'name' : 'Yadier Molina', 'position' : 'C', 'AB' : 4, 'R' : 0, 'H' : 2, 'RBI' : 1, 'BB' : 0, 'SO' : 0, 'PA' : 4 },
{'name' : 'Tommy Edman', 'position' : 'SS', 'AB' : 4, 'R' : 0, 'H' : 1, 'RBI' : 2, 'BB' : 0, 'SO' : 0, 'PA' : 4 },
#9/6/2022 WSH 1 - 4 STL
{'name' : 'Lars Nootbaar', 'position' : 'RF', 'AB' : 4, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 0, 'SO' : 2, 'PA' : 4 },
{'name' : 'Brendan Donovan', 'position' : '3B', 'AB' : 3, 'R' : 1, 'H' : 1, 'RBI' : 1, 'BB' : 0, 'SO' : 0, 'PA' : 3 },
{'name' : 'Paul Goldschmidt', 'position' : '1B', 'AB' : 3, 'R' : 0, 'H' : 1, 'RBI' : 0, 'BB' : 1, 'SO' : 1, 'PA' : 4 },
{'name' : 'Nolan Arenado', 'position' : 'DH', 'AB' : 4, 'R' : 1, 'H' : 1, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 4 },
{'name' : 'Corey Dickerson', 'position' : 'LF', 'AB' : 3, 'R' : 1, 'H' : 2, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 3 },
{'name' : 'Tyler O\'Neill', 'position' : 'CF', 'AB' : 2, 'R' : 0, 'H' : 0, 'RBI' : 1, 'BB' : 0, 'SO' : 1, 'PA' : 3 },
{'name' : 'Nolan Gorman', 'position' : '2B', 'AB' : 3, 'R' : 1, 'H' : 2, 'RBI' : 2, 'BB' : 0, 'SO' : 1, 'PA' : 3 },
{'name' : 'Andrew Knizner', 'position' : 'C', 'AB' : 3, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 0, 'SO' : 1, 'PA' : 3 },
{'name' : 'Tommy Edman', 'position' : 'SS', 'AB' : 3, 'R' : 0, 'H' : 1, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 3 },
#9/5/2022 WSH 6 - 0 STL
{'name' : 'Lars Nootbaar', 'position' : 'RF', 'AB' : 4, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 0, 'SO' : 4, 'PA' : 4 },
{'name' : 'Tyler O\'Neill', 'position' : 'CF', 'AB' : 4, 'R' : 0, 'H' : 1, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 4 },
{'name' : 'Paul Goldschmidt', 'position' : 'DH', 'AB' : 3, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 1, 'SO' : 0, 'PA' : 4 },
{'name' : 'Nolan Arenado', 'position' : '3B', 'AB' : 3, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 1, 'SO' : 1, 'PA' : 4 },
{'name' : 'Albert Pujols', 'position' : '1B', 'AB' : 4, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 0, 'SO' : 1, 'PA' : 4 },
{'name' : 'Corey Dickerson', 'position' : 'LF', 'AB' : 3, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 3 },
{'name' : 'Nolan Gorman', 'position' : '2B', 'AB' : 3, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 0, 'SO' : 2, 'PA' : 3 },
{'name' : 'Yadier Molina', 'position' : 'C', 'AB' : 3, 'R' : 0, 'H' : 1, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 3 },
{'name' : 'Tommy Edman', 'position' : 'SS', 'AB' : 3, 'R' : 0, 'H' : 1, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 3 }
])
df
name | position | AB | R | H | RBI | BB | SO | PA | |
---|---|---|---|---|---|---|---|---|---|
0 | Brendan Donovan | 2B | 3 | 2 | 3 | 0 | 2 | 0 | 5 |
1 | Tommy Edman | SS | 4 | 0 | 2 | 0 | 2 | 0 | 5 |
2 | Paul Goldschmidt | 1B | 4 | 0 | 1 | 1 | 0 | 1 | 4 |
3 | Nolan Arenado | 3B | 4 | 0 | 0 | 0 | 0 | 0 | 4 |
4 | Corey Dickerson | LF | 5 | 0 | 2 | 2 | 0 | 1 | 5 |
5 | Albert Pujols | DH | 5 | 0 | 0 | 0 | 0 | 0 | 5 |
6 | Alec Burleson | RF | 4 | 1 | 0 | 0 | 1 | 0 | 5 |
7 | Yadier Molina | C | 4 | 2 | 2 | 3 | 0 | 0 | 4 |
8 | Ben DeLuzio | CF | 2 | 0 | 1 | 0 | 0 | 0 | 2 |
9 | Lars Nootbaar | RF | 4 | 0 | 0 | 0 | 0 | 2 | 4 |
10 | Brendan Donovan | DH | 4 | 1 | 2 | 0 | 0 | 0 | 4 |
11 | Paul Goldschmidt | 1B | 3 | 2 | 1 | 1 | 1 | 0 | 4 |
12 | Nolan Arenado | 3B | 4 | 1 | 2 | 1 | 0 | 0 | 4 |
13 | Corey Dickerson | LF | 4 | 0 | 1 | 1 | 0 | 0 | 4 |
14 | Tyler O'Neill | CF | 3 | 1 | 0 | 0 | 1 | 3 | 4 |
15 | Nolan Gorman | 2B | 4 | 0 | 0 | 0 | 0 | 3 | 4 |
16 | Yadier Molina | C | 4 | 0 | 2 | 1 | 0 | 0 | 4 |
17 | Tommy Edman | SS | 4 | 0 | 2 | 1 | 0 | 0 | 4 |
18 | Lars Nootbaar | RF | 4 | 0 | 0 | 0 | 0 | 2 | 4 |
19 | Brendan Donovan | 3B | 3 | 1 | 1 | 1 | 0 | 0 | 3 |
20 | Paul Goldschmidt | 1B | 3 | 0 | 1 | 0 | 1 | 1 | 4 |
21 | Nolan Arenado | DH | 4 | 1 | 1 | 0 | 0 | 0 | 4 |
22 | Corey Dickerson | LF | 3 | 1 | 2 | 0 | 0 | 0 | 3 |
23 | Tyler O'Neill | CF | 2 | 0 | 0 | 1 | 0 | 1 | 3 |
24 | Nolan Gorman | 2B | 3 | 1 | 2 | 2 | 0 | 1 | 3 |
25 | Andrew Knizner | C | 3 | 0 | 0 | 0 | 0 | 1 | 3 |
26 | Tommy Edman | SS | 3 | 0 | 1 | 0 | 0 | 0 | 3 |
27 | Lars Nootbaar | RF | 4 | 0 | 0 | 0 | 0 | 4 | 4 |
28 | Tyler O'Neill | CF | 4 | 0 | 1 | 0 | 0 | 0 | 4 |
29 | Paul Goldschmidt | DH | 3 | 0 | 0 | 0 | 1 | 0 | 4 |
30 | Nolan Arenado | 3B | 3 | 0 | 0 | 0 | 1 | 1 | 4 |
31 | Albert Pujols | 1B | 4 | 0 | 0 | 0 | 0 | 1 | 4 |
32 | Corey Dickerson | LF | 3 | 0 | 0 | 0 | 0 | 0 | 3 |
33 | Nolan Gorman | 2B | 3 | 0 | 0 | 0 | 0 | 2 | 3 |
34 | Yadier Molina | C | 3 | 0 | 1 | 0 | 0 | 0 | 3 |
35 | Tommy Edman | SS | 3 | 0 | 1 | 0 | 0 | 0 | 3 |
Groupby Syntax
When using the groupby
function to group data by column, you pass one parameter into the function. The parameter is the string version of the column name. So to group by the "name" column, we will pass the string "name"
as a parameter to the function. The next thing you use is the agg
function, that specifies how the grouped rows should be aggregated together. After aggregating the data, reset_index
will bring it back to a simple, index-based DataFrame.
Example of the Groupby Syntax
df2 = df.groupby("name").agg("count").reset_index()
df2
name | position | AB | R | H | RBI | BB | SO | PA | |
---|---|---|---|---|---|---|---|---|---|
0 | Albert Pujols | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
1 | Alec Burleson | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
2 | Andrew Knizner | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
3 | Ben DeLuzio | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
4 | Brendan Donovan | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
5 | Corey Dickerson | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
6 | Lars Nootbaar | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
7 | Nolan Arenado | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
8 | Nolan Gorman | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
9 | Paul Goldschmidt | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
10 | Tommy Edman | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
11 | Tyler O'Neill | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
12 | Yadier Molina | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
The line of code output a new DataFrame, but what does that DataFrame mean? In this case, we aggregated the "name" column by count. In this example, count found all of the duplicates in the "name" column, and summed them up. So, in the case of Yadier Molina, there were 3 duplicates found. This means that Yadier Molina started 3 out of the 4 possible games in the series versus Washington.
Different Types of Aggregation
count: counts the number of duplicates in the column passed into the groupby
function. This type of aggregation would be useful when trying to find the total number of games played by a specific player.
sum: adds up every non-string column but the column passed into the groupby
function. This type of aggregation would be useful if we tried to find the player with the most hits across the 4 game series.
df3 = df.groupby("name").agg("sum").reset_index()
df3.nlargest(1, "H")
name | AB | R | H | RBI | BB | SO | PA | |
---|---|---|---|---|---|---|---|---|
4 | Brendan Donovan | 10 | 4 | 6 | 1 | 2 | 0 | 12 |
max: Finds the maximum value in all of the columns except for the column passed into the groupby
function. This could be useful for finding the player with the most hits in one game.
min: Finds the maximum value in all of the columns except for the column passed into the groupby
function. This could be useful for finding the player with the least hits in one game.
mean: Finds the mean value of all of the non-string columns, except for the column passed into the groupby
function.
median: Finds the median value of all of the non-string columns, except for the column passed into the groupby
function.
std: Finds the standard deviation of all of the non-string columns, except for the column passed into the groupby
function.
var: Finds the variance of all of the non-string columns, except for the column passed into the groupby
function.
mad: Finds the mean absolute deviation of all of the non-string columns, except for the column passed into the groupby
function.
prod: finds the product of every non-string column, except for the column passed into the groupby
function.
Passing Multiple Arguments to the Aggregation Function
If you wanted to aggregate by multiple things, you pass a list of the kinds of aggregation you want to use into the agg
function.
df4 = df.groupby("Name").agg(["mean", "median", "max"]).reset_index()
df4
name | AB | R | H | RBI | BB | SO | PA | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | median | max | mean | median | max | mean | median | max | mean | median | max | mean | median | max | mean | median | max | mean | median | max | ||
0 | Albert Pujols | 4.000000 | 4.0 | 4 | 0.000000 | 0.0 | 0 | 0.000000 | 0.0 | 0 | 0.000000 | 0.0 | 0 | 0.000000 | 0.0 | 0 | 1.000000 | 1.0 | 1 | 4.000000 | 4.0 | 4 |
1 | Albert Pujols | 5.000000 | 5.0 | 5 | 0.000000 | 0.0 | 0 | 0.000000 | 0.0 | 0 | 0.000000 | 0.0 | 0 | 0.000000 | 0.0 | 0 | 0.000000 | 0.0 | 0 | 5.000000 | 5.0 | 5 |
2 | Alec Burleson | 4.000000 | 4.0 | 4 | 1.000000 | 1.0 | 1 | 0.000000 | 0.0 | 0 | 0.000000 | 0.0 | 0 | 1.000000 | 1.0 | 1 | 0.000000 | 0.0 | 0 | 5.000000 | 5.0 | 5 |
3 | Andrew Knizner | 3.000000 | 3.0 | 3 | 0.000000 | 0.0 | 0 | 0.000000 | 0.0 | 0 | 0.000000 | 0.0 | 0 | 0.000000 | 0.0 | 0 | 1.000000 | 1.0 | 1 | 3.000000 | 3.0 | 3 |
4 | Ben DeLuzio | 2.000000 | 2.0 | 2 | 0.000000 | 0.0 | 0 | 1.000000 | 1.0 | 1 | 0.000000 | 0.0 | 0 | 0.000000 | 0.0 | 0 | 0.000000 | 0.0 | 0 | 2.000000 | 2.0 | 2 |
5 | Brendan Donovan | 3.333333 | 3.0 | 4 | 1.333333 | 1.0 | 2 | 2.000000 | 2.0 | 3 | 0.333333 | 0.0 | 1 | 0.666667 | 0.0 | 2 | 0.000000 | 0.0 | 0 | 4.000000 | 4.0 | 5 |
6 | Corey Dickerson | 3.750000 | 3.5 | 5 | 0.250000 | 0.0 | 1 | 1.250000 | 1.5 | 2 | 0.750000 | 0.5 | 2 | 0.000000 | 0.0 | 0 | 0.250000 | 0.0 | 1 | 3.750000 | 3.5 | 5 |
7 | Lars Nootbaar | 4.000000 | 4.0 | 4 | 0.000000 | 0.0 | 0 | 0.000000 | 0.0 | 0 | 0.000000 | 0.0 | 0 | 0.000000 | 0.0 | 0 | 2.666667 | 2.0 | 4 | 4.000000 | 4.0 | 4 |
8 | Nolan Arenado | 3.750000 | 4.0 | 4 | 0.500000 | 0.5 | 1 | 0.750000 | 0.5 | 2 | 0.250000 | 0.0 | 1 | 0.250000 | 0.0 | 1 | 0.250000 | 0.0 | 1 | 4.000000 | 4.0 | 4 |
9 | Nolan Gorman | 3.333333 | 3.0 | 4 | 0.333333 | 0.0 | 1 | 0.666667 | 0.0 | 2 | 0.666667 | 0.0 | 2 | 0.000000 | 0.0 | 0 | 2.000000 | 2.0 | 3 | 3.333333 | 3.0 | 4 |
10 | Paul Goldschmidt | 3.250000 | 3.0 | 4 | 0.500000 | 0.0 | 2 | 0.750000 | 1.0 | 1 | 0.500000 | 0.5 | 1 | 0.750000 | 1.0 | 1 | 0.500000 | 0.5 | 1 | 4.000000 | 4.0 | 4 |
11 | Tommy Edman | 3.500000 | 3.5 | 4 | 0.000000 | 0.0 | 0 | 1.250000 | 1.0 | 2 | 0.500000 | 0.0 | 2 | 0.250000 | 0.0 | 1 | 0.250000 | 0.0 | 1 | 3.750000 | 3.5 | 5 |
12 | Tyler O'Neill | 3.000000 | 3.0 | 4 | 0.333333 | 0.0 | 1 | 0.333333 | 0.0 | 1 | 0.333333 | 0.0 | 1 | 0.333333 | 0.0 | 1 | 1.333333 | 1.0 | 3 | 3.666667 | 4.0 | 4 |
13 | Yadier Molina | 3.666667 | 4.0 | 4 | 0.666667 | 0.0 | 2 | 1.666667 | 2.0 | 2 | 1.333333 | 1.0 | 3 | 0.000000 | 0.0 | 0 | 0.000000 | 0.0 | 0 | 3.666667 | 4.0 | 4 |
Selecting Certain Columns to Groupby
Sometimes when you use groupby, you only care about the result of one or two columns. As we saw in the sum example, it shows the results of the groupby for every column, even though we only really cared about the hits. So how would we only select the columns we want to see? Lets find out who had the most hits in one start for the Cardinals:
df5 = df[["name", "H"]].groupby("name").agg("count").reset_index()
df5.nlargest(1, "H")
name | H | |
---|---|---|
4 | Brendan Donovan | 3 |
By passing a list of the columns I wanted to look at, and making sure to include the column passed into the groupby
function, we guarantee that only the columns I want to see appear in the result. Lets take a look at what happens if I don't pass in a list of columns, and just take the max of every column:
df6 = df.groupby("name").agg("max").reset_index()
df6.nlargest(1, "H")
name | position | AB | R | H | RBI | BB | SO | PA | |
---|---|---|---|---|---|---|---|---|---|
4 | Brendan Donovan | DH | 4 | 2 | 3 | 1 | 2 | 0 | 5 |
As you can see the result is so much more cluttered and gives more information than necessary. That makes it a little harder to interpret. Sometimes it is nice to view every column's result, but it can be useful to only output the necessary columns.
Grouping by Multiple Columns
Sometimes it can be useful to sort the data by multiple columns. In this DataFrame, an example of when it would be useful is by looking at what position a player is more productive as. Someone like Brendan Donovan played 3 different positions, but Yadier Molina only played catcher. If you don't group by multiple columns, you won't be able to see how well a player hits at different positions. So in this example we will be grouping by the "name" and "position" columns:
df7 = df.groupby(["name", "position"]).agg("sum").reset_index()
df7
name | position | AB | R | H | RBI | BB | SO | PA | |
---|---|---|---|---|---|---|---|---|---|
0 | Albert Pujols | 1B | 4 | 0 | 0 | 0 | 0 | 1 | 4 |
1 | Albert Pujols | DH | 5 | 0 | 0 | 0 | 0 | 0 | 5 |
2 | Alec Burleson | RF | 4 | 1 | 0 | 0 | 1 | 0 | 5 |
3 | Andrew Knizner | C | 3 | 0 | 0 | 0 | 0 | 1 | 3 |
4 | Ben DeLuzio | CF | 2 | 0 | 1 | 0 | 0 | 0 | 2 |
5 | Brendan Donovan | 2B | 3 | 2 | 3 | 0 | 2 | 0 | 5 |
6 | Brendan Donovan | 3B | 3 | 1 | 1 | 1 | 0 | 0 | 3 |
7 | Brendan Donovan | DH | 4 | 1 | 2 | 0 | 0 | 0 | 4 |
8 | Corey Dickerson | LF | 15 | 1 | 5 | 3 | 0 | 1 | 15 |
9 | Lars Nootbaar | RF | 12 | 0 | 0 | 0 | 0 | 8 | 12 |
10 | Nolan Arenado | 3B | 11 | 1 | 2 | 1 | 1 | 1 | 12 |
11 | Nolan Arenado | DH | 4 | 1 | 1 | 0 | 0 | 0 | 4 |
12 | Nolan Gorman | 2B | 10 | 1 | 2 | 2 | 0 | 6 | 10 |
13 | Paul Goldschmidt | 1B | 10 | 2 | 3 | 2 | 2 | 2 | 12 |
14 | Paul Goldschmidt | DH | 3 | 0 | 0 | 0 | 1 | 0 | 4 |
15 | Tommy Edman | SS | 14 | 0 | 5 | 2 | 1 | 1 | 15 |
16 | Tyler O'Neill | CF | 9 | 1 | 1 | 1 | 1 | 4 | 11 |
17 | Yadier Molina | C | 11 | 2 | 5 | 4 | 0 | 0 | 11 |
While usually grouping by one column gives the intended result, sometimes grouping by multiple columns can give you a better sense of what the data is trying to tell you.
pandas Documentation
Click Here for the full pandas documentation for the groupby
function