Grouping Data by column in a DataFrame


The groupby function is primarily used to combine duplicate rows of a given column of a pandas DataFrame.

To explore the groupby function we will use a DataFrame of the St. Louis Cardinals starting lineups in a 4 game series against the Washington Nationals:

import pandas as pd

df =  pd.DataFrame([
    #9/8/2022 WSH 11 - 6 STL
    {'name' : 'Brendan Donovan', 'position' : '2B', 'AB' : 3, 'R' : 2, 'H' : 3, 'RBI' : 0, 'BB' : 2, 'SO' : 0, 'PA' : 5 },
    {'name' : 'Tommy Edman', 'position' : 'SS', 'AB' : 4, 'R' : 0, 'H' : 2, 'RBI' : 0, 'BB' : 1, 'SO' : 1, 'PA' : 5 },
    {'name' : 'Paul Goldschmidt', 'position' : '1B', 'AB' : 4, 'R' : 0, 'H' : 1, 'RBI' : 1, 'BB' : 0, 'SO' : 1, 'PA' : 4 },
    {'name' : 'Nolan Arenado', 'position' : '3B', 'AB' : 4, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 4 },
    {'name' : 'Corey Dickerson', 'position' : 'LF', 'AB' : 5, 'R' : 0, 'H' : 2, 'RBI' : 2, 'BB' : 0, 'SO' : 1, 'PA' : 5 },
    {'name' : 'Albert Pujols', 'position' : 'DH', 'AB' : 5, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 5 },
    {'name' : 'Alec Burleson', 'position' : 'RF', 'AB' : 4, 'R' : 1, 'H' : 0, 'RBI' : 0, 'BB' : 1, 'SO' : 0, 'PA' : 5 },
    {'name' : 'Yadier Molina', 'position' : 'C', 'AB' : 4, 'R' : 2, 'H' : 2, 'RBI' : 3, 'BB' : 0, 'SO' : 0, 'PA' : 4 },
    {'name' : 'Ben DeLuzio', 'position' : 'CF', 'AB' : 2, 'R' : 0, 'H' : 1, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 2 },

    #9/7/2022 WSH 5 - 6 STL
    {'name' : 'Lars Nootbaar', 'position' : 'RF', 'AB' : 4, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 0, 'SO' : 2, 'PA' : 4 },
    {'name' : 'Brendan Donovan', 'position' : 'DH', 'AB' : 4, 'R' : 1, 'H' : 2, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 4 },
    {'name' : 'Paul Goldschmidt', 'position' : '1B', 'AB' : 3, 'R' : 2, 'H' : 1, 'RBI' : 1, 'BB' : 1, 'SO' : 0, 'PA' : 4 },
    {'name' : 'Nolan Arenado', 'position' : '3B', 'AB' : 4, 'R' : 1, 'H' : 2, 'RBI' : 1, 'BB' : 0, 'SO' : 0, 'PA' : 4 },
    {'name' : 'Corey Dickerson', 'position' : 'LF', 'AB' : 4, 'R' : 0, 'H' : 1, 'RBI' : 1, 'BB' : 0, 'SO' : 0, 'PA' : 4 },
    {'name' : 'Tyler O\'Neill', 'position' : 'CF', 'AB' : 3, 'R' : 1, 'H' : 0, 'RBI' : 0, 'BB' : 1, 'SO' : 3, 'PA' : 4 },
    {'name' : 'Nolan Gorman', 'position' : '2B', 'AB' : 4, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 0, 'SO' : 3, 'PA' : 4 },
    {'name' : 'Yadier Molina', 'position' : 'C', 'AB' : 4, 'R' : 0, 'H' : 2, 'RBI' : 1, 'BB' : 0, 'SO' : 0, 'PA' : 4 },
    {'name' : 'Tommy Edman', 'position' : 'SS', 'AB' : 4, 'R' : 0, 'H' : 1, 'RBI' : 2, 'BB' : 0, 'SO' : 0, 'PA' : 4 },
    
    #9/6/2022 WSH 1 - 4 STL
    {'name' : 'Lars Nootbaar', 'position' : 'RF', 'AB' : 4, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 0, 'SO' : 2, 'PA' : 4 },
    {'name' : 'Brendan Donovan', 'position' : '3B', 'AB' : 3, 'R' : 1, 'H' : 1, 'RBI' : 1, 'BB' : 0, 'SO' : 0, 'PA' : 3 },
    {'name' : 'Paul Goldschmidt', 'position' : '1B', 'AB' : 3, 'R' : 0, 'H' : 1, 'RBI' : 0, 'BB' : 1, 'SO' : 1, 'PA' : 4 },
    {'name' : 'Nolan Arenado', 'position' : 'DH', 'AB' : 4, 'R' : 1, 'H' : 1, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 4 },
    {'name' : 'Corey Dickerson', 'position' : 'LF', 'AB' : 3, 'R' : 1, 'H' : 2, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 3 },
    {'name' : 'Tyler O\'Neill', 'position' : 'CF', 'AB' : 2, 'R' : 0, 'H' : 0, 'RBI' : 1, 'BB' : 0, 'SO' : 1, 'PA' : 3 },
    {'name' : 'Nolan Gorman', 'position' : '2B', 'AB' : 3, 'R' : 1, 'H' : 2, 'RBI' : 2, 'BB' : 0, 'SO' : 1, 'PA' : 3 },
    {'name' : 'Andrew Knizner', 'position' : 'C', 'AB' : 3, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 0, 'SO' : 1, 'PA' : 3 },
    {'name' : 'Tommy Edman', 'position' : 'SS', 'AB' : 3, 'R' : 0, 'H' : 1, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 3 },

    #9/5/2022 WSH 6 - 0 STL
    {'name' : 'Lars Nootbaar', 'position' : 'RF', 'AB' : 4, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 0, 'SO' : 4, 'PA' : 4 },
    {'name' : 'Tyler O\'Neill', 'position' : 'CF', 'AB' : 4, 'R' : 0, 'H' : 1, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 4 },
    {'name' : 'Paul Goldschmidt', 'position' : 'DH', 'AB' : 3, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 1, 'SO' : 0, 'PA' : 4 },
    {'name' : 'Nolan Arenado', 'position' : '3B', 'AB' : 3, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 1, 'SO' : 1, 'PA' : 4 },
    {'name' : 'Albert Pujols', 'position' : '1B', 'AB' : 4, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 0, 'SO' : 1, 'PA' : 4 },
    {'name' : 'Corey Dickerson', 'position' : 'LF', 'AB' : 3, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 3 },
    {'name' : 'Nolan Gorman', 'position' : '2B', 'AB' : 3, 'R' : 0, 'H' : 0, 'RBI' : 0, 'BB' : 0, 'SO' : 2, 'PA' : 3 },
    {'name' : 'Yadier Molina', 'position' : 'C', 'AB' : 3, 'R' : 0, 'H' : 1, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 3 },
    {'name' : 'Tommy Edman', 'position' : 'SS', 'AB' : 3, 'R' : 0, 'H' : 1, 'RBI' : 0, 'BB' : 0, 'SO' : 0, 'PA' : 3 }
])
df
namepositionABRHRBIBBSOPA
0Brendan Donovan2B3230205
1Tommy EdmanSS4020205
2Paul Goldschmidt1B4011014
3Nolan Arenado3B4000004
4Corey DickersonLF5022015
5Albert PujolsDH5000005
6Alec BurlesonRF4100105
7Yadier MolinaC4223004
8Ben DeLuzioCF2010002
9Lars NootbaarRF4000024
10Brendan DonovanDH4120004
11Paul Goldschmidt1B3211104
12Nolan Arenado3B4121004
13Corey DickersonLF4011004
14Tyler O'NeillCF3100134
15Nolan Gorman2B4000034
16Yadier MolinaC4021004
17Tommy EdmanSS4021004
18Lars NootbaarRF4000024
19Brendan Donovan3B3111003
20Paul Goldschmidt1B3010114
21Nolan ArenadoDH4110004
22Corey DickersonLF3120003
23Tyler O'NeillCF2001013
24Nolan Gorman2B3122013
25Andrew KniznerC3000013
26Tommy EdmanSS3010003
27Lars NootbaarRF4000044
28Tyler O'NeillCF4010004
29Paul GoldschmidtDH3000104
30Nolan Arenado3B3000114
31Albert Pujols1B4000014
32Corey DickersonLF3000003
33Nolan Gorman2B3000023
34Yadier MolinaC3010003
35Tommy EdmanSS3010003
Create our initial DataFrame of the 4 game series

Groupby Syntax

When using the groupby function to group data by column, you pass one parameter into the function. The parameter is the string version of the column name. So to group by the "name" column, we will pass the string "name" as a parameter to the function. The next thing you use is the agg function, that specifies how the grouped rows should be aggregated together. After aggregating the data, reset_index will bring it back to a simple, index-based DataFrame.

Example of the Groupby Syntax

df2 = df.groupby("name").agg("count").reset_index()
df2
namepositionABRHRBIBBSOPA
0Albert Pujols22222222
1Alec Burleson11111111
2Andrew Knizner11111111
3Ben DeLuzio11111111
4Brendan Donovan33333333
5Corey Dickerson44444444
6Lars Nootbaar33333333
7Nolan Arenado44444444
8Nolan Gorman33333333
9Paul Goldschmidt44444444
10Tommy Edman44444444
11Tyler O'Neill33333333
12Yadier Molina33333333
An example of groupby syntax in which we aggregate by count

The line of code output a new DataFrame, but what does that DataFrame mean? In this case, we aggregated the "name" column by count. In this example, count found all of the duplicates in the "name" column, and summed them up. So, in the case of Yadier Molina, there were 3 duplicates found. This means that Yadier Molina started 3 out of the 4 possible games in the series versus Washington.

Different Types of Aggregation

count: counts the number of duplicates in the column passed into the groupby function. This type of aggregation would be useful when trying to find the total number of games played by a specific player.

sum: adds up every non-string column but the column passed into the groupby function. This type of aggregation would be useful if we tried to find the player with the most hits across the 4 game series.

df3 = df.groupby("name").agg("sum").reset_index()
df3.nlargest(1, "H")
nameABRHRBIBBSOPA
4Brendan Donovan104612012
An example of Aggregation by sum

max: Finds the maximum value in all of the columns except for the column passed into the groupby function. This could be useful for finding the player with the most hits in one game.

min: Finds the maximum value in all of the columns except for the column passed into the groupby function. This could be useful for finding the player with the least hits in one game.

mean: Finds the mean value of all of the non-string columns, except for the column passed into the groupby function.

median: Finds the median value of all of the non-string columns, except for the column passed into the groupby function.

std: Finds the standard deviation of all of the non-string columns, except for the column passed into the groupby function.

var: Finds the variance of all of the non-string columns, except for the column passed into the groupby function.

mad: Finds the mean absolute deviation of all of the non-string columns, except for the column passed into the groupby function.

prod: finds the product of every non-string column, except for the column passed into the groupby function.

Passing Multiple Arguments to the Aggregation Function

If you wanted to aggregate by multiple things, you pass a list of the kinds of aggregation you want to use into the agg function.

df4 = df.groupby("Name").agg(["mean", "median", "max"]).reset_index()
df4
nameABRHRBIBBSOPA
meanmedianmaxmeanmedianmaxmeanmedianmaxmeanmedianmaxmeanmedianmaxmeanmedianmaxmeanmedianmax
0Albert Pujols4.0000004.040.0000000.000.0000000.000.0000000.000.0000000.001.0000001.014.0000004.04
1Albert Pujols5.0000005.050.0000000.000.0000000.000.0000000.000.0000000.000.0000000.005.0000005.05
2Alec Burleson4.0000004.041.0000001.010.0000000.000.0000000.001.0000001.010.0000000.005.0000005.05
3Andrew Knizner3.0000003.030.0000000.000.0000000.000.0000000.000.0000000.001.0000001.013.0000003.03
4Ben DeLuzio2.0000002.020.0000000.001.0000001.010.0000000.000.0000000.000.0000000.002.0000002.02
5Brendan Donovan3.3333333.041.3333331.022.0000002.030.3333330.010.6666670.020.0000000.004.0000004.05
6Corey Dickerson3.7500003.550.2500000.011.2500001.520.7500000.520.0000000.000.2500000.013.7500003.55
7Lars Nootbaar4.0000004.040.0000000.000.0000000.000.0000000.000.0000000.002.6666672.044.0000004.04
8Nolan Arenado3.7500004.040.5000000.510.7500000.520.2500000.010.2500000.010.2500000.014.0000004.04
9Nolan Gorman3.3333333.040.3333330.010.6666670.020.6666670.020.0000000.002.0000002.033.3333333.04
10Paul Goldschmidt3.2500003.040.5000000.020.7500001.010.5000000.510.7500001.010.5000000.514.0000004.04
11Tommy Edman3.5000003.540.0000000.001.2500001.020.5000000.020.2500000.010.2500000.013.7500003.55
12Tyler O'Neill3.0000003.040.3333330.010.3333330.010.3333330.010.3333330.011.3333331.033.6666674.04
13Yadier Molina3.6666674.040.6666670.021.6666672.021.3333331.030.0000000.000.0000000.003.6666674.04
An example of multiple aggregation arguments

Selecting Certain Columns to Groupby

Sometimes when you use groupby, you only care about the result of one or two columns. As we saw in the sum example, it shows the results of the groupby for every column, even though we only really cared about the hits. So how would we only select the columns we want to see? Lets find out who had the most hits in one start for the Cardinals:

df5 = df[["name", "H"]].groupby("name").agg("count").reset_index()
df5.nlargest(1, "H")
nameH
4Brendan Donovan3
Grouping by name and only looking at the result of the name and H columns

By passing a list of the columns I wanted to look at, and making sure to include the column passed into the groupby function, we guarantee that only the columns I want to see appear in the result. Lets take a look at what happens if I don't pass in a list of columns, and just take the max of every column:

df6 = df.groupby("name").agg("max").reset_index()
df6.nlargest(1, "H")
namepositionABRHRBIBBSOPA
4Brendan DonovanDH4231205
Grouping by name and aggregating by the max of every column

As you can see the result is so much more cluttered and gives more information than necessary. That makes it a little harder to interpret. Sometimes it is nice to view every column's result, but it can be useful to only output the necessary columns.

Grouping by Multiple Columns

Sometimes it can be useful to sort the data by multiple columns. In this DataFrame, an example of when it would be useful is by looking at what position a player is more productive as. Someone like Brendan Donovan played 3 different positions, but Yadier Molina only played catcher. If you don't group by multiple columns, you won't be able to see how well a player hits at different positions. So in this example we will be grouping by the "name" and "position" columns:

df7 = df.groupby(["name", "position"]).agg("sum").reset_index()
df7
namepositionABRHRBIBBSOPA
0Albert Pujols1B4000014
1Albert PujolsDH5000005
2Alec BurlesonRF4100105
3Andrew KniznerC3000013
4Ben DeLuzioCF2010002
5Brendan Donovan2B3230205
6Brendan Donovan3B3111003
7Brendan DonovanDH4120004
8Corey DickersonLF151530115
9Lars NootbaarRF120000812
10Nolan Arenado3B111211112
11Nolan ArenadoDH4110004
12Nolan Gorman2B101220610
13Paul Goldschmidt1B102322212
14Paul GoldschmidtDH3000104
15Tommy EdmanSS140521115
16Tyler O'NeillCF91111411
17Yadier MolinaC112540011
Groping by the name and position column

While usually grouping by one column gives the intended result, sometimes grouping by multiple columns can give you a better sense of what the data is trying to tell you.

pandas Documentation

Click Here for the full pandas documentation for the groupby function