Grouping Data by column in a DataFrame


The groupby function is primarily used to combine duplicate rows of a given column of a pandas DataFrame.

To explore the groupby function we will use a DataFrame of the St. Louis Cardinals starting lineups in a 4 game series against the Washington Nationals:

Reset Code Python Output:
name position AB R H RBI BB SO PA
0 Brendan Donovan 2B 3 2 3 0 2 0 5
1 Tommy Edman SS 4 0 2 0 2 0 5
2 Paul Goldschmidt 1B 4 0 1 1 0 1 4
3 Nolan Arenado 3B 4 0 0 0 0 0 4
4 Corey Dickerson LF 5 0 2 2 0 1 5
5 Albert Pujols DH 5 0 0 0 0 0 5
6 Alec Burleson RF 4 1 0 0 1 0 5
7 Yadier Molina C 4 2 2 3 0 0 4
8 Ben DeLuzio CF 2 0 1 0 0 0 2
9 Lars Nootbaar RF 4 0 0 0 0 2 4
10 Brendan Donovan DH 4 1 2 0 0 0 4
11 Paul Goldschmidt 1B 3 2 1 1 1 0 4
12 Nolan Arenado 3B 4 1 2 1 0 0 4
13 Corey Dickerson LF 4 0 1 1 0 0 4
14 Tyler O'Neill CF 3 1 0 0 1 3 4
15 Nolan Gorman 2B 4 0 0 0 0 3 4
16 Yadier Molina C 4 0 2 1 0 0 4
17 Tommy Edman SS 4 0 2 1 0 0 4
18 Lars Nootbaar RF 4 0 0 0 0 2 4
19 Brendan Donovan 3B 3 1 1 1 0 0 3
20 Paul Goldschmidt 1B 3 0 1 0 1 1 4
21 Nolan Arenado DH 4 1 1 0 0 0 4
22 Corey Dickerson LF 3 1 2 0 0 0 3
23 Tyler O'Neill CF 2 0 0 1 0 1 3
24 Nolan Gorman 2B 3 1 2 2 0 1 3
25 Andrew Knizner C 3 0 0 0 0 1 3
26 Tommy Edman SS 3 0 1 0 0 0 3
27 Lars Nootbaar RF 4 0 0 0 0 4 4
28 Tyler O'Neill CF 4 0 1 0 0 0 4
29 Paul Goldschmidt DH 3 0 0 0 1 0 4
30 Nolan Arenado 3B 3 0 0 0 1 1 4
31 Albert Pujols 1B 4 0 0 0 0 1 4
32 Corey Dickerson LF 3 0 0 0 0 0 3
33 Nolan Gorman 2B 3 0 0 0 0 2 3
34 Yadier Molina C 3 0 1 0 0 0 3
35 Tommy Edman SS 3 0 1 0 0 0 3

Groupby Syntax

When using the groupby function to group data by column, you pass one parameter into the function. The parameter is the string version of the column name. So to group by the "name" column, we will pass the string "name" as a parameter to the function. The next thing you use is the agg function, that specifies how the grouped rows should be aggregated together. After aggregating the data, reset_index will bring it back to a simple, index-based DataFrame.

Example of the Groupby Syntax

Reset Code Python Output:
name position AB R H RBI BB SO PA
0 Albert Pujols 2 2 2 2 2 2 2 2
1 Alec Burleson 1 1 1 1 1 1 1 1
2 Andrew Knizner 1 1 1 1 1 1 1 1
3 Ben DeLuzio 1 1 1 1 1 1 1 1
4 Brendan Donovan 3 3 3 3 3 3 3 3
5 Corey Dickerson 4 4 4 4 4 4 4 4
6 Lars Nootbaar 3 3 3 3 3 3 3 3
7 Nolan Arenado 4 4 4 4 4 4 4 4
8 Nolan Gorman 3 3 3 3 3 3 3 3
9 Paul Goldschmidt 4 4 4 4 4 4 4 4
10 Tommy Edman 4 4 4 4 4 4 4 4
11 Tyler O'Neill 3 3 3 3 3 3 3 3
12 Yadier Molina 3 3 3 3 3 3 3 3

The line of code output a new DataFrame, but what does that DataFrame mean? In this case, we aggregated the "name" column by count. In this example, count found all of the duplicates in the "name" column, and summed them up. So, in the case of Yadier Molina, there were 3 duplicates found. This means that Yadier Molina started 3 out of the 4 possible games in the series versus Washington.

Different Types of Aggregation

count: counts the number of duplicates in the column passed into the groupby function. This type of aggregation would be useful when trying to find the total number of games played by a specific player.

sum: adds up every non-string column but the column passed into the groupby function. This type of aggregation would be useful if we tried to find the player with the most hits across the 4 game series.

Reset Code Python Output:
name AB R H RBI BB SO PA
4 Brendan Donovan 10 4 6 1 2 0 12

max: Finds the maximum value in all of the columns except for the column passed into the groupby function. This could be useful for finding the player with the most hits in one game.

min: Finds the maximum value in all of the columns except for the column passed into the groupby function. This could be useful for finding the player with the least hits in one game.

mean: Finds the mean value of all of the non-string columns, except for the column passed into the groupby function.

median: Finds the median value of all of the non-string columns, except for the column passed into the groupby function.

std: Finds the standard deviation of all of the non-string columns, except for the column passed into the groupby function.

var: Finds the variance of all of the non-string columns, except for the column passed into the groupby function.

mad: Finds the mean absolute deviation of all of the non-string columns, except for the column passed into the groupby function.

prod: finds the product of every non-string column, except for the column passed into the groupby function.

Passing Multiple Arguments to the Aggregation Function

If you wanted to aggregate by multiple things, you pass a list of the kinds of aggregation you want to use into the agg function.

Reset Code Python Output:
name AB R H RBI BB SO PA
mean median max mean median max mean median max mean median max mean median max mean median max mean median max
0 Albert Pujols 4.000000 4.0 4 0.000000 0.0 0 0.000000 0.0 0 0.000000 0.0 0 0.000000 0.0 0 1.000000 1.0 1 4.000000 4.0 4
1 Albert Pujols 5.000000 5.0 5 0.000000 0.0 0 0.000000 0.0 0 0.000000 0.0 0 0.000000 0.0 0 0.000000 0.0 0 5.000000 5.0 5
2 Alec Burleson 4.000000 4.0 4 1.000000 1.0 1 0.000000 0.0 0 0.000000 0.0 0 1.000000 1.0 1 0.000000 0.0 0 5.000000 5.0 5
3 Andrew Knizner 3.000000 3.0 3 0.000000 0.0 0 0.000000 0.0 0 0.000000 0.0 0 0.000000 0.0 0 1.000000 1.0 1 3.000000 3.0 3
4 Ben DeLuzio 2.000000 2.0 2 0.000000 0.0 0 1.000000 1.0 1 0.000000 0.0 0 0.000000 0.0 0 0.000000 0.0 0 2.000000 2.0 2
5 Brendan Donovan 3.333333 3.0 4 1.333333 1.0 2 2.000000 2.0 3 0.333333 0.0 1 0.666667 0.0 2 0.000000 0.0 0 4.000000 4.0 5
6 Corey Dickerson 3.750000 3.5 5 0.250000 0.0 1 1.250000 1.5 2 0.750000 0.5 2 0.000000 0.0 0 0.250000 0.0 1 3.750000 3.5 5
7 Lars Nootbaar 4.000000 4.0 4 0.000000 0.0 0 0.000000 0.0 0 0.000000 0.0 0 0.000000 0.0 0 2.666667 2.0 4 4.000000 4.0 4
8 Nolan Arenado 3.750000 4.0 4 0.500000 0.5 1 0.750000 0.5 2 0.250000 0.0 1 0.250000 0.0 1 0.250000 0.0 1 4.000000 4.0 4
9 Nolan Gorman 3.333333 3.0 4 0.333333 0.0 1 0.666667 0.0 2 0.666667 0.0 2 0.000000 0.0 0 2.000000 2.0 3 3.333333 3.0 4
10 Paul Goldschmidt 3.250000 3.0 4 0.500000 0.0 2 0.750000 1.0 1 0.500000 0.5 1 0.750000 1.0 1 0.500000 0.5 1 4.000000 4.0 4
11 Tommy Edman 3.500000 3.5 4 0.000000 0.0 0 1.250000 1.0 2 0.500000 0.0 2 0.250000 0.0 1 0.250000 0.0 1 3.750000 3.5 5
12 Tyler O'Neill 3.000000 3.0 4 0.333333 0.0 1 0.333333 0.0 1 0.333333 0.0 1 0.333333 0.0 1 1.333333 1.0 3 3.666667 4.0 4
13 Yadier Molina 3.666667 4.0 4 0.666667 0.0 2 1.666667 2.0 2 1.333333 1.0 3 0.000000 0.0 0 0.000000 0.0 0 3.666667 4.0 4

Selecting Certain Columns to Groupby

Sometimes when you use groupby, you only care about the result of one or two columns. As we saw in the sum example, it shows the results of the groupby for every column, even though we only really cared about the hits. So how would we only select the columns we want to see? Lets find out who had the most hits in one start for the Cardinals:

Reset Code Python Output:
name H
5 Corey Dickerson 4

By passing a list of the columns I wanted to look at, and making sure to include the column passed into the groupby function, we guarantee that only the columns I want to see appear in the result. Lets take a look at what happens if I don't pass in a list of columns, and just take the max of every column:

Reset Code Python Output:
name position AB R H RBI BB SO PA
4 Brendan Donovan DH 4 2 3 1 2 0 5

As you can see the result is so much more cluttered and gives more information than necessary. That makes it a little harder to interpret. Sometimes it is nice to view every column's result, but it can be useful to only output the necessary columns.

Grouping by Multiple Columns

Sometimes it can be useful to sort the data by multiple columns. In this DataFrame, an example of when it would be useful is by looking at what position a player is more productive as. Someone like Brendan Donovan played 3 different positions, but Yadier Molina only played catcher. If you don't group by multiple columns, you won't be able to see how well a player hits at different positions. So in this example we will be grouping by the "name" and "position" columns:

Reset Code Python Output:
name position AB R H RBI BB SO PA
0 Albert Pujols 1B 4 0 0 0 0 1 4
1 Albert Pujols DH 5 0 0 0 0 0 5
2 Alec Burleson RF 4 1 0 0 1 0 5
3 Andrew Knizner C 3 0 0 0 0 1 3
4 Ben DeLuzio CF 2 0 1 0 0 0 2
5 Brendan Donovan 2B 3 2 3 0 2 0 5
6 Brendan Donovan 3B 3 1 1 1 0 0 3
7 Brendan Donovan DH 4 1 2 0 0 0 4
8 Corey Dickerson LF 15 1 5 3 0 1 15
9 Lars Nootbaar RF 12 0 0 0 0 8 12
10 Nolan Arenado 3B 11 1 2 1 1 1 12
11 Nolan Arenado DH 4 1 1 0 0 0 4
12 Nolan Gorman 2B 10 1 2 2 0 6 10
13 Paul Goldschmidt 1B 10 2 3 2 2 2 12
14 Paul Goldschmidt DH 3 0 0 0 1 0 4
15 Tommy Edman SS 14 0 5 2 1 1 15
16 Tyler O'Neill CF 9 1 1 1 1 4 11
17 Yadier Molina C 11 2 5 4 0 0 11

While usually grouping by one column gives the intended result, sometimes grouping by multiple columns can give you a better sense of what the data is trying to tell you.

pandas Documentation

Click Here for the full pandas documentation for the groupby function