Defining your Own Aggregation Function in Pandas
Introduction
The pandas df.groupby().agg()
is an extremely useful function which allows us to obtain an aggregate representation of each group in our data. However, sometimes, we have to define a customized function ourselves to meet our needs.
Built-in Function df.groupby().agg()
in Pandas
Let us take a look at the marks of two high-school students in various subjects:
import pandas as pd
df = pd.DataFrame([
{"Student Name": "Student1", "Subject": "Math", "Marks": 51},
{"Student Name": "Student1", "Subject": "Science", "Marks": 82},
{"Student Name": "Student2", "Subject": "Math", "Marks": 60},
{"Student Name": "Student2", "Subject": "English Literature", "Marks": 45},
{"Student Name": "Student1", "Subject": "Social Sciences", "Marks": 75},
{"Student Name": "Student2", "Subject": "Social Sciences", "Marks": 73},
{"Student Name": "Student1", "Subject": "Economics", "Marks": 35},
])
df
Student Name | Subject | Marks |
---|---|---|
Student1 | Math | 51 |
Student1 | Science | 82 |
Student2 | Math | 60 |
Student2 | English Literature | 45 |
Student1 | Social Sciences | 75 |
Student2 | Social Sciences | 73 |
Student1 | Economics | 35 |
In the guide Grouping Data by column in a DataFrame, we have already learned how to use groupby.agg()
to perform some standard operations, such as sum()
and mean()
. The syntax for that is simple:
df.groupby("Student Name")["Marks"].agg("sum").reset_index()
Student Name | sum |
---|---|
Student1 | 243 |
Student2 | 178 |
Self-Defined Aggregation Function
But, what if we wanted to do an aggregation specific for our problem? Something which can't be done using the default agg()
functions. For such cases, Pandas allows users to define their own custom aggregation functions.
Let's say we want to check if a student is eligible for a scholarship or not. To be eligible for a scholarship a student should score above 50 marks in at least 3 different subjects. As you might realize it's not possible to use one of the default aggregation functions in this case.
For this, we can write a custom aggregation function which checks for the scholarship eligibility:
Pandas always passes all the values for a group to the custom aggregation function as a Series
object which can be iterated over just like a list
. In this case, it means for Student1 the value for list_of_marks
parameter would be [51, 82, 75, 35]
.
Using your custom aggregation function is straightforward, just write the function in the agg()
parentheses:
df_eligibility = df.groupby("Student Name")["Marks"].agg(check_scholarship_eligibility).reset_index()
df_eligibility.rename(columns={"Marks": "Scholarship"}) # Rename the column to something informative
Student Name | Scholarship |
---|---|
Student1 | Eligible |
Student2 | Not Eligible |