Defining your Own Aggregation Function in Pandas


Introduction

The pandas df.groupby().agg() is an extremely useful function which allows us to obtain an aggregate representation of each group in our data. However, sometimes, we have to define a customized function ourselves to meet our needs.

Built-in Function df.groupby().agg() in Pandas

Let us take a look at the marks of two high-school students in various subjects:

Reset Code Python Output:
Student Name Subject Marks
Student1 Math 51
Student1 Science 82
Student2 Math 60
Student2 English Literature 45
Student1 Social Sciences 75
Student2 Social Sciences 73
Student1 Economics 35

In the guide Grouping Data by column in a DataFrame, we have already learned how to use groupby.agg() to perform some standard operations, such as sum() and mean(). The syntax for that is simple:

Reset Code Python Output:
Student Name sum
Student1 243
Student2 178

Self-Defined Aggregation Function

But, what if we wanted to do an aggregation specific for our problem? Something which can't be done using the default agg() functions. For such cases, Pandas allows users to define their own custom aggregation functions.

Let's say we want to check if a student is eligible for a scholarship or not. To be eligible for a scholarship a student should score above 50 marks in at least 3 different subjects. As you might realize it's not possible to use one of the default aggregation functions in this case.

For this, we can write a custom aggregation function which checks for the scholarship eligibility:

def check_scholarship_eligibility(list_of_marks):
    greater_than_fifty = 0 # a variable which stores the total number of subjects with over 50 marks
    for score in list_of_marks: # iterate over marks for all the subjects
        if score > 50:
            greater_than_fifty += 1 # count the number of subjects with more than 50 marks
    if greater_than_fifty >= 3: # check if number of subjects with more than 50 marks is at least 3
      return "Eligible"
    else:
      return "Not Eligible"

groupby custom aggregation

Pandas always passes all the values for a group to the custom aggregation function as a Series object which can be iterated over just like a list. In this case, it means for Student1 the value for list_of_marks parameter would be [51, 82, 75, 35].

Using your custom aggregation function is straightforward, just write the function in the agg() parentheses:

Reset Code Python Output:
Student Name Scholarship
Student1 Eligible
Student2 Not Eligible