Defining your Own Aggregation Function in Pandas


Introduction

The pandas df.groupby().agg() is an extremely useful function which allows us to obtain an aggregate representation of each group in our data. However, sometimes, we have to define a customized function ourselves to meet our needs.

Built-in Function df.groupby().agg() in Pandas

Let us take a look at the marks of two high-school students in various subjects:

import pandas as pd

df = pd.DataFrame([
  {"Student Name": "Student1", "Subject": "Math", "Marks": 51},
  {"Student Name": "Student1", "Subject": "Science", "Marks": 82},
  {"Student Name": "Student2", "Subject": "Math", "Marks": 60},
  {"Student Name": "Student2", "Subject": "English Literature", "Marks": 45},
  {"Student Name": "Student1", "Subject": "Social Sciences", "Marks": 75},
  {"Student Name": "Student2", "Subject": "Social Sciences", "Marks": 73},
  {"Student Name": "Student1", "Subject": "Economics", "Marks": 35},
])

df
Student NameSubjectMarks
Student1Math51
Student1Science82
Student2Math60
Student2English Literature45
Student1Social Sciences75
Student2Social Sciences73
Student1Economics35
Student Data

In the guide Grouping Data by column in a DataFrame, we have already learned how to use groupby.agg() to perform some standard operations, such as sum() and mean(). The syntax for that is simple:

df.groupby("Student Name")["Marks"].agg("sum").reset_index()
Student Namesum
Student1243
Student2178
groupby sum

Self-Defined Aggregation Function

But, what if we wanted to do an aggregation specific for our problem? Something which can't be done using the default agg() functions. For such cases, Pandas allows users to define their own custom aggregation functions.

Let's say we want to check if a student is eligible for a scholarship or not. To be eligible for a scholarship a student should score above 50 marks in at least 3 different subjects. As you might realize it's not possible to use one of the default aggregation functions in this case.

For this, we can write a custom aggregation function which checks for the scholarship eligibility:

def check_scholarship_eligibility(list_of_marks):
    greater_than_fifty = 0 # a variable which stores the total number of subjects with over 50 marks
    for score in list_of_marks: # iterate over marks for all the subjects
        if score > 50:
            greater_than_fifty += 1 # count the number of subjects with more than 50 marks
    if greater_than_fifty >= 3: # check if number of subjects with more than 50 marks is at least 3
      return "Eligible"
    else:
      return "Not Eligible"

groupby custom aggregation

Pandas always passes all the values for a group to the custom aggregation function as a Series object which can be iterated over just like a list. In this case, it means for Student1 the value for list_of_marks parameter would be [51, 82, 75, 35].

Using your custom aggregation function is straightforward, just write the function in the agg() parentheses:

df_eligibility = df.groupby("Student Name")["Marks"].agg(check_scholarship_eligibility).reset_index()
df_eligibility.rename(columns={"Marks": "Scholarship"}) # Rename the column to something informative
Student NameScholarship
Student1Eligible
Student2Not Eligible
Using our custom aggregation function