Defining your Own Aggregation Function in Pandas

Introduction

The pandas df.groupby().agg() is an extremely useful function which allows us to obtain an aggregate representation of each group in our data. However, sometimes, we have to define a customized function ourselves to meet our needs.

Built-in Function `df.groupby().agg()` in Pandas

Let us take a look at the marks of two high-school students in various subjects:

import pandas as pd\n&nbsp;\ndf = pd.DataFrame([\n  {"Student Name": "Student1", "Subject": "Math", "Marks": 51},\n  {"Student Name": "Student1", "Subject": "Science", "Marks": 82},\n  {"Student Name": "Student2", "Subject": "Math", "Marks": 60},\n  {"Student Name": "Student2", "Subject": "English Literature", "Marks": 45},\n  {"Student Name": "Student1", "Subject": "Social Sciences", "Marks": 75},\n  {"Student Name": "Student2", "Subject": "Social Sciences", "Marks": 73},\n  {"Student Name": "Student1", "Subject": "Economics", "Marks": 35},\n])\n&nbsp;\ndf

Reset Code Python Output:


  
    
      Student Name
      Subject
      Marks
    
  
  
    
      Student1
      Math
      51
    
    
      Student1
      Science
      82
    
    
      Student2
      Math
      60
    
    
      Student2
      English Literature
      45
    
    
      Student1
      Social Sciences
      75
    
    
      Student2
      Social Sciences
      73
    
    
      Student1
      Economics
      35

Student Name	Subject	Marks
Student1	Math	51
Student1	Science	82
Student2	Math	60
Student2	English Literature	45
Student1	Social Sciences	75
Student2	Social Sciences	73
Student1	Economics	35

In the guide Grouping Data by column in a DataFrame, we have already learned how to use groupby.agg() to perform some standard operations, such as sum() and mean(). The syntax for that is simple:

import pandas as pd\n&nbsp;\ndf = pd.DataFrame([\n  {"Student Name": "Student1", "Subject": "Math", "Marks": 51},\n  {"Student Name": "Student1", "Subject": "Science", "Marks": 82},\n  {"Student Name": "Student2", "Subject": "Math", "Marks": 60},\n  {"Student Name": "Student2", "Subject": "English Literature", "Marks": 45},\n  {"Student Name": "Student1", "Subject": "Social Sciences", "Marks": 75},\n  {"Student Name": "Student2", "Subject": "Social Sciences", "Marks": 73},\n  {"Student Name": "Student1", "Subject": "Economics", "Marks": 35},\n])\ndf.groupby("Student Name")["Marks"].agg("sum").reset_index()

Reset Code Python Output:


  
    
      Student Name
      sum
    
  
  
    
      Student1
      243
    
    
      Student2
      178

Student Name	sum
Student1	243
Student2	178

Self-Defined Aggregation Function

But, what if we wanted to do an aggregation specific for our problem? Something which can't be done using the default agg() functions. For such cases, Pandas allows users to define their own custom aggregation functions.

Let's say we want to check if a student is eligible for a scholarship or not. To be eligible for a scholarship a student should score above 50 marks in at least 3 different subjects. As you might realize it's not possible to use one of the default aggregation functions in this case.

For this, we can write a custom aggregation function which checks for the scholarship eligibility:

def check_scholarship_eligibility(list_of_marks):
    greater_than_fifty = 0 # a variable which stores the total number of subjects with over 50 marks
    for score in list_of_marks: # iterate over marks for all the subjects
        if score > 50:
            greater_than_fifty += 1 # count the number of subjects with more than 50 marks
    if greater_than_fifty >= 3: # check if number of subjects with more than 50 marks is at least 3
      return "Eligible"
    else:
      return "Not Eligible"

groupby custom aggregation

Pandas always passes all the values for a group to the custom aggregation function as a Series object which can be iterated over just like a list. In this case, it means for Student1 the value for list_of_marks parameter would be [51, 82, 75, 35].

Using your custom aggregation function is straightforward, just write the function in the agg() parentheses:

import pandas as pd\n&nbsp;\ndf = pd.DataFrame([\n  {"Student Name": "Student1", "Subject": "Math", "Marks": 51},\n  {"Student Name": "Student1", "Subject": "Science", "Marks": 82},\n  {"Student Name": "Student2", "Subject": "Math", "Marks": 60},\n  {"Student Name": "Student2", "Subject": "English Literature", "Marks": 45},\n  {"Student Name": "Student1", "Subject": "Social Sciences", "Marks": 75},\n  {"Student Name": "Student2", "Subject": "Social Sciences", "Marks": 73},\n  {"Student Name": "Student1", "Subject": "Economics", "Marks": 35},\n])\n&nbsp;\ndef check_scholarship_eligibility(list_of_marks):\n    greater_than_fifty = 0 # a variable which stores the total number of subjects with over 50 marks\n    for score in list_of_marks: # iterate over marks for all the subjects\n        if score > 50:\n            greater_than_fifty += 1 # count the number of subjects with more than 50 marks\n    if greater_than_fifty >= 3: # check if number of subjects with more than 50 marks is at least 3\n      return "Eligible"\n    else:\n      return "Not Eligible"\n&nbsp;\ndf_eligibility = df.groupby("Student Name")["Marks"].agg(check_scholarship_eligibility).reset_index()\ndf_eligibility.rename(columns={"Marks": "Scholarship"}) # Rename the column to something informative

Reset Code Python Output:


  
    
      Student Name
      Scholarship
    
  
  
    
      Student1
      Eligible
    
    
      Student2
      Not Eligible