Removing Duplicates from a DataFrame using pandas


In a DataFrame, rows that appear multiple times are referred to as duplicate rows.

To explore the concept of duplicate rows, let's consider a DataFrame consisting of 7 different drinks, each characterized by different attributes and nutrition facts.

import pandas as pd

# Create a DataFrame with "drink", "carbonated?", "temperature", "sugar(tsp)", and "calories" columns:
df = pd.DataFrame([
  {'drink': 'soda', 'carbonated?': True, 'temperature': 'cold', 'sugar(tsp)': 10.5, 'calories': 150},
  {'drink': 'coffee', 'carbonated?': False, 'temperature': 'hot', 'sugar(tsp)': 3, 'calories': 31},
  {'drink': 'coffee', 'carbonated?': False, 'temperature': 'hot', 'sugar(tsp)': 3, 'calories': 31}, 
  {'drink': 'smoothie', 'carbonated?': False, 'temperature': 'cold', 'sugar(tsp)': 6, 'calories': 85}, 
  {'drink': 'water', 'carbonated?': False, 'temperature': 'cold', 'sugar(tsp)': 0, 'calories': 0},
  {'drink': 'tea', 'carbonated?': False, 'temperature': 'hot', 'sugar(tsp)': 2, 'calories': 43}, 
  {'drink': 'lemonade', 'carbonated?': False, 'temperature': 'cold', 'sugar(tsp)': 9.5, 'calories': 125},
  {'drink': 'slushy', 'carbonated?': False, 'temperature': 'cold', 'sugar(tsp)': 8, 'calories': 99}])
df
drinkcarbonated?temperaturesugar(tsp)calories
0sodaTruecold10.5150
1coffeeFalsehot3.031
2coffeeFalsehot3.031
3smoothieFalsecold6.085
4waterFalsecold0.00
5teaFalsehot2.043
6lemonadeFalsecold9.5125
7slushyFalsecold8.099
Creating a DataFrame with Duplicate Rows

Discovering Duplicates

To find duplicates in a DataFrame, we can use the duplicated function. This function helps us determine which rows are duplicate rows.

df.duplicated()
0False
1False
2True
3False
4False
5False
6False
dtype:bool
Discovering Duplicate Rows

Removing Duplicates

Once you've found your duplicate rows, you may want to remove them. Below, we will explore two options for removal:

Option #1: drop_duplicates

The below example removes duplicates from the dataframe df using drop_duplicates and saves it to a new DataFrame, df1.

df1 = df.drop_duplicates()
Removing Duplicate Rows

Option #2: drop_duplicates(inplace = True)

The below example removes duplicates from the dataframe df in-place using the inplace parameter of drop_duplicates.

df.drop_duplicates(inplace = True)
Removing Duplicate Rows In-Place

Documentation

To learn more, check out duplicated and drop_duplicates.