Removing Duplicates from a DataFrame using pandas
In a DataFrame, rows that appear multiple times are referred to as duplicate rows.
To explore the concept of duplicate rows, let's consider a DataFrame consisting of 7 different drinks, each characterized by different attributes and nutrition facts.
import pandas as pd
# Create a DataFrame with "drink", "carbonated?", "temperature", "sugar(tsp)", and "calories" columns:
df = pd.DataFrame([
{'drink': 'soda', 'carbonated?': True, 'temperature': 'cold', 'sugar(tsp)': 10.5, 'calories': 150},
{'drink': 'coffee', 'carbonated?': False, 'temperature': 'hot', 'sugar(tsp)': 3, 'calories': 31},
{'drink': 'coffee', 'carbonated?': False, 'temperature': 'hot', 'sugar(tsp)': 3, 'calories': 31},
{'drink': 'smoothie', 'carbonated?': False, 'temperature': 'cold', 'sugar(tsp)': 6, 'calories': 85},
{'drink': 'water', 'carbonated?': False, 'temperature': 'cold', 'sugar(tsp)': 0, 'calories': 0},
{'drink': 'tea', 'carbonated?': False, 'temperature': 'hot', 'sugar(tsp)': 2, 'calories': 43},
{'drink': 'lemonade', 'carbonated?': False, 'temperature': 'cold', 'sugar(tsp)': 9.5, 'calories': 125},
{'drink': 'slushy', 'carbonated?': False, 'temperature': 'cold', 'sugar(tsp)': 8, 'calories': 99}])
df
drink | carbonated? | temperature | sugar(tsp) | calories | |
---|---|---|---|---|---|
0 | soda | True | cold | 10.5 | 150 |
1 | coffee | False | hot | 3.0 | 31 |
2 | coffee | False | hot | 3.0 | 31 |
3 | smoothie | False | cold | 6.0 | 85 |
4 | water | False | cold | 0.0 | 0 |
5 | tea | False | hot | 2.0 | 43 |
6 | lemonade | False | cold | 9.5 | 125 |
7 | slushy | False | cold | 8.0 | 99 |
Discovering Duplicates
To find duplicates in a DataFrame, we can use the duplicated
function. This function helps us determine which rows are duplicate rows.
df.duplicated()
0 | False |
---|---|
1 | False |
2 | True |
3 | False |
4 | False |
5 | False |
6 | False |
dtype: | bool |
Removing Duplicates
Once you've found your duplicate rows, you may want to remove them. Below, we will explore two options for removal:
Option #1: drop_duplicates
The below example removes duplicates from the dataframe df
using drop_duplicates
and saves it to a new DataFrame, df1
.
df1 = df.drop_duplicates()
Option #2: drop_duplicates(inplace = True)
The below example removes duplicates from the dataframe df
in-place using the inplace
parameter of drop_duplicates
.
df.drop_duplicates(inplace = True)
Documentation
To learn more, check out duplicated and drop_duplicates.