Removing Duplicates from a DataFrame using pandas


In a DataFrame, rows that appear multiple times are referred to as duplicate rows.

To explore the concept of duplicate rows, let's consider a DataFrame consisting of 7 different drinks, each characterized by different attributes and nutrition facts.

Reset Code Python Output:
drink carbonated? temperature sugar(tsp) calories
0 soda True cold 10.5 150
1 coffee False hot 3.0 31
2 coffee False hot 3.0 31
3 smoothie False cold 6.0 85
4 water False cold 0.0 0
5 tea False hot 2.0 43
6 lemonade False cold 9.5 125
7 slushy False cold 8.0 99

Discovering Duplicates

To find duplicates in a DataFrame, we can use the duplicated function. This function helps us determine which rows are duplicate rows.

Reset Code Python Output:
0 False
1 False
2 True
3 False
4 False
5 False
6 False
dtype: bool

Removing Duplicates

Once you've found your duplicate rows, you may want to remove them. Below, we will explore two options for removal:

Option #1: drop_duplicates

The below example removes duplicates from the dataframe df using drop_duplicates and saves it to a new DataFrame, df1.

Reset Code Python Output:
(Run your code to see your code result's here.)

Option #2: drop_duplicates(inplace = True)

The below example removes duplicates from the dataframe df in-place using the inplace parameter of drop_duplicates.

Reset Code Python Output:
(Run your code to see your code result's here.)

Documentation

To learn more, check out duplicated and drop_duplicates.