Removing Duplicates from a DataFrame using pandas
In a DataFrame, rows that appear multiple times are referred to as duplicate rows.
To explore the concept of duplicate rows, let's consider a DataFrame consisting of 7 different drinks, each characterized by different attributes and nutrition facts.
| drink | carbonated? | temperature | sugar(tsp) | calories | |
|---|---|---|---|---|---|
| 0 | soda | True | cold | 10.5 | 150 |
| 1 | coffee | False | hot | 3.0 | 31 |
| 2 | coffee | False | hot | 3.0 | 31 |
| 3 | smoothie | False | cold | 6.0 | 85 |
| 4 | water | False | cold | 0.0 | 0 |
| 5 | tea | False | hot | 2.0 | 43 |
| 6 | lemonade | False | cold | 9.5 | 125 |
| 7 | slushy | False | cold | 8.0 | 99 |
Discovering Duplicates
To find duplicates in a DataFrame, we can use the duplicated function. This function helps us determine which rows are duplicate rows.
| 0 | False |
|---|---|
| 1 | False |
| 2 | True |
| 3 | False |
| 4 | False |
| 5 | False |
| 6 | False |
| dtype: | bool |
Removing Duplicates
Once you've found your duplicate rows, you may want to remove them. Below, we will explore two options for removal:
Option #1: drop_duplicates
The below example removes duplicates from the dataframe df using drop_duplicates and saves it to a new DataFrame, df1.
Option #2: drop_duplicates(inplace = True)
The below example removes duplicates from the dataframe df in-place using the inplace parameter of drop_duplicates.
Documentation
To learn more, check out duplicated and drop_duplicates.