Handling Missing Data in Pandas

While it would be nice if our datasets all had the values we expect, it's not always the case. Oftentimes certain cells in a DataFrame will be empty, or contain a value that we don't want.

Say we were to have a DataFrame like this one, where some rows are missing data:

import pandas as pd\ndf = pd.DataFrame([\n  {'Name': 'John', 'Year': 2025, 'Major': 'Economics'},\n  {'Name': 'Jessica', 'Year': 2023},\n  {'Name': 'Alex', 'Major': 'Statistics'}\n])\ndf

Reset Code Python Output:


  
    
      
      Name
      Year
      Major
    
  
  
    
      0
      John
      2025.0
      Economics
    
    
      1
      Jessica
      2023.0
      NaN
    
    
      2
      Alex
      NaN
      Statistics

	Name	Year	Major
0	John	2025.0	Economics
1	Jessica	2023.0	NaN
2	Alex	NaN	Statistics

Default Behavior of `NaN`

By default Pandas will fill in missing values with NaN, which is a special value that conveys that their is no valid value for that cell (although it looks like a string, NaN is a "number" with the special meaning "Not a Number"). By default Pandas will skip over NaN values when you do most operations on a DataFrame.

This call to the mean function on the year column will return 2024, because it will skip over the NaN value in the Year column (only averaging the values of 2025 and 2023).

import pandas as pd\ndf = pd.DataFrame([\n  {'Name': 'John', 'Year': 2025, 'Major': 'Economics'},\n  {'Name': 'Jessica', 'Year': 2023},\n  {'Name': 'Alex', 'Major': 'Statistics'}\n])\ndf['Year'].mean()

Reset Code Python Output:

But other times we might want to deal with these NaN values, which we can do in several ways.

Removing `NaN` Values from a DataFrame

The first, and often simplest approach, is to completely eliminate data (either by row, or by column) that contains any NaN values. Alteratively, later we talk about replacing the NaN values instead.

Removing All Rows with Any `NaN` Values

If we simply want to remove rows with NaN values, we can use the dropna() method.

In our example this will remove the 2nd and 3rd row, since they each have a NaN value in one one of their columns, but will keep the 1st row since it was valid data at every column.

import pandas as pd\ndf = pd.DataFrame([\n  {'Name': 'John', 'Year': 2025, 'Major': 'Economics'},\n  {'Name': 'Jessica', 'Year': 2023},\n  {'Name': 'Alex', 'Major': 'Statistics'}\n])\ndf.dropna()

Reset Code Python Output:


  
    
      
      Name
      Year
      Major
    
  
  
    
      0
      John
      2025.0
      Economics

	Name	Year	Major
0	John	2025.0	Economics

Removing Rows with `NaN` Values in Specific Columns

We might also want to require that certain columns have no NaN values (and maybe we don't care as much if other columns have NaN values). We can specify what columns we might want to drop along by specifying a subset=['col'] parameter to target a specific column.

import pandas as pd\ndf = pd.DataFrame([\n  {'Name': 'John', 'Year': 2025, 'Major': 'Economics'},\n  {'Name': 'Jessica', 'Year': 2023},\n  {'Name': 'Alex', 'Major': 'Statistics'}\n])\ndf.dropna(subset=['Major'])

Reset Code Python Output:


  
    
      
      Name
      Year
      Major
    
  
  
    
      0
      John
      2025.0
      Economics
    
    
      2
      Alex
      NaN
      Statistics

	Name	Year	Major
0	John	2025.0	Economics
2	Alex	NaN	Statistics

Here we can see that the row for "Jessica" was removed because it contained a NaN value in the Major column, while Alex (and John) were kept because they didn't have a NaN value in the Major column.

Re-generating Index Row Labels After Removing Rows

When we drop certain rows from the our DataFrame, the index label of the previous may not reflect the actual position of our label (you can see how in the example above the row indexes jump form 0 to 2, skipping 1 since that row was dropped). If we want to make sure that the indexes are still in the correct order, we can add the .reset_index(drop=True) method to relabel the indexes (the drop parameter makes sure we overwrite the old indexes with the correct ones).

import pandas as pd\ndf = pd.DataFrame([\n  {'Name': 'John', 'Year': 2025, 'Major': 'Economics'},\n  {'Name': 'Jessica', 'Year': 2023},\n  {'Name': 'Alex', 'Major': 'Statistics'}\n])\ndf.dropna(subset=['Major']).reset_index(drop=True)

Reset Code Python Output:


  
    
      
      Name
      Year
      Major
    
  
  
    
      0
      John
      2025.0
      Economics
    
    
      1
      Alex
      NaN
      Statistics

	Name	Year	Major
0	John	2025.0	Economics
1	Alex	NaN	Statistics

Removing Columns with `NaN` Values

If we wanted to remove any columns with NaN values, we can use the dropna(axis=1) method. This may be helpful if we don't want to consider attributes for each entry if they may be missing for one of more of the entires.

import pandas as pd\ndf = pd.DataFrame([\n  {'Name': 'John', 'Year': 2025, 'Major': 'Economics'},\n  {'Name': 'Jessica', 'Year': 2023},\n  {'Name': 'Alex', 'Major': 'Statistics'}\n])\ndf.dropna(axis=1)

Reset Code Python Output:

	Name
0	John
1	Jessica
2	Alex

Since all the columns except for the Name column have at least one NaN value in them, only the Name column will remain in the DataFrame.

Replacing `NaN` Values with Data

If we want to replace NaN with data, we can use the fillna() method. This method will fill in the NaN values with a specified value.

import pandas as pd\ndf = pd.DataFrame([\n  {'Name': 'John', 'Year': 2025, 'Major': 'Economics'},\n  {'Name': 'Jessica', 'Year': 2023},\n  {'Name': 'Alex', 'Major': 'Statistics'}\n])\ndf.fillna(':(')

Reset Code Python Output:


  
    
      
      Name
      Year
      Major
    
  
  
    
      0
      John
      2025.0
      Economics
    
    
      1
      Jessica
      2023.0
      :(
    
    
      2
      Alex
      :(
      Statistics

	Name	Year	Major
0	John	2025.0	Economics
1	Jessica	2023.0	:(
2	Alex	:(	Statistics

Realistically, we would want to fill in the NaN values with a value that makes sense. This might include replacing NaN values in the Year column with the average Year among valid entries, and "Not Known" for NaN values in the Major column.

import pandas as pd\ndf = pd.DataFrame([\n  {'Name': 'John', 'Year': 2025, 'Major': 'Economics'},\n  {'Name': 'Jessica', 'Year': 2023},\n  {'Name': 'Alex', 'Major': 'Statistics'}\n])\n# Replace NaN values in the "Year" column with the mean "Year" of the DataFrame when NaN\ndf['Year'] = df['Year'].fillna( df['Year'].mean() )\n&nbsp;\n# Replace NaN values in the "Major" column with "Not Known" when NaN\ndf['Major'] = df['Major'].fillna('Not Known')\n&nbsp;\n# Display the new DataFrame\ndf

Reset Code Python Output:


  
    
      
      Name
      Year
      Major
    
  
  
    
      0
      John
      2025.0
      Economics
    
    
      1
      Jessica
      2023.0
      Not Known
    
    
      2
      Alex
      2024.0
      Statistics