Handling Missing Data in Pandas


While it would be nice if our datasets all had the values we expect, it's not always the case. Oftentimes certain cells in a DataFrame will be empty, or contain a value that we don't want.

Say we were to have a DataFrame like this one, where some rows are missing data:

df = pd.DataFrame([
  {'Name': 'John', 'Year': 2025, 'Major': 'Economics'},
  {'Name': 'Jessica', 'Year': 2023},
  {'Name': 'Alex', 'Major': 'Statistics'}
])
df
NameYearMajor
0John2025.0Economics
1Jessica2023.0NaN
2AlexNaNStatistics
A DataFrame with some missing values

Default Behavior of NaN

By default Pandas will fill in missing values with NaN, which is a special value that conveys that their is no valid value for that cell (although it looks like a string, NaN is a "number" with the special meaning "Not a Number"). By default Pandas will skip over NaN values when you do most operations on a DataFrame.

This call to the mean function on the year column will return 2024, because it will skip over the NaN value in the Year column (only averaging the values of 2025 and 2023).

df['Year'].mean()
2024
The result of the mean of [2023, 2025, and NaN] is 2024, since NaN values are skipped

But other times we might want to deal with these NaN values, which we can do in several ways.

Removing NaN Values from a DataFrame

The first, and often simplest approach, is to completely eliminate data (either by row, or by column) that contains any NaN values. Alteratively, later we talk about replacing the NaN values instead.

Removing All Rows with Any NaN Values

If we simply want to remove rows with NaN values, we can use the dropna() method.

In our example this will remove the 2nd and 3rd row, since they each have a NaN value in one one of their columns, but will keep the 1st row since it was valid data at every column.

df.dropna()
NameYearMajor
0John2025.0Economics
Removing all rows with any NaN values

Removing Rows with NaN Values in Specific Columns

We might also want to require that certain columns have no NaN values (and maybe we don't care as much if other columns have NaN values). We can specify what columns we might want to drop along by specifying a subset=['col'] parameter to target a specific column.

df.dropna(subset=['Major'])
NameYearMajor
0John2025.0Economics
2AlexNaNStatistics
Removing rows with NaN values in the "Major" column, but leaving other NaN values

Here we can see that the row for "Jessica" was removed because it contained a NaN value in the Major column, while Alex (and John) were kept because they didn't have a NaN value in the Major column.

Re-generating Index Row Labels After Removing Rows

When we drop certain rows from the our DataFrame, the index label of the previous may not reflect the actual position of our label (you can see how in the example above the row indexes jump form 0 to 2, skipping 1 since that row was dropped). If we want to make sure that the indexes are still in the correct order, we can add the .reset_index(drop=True) method to relabel the indexes (the drop parameter makes sure we overwrite the old indexes with the correct ones).

df.dropna(subset=['Major']).reset_index(drop=True)
NameYearMajor
0John2025.0Economics
1AlexNaNStatistics
Removing rows with NaN values in Major

Removing Columns with NaN Values

If we wanted to remove any columns with NaN values, we can use the dropna(axis=1) method. This may be helpful if we don't want to consider attributes for each entry if they may be missing for one of more of the entires.

df.dropna(axis=1)
Name
0John
1Jessica
2Alex
Removing columns with NaN values

Since all the columns except for the Name column have at least one NaN value in them, only the Name column will remain in the DataFrame.

Replacing NaN Values with Data

If we want to replace NaN with data, we can use the fillna() method. This method will fill in the NaN values with a specified value.

df.fillna(':(')
NameYearMajor
0John2025.0Economics
1Jessica2023.0:(
2Alex:(Statistics
Replacing NaN values with a frowny face

Realistically, we would want to fill in the NaN values with a value that makes sense. This might include replacing NaN values in the Year column with the average Year among valid entries, and "Not Known" for NaN values in the Major column.

# Replace NaN values in the "Year" column with the mean "Year" of the DataFrame when NaN
df['Year'] = df['Year'].fillna( df['Year'].mean() )

# Replace NaN values in the "Major" column with "Not Known" when NaN
df['Major'] = df['Major'].fillna('Not Known')

# Display the new DataFrame
df
NameYearMajor
0John2025.0Economics
1Jessica2023.0Not Known
2Alex2024.0Statistics
Replacing NaN values with more reasonable data