Handling Missing Data in Pandas
While it would be nice if our datasets all had the values we expect, it's not always the case. Oftentimes certain cells in a DataFrame will be empty, or contain a value that we don't want.
Say we were to have a DataFrame like this one, where some rows are missing data:
df = pd.DataFrame([
{'Name': 'John', 'Year': 2025, 'Major': 'Economics'},
{'Name': 'Jessica', 'Year': 2023},
{'Name': 'Alex', 'Major': 'Statistics'}
])
df
Name | Year | Major | |
---|---|---|---|
0 | John | 2025.0 | Economics |
1 | Jessica | 2023.0 | NaN |
2 | Alex | NaN | Statistics |
Default Behavior of NaN
By default Pandas will fill in missing values with NaN
, which is a special value that conveys that their is no valid value for that cell (although it looks like a string, NaN
is a "number" with the special meaning "Not a Number"). By default Pandas will skip over NaN
values when you do most operations on a DataFrame.
This call to the mean function on the year column will return 2024, because it will skip over the NaN
value in the Year
column (only averaging the values of 2025 and 2023).
But other times we might want to deal with these NaN
values, which we can do in several ways.
Removing NaN
Values from a DataFrame
The first, and often simplest approach, is to completely eliminate data (either by row, or by column) that contains any NaN
values. Alteratively, later we talk about replacing the NaN
values instead.
Removing All Rows with Any NaN
Values
If we simply want to remove rows with NaN
values, we can use the dropna()
method.
In our example this will remove the 2nd and 3rd row, since they each have a NaN
value in one one of their columns, but will keep the 1st row since it was valid data at every column.
df.dropna()
Name | Year | Major | |
---|---|---|---|
0 | John | 2025.0 | Economics |
Removing Rows with NaN
Values in Specific Columns
We might also want to require that certain columns have no NaN
values (and maybe we don't care as much if other columns have NaN
values). We can specify what columns we might want to drop along by specifying a subset=['col']
parameter to target a specific column.
df.dropna(subset=['Major'])
Name | Year | Major | |
---|---|---|---|
0 | John | 2025.0 | Economics |
2 | Alex | NaN | Statistics |
Here we can see that the row for "Jessica" was removed because it contained a NaN
value in the Major
column, while Alex (and John) were kept because they didn't have a NaN
value in the Major
column.
Re-generating Index Row Labels After Removing Rows
When we drop certain rows from the our DataFrame, the index label of the previous may not reflect the actual position of our label (you can see how in the example above the row indexes jump form 0 to 2, skipping 1 since that row was dropped). If we want to make sure that the indexes are still in the correct order, we can add the .reset_index(drop=True)
method to relabel the indexes (the drop parameter makes sure we overwrite the old indexes with the correct ones).
df.dropna(subset=['Major']).reset_index(drop=True)
Name | Year | Major | |
---|---|---|---|
0 | John | 2025.0 | Economics |
1 | Alex | NaN | Statistics |
Removing Columns with NaN
Values
If we wanted to remove any columns with NaN
values, we can use the dropna(axis=1)
method. This may be helpful if we don't want to consider attributes for each entry if they may be missing for one of more of the entires.
df.dropna(axis=1)
Name | |
---|---|
0 | John |
1 | Jessica |
2 | Alex |
Since all the columns except for the Name column have at least one NaN
value in them, only the Name column will remain in the DataFrame.
Replacing NaN
Values with Data
If we want to replace NaN
with data, we can use the fillna()
method. This method will fill in the NaN
values with a specified value.
df.fillna(':(')
Name | Year | Major | |
---|---|---|---|
0 | John | 2025.0 | Economics |
1 | Jessica | 2023.0 | :( |
2 | Alex | :( | Statistics |
Realistically, we would want to fill in the NaN
values with a value that makes sense. This might include replacing NaN
values in the Year column with the average Year among valid entries, and "Not Known" for NaN
values in the Major column.
# Replace NaN values in the "Year" column with the mean "Year" of the DataFrame when NaN
df['Year'] = df['Year'].fillna( df['Year'].mean() )
# Replace NaN values in the "Major" column with "Not Known" when NaN
df['Major'] = df['Major'].fillna('Not Known')
# Display the new DataFrame
df
Name | Year | Major | |
---|---|---|---|
0 | John | 2025.0 | Economics |
1 | Jessica | 2023.0 | Not Known |
2 | Alex | 2024.0 | Statistics |