Handling Missing Data in Pandas
While it would be nice if our datasets all had the values we expect, it's not always the case. Oftentimes certain cells in a DataFrame will be empty, or contain a value that we don't want.
Say we were to have a DataFrame like this one, where some rows are missing data:
| Name | Year | Major | |
|---|---|---|---|
| 0 | John | 2025.0 | Economics |
| 1 | Jessica | 2023.0 | NaN |
| 2 | Alex | NaN | Statistics |
Default Behavior of NaN
By default Pandas will fill in missing values with NaN, which is a special value that conveys that their is no valid value for that cell (although it looks like a string, NaN is a "number" with the special meaning "Not a Number"). By default Pandas will skip over NaN values when you do most operations on a DataFrame.
This call to the mean function on the year column will return 2024, because it will skip over the NaN value in the Year column (only averaging the values of 2025 and 2023).
2024
But other times we might want to deal with these NaN values, which we can do in several ways.
Removing NaN Values from a DataFrame
The first, and often simplest approach, is to completely eliminate data (either by row, or by column) that contains any NaN values. Alteratively, later we talk about replacing the NaN values instead.
Removing All Rows with Any NaN Values
If we simply want to remove rows with NaN values, we can use the dropna() method.
In our example this will remove the 2nd and 3rd row, since they each have a NaN value in one one of their columns, but will keep the 1st row since it was valid data at every column.
| Name | Year | Major | |
|---|---|---|---|
| 0 | John | 2025.0 | Economics |
Removing Rows with NaN Values in Specific Columns
We might also want to require that certain columns have no NaN values (and maybe we don't care as much if other columns have NaN values). We can specify what columns we might want to drop along by specifying a subset=['col'] parameter to target a specific column.
| Name | Year | Major | |
|---|---|---|---|
| 0 | John | 2025.0 | Economics |
| 2 | Alex | NaN | Statistics |
Here we can see that the row for "Jessica" was removed because it contained a NaN value in the Major column, while Alex (and John) were kept because they didn't have a NaN value in the Major column.
Re-generating Index Row Labels After Removing Rows
When we drop certain rows from the our DataFrame, the index label of the previous may not reflect the actual position of our label (you can see how in the example above the row indexes jump form 0 to 2, skipping 1 since that row was dropped). If we want to make sure that the indexes are still in the correct order, we can add the .reset_index(drop=True) method to relabel the indexes (the drop parameter makes sure we overwrite the old indexes with the correct ones).
| Name | Year | Major | |
|---|---|---|---|
| 0 | John | 2025.0 | Economics |
| 1 | Alex | NaN | Statistics |
Removing Columns with NaN Values
If we wanted to remove any columns with NaN values, we can use the dropna(axis=1) method. This may be helpful if we don't want to consider attributes for each entry if they may be missing for one of more of the entires.
| Name | |
|---|---|
| 0 | John |
| 1 | Jessica |
| 2 | Alex |
Since all the columns except for the Name column have at least one NaN value in them, only the Name column will remain in the DataFrame.
Replacing NaN Values with Data
If we want to replace NaN with data, we can use the fillna() method. This method will fill in the NaN values with a specified value.
| Name | Year | Major | |
|---|---|---|---|
| 0 | John | 2025.0 | Economics |
| 1 | Jessica | 2023.0 | :( |
| 2 | Alex | :( | Statistics |
Realistically, we would want to fill in the NaN values with a value that makes sense. This might include replacing NaN values in the Year column with the average Year among valid entries, and "Not Known" for NaN values in the Major column.
| Name | Year | Major | |
|---|---|---|---|
| 0 | John | 2025.0 | Economics |
| 1 | Jessica | 2023.0 | Not Known |
| 2 | Alex | 2024.0 | Statistics |