Sorting a DataFrame Using Pandas
The sort_values
method of a DataFrame is used to sort a DataFrame by the data in a column.
To explore sorting, we'll use a DataFrame of five cities, across three states, with various populations:
import pandas as pd
# Creates a DataFrame of "city", "state", and "population" columns:
df = pd.DataFrame( [{"city": "Los Angeles", "state": "CA", "population": 3898747},
{"city": "Urbana", "state": "IL", "population": 42461},
{"city": "New York", "state": "NY", "population": 8804190},
{"city": "Washington D.C.", "population": 689545}, # Note: No "state" for D.C.
{"city": "Chicago", "state": "IL", "population": 2746388}] )
city | state | population | |
---|---|---|---|
0 | Los Angeles | CA | 3898747 |
1 | Urbana | IL | 42461 |
2 | New York | NY | 8804190 |
3 | Washington D.C. | NaN | 689545 |
4 | Chicago | IL | 2746388 |
Sorting by a Single Column
Usually, we need to sort data by the values contained within a single column. We can provide the column name and pandas will return a new DataFrame:
df.sort_values("population")
city | state | population | |
---|---|---|---|
1 | Urbana | IL | 42461 |
3 | Washington D.C. | NaN | 689545 |
4 | Chicago | IL | 2746388 |
0 | Los Angeles | CA | 3898747 |
2 | New York | NY | 8804190 |
Sorting by Multiple Columns
A list of column names can be provided to sort by multiple columns. Your DataFrame will be sorted by the first column first and, when multiple rows have the same value for the first column, those rows are then sorted by their second column.
df.sort_values(["state", "population"])
# => Sorts by "state" first,
# then "population" for rows with identical "state" values.
city | state | population | |
---|---|---|---|
0 | Los Angeles | CA | 3898747 |
1 | Urbana | IL | 42461 |
4 | Chicago | IL | 2746388 |
2 | New York | NY | 8804190 |
3 | Washington D.C. | NaN | 689545 |
Sorting in Reverse Order (Descending Order)
By default, the ascending
parameter is True
indicating that the sorting will be done in ascending order (smallest value first, largest value last). The ascending
parameter can be set to False
to sort in descending order.
df.sort_values("population", ascending=False)
# => Sorts in reverse (descending) order
city | state | population | |
---|---|---|---|
2 | New York | NY | 8804190 |
0 | Los Angeles | CA | 3898747 |
4 | Chicago | IL | 2746388 |
3 | Washington D.C. | NaN | 689545 |
1 | Urbana | IL | 42461 |
Using a Stable Sort
When a sort is a stable, the sort guarantees to not change the order among rows with identical values. For example, our initial dataset has "Urbana"
listed before "Chicago"
. A stable sort that sorts the list only on state
will keep "Urbana"
listed before "Chicago"
since that is the way it appears in the original list.
By default, the sorting will be done as fast as possible using a quicksort algorithm. The quicksort algorithm is NOT stable. The kind
parameter can specify a "stable"
to guarantee a stable sort is used.
df.sort_values("state", kind="stable")
# => Guarantees a "stable" sort
city | state | population | |
---|---|---|---|
0 | Los Angeles | CA | 3898747 |
1 | Urbana | IL | 42461 |
4 | Chicago | IL | 2746388 |
2 | New York | NY | 8804190 |
3 | Washington D.C. | NaN | 689545 |
Location of NaN
Values
When sorting a DataFrame, missing values (NaN
) can be placed at the beginning or end of your DataFrame. By default, the na_position
parameter is set to "last"
. The na_position
can be set to "first"
to place missing values at the beginning of your DataFrame.
df.sort_values("state", na_position="first")
# => Sorts the missing values (NaN) to the first part of the DataFrame
city | state | population | |
---|---|---|---|
3 | Washington D.C. | NaN | 689545 |
0 | Los Angeles | CA | 3898747 |
1 | Urbana | IL | 42461 |
4 | Chicago | IL | 2746388 |
2 | New York | NY | 8804190 |
Pandas Documentation
The full documentation for sort_values
is available in the pandas documentation.