Combining DataFrames by Joining


pandas has a feature called join that allows you to join in columns from one or more other DataFrames by either their index or on a "key" column that you specify.

Say we have two DataFrames with contents as follows:

df1 = pd.DataFrame([
    {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},
    {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},
    {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},
    {'Name': 'Howard', 'Favorite Color': 'Green', 'Show': 'BCS'}
])
df2 = pd.DataFrame([
    {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},
    {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},
    {'Name': 'Jesse', 'Favorite Color': 'Maroon', 'Show': 'BB'},
    {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'}
])
Our example DataFrames

Our data between the two columns overlap, so we might be inclined to combine the two DataFrames. The join function in pandas allows us to do this and specify how exactly we want this join to occur.

Using Join

The simplest operation is too preform a standard join (which is a left join), which will simply tack on the values of the second DataFrame to the right of the first one. By default this will join on matching indexes.

df1.join(df2, rsuffix='(right)')
Name(right)Favorite Color(right)Show(right)NameFavorite ColorShow
0SaulMaroonBCSSaulMaroonBCS
1WalterBlueBBWalterBlueBB
2KimRedBCSKimRedBCS
3HowardGreenBCSJesseMaroonBB
Preforming a default join

Here we also specify a rsuffix since our DataFrames have matching column names. This makes sure that all of our columns have unique names by adding a suffix to the right DataFrame. (you can also specify a lsuffix to append to column names on the left side)

Sometimes, you might want to join on something different that just the index, so we can also specify a "key" column to join on like this (we'll also specify both a left and a right suffix for we can tell where data is coming from):

df1.join(df2.set_index('Name'), on='Name', rsuffix='(right)', lsuffix='(left)')
NameFavorite Color(left)Show(left)Favorite Color(right)Show(right)
0SaulMaroonBCSMaroonBCS
1WalterBlueBBBlueBB
2KimRedBCSRedBCS
3HowardGreenBCSNaNNaN
Joining on a custom column

⬆ We'll need to set the index of the right DataFrame to the Name column to allow for the matching to occur.

Note: You'll also see that is data that doesn't exist prior to the join is populated with NaN, which is something you may have to deal with in order to use your DataFrame properly

Types of Joins

You might notice that in the after the previous join command, we have rows with "Kim", "Saul", and "Walter", who appear is both DataFrames, and a row with "Howard" who appears in the left DataFrame. However "Jesse", who only appears in the right DataFrame is dropped. This is because by default the join command in pandas will perform a left join (i.e. keep every value in the left and the intersection).

There are 4 types of joins, which can be visualized using a venn diagram:



So if we wanted a join that kept every value in the right DataFrame and discarded any value not in the right DataFrame (including its intersection with the left DataFrame) we can specify a right join like this:

df1.join(df2.set_index('Name'), on='Name', how='right', rsuffix='(right)', lsuffix='(left)')
NameFavorite Color(left)Show(left)Favorite Color(right)Show(right)
0.0SaulMaroonBCSMaroonBCS
1.0WalterBlueBBBlueBB
2.0KimRedBCSRedBCS
NaNJesseNaNNaNMaroonBB
Right join

The inner join will only join on the overlap between the two DataFrames:

df1.join(df2.set_index('Name'), on='Name', how='inner', rsuffix='(right)', lsuffix='(left)')
NameFavorite Color(left)Show(left)Favorite Color(right)Show(right)
0SaulMaroonBCSMaroonBCS
1WalterBlueBBBlueBB
2KimRedBCSRedBCS
Inner join

The outer join will include all data in both:

df1.join(df2.set_index('Name'), on='Name', how='outer', rsuffix='(right)', lsuffix='(left)')
NameFavorite Color(left)Show(left)Favorite Color(right)Show(right)
0.0SaulMaroonBCSMaroonBCS
1.0WalterBlueBBBlueBB
2.0KimRedBCSRedBCS
3.0HowardGreenBCSNaNNaN
NaNJesseNaNNaNMaroonBB
Outer join

Another useful property of join is that if there are multiple matches in a key column, it will return the combinations:

df1.join(df2.set_index('Show'), on='Show', how='outer', rsuffix='(right)', lsuffix='(left)')
Name(left)Favorite Color(left)ShowName(right)Favorite Color(right)
0SaulMaroonBCSSaulMaroon
0SaulMaroonBCSKimRed
2KimRedBCSSaulMaroon
2KimRedBCSKimRed
3HowardGreenBCSSaulMaroon
3HowardGreenBCSKimRed
1WalterBlueBBWalterBlue
1WalterBlueBBJesseMaroon
Multiple matches

Here we'll get rows of every combination of matching show with the other DataFrame.

You can read more about join in the pandas documentation here: pandas documentation.