Combining DataFrames by Merging


Sometimes when you're dealing with multiple DataFrames, you might want to combine them together. One method of combining DataFrames is merging, which combines rows together by matching DataFrames based on matching the contents of the data from columns.

Consider the following DataFrames:

df1 = pd.DataFrame([
    {'Name': 'Saul', 'Job': 'Lawyer', 'Age': 60},
    {'Name': 'Walter', 'Job': 'Teacher', 'Age': 35},
    {'Name': 'Kim', 'Job': 'Lawyer', 'Age': 42},
    {'Name': 'Gus', 'Job': 'Distributor', 'Age': 71},
])

df2 = pd.DataFrame([
    {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},
    {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},
    {'Name': 'Jesse', 'Favorite Color': 'Green', 'Show': 'BB'},
    {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},
])

Merging DataFrames With Matching Columns

The default pd.merge operation merges two DataFrames by finding matching values in all columns with identical names and creating a DataFrame with only the matches that were found.

In our example above, the Name column appears in both df1 and df2. The rows with identical data for the Name are combined and only Saul, Walter, and Kim appears since only those three people are found in both DataFrames:

Reset Code Python Output:
Name Job Age Favorite Color Show
0 Saul Lawyer 60 Maroon BCS
1 Walter Teacher 35 Blue BB
2 Kim Lawyer 42 Red BCS

Keeping Unmatched Rows

When merging, you can specify how the merging occurs. The default merge in pandas is an "inner merge" that keeps only the rows that are found in both the DataFrames.

By using how='outer' to preform an "outer merge", all rows are kept from both DataFrames and NaN values are used when no match for a specific column was found:

Reset Code Python Output:
Name Job Age Favorite Color Show
0 Saul Lawyer 60.0 Maroon BCS
1 Walter Teacher 35.0 Blue BB
2 Kim Lawyer 42.0 Red BCS
3 Gus Distributor 71.0 NaN NaN
4 Jesse NaN NaN Green BB

Merging by Specifying Specific Columns

By default, Python merges on ALL columns with identical names. If you have multiple columns with multiple names, the on parameter allows a single column (or list of columns) to be specified to merge. For example, the following DataFrames have identical records for two people with results for the same metric from two different dates:

day1 = pd.DataFrame([
  {"author": "Wade", "date": "2022-05-07", "wpm": 134},
  {"author": "Karle", "date": "2022-05-07", "wpm": 148},
])

day2 = pd.DataFrame([
  {"author": "Wade", "date": "2022-05-08", "wpm": 179},
  {"author": "Karle", "date": "2022-05-08", "wpm": 165},
])
DataFrame with two days of data, including all identical columns.

A default merge will try to match unique authors and dates, find no matches, and return an empty DataFrame:

Reset Code Python Output:
author date wpm

By specifying a merge on='author':

  • The merge is done only using the author column, and
  • All identical column names will be updated; the first (left-most) DataFrame provided to pd.merge will have all columns names appended with _x (ex: date becomes date_x and wpm becomes wpm_x) and the other DataFrame provided will have all column names appended with _y.
Reset Code Python Output:
author date_x wpm_x date_y wpm_y
0 Wade 2022-05-07 134 2022-05-08 179
1 Karle 2022-05-07 148 2022-05-08 165