Combining DataFrames by Merging

Sometimes when you're dealing with multiple DataFrames, you might want to combine them together. One method of combining DataFrames is merging, which combines rows together by matching DataFrames based on matching the contents of the data from columns.

Consider the following DataFrames:

df1 = pd.DataFrame([
    {'Name': 'Saul', 'Job': 'Lawyer', 'Age': 60},
    {'Name': 'Walter', 'Job': 'Teacher', 'Age': 35},
    {'Name': 'Kim', 'Job': 'Lawyer', 'Age': 42},
    {'Name': 'Gus', 'Job': 'Distributor', 'Age': 71},

df2 = pd.DataFrame([
    {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},
    {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},
    {'Name': 'Jesse', 'Favorite Color': 'Green', 'Show': 'BB'},
    {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},

Merging DataFrames With Matching Columns

The default pd.merge operation merges two DataFrames by finding matching values in all columns with identical names and creating a DataFrame with only the matches that were found.

In our example above, the Name column appears in both df1 and df2. The rows with identical data for the Name are combined and only Saul, Walter, and Kim appears since only those three people are found in both DataFrames:

pd.merge(df1, df2)
NameJobAgeFavorite ColorShow
Merging DataFrames Based on a Column

Keeping Unmatched Rows

When merging, you can specify how the merging occurs. The default merge in pandas is an "inner merge" that keeps only the rows that are found in both the DataFrames.

By using how='outer' to preform an "outer merge", all rows are kept from both DataFrames and NaN values are used when no match for a specific column was found:

pd.merge(df1, df2, how='outer')
NameJobAgeFavorite ColorShow
Merging DataFrames Based on a Column

Merging by Specifying Specific Columns

By default, Python merges on ALL columns with identical names. If you have multiple columns with multiple names, the on parameter allows a single column (or list of columns) to be specified to merge. For example, the following DataFrames have identical records for two people with results for the same metric from two different dates:

day1 = pd.DataFrame([
  {"author": "Wade", "date": "2022-05-07", "wpm": 134},
  {"author": "Karle", "date": "2022-05-07", "wpm": 148},

day2 = pd.DataFrame([
  {"author": "Wade", "date": "2022-05-08", "wpm": 179},
  {"author": "Karle", "date": "2022-05-08", "wpm": 165},
DataFrame with two days of data, including all identical columns.

A default merge will try to match unique authors and dates, find no matches, and return an empty DataFrame:

pd.merge(day1, day2)
An outer merge on all matching columns on the DataFrames results in no matches.

By specifying a merge on='author':

  • The merge is done only using the author column, and
  • All identical column names will be updated; the first (left-most) DataFrame provided to pd.merge will have all columns names appended with _x (ex: date becomes date_x and wpm becomes wpm_x) and the other DataFrame provided will have all column names appended with _y.
pd.merge(day1, day2, on='author')
An outer merge on "author" merges the data for each unique author.