Combining DataFrames by Merging

Sometimes when you're dealing with multiple DataFrames, you might want to combine them together. One method of combining DataFrames is merging, which combines rows together by matching DataFrames based on matching the contents of the data from columns.

Consider the following DataFrames:

df1 = pd.DataFrame([
    {'Name': 'Saul', 'Job': 'Lawyer', 'Age': 60},
    {'Name': 'Walter', 'Job': 'Teacher', 'Age': 35},
    {'Name': 'Kim', 'Job': 'Lawyer', 'Age': 42},
    {'Name': 'Gus', 'Job': 'Distributor', 'Age': 71},
])

df2 = pd.DataFrame([
    {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},
    {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},
    {'Name': 'Jesse', 'Favorite Color': 'Green', 'Show': 'BB'},
    {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},
])

Merging DataFrames With Matching Columns

The default pd.merge operation merges two DataFrames by finding matching values in all columns with identical names and creating a DataFrame with only the matches that were found.

In our example above, the Name column appears in both df1 and df2. The rows with identical data for the Name are combined and only Saul, Walter, and Kim appears since only those three people are found in both DataFrames:

import pandas as pd\ndf1 = pd.DataFrame([\n    {'Name': 'Saul', 'Job': 'Lawyer', 'Age': 60},\n    {'Name': 'Walter', 'Job': 'Teacher', 'Age': 35},\n    {'Name': 'Kim', 'Job': 'Lawyer', 'Age': 42},\n    {'Name': 'Gus', 'Job': 'Distributor', 'Age': 71},\n])\n&nbsp;\ndf2 = pd.DataFrame([\n    {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},\n    {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},\n    {'Name': 'Jesse', 'Favorite Color': 'Green', 'Show': 'BB'},\n    {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},\n])\npd.merge(df1, df2)

Reset Code Python Output:


  
    
      
      Name
      Job
      Age
      Favorite Color
      Show
    
  
  
    
      0
      Saul
      Lawyer
      60
      Maroon
      BCS
    
    
      1
      Walter
      Teacher
      35
      Blue
      BB
    
    
      2
      Kim
      Lawyer
      42
      Red
      BCS

	Name	Job	Age	Favorite Color	Show
0	Saul	Lawyer	60	Maroon	BCS
1	Walter	Teacher	35	Blue	BB
2	Kim	Lawyer	42	Red	BCS

Keeping Unmatched Rows

When merging, you can specify how the merging occurs. The default merge in pandas is an "inner merge" that keeps only the rows that are found in both the DataFrames.

By using how='outer' to preform an "outer merge", all rows are kept from both DataFrames and NaN values are used when no match for a specific column was found:

import pandas as pd\ndf1 = pd.DataFrame([\n    {'Name': 'Saul', 'Job': 'Lawyer', 'Age': 60},\n    {'Name': 'Walter', 'Job': 'Teacher', 'Age': 35},\n    {'Name': 'Kim', 'Job': 'Lawyer', 'Age': 42},\n    {'Name': 'Gus', 'Job': 'Distributor', 'Age': 71},\n])\n&nbsp;\ndf2 = pd.DataFrame([\n    {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},\n    {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},\n    {'Name': 'Jesse', 'Favorite Color': 'Green', 'Show': 'BB'},\n    {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},\n])\npd.merge(df1, df2, how='outer')

Reset Code Python Output:


  
    
      
      Name
      Job
      Age
      Favorite Color
      Show
    
  
  
    
      0
      Saul
      Lawyer
      60.0
      Maroon
      BCS
    
    
      1
      Walter
      Teacher
      35.0
      Blue
      BB
    
    
      2
      Kim
      Lawyer
      42.0
      Red
      BCS
    
    
      3
      Gus
      Distributor
      71.0
      NaN
      NaN
    
    
      4
      Jesse
      NaN
      NaN
      Green
      BB

	Name	Job	Age	Favorite Color	Show
0	Saul	Lawyer	60.0	Maroon	BCS
1	Walter	Teacher	35.0	Blue	BB
2	Kim	Lawyer	42.0	Red	BCS
3	Gus	Distributor	71.0	NaN	NaN
4	Jesse	NaN	NaN	Green	BB

Merging by Specifying Specific Columns

By default, Python merges on ALL columns with identical names. If you have multiple columns with multiple names, the on parameter allows a single column (or list of columns) to be specified to merge. For example, the following DataFrames have identical records for two people with results for the same metric from two different dates:

day1 = pd.DataFrame([
  {"author": "Wade", "date": "2022-05-07", "wpm": 134},
  {"author": "Karle", "date": "2022-05-07", "wpm": 148},
])

day2 = pd.DataFrame([
  {"author": "Wade", "date": "2022-05-08", "wpm": 179},
  {"author": "Karle", "date": "2022-05-08", "wpm": 165},
])

DataFrame with two days of data, including all identical columns.

A default merge will try to match unique authors and dates, find no matches, and return an empty DataFrame:

import pandas as pd\nday1 = pd.DataFrame([\n  {"author": "Wade", "date": "2022-05-07", "wpm": 134},\n  {"author": "Karle", "date": "2022-05-07", "wpm": 148},\n])\n&nbsp;\nday2 = pd.DataFrame([\n  {"author": "Wade", "date": "2022-05-08", "wpm": 179},\n  {"author": "Karle", "date": "2022-05-08", "wpm": 165},\n])\npd.merge(day1, day2)

Reset Code Python Output:


  
    
      
      author
      date
      wpm

By specifying a merge on='author':

The merge is done only using the author column, and
All identical column names will be updated; the first (left-most) DataFrame provided to pd.merge will have all columns names appended with _x (ex: date becomes date_x and wpm becomes wpm_x) and the other DataFrame provided will have all column names appended with _y.

import pandas as pd\nday1 = pd.DataFrame([\n  {"author": "Wade", "date": "2022-05-07", "wpm": 134},\n  {"author": "Karle", "date": "2022-05-07", "wpm": 148},\n])\n&nbsp;\nday2 = pd.DataFrame([\n  {"author": "Wade", "date": "2022-05-08", "wpm": 179},\n  {"author": "Karle", "date": "2022-05-08", "wpm": 165},\n])\npd.merge(day1, day2, on='author')

Reset Code Python Output:

	author	date_x	wpm_x	date_y	wpm_y
0	Wade	2022-05-07	134	2022-05-08	179
1	Karle	2022-05-07	148	2022-05-08	165