Combining DataFrames by Merging
Sometimes when you're dealing with multiple DataFrames, you might want to combine them together. One method of combining DataFrames is merging, which combines rows together by matching DataFrames based on matching the contents of the data from columns.
Consider the following DataFrames:
df1 = pd.DataFrame([
{'Name': 'Saul', 'Job': 'Lawyer', 'Age': 60},
{'Name': 'Walter', 'Job': 'Teacher', 'Age': 35},
{'Name': 'Kim', 'Job': 'Lawyer', 'Age': 42},
{'Name': 'Gus', 'Job': 'Distributor', 'Age': 71},
])
df2 = pd.DataFrame([
{'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},
{'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},
{'Name': 'Jesse', 'Favorite Color': 'Green', 'Show': 'BB'},
{'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},
])
Merging DataFrames With Matching Columns
The default pd.merge
operation merges two DataFrames by finding matching values in all columns with identical names and creating a DataFrame with only the matches that were found.
In our example above, the Name
column appears in both df1
and df2
. The rows with identical data for the Name
are combined and only Saul, Walter, and Kim appears since only those three people are found in both DataFrames:
pd.merge(df1, df2)
Name | Job | Age | Favorite Color | Show | |
---|---|---|---|---|---|
0 | Saul | Lawyer | 60 | Maroon | BCS |
1 | Walter | Teacher | 35 | Blue | BB |
2 | Kim | Lawyer | 42 | Red | BCS |
Keeping Unmatched Rows
When merging, you can specify how the merging occurs. The default merge in pandas is an "inner merge" that keeps only the rows that are found in both the DataFrames.
By using how='outer'
to preform an "outer merge", all rows are kept from both DataFrames and NaN
values are used when no match for a specific column was found:
pd.merge(df1, df2, how='outer')
Name | Job | Age | Favorite Color | Show | |
---|---|---|---|---|---|
0 | Saul | Lawyer | 60.0 | Maroon | BCS |
1 | Walter | Teacher | 35.0 | Blue | BB |
2 | Kim | Lawyer | 42.0 | Red | BCS |
3 | Gus | Distributor | 71.0 | NaN | NaN |
4 | Jesse | NaN | NaN | Green | BB |
Merging by Specifying Specific Columns
By default, Python merges on ALL columns with identical names. If you have multiple columns with multiple names, the on
parameter allows a single column (or list of columns) to be specified to merge. For example, the following DataFrames have identical records for two people with results for the same metric from two different dates:
A default merge will try to match unique authors and dates, find no matches, and return an empty DataFrame:
pd.merge(day1, day2)
author | date | wpm |
---|
By specifying a merge on='author'
:
- The merge is done only using the
author
column, and - All identical column names will be updated; the first (left-most) DataFrame provided to
pd.merge
will have all columns names appended with_x
(ex:date
becomesdate_x
andwpm
becomeswpm_x
) and the other DataFrame provided will have all column names appended with_y
.
pd.merge(day1, day2, on='author')
author | date_x | wpm_x | date_y | wpm_y | |
---|---|---|---|---|---|
0 | Wade | 2022-05-07 | 134 | 2022-05-08 | 179 |
1 | Karle | 2022-05-07 | 148 | 2022-05-08 | 165 |