Combining DataFrames by Concatenation


Sometimes when you're dealing with multiple DataFrames, you might want to combine them together. One method of combining DataFrames is concatenation, which will combine the DataFrames end-to-end.

Concatenation is a great way to combine DataFrames with identical columns (for example, you may have data that was observed on two separate dates and stored in separate files that needs to be joined together). Concatenation does not look at the contents of the data at all and only joins the DataFrame end-to-end.

Instead, you want to merge data based on the contents (for example, when you have two files both with information about the same person and their name appears in both files), you need to merge DataFrames together instead of concatenating.

Consider the following DataFrames that are ideal for concatenation:

df1 = pd.DataFrame([
  {'Name': 'Saul', 'Job': 'Lawyer', 'Age': 60},
  {'Name': 'Walter', 'Job': 'Teacher', 'Age': 35},
  {'Name': 'Kim', 'Job': 'Lawyer', 'Age': 42},
])

df2 = pd.DataFrame([
  {'Name': 'Gus', 'Job': 'Manager', 'Age': 45},
  {'Name': 'Hank', 'Job': 'Special Agent', 'Age': 39},
  {'Name': 'Jesse', 'Job': 'Distributor', 'Age': 20}
])

Concatenating DataFrames Vertically

The pd.concat() function combines a list of DataFrames end-to-end. To combine df1 to the end of df2:

Reset Code Python Output:
Name Job Age
0 Saul Lawyer 60
1 Walter Teacher 35
2 Kim Lawyer 42
0 Gus Manager 45
1 Hank Special Agent 39
2 Jesse Distributor 20

You can add as many elements in the list, and place them in the order you'd like them to be combined in the resulting DataFrame.

Re-generating Index Row Labels After Concatenating

The concatenation of two DataFrames adds the rows end-to-end, leaving the row labels unchanged *(notice the repeating 0, 1, 2 labels). The .reset_index() function reset the row labels to index values to match the row number.

Reset Code Python Output:
Name Job Age
0 Saul Lawyer 60
1 Walter Teacher 35
2 Kim Lawyer 42
3 Gus Manager 45
4 Hank Special Agent 39
5 Jesse Distributor 20

Concatenating Non-Identical Columns

Since concatenating two DataFrames does not do any merging of data, when your DataFrames do not have identical column names the missing columns in any DataFrames that are being concatenated are marked with an NaN. To combine data based on merging data, use a merging technique instead of concatenation.

Reset Code Python Output:
index Name Job Age Favorite Color Show
0 0 Saul Lawyer 60.0 NaN NaN
1 1 Walter Teacher 35.0 NaN NaN
2 2 Kim Lawyer 42.0 NaN NaN
3 0 Kim NaN NaN Red BCS
4 1 Walter NaN NaN Blue BB
5 2 Jesse NaN NaN Green BB
6 3 Saul NaN NaN Maroon BCS

Concatenating DataFrame Horizontally

If you wanted to combine the two DataFrames horizontally, you can use .concat() with the parameter axis=1. This might be useful if data extends across multiple columns in the two DataFrames.

You can think of this as extending the columns of the first DataFrame, as opposed to extending the rows.

Reset Code Python Output:
Name Job Age Favorite Color Show
0 Saul Lawyer 60 Maroon BCS
1 Walter Teacher 35 Blue BB
2 Kim Lawyer 42 Red BCS

Identical to concatenating the data vertically, the columns are added to row end-to-end. Be careful, as this can be dangerous if you are not absolutely certain your data is in the identical order. If you have a identifier column (for example, the Name columns) in both DataFrames, you should merge the DataFrames based on the contents of the column instead of concatenating.