Combining DataFrames by Concatenation

Sometimes when you're dealing with multiple DataFrames, you might want to combine them together. One method of combining DataFrames is concatenation, which will combine the DataFrames end-to-end.

Concatenation is a great way to combine DataFrames with identical columns (for example, you may have data that was observed on two separate dates and stored in separate files that needs to be joined together). Concatenation does not look at the contents of the data at all and only joins the DataFrame end-to-end.

Instead, you want to merge data based on the contents (for example, when you have two files both with information about the same person and their name appears in both files), you need to merge DataFrames together instead of concatenating.

Consider the following DataFrames that are ideal for concatenation:

df1 = pd.DataFrame([
  {'Name': 'Saul', 'Job': 'Lawyer', 'Age': 60},
  {'Name': 'Walter', 'Job': 'Teacher', 'Age': 35},
  {'Name': 'Kim', 'Job': 'Lawyer', 'Age': 42},
])

df2 = pd.DataFrame([
  {'Name': 'Gus', 'Job': 'Manager', 'Age': 45},
  {'Name': 'Hank', 'Job': 'Special Agent', 'Age': 39},
  {'Name': 'Jesse', 'Job': 'Distributor', 'Age': 20}
])

Concatenating DataFrames Vertically

The pd.concat() function combines a list of DataFrames end-to-end. To combine df1 to the end of df2:

import pandas as pd\ndf1 = pd.DataFrame([\n  {'Name': 'Saul', 'Job': 'Lawyer', 'Age': 60},\n  {'Name': 'Walter', 'Job': 'Teacher', 'Age': 35},\n  {'Name': 'Kim', 'Job': 'Lawyer', 'Age': 42},\n])\n&nbsp;\ndf2 = pd.DataFrame([\n  {'Name': 'Gus', 'Job': 'Manager', 'Age': 45},\n  {'Name': 'Hank', 'Job': 'Special Agent', 'Age': 39},\n  {'Name': 'Jesse', 'Job': 'Distributor', 'Age': 20}\n])\npd.concat([df1, df2])

Reset Code Python Output:


  
    
      
      Name
      Job
      Age
    
  
  
    
      0
      Saul
      Lawyer
      60
    
    
      1
      Walter
      Teacher
      35
    
    
      2
      Kim
      Lawyer
      42
    
    
      0
      Gus
      Manager
      45
    
    
      1
      Hank
      Special Agent
      39
    
    
      2
      Jesse
      Distributor
      20

	Name	Job	Age
0	Saul	Lawyer	60
1	Walter	Teacher	35
2	Kim	Lawyer	42
0	Gus	Manager	45
1	Hank	Special Agent	39
2	Jesse	Distributor	20

You can add as many elements in the list, and place them in the order you'd like them to be combined in the resulting DataFrame.

Re-generating Index Row Labels After Concatenating

The concatenation of two DataFrames adds the rows end-to-end, leaving the row labels unchanged *(notice the repeating 0, 1, 2 labels). The .reset_index() function reset the row labels to index values to match the row number.

import pandas as pd\ndf1 = pd.DataFrame([\n  {'Name': 'Saul', 'Job': 'Lawyer', 'Age': 60},\n  {'Name': 'Walter', 'Job': 'Teacher', 'Age': 35},\n  {'Name': 'Kim', 'Job': 'Lawyer', 'Age': 42},\n])\n&nbsp;\ndf2 = pd.DataFrame([\n  {'Name': 'Gus', 'Job': 'Manager', 'Age': 45},\n  {'Name': 'Hank', 'Job': 'Special Agent', 'Age': 39},\n  {'Name': 'Jesse', 'Job': 'Distributor', 'Age': 20}\n])\npd.concat([df1, df2]).reset_index()

Reset Code Python Output:


  
    
      
      Name
      Job
      Age
    
  
  
    
      0
      Saul
      Lawyer
      60
    
    
      1
      Walter
      Teacher
      35
    
    
      2
      Kim
      Lawyer
      42
    
    
      3
      Gus
      Manager
      45
    
    
      4
      Hank
      Special Agent
      39
    
    
      5
      Jesse
      Distributor
      20

	Name	Job	Age
0	Saul	Lawyer	60
1	Walter	Teacher	35
2	Kim	Lawyer	42
3	Gus	Manager	45
4	Hank	Special Agent	39
5	Jesse	Distributor	20

Concatenating Non-Identical Columns

Since concatenating two DataFrames does not do any merging of data, when your DataFrames do not have identical column names the missing columns in any DataFrames that are being concatenated are marked with an NaN. To combine data based on merging data, use a merging technique instead of concatenation.

import pandas as pd\ndf1 = pd.DataFrame([\n  {'Name': 'Saul', 'Job': 'Lawyer', 'Age': 60},\n  {'Name': 'Walter', 'Job': 'Teacher', 'Age': 35},\n  {'Name': 'Kim', 'Job': 'Lawyer', 'Age': 42},\n])\n# The "Name" column appears in df1, but not "Favorite Color" and "Show" does not appear in df1\ndf3 = pd.DataFrame([\n    {'Name': 'Kim', 'Favorite Color': 'Red', 'Show': 'BCS'},\n    {'Name': 'Walter', 'Favorite Color': 'Blue', 'Show': 'BB'},\n    {'Name': 'Jesse', 'Favorite Color': 'Green', 'Show': 'BB'},\n    {'Name': 'Saul', 'Favorite Color': 'Maroon', 'Show': 'BCS'},\n])\n&nbsp;\n# Combine `df1` and our new `df3`, and using .reset_index():\npd.concat([df1, df3]).reset_index()

Reset Code Python Output:


  
    
      
      index
      Name
      Job
      Age
      Favorite Color
      Show
    
  
  
    
      0
      0
      Saul
      Lawyer
      60.0
      NaN
      NaN
    
    
      1
      1
      Walter
      Teacher
      35.0
      NaN
      NaN
    
    
      2
      2
      Kim
      Lawyer
      42.0
      NaN
      NaN
    
    
      3
      0
      Kim
      NaN
      NaN
      Red
      BCS
    
    
      4
      1
      Walter
      NaN
      NaN
      Blue
      BB
    
    
      5
      2
      Jesse
      NaN
      NaN
      Green
      BB
    
    
      6
      3
      Saul
      NaN
      NaN
      Maroon
      BCS

	index	Name	Job	Age	Favorite Color	Show
0	0	Saul	Lawyer	60.0	NaN	NaN
1	1	Walter	Teacher	35.0	NaN	NaN
2	2	Kim	Lawyer	42.0	NaN	NaN
3	0	Kim	NaN	NaN	Red	BCS
4	1	Walter	NaN	NaN	Blue	BB
5	2	Jesse	NaN	NaN	Green	BB
6	3	Saul	NaN	NaN	Maroon	BCS

Concatenating DataFrame Horizontally

If you wanted to combine the two DataFrames horizontally, you can use .concat() with the parameter axis=1. This might be useful if data extends across multiple columns in the two DataFrames.

You can think of this as extending the columns of the first DataFrame, as opposed to extending the rows.

import pandas as pd\ndf4 = pd.DataFrame([\n    {'Name': 'Saul', 'Job': 'Lawyer', 'Age': 60},\n    {'Name': 'Walter', 'Job': 'Teacher', 'Age': 35},\n    {'Name': 'Kim', 'Job': 'Lawyer', 'Age': 42},\n])\ndf5 = pd.DataFrame([\n    {'Favorite Color': 'Maroon', 'Show': 'BCS'}, # Saul's continued record\n    {'Favorite Color': 'Blue', 'Show': 'BB'},    # Walter's continued record\n    {'Favorite Color': 'Red', 'Show': 'BCS'},    # Kim's continued record\n])\n&nbsp;\npd.concat([df4, df5], axis=1)

Reset Code Python Output:


  
    
      
      Name
      Job
      Age
      Favorite Color
      Show
    
  
  
    
      0
      Saul
      Lawyer
      60
      Maroon
      BCS
    
    
      1
      Walter
      Teacher
      35
      Blue
      BB
    
    
      2
      Kim
      Lawyer
      42
      Red
      BCS

	Name	Job	Age	Favorite Color	Show
0	Saul	Lawyer	60	Maroon	BCS
1	Walter	Teacher	35	Blue	BB
2	Kim	Lawyer	42	Red	BCS

Identical to concatenating the data vertically, the columns are added to row end-to-end. Be careful, as this can be dangerous if you are not absolutely certain your data is in the identical order. If you have a identifier column (for example, the Name columns) in both DataFrames, you should merge the DataFrames based on the contents of the column instead of concatenating.