Creating a DataFrame from an HTML table using Pandas


HTML tables can be found on many different websites and can contain useful data we may want to analyze. In this guide, we will learn how to create a pandas DataFrame from a table on a website, first creating one from a page containing only an HTML table, and then from a website such as a Wikipedia page, where the table is only part of the page.

Reading a HTML table from a URL

First, we will work with the UIUC Graybook dataset. This dataset contains salary data from administrators and faculty at the university. The most current dataset that contains salaries for the 2021-22 school year can be found here. The data is broken down by college, so we will look at data from the School of Liberal Arts & Sciences.

To read in the data, input the URL into the read_html function.

import pandas as pd
pages = pd.read_html("https://www.trustees.uillinois.edu/trustees/resources/21-22-Graybook/KV.html")
[                       Employee Name                       Job Title  \
 0       KV - Liberal Arts & Sciences    KV - Liberal Arts & Sciences   
 1     303 - African American Studies  303 - African American Studies   
 2                  Anderson, James D                            PROF   
 3                  Anderson, James D  Employee Total for All Jobs...   
 4             Bailey, Ronald William                            HEAD   
 ...                              ...                             ...   
 2915             Weinberg, Shelley E  Employee Total for All Jobs...   
 2916                 Weissman, Terri                      ASSOC PROF   
 2917                 Weissman, Terri  Employee Total for All Jobs...   
 2918                   Wilson, David                            PROF   
 2919                   Wilson, David  Employee Total for All Jobs...   

                               Tenure                      Empl Class  \
 0       KV - Liberal Arts & Sciences    KV - Liberal Arts & Sciences   
 1     303 - African American Studies  303 - African American Studies   
 2                                NaN                              AA   
 3     Employee Total for All Jobs...  Employee Total for All Jobs...   
 4                                NaN                              BC   
 ...                              ...                             ...   
 2915  Employee Total for All Jobs...  Employee Total for All Jobs...   
 2916                             NaN                              AA   
 2917  Employee Total for All Jobs...  Employee Total for All Jobs...   
 2918                             NaN                              AA   
 2919  Employee Total for All Jobs...  Employee Total for All Jobs...   
 
                          Present FTE                    Proposed FTE  \
 0       KV - Liberal Arts & Sciences    KV - Liberal Arts & Sciences   
 1     303 - African American Studies  303 - African American Studies   
 2                               0.00                            0.00   
 3                               1.00                            1.00   
 4                               0.00                            0.00   
 ...                              ...                             ...   
 2915                            1.00                            1.00   
 2916                            0.00                            0.00   
 2917                            1.00                            1.00   
 2918                            0.00                            0.00   
 2919                            1.00                            1.00   
 
                       Present Salary                 Proposed Salary  
 0       KV - Liberal Arts & Sciences    KV - Liberal Arts & Sciences  
 1     303 - African American Studies  303 - African American Studies  
 2                              $0.00                           $0.00  
 3                        $319,633.33                     $325,797.33  
 4                          $4,400.00                       $4,400.00  
 ...                              ...                             ...  
 2915                      $86,660.00                      $88,393.00  
 2916                           $0.00                           $0.00  
 2917                      $85,870.00                      $87,670.00  
 2918                           $0.00                           $0.00  
 2919                     $125,875.00                     $128,392.50  
 
 [2920 rows x 8 columns]]
Loading in a list of dataframes

The read_html function returns a list of DataFrames. In this case, we only have one table on the screen. This means that the data frame we're looking for is the first one in the list. Python list elements can be accessed by their index, meaning our DataFrame is pages[0], or the first element in the list.

All together, we have:

import pandas as pd
pages = pd.read_html("https://www.trustees.uillinois.edu/trustees/resources/21-22-Graybook/KV.html")
df = pages[0]
Employee NameJob TitleTenureEmpl ClassPresent FTEProposed FTEPresent SalaryProposed Salary
0KV - Liberal Arts & SciencesKV - Liberal Arts & SciencesKV - Liberal Arts & SciencesKV - Liberal Arts & SciencesKV - Liberal Arts & SciencesKV - Liberal Arts & SciencesKV - Liberal Arts & SciencesKV - Liberal Arts & Sciences
1303 - African American Studies303 - African American Studies303 - African American Studies303 - African American Studies303 - African American Studies303 - African American Studies303 - African American Studies303 - African American Studies
2Anderson, James DPROFNaNAA0.000.00$0.00$0.00
3Anderson, James DEmployee Total for All Jobs...Employee Total for All Jobs...Employee Total for All Jobs...1.001.00$319,633.33$325,797.33
4Bailey, Ronald WilliamHEADNaNBC0.000.00$4,400.00$4,400.00
...........................
2915Weinberg, Shelley EEmployee Total for All Jobs...Employee Total for All Jobs...Employee Total for All Jobs...1.001.00$86,660.00$88,393.00
2916Weissman, TerriASSOC PROFNaNAA0.000.00$0.00$0.00
2917Weissman, TerriEmployee Total for All Jobs...Employee Total for All Jobs...Employee Total for All Jobs...1.001.00$85,870.00$87,670.00
2918Wilson, DavidPROFNaNAA0.000.00$0.00$0.00
2919Wilson, DavidEmployee Total for All Jobs...Employee Total for All Jobs...Employee Total for All Jobs...1.001.00$125,875.00$128,392.50
Loading a HTML table from a URL

Reading a HTML table from a Wikipedia Page

For our second example, let's do something a bit more complicated: converting a html table from a wikipedia page. The wikipedia page for the National Dog Show features a table containing the previous winners of the competition. This will be our target data frame.

Let's start off as we did before, by inputting the URL of the wikipedia page into the read_html function.

import pandas as pd
pages = pd.read_html("https://en.wikipedia.org/wiki/National_Dog_Show")
[      National Dog Show                                National Dog Show.1
 0                   NaN                                                NaN
 1         Also known as          The National Dog Show Presented by Purina
 2                 Genre                                           Dog show
 3            Created by                  Kennel Club of Philadelphia (KCP)
 4          Presented by                            John O'HurleyDavid Frei
 5                Judges              Frank DePaulo (Show Chairman & Judge)
 6     Country of origin                                      United States
 7            Production                                         Production
 8   Production location  Greater Philadelphia Expo Center at Oaks in Oa...
 9          Running time                                            2 hours
 10          Distributor                             NBC Sports Productions
 11              Release                                            Release
 12     Original network                                                NBC
 13     Original release                        November 2002; 19 years ago,
     Year                 Breed                                  Name  \
 0   2001    Labrador Retriever                        Gustavus Brady   
 1   2002       Standard Poodle             Ale Kai Mikimoto On Fifth   
 2   2003              Doberman                Blue Chip Purple Reign   
 3   2004    Smooth Fox Terrier              Aimhi Avalon Renaissance   
 4   2005  Colored Bull Terrier              Rocky Top's Sundance Kid   
 5   2006            Toy Poodle                Smash JP Win A Victory   
 6   2007   Australian Shepherd         Buff Cap Creslane Arctic Mist   
 7   2008               Pointer         Cookieland Seasyde Hollyberry   
 8   2009      Scottish Terrier    Ch. Roundtown Mercedes of Maryscot   
 9   2010          Irish Setter                 Windntide Mr. Sandman   
 10  2011      Wire Fox Terrier                 GCH Steele Your Heart   
 11  2012      Wire Fox Terrier             Afterall Painting the Sky   
 12  2013     American Foxhound           Gch. Kiarry's Pandora's Box   
 13  2014            Bloodhound       Flessner's International S’cess   
 14  2015          Skye Terrier    GCH CH Cragsmoor Good Time Charlie   
 15  2016             Greyhound         GCHS CH Grandcru Giaconda CGC   
 16  2017      Brussels Griffon          GCH Somerset Wynzall Hashtag   
 17  2018               Whippet       GCHG Pinnacle Tennessee Whiskey   
 18  2019               Bulldog  GCHG Diamond Gold Majesu Pisko Bulls   
 19  2020    Scottish Deerhound   GCH Foxcliffe Claire Randall Fraser   
 20  2021    Scottish Deerhound   GCH Foxcliffe Claire Randall Fraser   
 
    Call Name                Handler Reference  
 0        Gus                    NaN       NaN  
 1       Miki           Karen LeFrak       NaN  
 2     Raisin                    NaN       NaN  
 3     Gracie                    NaN       [7]  
 4      Rufus         Barbara Bishop       [8]  
 5      Vikki             Kaz Hosaka       [7]  
 6    Swizzle           Nancy Gagnon       [9]  
 7      Holly  Sean & Tammy McCarthy      [10]  
 8      Sadie          Amelia Musser       [7]  
 9    Clooney           Peter Kubacz      [11]  
 10      Eira         Gabriel Rangel       NaN  
 11       Sky         Gabriel Rangel      [12]  
 12     Jewel            Lisa Miller      [13]  
 13    Nathan         Heather Helmer      [14]  
 14   Charlie        Larry Cornelius      [15]  
 15       Gia         Melanie Steele      [16]  
 16    Newton         Keith Jacobson      [17]  
 17   Whiskey         Justin Smithey      [18]  
 18      Thor          Eduardo Paris  [19][20]  
 19    Claire           Angela Lloyd      [21]  
 20    Claire           Angela Lloyd      [22]  ,
 .mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:"[ "}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:" ]"}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar a>span,.mw-parser-output .navbar a>abbr{text-decoration:inherit}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-parser-output .navbar-ct-mini{font-size:114%;margin:0 4em}vteThanksgiving  \
 0                               History andtraditions
 1                                              Canada
 2                                       United States
 3                                             Cuisine
 4                                               Songs
 5                                    Associatedevents
 6                                            Cultural
 7                                             Parades
 8                                            Protests
 9                                              Sports
 10                                           Football
 11                                         Basketball
 12                                       Turkey Trots
 13                                             Others
 
 .mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:"[ "}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:" ]"}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar a>span,.mw-parser-output .navbar a>abbr{text-decoration:inherit}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-parser-output .navbar-ct-mini{font-size:114%;margin:0 4em}vteThanksgiving.1  
 0   Canada Samuel de Champlain Martin Frobisher Ha...
 1   Samuel de Champlain Martin Frobisher Halifax F...
 2   Pilgrims Mayflower Plymouth Colony Plymouth, M...
 3   Thanksgiving dinner Turkey Deep-fried Tofurkey...
 4   "Alice's Restaurant" "Bless This House" "Bring...
 5   Cultural Christmas and holiday season Harvest ...
 6   Christmas and holiday season Harvest festivals...
 7   Novant Health (Charlotte) Chicago America's (D...
 8   National Day of Mourning Unthanksgiving Day Bu...
 9   Football NFL on Thanksgiving Day CFL Thanksgiv...
 10  NFL on Thanksgiving Day CFL Thanksgiving Class...
 11  AdvoCare Invitational Battle 4 Atlantis Wooden...
 12  Berwick Run for the Diamonds Buffalo Turkey Tr...
 13  National Dog Show Pumpkin chucking Turkey bowl...,
                0                                                  1
 0         Canada  Samuel de Champlain Martin Frobisher Halifax F...
 1  United States  Pilgrims Mayflower Plymouth Colony Plymouth, M...
 2        Cuisine  Thanksgiving dinner Turkey Deep-fried Tofurkey...
 3          Songs  "Alice's Restaurant" "Bless This House" "Bring...,
           0                                                  1
 0  Cultural  Christmas and holiday season Harvest festivals...
 1   Parades  Novant Health (Charlotte) Chicago America's (D...
 2  Protests  National Day of Mourning Unthanksgiving Day Bu...,
               0                                                  1
 0      Football  NFL on Thanksgiving Day CFL Thanksgiving Class...
 1    Basketball  AdvoCare Invitational Battle 4 Atlantis Wooden...
 2  Turkey Trots  Berwick Run for the Diamonds Buffalo Turkey Tr...
 3        Others  National Dog Show Pumpkin chucking Turkey bowl...]
Loading in the initial Wikipedia page

Looking at our output, we can see that the read_html function is breaking up the webpage into different DataFrames. Because of this we need to look through the output and find which DataFrame contains the table we are looking for. In this case, data we are looking for, the table with the dogs who won best in show, is the second dataframe in the list.

We can access the second dataframe by its index, pages[1].

Putting it all together, we have:

import pandas as pd
pages = pd.read_html("https://en.wikipedia.org/wiki/National_Dog_Show")
df = pages[1]
YearBreedNameCall NameHandlerReference
02001Labrador RetrieverGustavus BradyGusNaNNaN
12002Standard PoodleAle Kai Mikimoto On FifthMikiKaren LeFrakNaN
22003DobermanBlue Chip Purple ReignRaisinNaNNaN
32004Smooth Fox TerrierAimhi Avalon RenaissanceGracieNaN[7]
42005Colored Bull TerrierRocky Top's Sundance KidRufusBarbara Bishop[8]
52006Toy PoodleSmash JP Win A VictoryVikkiKaz Hosaka[7]
62007Australian ShepherdBuff Cap Creslane Arctic MistSwizzleNancy Gagnon[9]
72008PointerCookieland Seasyde HollyberryHollySean & Tammy McCarthy[10]
82009Scottish TerrierCh. Roundtown Mercedes of MaryscotSadieAmelia Musser[7]
92010Irish SetterWindntide Mr. SandmanClooneyPeter Kubacz[11]
102011Wire Fox TerrierGCH Steele Your HeartEiraGabriel RangelNaN
112012Wire Fox TerrierAfterall Painting the SkySkyGabriel Rangel[12]
122013American FoxhoundGch. Kiarry's Pandora's BoxJewelLisa Miller[13]
132014BloodhoundFlessner's International S’cessNathanHeather Helmer[14]
142015Skye TerrierGCH CH Cragsmoor Good Time CharlieCharlieLarry Cornelius[15]
152016GreyhoundGCHS CH Grandcru Giaconda CGCGiaMelanie Steele[16]
162017Brussels GriffonGCH Somerset Wynzall HashtagNewtonKeith Jacobson[17]
172018WhippetGCHG Pinnacle Tennessee WhiskeyWhiskeyJustin Smithey[18]
182019BulldogGCHG Diamond Gold Majesu Pisko BullsThorEduardo Paris[19][20]
192020Scottish DeerhoundGCH Foxcliffe Claire Randall FraserClaireAngela Lloyd[21]
202021Scottish DeerhoundGCH Foxcliffe Claire Randall FraserClaireAngela Lloyd[22]
Loading in a HTML table from a Wikipedia website.

This method can be used on any webpage with a table on it, not just for Wikipedia articles.

Pandas Documentation

The full documentation for read_html is available in the pandas documentation.