Creating a DataFrame from an HTML table using Pandas
HTML tables can be found on many different websites and can contain useful data we may want to analyze. In this guide, we will learn how to create a pandas DataFrame from a table on a website, first creating one from a page containing only an HTML table, and then from a website such as a Wikipedia page, where the table is only part of the page.
Reading a HTML table from a URL
First, we will work with the UIUC Graybook dataset. This dataset contains salary data from administrators and faculty at the university. The most current dataset that contains salaries for the 2021-22 school year can be found here. The data is broken down by college, so we will look at data from the School of Liberal Arts & Sciences.
To read in the data, input the URL into the read_html function.
import pandas as pd
pages = pd.read_html("https://www.trustees.uillinois.edu/trustees/resources/21-22-Graybook/KV.html")
[ Employee Name Job Title \
0 KV - Liberal Arts & Sciences KV - Liberal Arts & Sciences
1 303 - African American Studies 303 - African American Studies
2 Anderson, James D PROF
3 Anderson, James D Employee Total for All Jobs...
4 Bailey, Ronald William HEAD
... ... ...
2915 Weinberg, Shelley E Employee Total for All Jobs...
2916 Weissman, Terri ASSOC PROF
2917 Weissman, Terri Employee Total for All Jobs...
2918 Wilson, David PROF
2919 Wilson, David Employee Total for All Jobs...
Tenure Empl Class \
0 KV - Liberal Arts & Sciences KV - Liberal Arts & Sciences
1 303 - African American Studies 303 - African American Studies
2 NaN AA
3 Employee Total for All Jobs... Employee Total for All Jobs...
4 NaN BC
... ... ...
2915 Employee Total for All Jobs... Employee Total for All Jobs...
2916 NaN AA
2917 Employee Total for All Jobs... Employee Total for All Jobs...
2918 NaN AA
2919 Employee Total for All Jobs... Employee Total for All Jobs...
Present FTE Proposed FTE \
0 KV - Liberal Arts & Sciences KV - Liberal Arts & Sciences
1 303 - African American Studies 303 - African American Studies
2 0.00 0.00
3 1.00 1.00
4 0.00 0.00
... ... ...
2915 1.00 1.00
2916 0.00 0.00
2917 1.00 1.00
2918 0.00 0.00
2919 1.00 1.00
Present Salary Proposed Salary
0 KV - Liberal Arts & Sciences KV - Liberal Arts & Sciences
1 303 - African American Studies 303 - African American Studies
2 $0.00 $0.00
3 $319,633.33 $325,797.33
4 $4,400.00 $4,400.00
... ... ...
2915 $86,660.00 $88,393.00
2916 $0.00 $0.00
2917 $85,870.00 $87,670.00
2918 $0.00 $0.00
2919 $125,875.00 $128,392.50
[2920 rows x 8 columns]]
The read_html function returns a list of DataFrames. In this case, we only have one table on the screen. This means that the data frame we're looking for is the first one in the list. Python list elements can be accessed by their index, meaning our DataFrame is pages[0], or the first element in the list.
All together, we have:
import pandas as pd
pages = pd.read_html("https://www.trustees.uillinois.edu/trustees/resources/21-22-Graybook/KV.html")
df = pages[0]| Employee Name | Job Title | Tenure | Empl Class | Present FTE | Proposed FTE | Present Salary | Proposed Salary | |
|---|---|---|---|---|---|---|---|---|
| 0 | KV - Liberal Arts & Sciences | KV - Liberal Arts & Sciences | KV - Liberal Arts & Sciences | KV - Liberal Arts & Sciences | KV - Liberal Arts & Sciences | KV - Liberal Arts & Sciences | KV - Liberal Arts & Sciences | KV - Liberal Arts & Sciences |
| 1 | 303 - African American Studies | 303 - African American Studies | 303 - African American Studies | 303 - African American Studies | 303 - African American Studies | 303 - African American Studies | 303 - African American Studies | 303 - African American Studies |
| 2 | Anderson, James D | PROF | NaN | AA | 0.00 | 0.00 | $0.00 | $0.00 |
| 3 | Anderson, James D | Employee Total for All Jobs... | Employee Total for All Jobs... | Employee Total for All Jobs... | 1.00 | 1.00 | $319,633.33 | $325,797.33 |
| 4 | Bailey, Ronald William | HEAD | NaN | BC | 0.00 | 0.00 | $4,400.00 | $4,400.00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2915 | Weinberg, Shelley E | Employee Total for All Jobs... | Employee Total for All Jobs... | Employee Total for All Jobs... | 1.00 | 1.00 | $86,660.00 | $88,393.00 |
| 2916 | Weissman, Terri | ASSOC PROF | NaN | AA | 0.00 | 0.00 | $0.00 | $0.00 |
| 2917 | Weissman, Terri | Employee Total for All Jobs... | Employee Total for All Jobs... | Employee Total for All Jobs... | 1.00 | 1.00 | $85,870.00 | $87,670.00 |
| 2918 | Wilson, David | PROF | NaN | AA | 0.00 | 0.00 | $0.00 | $0.00 |
| 2919 | Wilson, David | Employee Total for All Jobs... | Employee Total for All Jobs... | Employee Total for All Jobs... | 1.00 | 1.00 | $125,875.00 | $128,392.50 |
Reading a HTML table from a Wikipedia Page
For our second example, let's do something a bit more complicated: converting a html table from a wikipedia page. The wikipedia page for the National Dog Show features a table containing the previous winners of the competition. This will be our target data frame.
Let's start off as we did before, by inputting the URL of the wikipedia page into the read_html function.
import pandas as pd
pages = pd.read_html("https://en.wikipedia.org/wiki/National_Dog_Show")
[ National Dog Show National Dog Show.1
0 NaN NaN
1 Also known as The National Dog Show Presented by Purina
2 Genre Dog show
3 Created by Kennel Club of Philadelphia (KCP)
4 Presented by John O'HurleyDavid Frei
5 Judges Frank DePaulo (Show Chairman & Judge)
6 Country of origin United States
7 Production Production
8 Production location Greater Philadelphia Expo Center at Oaks in Oa...
9 Running time 2 hours
10 Distributor NBC Sports Productions
11 Release Release
12 Original network NBC
13 Original release November 2002; 19 years ago,
Year Breed Name \
0 2001 Labrador Retriever Gustavus Brady
1 2002 Standard Poodle Ale Kai Mikimoto On Fifth
2 2003 Doberman Blue Chip Purple Reign
3 2004 Smooth Fox Terrier Aimhi Avalon Renaissance
4 2005 Colored Bull Terrier Rocky Top's Sundance Kid
5 2006 Toy Poodle Smash JP Win A Victory
6 2007 Australian Shepherd Buff Cap Creslane Arctic Mist
7 2008 Pointer Cookieland Seasyde Hollyberry
8 2009 Scottish Terrier Ch. Roundtown Mercedes of Maryscot
9 2010 Irish Setter Windntide Mr. Sandman
10 2011 Wire Fox Terrier GCH Steele Your Heart
11 2012 Wire Fox Terrier Afterall Painting the Sky
12 2013 American Foxhound Gch. Kiarry's Pandora's Box
13 2014 Bloodhound Flessner's International S’cess
14 2015 Skye Terrier GCH CH Cragsmoor Good Time Charlie
15 2016 Greyhound GCHS CH Grandcru Giaconda CGC
16 2017 Brussels Griffon GCH Somerset Wynzall Hashtag
17 2018 Whippet GCHG Pinnacle Tennessee Whiskey
18 2019 Bulldog GCHG Diamond Gold Majesu Pisko Bulls
19 2020 Scottish Deerhound GCH Foxcliffe Claire Randall Fraser
20 2021 Scottish Deerhound GCH Foxcliffe Claire Randall Fraser
Call Name Handler Reference
0 Gus NaN NaN
1 Miki Karen LeFrak NaN
2 Raisin NaN NaN
3 Gracie NaN [7]
4 Rufus Barbara Bishop [8]
5 Vikki Kaz Hosaka [7]
6 Swizzle Nancy Gagnon [9]
7 Holly Sean & Tammy McCarthy [10]
8 Sadie Amelia Musser [7]
9 Clooney Peter Kubacz [11]
10 Eira Gabriel Rangel NaN
11 Sky Gabriel Rangel [12]
12 Jewel Lisa Miller [13]
13 Nathan Heather Helmer [14]
14 Charlie Larry Cornelius [15]
15 Gia Melanie Steele [16]
16 Newton Keith Jacobson [17]
17 Whiskey Justin Smithey [18]
18 Thor Eduardo Paris [19][20]
19 Claire Angela Lloyd [21]
20 Claire Angela Lloyd [22] ,
.mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:"[ "}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:" ]"}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar a>span,.mw-parser-output .navbar a>abbr{text-decoration:inherit}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-parser-output .navbar-ct-mini{font-size:114%;margin:0 4em}vteThanksgiving \
0 History andtraditions
1 Canada
2 United States
3 Cuisine
4 Songs
5 Associatedevents
6 Cultural
7 Parades
8 Protests
9 Sports
10 Football
11 Basketball
12 Turkey Trots
13 Others
.mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:"[ "}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:" ]"}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar a>span,.mw-parser-output .navbar a>abbr{text-decoration:inherit}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-parser-output .navbar-ct-mini{font-size:114%;margin:0 4em}vteThanksgiving.1
0 Canada Samuel de Champlain Martin Frobisher Ha...
1 Samuel de Champlain Martin Frobisher Halifax F...
2 Pilgrims Mayflower Plymouth Colony Plymouth, M...
3 Thanksgiving dinner Turkey Deep-fried Tofurkey...
4 "Alice's Restaurant" "Bless This House" "Bring...
5 Cultural Christmas and holiday season Harvest ...
6 Christmas and holiday season Harvest festivals...
7 Novant Health (Charlotte) Chicago America's (D...
8 National Day of Mourning Unthanksgiving Day Bu...
9 Football NFL on Thanksgiving Day CFL Thanksgiv...
10 NFL on Thanksgiving Day CFL Thanksgiving Class...
11 AdvoCare Invitational Battle 4 Atlantis Wooden...
12 Berwick Run for the Diamonds Buffalo Turkey Tr...
13 National Dog Show Pumpkin chucking Turkey bowl...,
0 1
0 Canada Samuel de Champlain Martin Frobisher Halifax F...
1 United States Pilgrims Mayflower Plymouth Colony Plymouth, M...
2 Cuisine Thanksgiving dinner Turkey Deep-fried Tofurkey...
3 Songs "Alice's Restaurant" "Bless This House" "Bring...,
0 1
0 Cultural Christmas and holiday season Harvest festivals...
1 Parades Novant Health (Charlotte) Chicago America's (D...
2 Protests National Day of Mourning Unthanksgiving Day Bu...,
0 1
0 Football NFL on Thanksgiving Day CFL Thanksgiving Class...
1 Basketball AdvoCare Invitational Battle 4 Atlantis Wooden...
2 Turkey Trots Berwick Run for the Diamonds Buffalo Turkey Tr...
3 Others National Dog Show Pumpkin chucking Turkey bowl...]
Looking at our output, we can see that the read_html function is breaking up the webpage into different DataFrames. Because of this we need to look through the output and find which DataFrame contains the table we are looking for. In this case, data we are looking for, the table with the dogs who won best in show, is the second dataframe in the list.
We can access the second dataframe by its index, pages[1].
Putting it all together, we have:
import pandas as pd
pages = pd.read_html("https://en.wikipedia.org/wiki/National_Dog_Show")
df = pages[1]| Year | Breed | Name | Call Name | Handler | Reference | |
|---|---|---|---|---|---|---|
| 0 | 2001 | Labrador Retriever | Gustavus Brady | Gus | NaN | NaN |
| 1 | 2002 | Standard Poodle | Ale Kai Mikimoto On Fifth | Miki | Karen LeFrak | NaN |
| 2 | 2003 | Doberman | Blue Chip Purple Reign | Raisin | NaN | NaN |
| 3 | 2004 | Smooth Fox Terrier | Aimhi Avalon Renaissance | Gracie | NaN | [7] |
| 4 | 2005 | Colored Bull Terrier | Rocky Top's Sundance Kid | Rufus | Barbara Bishop | [8] |
| 5 | 2006 | Toy Poodle | Smash JP Win A Victory | Vikki | Kaz Hosaka | [7] |
| 6 | 2007 | Australian Shepherd | Buff Cap Creslane Arctic Mist | Swizzle | Nancy Gagnon | [9] |
| 7 | 2008 | Pointer | Cookieland Seasyde Hollyberry | Holly | Sean & Tammy McCarthy | [10] |
| 8 | 2009 | Scottish Terrier | Ch. Roundtown Mercedes of Maryscot | Sadie | Amelia Musser | [7] |
| 9 | 2010 | Irish Setter | Windntide Mr. Sandman | Clooney | Peter Kubacz | [11] |
| 10 | 2011 | Wire Fox Terrier | GCH Steele Your Heart | Eira | Gabriel Rangel | NaN |
| 11 | 2012 | Wire Fox Terrier | Afterall Painting the Sky | Sky | Gabriel Rangel | [12] |
| 12 | 2013 | American Foxhound | Gch. Kiarry's Pandora's Box | Jewel | Lisa Miller | [13] |
| 13 | 2014 | Bloodhound | Flessner's International S’cess | Nathan | Heather Helmer | [14] |
| 14 | 2015 | Skye Terrier | GCH CH Cragsmoor Good Time Charlie | Charlie | Larry Cornelius | [15] |
| 15 | 2016 | Greyhound | GCHS CH Grandcru Giaconda CGC | Gia | Melanie Steele | [16] |
| 16 | 2017 | Brussels Griffon | GCH Somerset Wynzall Hashtag | Newton | Keith Jacobson | [17] |
| 17 | 2018 | Whippet | GCHG Pinnacle Tennessee Whiskey | Whiskey | Justin Smithey | [18] |
| 18 | 2019 | Bulldog | GCHG Diamond Gold Majesu Pisko Bulls | Thor | Eduardo Paris | [19][20] |
| 19 | 2020 | Scottish Deerhound | GCH Foxcliffe Claire Randall Fraser | Claire | Angela Lloyd | [21] |
| 20 | 2021 | Scottish Deerhound | GCH Foxcliffe Claire Randall Fraser | Claire | Angela Lloyd | [22] |
This method can be used on any webpage with a table on it, not just for Wikipedia articles.
Pandas Documentation
The full documentation for read_html is available in the pandas documentation.