Creating a DataFrame from an HTML table using Pandas
HTML tables can be found on many different websites and can contain useful data we may want to analyze. In this guide, we will learn how to create a pandas DataFrame from a table on a website, first creating one from a page containing only an HTML table, and then from a website such as a Wikipedia page, where the table is only part of the page.
Reading a HTML table from a URL
First, we will work with the UIUC Graybook dataset. This dataset contains salary data from administrators and faculty at the university. The most current dataset that contains salaries for the 2021-22 school year can be found here. The data is broken down by college, so we will look at data from the School of Liberal Arts & Sciences.
To read in the data, input the URL into the read_html function.
The read_html function returns a list of DataFrames. In this case, we only have one table on the screen. This means that the data frame we're looking for is the first one in the list. Python list elements can be accessed by their index, meaning our DataFrame is pages[0], or the first element in the list.
All together, we have:
import pandas as pd
pages = pd.read_html("https://www.trustees.uillinois.edu/trustees/resources/21-22-Graybook/KV.html")
df = pages[0]
Employee Name
Job Title
Tenure
Empl Class
Present FTE
Proposed FTE
Present Salary
Proposed Salary
0
KV - Liberal Arts & Sciences
KV - Liberal Arts & Sciences
KV - Liberal Arts & Sciences
KV - Liberal Arts & Sciences
KV - Liberal Arts & Sciences
KV - Liberal Arts & Sciences
KV - Liberal Arts & Sciences
KV - Liberal Arts & Sciences
1
303 - African American Studies
303 - African American Studies
303 - African American Studies
303 - African American Studies
303 - African American Studies
303 - African American Studies
303 - African American Studies
303 - African American Studies
2
Anderson, James D
PROF
NaN
AA
0.00
0.00
$0.00
$0.00
3
Anderson, James D
Employee Total for All Jobs...
Employee Total for All Jobs...
Employee Total for All Jobs...
1.00
1.00
$319,633.33
$325,797.33
4
Bailey, Ronald William
HEAD
NaN
BC
0.00
0.00
$4,400.00
$4,400.00
...
...
...
...
...
...
...
...
...
2915
Weinberg, Shelley E
Employee Total for All Jobs...
Employee Total for All Jobs...
Employee Total for All Jobs...
1.00
1.00
$86,660.00
$88,393.00
2916
Weissman, Terri
ASSOC PROF
NaN
AA
0.00
0.00
$0.00
$0.00
2917
Weissman, Terri
Employee Total for All Jobs...
Employee Total for All Jobs...
Employee Total for All Jobs...
1.00
1.00
$85,870.00
$87,670.00
2918
Wilson, David
PROF
NaN
AA
0.00
0.00
$0.00
$0.00
2919
Wilson, David
Employee Total for All Jobs...
Employee Total for All Jobs...
Employee Total for All Jobs...
1.00
1.00
$125,875.00
$128,392.50
Loading a HTML table from a URL
Reading a HTML table from a Wikipedia Page
For our second example, let's do something a bit more complicated: converting a html table from a wikipedia page. The wikipedia page for the National Dog Show features a table containing the previous winners of the competition. This will be our target data frame.
Let's start off as we did before, by inputting the URL of the wikipedia page into the read_html function.
Looking at our output, we can see that the read_html function is breaking up the webpage into different DataFrames. Because of this we need to look through the output and find which DataFrame contains the table we are looking for. In this case, data we are looking for, the table with the dogs who won best in show, is the second dataframe in the list.
We can access the second dataframe by its index, pages[1].
Putting it all together, we have:
import pandas as pd
pages = pd.read_html("https://en.wikipedia.org/wiki/National_Dog_Show")
df = pages[1]
Year
Breed
Name
Call Name
Handler
Reference
0
2001
Labrador Retriever
Gustavus Brady
Gus
NaN
NaN
1
2002
Standard Poodle
Ale Kai Mikimoto On Fifth
Miki
Karen LeFrak
NaN
2
2003
Doberman
Blue Chip Purple Reign
Raisin
NaN
NaN
3
2004
Smooth Fox Terrier
Aimhi Avalon Renaissance
Gracie
NaN
[7]
4
2005
Colored Bull Terrier
Rocky Top's Sundance Kid
Rufus
Barbara Bishop
[8]
5
2006
Toy Poodle
Smash JP Win A Victory
Vikki
Kaz Hosaka
[7]
6
2007
Australian Shepherd
Buff Cap Creslane Arctic Mist
Swizzle
Nancy Gagnon
[9]
7
2008
Pointer
Cookieland Seasyde Hollyberry
Holly
Sean & Tammy McCarthy
[10]
8
2009
Scottish Terrier
Ch. Roundtown Mercedes of Maryscot
Sadie
Amelia Musser
[7]
9
2010
Irish Setter
Windntide Mr. Sandman
Clooney
Peter Kubacz
[11]
10
2011
Wire Fox Terrier
GCH Steele Your Heart
Eira
Gabriel Rangel
NaN
11
2012
Wire Fox Terrier
Afterall Painting the Sky
Sky
Gabriel Rangel
[12]
12
2013
American Foxhound
Gch. Kiarry's Pandora's Box
Jewel
Lisa Miller
[13]
13
2014
Bloodhound
Flessner's International S’cess
Nathan
Heather Helmer
[14]
14
2015
Skye Terrier
GCH CH Cragsmoor Good Time Charlie
Charlie
Larry Cornelius
[15]
15
2016
Greyhound
GCHS CH Grandcru Giaconda CGC
Gia
Melanie Steele
[16]
16
2017
Brussels Griffon
GCH Somerset Wynzall Hashtag
Newton
Keith Jacobson
[17]
17
2018
Whippet
GCHG Pinnacle Tennessee Whiskey
Whiskey
Justin Smithey
[18]
18
2019
Bulldog
GCHG Diamond Gold Majesu Pisko Bulls
Thor
Eduardo Paris
[19][20]
19
2020
Scottish Deerhound
GCH Foxcliffe Claire Randall Fraser
Claire
Angela Lloyd
[21]
20
2021
Scottish Deerhound
GCH Foxcliffe Claire Randall Fraser
Claire
Angela Lloyd
[22]
Loading in a HTML table from a Wikipedia website.
This method can be used on any webpage with a table on it, not just for Wikipedia articles.