Basic Data Visualization in Python
The pandas library makes it extremely easy to create basic data visualizations and provides built-in utilities for all common data visualizations:
df.plot.bar(...), to create a bar plot (or add an
.barhfor a horizontal bar chart)
df.plot.line(...), to create a line plot
df.plot.scatter(...), to create a scatter plot
df.plot.hist(...), to create a histogram
df.plot.box(...), to create a boxplot
df.plot.hexbin, and more.
Selecting Columns for Visualizations
By default, each visualization will display all numeric columns of data -- which is often A LOT of data. For example, the Illini Football Dataset contains four numeric columns:
14). This means a default visualization will display all four columns:
To create a DataFrame with only a subset of the columns, we need to select a subset of columns. The syntax required to do this will provide the list of column names as an index to the DataFrame as shown below:
df[ ["IlliniScore", "OpponentScore"] ]
Notice that there are two sets of square brackets!
- The first set tells us we're working within the
df[ ... ].
- The second set is the list of column names where each column name is separated by a comma
- Together, they make the full command to select a subset of columns from our DataFrame.
A default box plot with only the two columns can now be created:
y Column Values
Some visualizations require a single column to be plotted on the
y axis. For example, there is no default scatter plot and Python informs us that both
y are required:
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-33-c4b381c47468> in <module>() ----> 1 df.plot.scatter()
TypeError: scatter() missing 2 required positional arguments: 'x' and 'y'
When Python informs us that we're missing "missing 2 required positional arguments", we need to specify them in the function call. For all visualizations, the
y values will be the name of the column to be used. If we wanted to create a scatter plot of the
IlliniScore verses the
While this visualization is good, there's almost too much data for a scatter plot! Another plot, a
hexbin, provides a heat map for of density of each region of a scatter plot. Switching out the
.hexbin, we get a completely different visualization:
Yikes -- this visualization is not great. I can barely see any details.
Every visualization has dozens of individual options to customize -- color, style, and function! The technical documentation for each graph will display all of the different options. The easiest way to find the technical documentation is often to just search for the function:
df.plot.hexbin and look for the pandas documentation:
All pandas documentation will show a description and a list of parameters with descriptions. For the
df.plot.hexbin documentation has several options specific to hexbin including
gridsize. The documentation for
From above, pandas allows us to specify "The number of hexagons in the x-direction." and has a "default [of] 100". What happens if we specify just 15, instead of 100, hexagons in the x-direction?
Example Walk-Throughs with Worksheets
Video 1: Histograms, Bar Charts, and Box Plots
Video 2: Anscombes Quartet and Scatter Plots
Practice QuestionsQ1: You have a DataFrame df with columns Name, GPA, ExamScore, and Major. How would you create the following scatter plot?
Q2: You have a DataFrame df with columns Name, GPA, Exam1Score, Exam2Score, and Major. How would you create a hexbin with 20 hexagons in the x direction, GPA on the y-axis, and Exam1Score on the x-axis?
Q3: What is an additional argument you could supply to df.plot.scatter(x = "Exam1Score", y = "GPA")?
Q4: What's wrong with the following python command attempting to create a scatter plot: df.plot.scatter()
Q5: You have a DataFrame df with columns Name, GPA, Exam1Score, Exam2Score, and Major. How would you create a boxplot comparing Exam1Score to Exam2Score in Python?