MicroProject #1: Trends in High School GPA

In this MicroProject, you will do real data science in less than an hour and you will earn this MicroProject's card to your collection when you fully complete this MicroProject! 🎉

Data Source: Common Data Set

MicroProject collectable card

The Common Data Set (CDS) is an annual report published by nearly every major college and university in the United States with the goal to achieve the "development of clear, standard data items and definitions in order to determine a specific cohort relevant to each item".

As part of the Common Data Set, universities report the "percentage of all enrolled, degree-seeking, first-time, first-year students [...] high school grade-point averages" in the following GPA ranges:

  • Percentage of incoming freshman with a high school GPA of 4.00
  • Percentage of incoming freshman with a high school GPA of 3.75 or greater
  • Percentage of incoming freshman with a high school GPA of 3.50 or greater
  • ...and so on...

For example, the University of Wisconsin-Madison's UW-Madison 2024 CDS reports following the following high school GPAs for the freshman class entering in Fall 2023:

High School GPAPercentage of Freshman Class
=4.0047.8%
$\ge$ 3.75 (includes 4.00s)83.8%
$\ge$ 3.5095.9%
$\ge$ 3.2598.8%
$\ge$ 3.0099.7%
$\ge$ 2.50100%
$\ge$ 2.00100%
$\ge$ 1.00100%
$\ge$ 0.00100%

In January 2025, we compiled all of high school GPAs from the Common Data Sets provided by all Big Ten Universities and provide them as a dataset cds-high-school-gpas.csv. In this MicroProject, you will nerd out with this data and explore any trends in the high school GPAs of freshman at Big Ten schools. Let's nerd out! 🎉

Part 1: Importing the CDS Big Ten High School GPAs Dataset

You can find the cds-high-school-gpas.csv dataset at the following URL: https://waf-server-01.cs.illinois.edu/static/cds-high-school-gpas.csv

Load the URL as a dataset as a new DataFrame in a variable named df:

Reset Code Run All to Here Python Output:
(Run your code to see your code result's here.)

Once you have the dataset loaded, you can display the DataFrame by placing the variable name on the last line of any Python cell. For example, run the following cell that just contains the variable named df to see the DataFrame you just loaded:

Reset Code Run All to Here Python Output:
(Run your code to see your code result's here.)

Part 2: Exploring the Data

One of the first steps in Data Science is to understand your data and explore the dataset!

Part 2.1: Highest Percentage of Freshman with 4.00 High School GPAs

First, find the ten rows that have the highest percentage of incoming freshman with a 4.00 GPA. Store those ten rows in a new variable named df_highest400.

Helpful Tips:

  • You will need to reference Part 1 to understand the structure of the dataset to find the exact name of the column.
  • You can review the DISCOVERY page: "Row Selection with DataFrames" to find out how to select rows with the largest values.
Reset Code Run All to Here Python Output:
(Run your code to see your code result's here.)
⚙️ Test Case: Highest Percentage of Freshman with 4.00 High School GPAs

Part 2.2: Exploring UW-Madison Data

In the introduction, we previewed the Fall 2023 freshman class at the University of Wisconsin-Madison. To understand if there's a trend in the data, we want to select ALL years of data from University of Wisconsin-Madison.

To do this, store the rows with data about the University of Wisconsin-Madison in a new variable named df_wisconsin:

Reset Code Run All to Here Python Output:
(Run your code to see your code result's here.)
⚙️ Test Case: Exploring UW-Madison Data

Part 2.3: Exploring Data From U-Michigan

An extremely helpful tool is to list ALL the unique values for a given column by using the command:

df["Column"].unique()

To list all of the different universities that appear in the dataset, we would need to find the unique values for the column School. Using the syntax above, list all of the unique Big Ten universities stored in the DataFrame df:

Reset Code Run All to Here Python Output:
(Run your code to see your code result's here.)

Once you have a list of all the University names, find the exact spelling and capitalization for The University of Michigan as it appears in the dataset. Then, just like in Part 2.2, select all the rows from the original DataFrame (df) that contain data about The University of Michigan.

Store your data in a new variable called df_michigan:

Reset Code Run All to Here Python Output:
(Run your code to see your code result's here.)
⚙️ Test Case: Exploring Data From U-Michigan

Part 2.4: Exploring Data From Michigan State

Finally, let's create a variable for one final school! Find all the rows about Michigan State and store those rows in a variable called df_michiganState:

Reset Code Run All to Here Python Output:
(Run your code to see your code result's here.)
⚙️ Test Case: Exploring Data From U-Michigan

Part 3: Data Visualization

If you look back over the three DataFrames you created in Part 2.2 (Wisconsin), 2.3 (U-Michigan), and 2.4 (Michigan State), you can examine the tables for trends. However, it's often more impactful to see the data visually!

For any visualization, we generally need to specify:

  • What data do we want on the x-axis?
  • What data do we want on the y-axis?

Part 3.1: X-Axis Values

To understand a trend over time, it's common the visualize the year on the x-axis. To help create this visualization, find the column name in your DataFrame that contains data for the year of the data and store it in the Python variable x_column below:

Reset Code Run All to Here Python Output:
(Run your code to see your code result's here.)

Part 3.2: Y-Axis Values

In addition, we also need to decide what actual data we want to visualize the trend over time.

  • Do we want to visualize only students who have a 3.00?
  • ...or those with at least a 3.50?
  • ...or a perfect 4.0?
  • ...or something else entirely?
  • We can choose any column in our dataset!

To start, find the name of the column that contains data of the percentage of all freshman who have at least a high school GPA of 3.50 or greater. Store that column name in the variable y_column:

Reset Code Run All to Here Python Output:
(Run your code to see your code result's here.)
⚙️ Test Case: Axis Column Names

Part 3.3: Create the visualization!

In Module 2 of DISCOVERY (the very next module!), you will learn how to create a visualization -- but take a peek at the code below before you run it.

  • In the first line, you will find we use your df_wisconsin that you created in Part 2.2. This is how we source data about the high school GPAs of the freshman class at The University of Wisconsin.
  • Immediately after df_wisconsin, we use .plot.line to create a line graph with the data! This line of code is stored in the variable ax that is referenced later in the code to add modifications to the original line graph.
  • This is repeated in a very similar way in Lines #2 and Line #3 below for df_michigan and df_michiganState. Note the ax= parameter in these lines which adds each university's line to the original Wisconsin-Madison line graph stored in the variable ax.

You'll create line plots like this, along with dozens of other visualizations, when we cover Data Visualization in DISCOVERY in Module 2! However, for now, we've provided it for you. :)

When you're ready, run the code to create the visualization:

Reset Code Run All to Here Python Output:
(Run your code to see your code result's here.)

Part 3.4: Modify the Data Being Visualized

Finally, return to Part 3.2 and modify the y_column to visualize the percentage of the freshman class that has a high school GPA of at 3.75 or greater.

⚠️ - You must modify the y_column cell and re-run that cell AND generate a new graph by re-running the visualization cell in order to pass this the next test case! - ⚠️

⚙️ Test Case: Data Visualization

Earn the MicroProject Collectable Card!

Congratulations on finishing the MicroProject! 🎉🎉

To validate your entire project, your entire code will run from top-to-bottom on this page and each test case will be validated one final time. If everything looks good, you'll earn the card for completing this MicroProject: