MicroProject #1: Trends in High School GPA
In this MicroProject, you will do real data science in less than an hour and you will earn this MicroProject's card to your collection when you fully complete this MicroProject! 🎉
Data Source: Common Data Set
The Common Data Set (CDS) is an annual report published by nearly every major college and university in the United States with the goal to achieve the "development of clear, standard data items and definitions in order to determine a specific cohort relevant to each item".
As part of the Common Data Set, universities report the "percentage of all enrolled, degree-seeking, first-time, first-year students [...] high school grade-point averages" in the following GPA ranges:
- Percentage of incoming freshman with a high school GPA of 4.00
- Percentage of incoming freshman with a high school GPA of 3.75 or greater
- Percentage of incoming freshman with a high school GPA of 3.50 or greater
- ...and so on...
For example, the University of Wisconsin-Madison's UW-Madison 2024 CDS reports following the following high school GPAs for the freshman class entering in Fall 2023:
| High School GPA | Percentage of Freshman Class |
|---|---|
| =4.00 | 47.8% |
| $\ge$ 3.75 (includes 4.00s) | 83.8% |
| $\ge$ 3.50 | 95.9% |
| $\ge$ 3.25 | 98.8% |
| $\ge$ 3.00 | 99.7% |
| $\ge$ 2.50 | 100% |
| $\ge$ 2.00 | 100% |
| $\ge$ 1.00 | 100% |
| $\ge$ 0.00 | 100% |
In January 2025, we compiled all of high school GPAs from the Common Data Sets provided by all Big Ten Universities and provide them as a dataset cds-high-school-gpas.csv. In this MicroProject, you will nerd out with this data and explore any trends in the high school GPAs of freshman at Big Ten schools. Let's nerd out! 🎉
Part 1: Importing the CDS Big Ten High School GPAs Dataset
You can find the cds-high-school-gpas.csv dataset at the following URL: https://waf-server-01.cs.illinois.edu/static/cds-high-school-gpas.csv
Load the URL as a dataset as a new DataFrame in a variable named df:
Once you have the dataset loaded, you can display the DataFrame by placing the variable name on the last line of any Python cell. For example, run the following cell that just contains the variable named df to see the DataFrame you just loaded:
Part 2: Exploring the Data
One of the first steps in Data Science is to understand your data and explore the dataset!
Part 2.1: Highest Percentage of Freshman with 4.00 High School GPAs
First, find the ten rows that have the highest percentage of incoming freshman with a 4.00 GPA. Store those ten rows in a new variable named df_highest400.
Helpful Tips:
- You will need to reference Part 1 to understand the structure of the dataset to find the exact name of the column.
- You can review the DISCOVERY page: "Row Selection with DataFrames" to find out how to select rows with the largest values.
⚙️ Test Case: Highest Percentage of Freshman with 4.00 High School GPAs
Part 2.2: Exploring UW-Madison Data
In the introduction, we previewed the Fall 2023 freshman class at the University of Wisconsin-Madison. To understand if there's a trend in the data, we want to select ALL years of data from University of Wisconsin-Madison.
To do this, store the rows with data about the University of Wisconsin-Madison in a new variable named df_wisconsin:
⚙️ Test Case: Exploring UW-Madison Data
Part 2.3: Exploring Data From U-Michigan
An extremely helpful tool is to list ALL the unique values for a given column by using the command:
df["Column"].unique()
To list all of the different universities that appear in the dataset, we would need to find the unique values for the column School. Using the syntax above, list all of the unique Big Ten universities stored in the DataFrame df:
Once you have a list of all the University names, find the exact spelling and capitalization for The University of Michigan as it appears in the dataset. Then, just like in Part 2.2, select all the rows from the original DataFrame (df) that contain data about The University of Michigan.
Store your data in a new variable called df_michigan:
⚙️ Test Case: Exploring Data From U-Michigan
Part 2.4: Exploring Data From Michigan State
Finally, let's create a variable for one final school! Find all the rows about Michigan State and store those rows in a variable called df_michiganState:
⚙️ Test Case: Exploring Data From U-Michigan
Part 3: Data Visualization
If you look back over the three DataFrames you created in Part 2.2 (Wisconsin), 2.3 (U-Michigan), and 2.4 (Michigan State), you can examine the tables for trends. However, it's often more impactful to see the data visually!
For any visualization, we generally need to specify:
- What data do we want on the x-axis?
- What data do we want on the y-axis?
Part 3.1: X-Axis Values
To understand a trend over time, it's common the visualize the year on the x-axis. To help create this visualization, find the column name in your DataFrame that contains data for the year of the data and store it in the Python variable x_column below:
Part 3.2: Y-Axis Values
In addition, we also need to decide what actual data we want to visualize the trend over time.
- Do we want to visualize only students who have a 3.00?
- ...or those with at least a 3.50?
- ...or a perfect 4.0?
- ...or something else entirely?
- We can choose any column in our dataset!
To start, find the name of the column that contains data of the percentage of all freshman who have at least a high school GPA of 3.50 or greater. Store that column name in the variable y_column:
⚙️ Test Case: Axis Column Names
Part 3.3: Create the visualization!
In Module 2 of DISCOVERY (the very next module!), you will learn how to create a visualization -- but take a peek at the code below before you run it.
- In the first line, you will find we use your
df_wisconsinthat you created in Part 2.2. This is how we source data about the high school GPAs of the freshman class at The University of Wisconsin. - Immediately after
df_wisconsin, we use.plot.lineto create a line graph with the data! This line of code is stored in the variableaxthat is referenced later in the code to add modifications to the original line graph. - This is repeated in a very similar way in Lines #2 and Line #3 below for
df_michigananddf_michiganState. Note theax=parameter in these lines which adds each university's line to the original Wisconsin-Madison line graph stored in the variableax.
You'll create line plots like this, along with dozens of other visualizations, when we cover Data Visualization in DISCOVERY in Module 2! However, for now, we've provided it for you. :)
When you're ready, run the code to create the visualization:
Part 3.4: Modify the Data Being Visualized
Finally, return to Part 3.2 and modify the y_column to visualize the percentage of the freshman class that has a high school GPA of at 3.75 or greater.
⚠️ - You must modify the y_column cell and re-run that cell AND generate a new graph by re-running the visualization cell in order to pass this the next test case! - ⚠️
⚙️ Test Case: Data Visualization
Earn the MicroProject Collectable Card!
Congratulations on finishing the MicroProject! 🎉🎉
To validate your entire project, your entire code will run from top-to-bottom on this page and each test case will be validated one final time. If everything looks good, you'll earn the card for completing this MicroProject: