Data Science MicroProjects

MicroProjects are detailed but small projects designed to do real Data Science in under an hour!


MicroProject: Image Steganography with DataFrames

Steganography describes the technique of hiding data within secondary, usually ordinary, data to avoid detection. For example, an ordinary PNG image might look like a picture to us -- but, hidden inside of the image, is a special encoding that reveals hidden data that otherwise goes undetected.

In this MicroProject, you will explore steganography by decoding a message secretly hidden in an image just for you. Let's nerd out!

MicroProject: NCAA March Madness

The NCAA March Madness tournament is an annual college basketball event in the United States that determines the national champion among Division I men's and women's teams. It is a single-elimination tournament with 64 teams. In 2024, UIUC is a 3 seed, ranked 13 overall out of the 64 teams!

A bracket in the March Madness tournament visually represents the matchups between teams in each round of the single-elimination competition. Fans can fill out their own brackets with predictions before the tournament starts, attempting to forecast the outcomes of all games, including which team will ultimately win the championship.

MicroProject: Infinite Money in Roulette (Martingale Betting System)

The "Martingale Betting System" is a specific strategy that, when applied to the game of Roulette, involves the following actions:
  • You will initially start by betting a small amount (ex: $1.00) on red.
  • If you win (the wheel lands on red), you will win your bet. You have made $1.00 and you can repeat again with $1.00.
  • Each time you lose, you double your bet (initially, after one loss, a $2.00 bet; then $4.00; then $8.00; and so on) until you win.

Using this strategy, every bet will always recover all previous losses and ALWAYS result in a winning bet netting a win of $1.00. (Ex: A $4.00 bet that wins only happens after a $1.00 and $2.00 loss, still resulting in a +$1.00 net win.)

Therefore, mathematically, this betting strategy will result in an infinite increasing amount of money so long as you play the game long enough! In this MicroProject, we will explore this claim and use simulation to play Roulette using the Martingale Betting System. Let's nerd out! :)

MicroProject: Dungeons & Dragons

Dungeons & Dragons is a tabletop role-playing game in which you and your friends play a group of adventurers and seek to slay monsters, meet NPCs, and become more powerful in a fantasy world. One of the most recognizable aspects is the dice used to determine the outcome of events. The dice are designated by "d" followed by the number of sides. In this MicroProject, you will create simulations of dice rolls and explore the statistics of the rolls.

MicroProject: Simulation for Ten Heads in a Row

Simulation is a powerful tool that allows us to run a event with a probabilistic outcome millions of times in under a second. In this MicroProject, you will use simple simulation to flip a coin a million times and discover how to find trends in the simulated data. After writing this simulation, you will do analysis that compile data over multiple observation -- a simple form of time-series analysis -- to find if the statistical probability of events measure the simulated probability.

MicroProject: Bechdel Test 🎥

The Bechdel Test, or Bechdel-Wallace Test, is a simple way of measuring the representation of women in a film or other work of fiction. To pass the The Bechdel Test, a work must pass all three criteria: (1): The work must have at least two women in it, (2): who talk to each other, (3): about something other than a man.

The test was popularized by Alison Bechdel's comic, in a 1985 strip called "The Rule". The website BechdelTest.com provides a searchable database of films and their Bechdel Test results, allowing users to explore and analyze patterns in gender representation in cinema.

MicroProject: Choropleth Maps from DataFrames 🗺️

Geographical data visualizations are some of the most impactful forms of visualization since it easily allows the user to locate places familiar to themselves. One popular geographical visualization is a choropleth map-- a visualization of data on a map where geographical regions are shaded to visually encode data about the region as a whole. For example, population density maps and per-capita income maps are common choropleth maps.

In this MicroProject, you will learn about the folium Python library to create choropleth maps from a DataFrame! Let's nerd out! :)

MicroProject: DataFrame of Your NWS Weather Forecast 🌩️

The National Weather Service allows, for free, *"developers access to critical forecasts, alerts, and observations, along with other weather data."* In this MicroProject, you will explore using the [weather.gov API service](https://www.weather.gov/documentation/services-web-api) to get the weather for your location.

MicroProject: Exploring COVID-19 Data from GitHub

Since before COVID-19 was detected in the United States, the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University has provided daily updates of COVID-19 case data as clean, structured CSV files on GitHub as a free public service to the world. In this MicroProject, you will explore how to find a dataset on GitHub and use that for Data Science analysis!

MicroProject: Valentine's Day 💗

The NRF (National Retail Federation) is the world's largest retail trade association. Its members include department stores, specialty, discount, catalog, Internet, and independent retailers, chain restaurants, grocery stores, and multi-level marketing companies. NRF has surveyed consumers about how they plan to celebrate Valentine’s Day annually for over a decade. This includes consumer spending, gifts purchased, and more!

MicroProject: Highest Mountains in the World ⛰️

Wikipedia is an absolutely amazing source of information about almost every topic you can imagine! In this microproject, you will explore how to easily use data in Wikipedia tables as datasets, and perform row selection based on the contents of strings in your DataFrame.

MicroProject: Illini Football

The [University of Illinois' Fighting Illini Historical Football Scores Dataset](https://github.com/wadefagen/datasets/tree/master/illini-football) contains a "collection of final scores of every known Fighting Illini football game since 1892, with data on location, homecoming, and national bowl games." In this MicroProject, you will explore the history of Illini football games through row selection, data groups, and creating basic data visualization.

MicroProject: United States Congress 🏛️

The @unitedstates project (https://theunitedstates.io/) maintains various high-quality datasets about the United States government. Specifically, the `congress-legislators` dataset contains every member "of the United States Congress (1789-Present), congressional committees (1973-Present), committee membership (current only), and presidents and vice presidents of the United States in YAML, JSON, and CSV format."

MicroProject: World University Rankings

There are hundreds of organizations that rank universities, including US News and World Report, QS World University Rankings, Times Higher Education (THE), and many others.

The Times Higher Education (THE) provides a clean, well-documented CSV that includes their rankings based on the "performance data on universities for students and their families, academics, university leaders, governments and industry". Their 2020 dataset includes almost 1,400 universities across 92 countries and includes 13 performance indicators that measure an institution’s performance across teaching, research, knowledge transfer and international outlook. Their website with additional details on this dataset is found here: https://www.timeshighereducation.com/content/world-university-rankings

In this MicroProject, you will explore basic DataFrame operations on the Times Higher Education university rankings.

MicroProject: AI versus Human: Response Analysis

Detecting if something has been generated by a human or AI can be difficult. This MicroProject explores this topic by looking at 100 questions identical questions answered by both ChatGPT and a human.

Hugging Face is a company that has developed a platform for natural language processing (NLP) applications. They have created and shared a large collection of pre-trained models, datasets, and learning resources which are open-source and available for the public to use. Using Hello-SampleAI from their HC3 dataset, this MicroProject
explores the length of responses, sentiment score, and subjectivity scores of the responses.

MicroProject: Building a Scene Recognition Model form Video Frames

Visual images are an important part of all media and Data Scientists are often using images as data sources. In this MicroProject, you will create a simple model to detect the amount of time spent in two different "scenes" we used when creating office-hour style videos for Data Science DISCOVERY. To do this, you will learn how to import an entire folder of images, preform image analysis, and create your own model without using a pre-build library.

MicroProject: FIFA World Cup

The FIFA World Cup is a global football (soccer) competition contested by the senior men's national teams which occurs every 4 years. It is likely the most popular sporting event in the world, drawing billions of television viewers every tournament. This MicroProject explores thousands of football matches, with a specific focus on World Cup games.

MicroProject: Open Policing Project

The Stanford Open Policing Project is an initiative by the Stanford Computational Journalism Lab to collect, analyze, and provide access to traffic stop data from across the United States. According to the Open Policing Project, *"on a typical day in the United States, police officers make more than 50,000 traffic stops"*. This MicroProject will only be looking at the data from the state of Illinois and explore one possible racial disparity among traffic stops in Champaign-Urbana.

MicroProject: Custom Discrete Distribution in Python

In statistics and data science, random variables are used to model events that have uncertain outcomes. For example, in DISCOVERY, we explore the binomial distribution to model flipping a coin, drawing from a deck of cards, guessing on a multiple choice exam, and many other events with a single, fixed probability of success. However, what if there are multiple different outcomes? This MicroProject will explore creating custom discrete distributions in Python to model complex events!

MicroProject: United Nations (UNHCR) Refugee Data

The United Nations High Commissioner for Refugees (UNHCR) is a United Nations agency mandated to aid and protect refugees, forcibly displaced communities, and stateless people, and to assist in their voluntary repatriation, local integration or resettlement to a third country.

The UNHCR has a database of refugees and internally displaced persons (IDPs) around the world. The data is updated daily and includes information on the number of refugees and IDPs, the countries they are from, the countries they are in, and the number of people who have been displaced by conflict or natural disasters.