# Welcome to Data Science Discovery!

Data Science Discovery is an open-access data science resource created by The University of Illinois and used as the basis for STAT/CS/IS 107: Data Science Discovery and several other courses. Our mission is to create the most valuable Data Science resource available, both for students at The University of Illinois and for any other learner online.

You will find hundreds of pages of high-quality content, primarily divided up into four sections:

**Data Science Lessons**each contain a deep-dive into core Data Science topics, divided up into six modules. Every single lesson contains conversation "office hour"-style lectures with Professors Wade Fagen-Ulmschneider and Karle Flanagan, written explanations, example worksheets, practice questions, and more!**Data Science Guides**are short, solution-focused examples of common tasks in Data Science. We create several new guides each week, so there is constantly something new!**Data Science Datasets**are clean, documented, and relevant datasets for Data Science. As we use a new dataset in any of courses or research, we add the dataset here for you to use!**Data Science MicroProjects**are guided, detailed projects that provide "micro" exploration of a new dataset. Each MicroProject is designed to give you a real data science experience in Python in under and hour!

## Guides

**Data Science Guides**are our short, solution-focused examples of common tasks in Data Science. Here's a few of the latest:

- 3 Ways to Calculate the RMSE in PythonMay 31, 2024
- Simulations in PythonMay 24, 2024
- Python Data TypesMay 17, 2024
- Removing Duplicates from a DataFrame using pandasApril 26, 2024
- View All Guides >>

## μProjects

- FIFA World CupApril 26, 2024
- Boston MarathonApril 22, 2024
- Building a Scene Recognition Model from Video FramesApril 19, 2024
- View All MicroProjects >>

## Datasets

**datasets**are clean, documented, and relevant datasets for Data Science. Here's the latest datasets we've been nerding out with:

- Perception of Probability Words DatasetAugust 17, 2022
- Perception of People at a Party (Fixed Size Dataset)August 17, 2022
- GPA Dataset (Spring 2010 through Spring 2020)August 17, 2022
- Illini Football Dataset (1892-2020)August 17, 2022
- View All Datasets >>

## Learn Data Science!

Begin learning Data Science through diving into the six "badges", each with many sections to explore individual topics:

### Module 1: Basics of Data Science with Python

"Basics of Data Science with Python" provides a strong introduction of the field of Data Science. You will understand best practices in designing good, great, and ideal experiments, use Python to load data into DataFrame, and manipulate DataFrames in Python to explore subsets of data.

`1-00`

» What is Data Science?`1-01`

» Types of Data`1-02`

» Experimental Design and Blocking`1-03`

» Python for Data Science: Introduction to DataFrames`1-04`

» Row Selection with DataFrames`1-05`

» Observational Studies, Confounders, and Stratification`1-06`

» Simpson's Paradox`1-07`

» DataFrames with Conditionals`1-08`

» Software Version Control with git

### Module 2: Exploratory Data Analysis

"Exploratory Data Analysis" teaches about the tools and techniques to begin to do exploratory data analysis on real-world datasets. You will learn several methods of analyzing statistical properties of the data and how to calculate and apply these properties using Python. Finally, you will create simple data visualizations showing an overview of the data.

`2-01`

» Exploratory Data Analysis Overview`2-02`

» Descriptive Statistics`2-04`

» Grouping Data in Python`2-05`

» Histograms`2-06`

» Quartiles and Box Plots`2-07`

» Basic Data Visualization in Python

### Module 3: Simulation and Distributions

"Simulation and Distributions" provides an exploration into the world of computer simulations. Beginning with simulating simple events, like rolling a dice where the expected outcome is known, you gradually build increasingly complex simulations. You will find many simulations result in common distributions, such as the Normal Distribution, which you will learn has many interesting properties all its own.

`3-01`

» Overview of Simulation`3-02`

» Random Numbers in Python`3-03`

» For-Loops in Python`3-04`

» Simple Simulations in Python`3-05`

» Sample Space`3-06`

» Conditionals in Python`3-07`

» Functions in Python`3-08`

» Normal Distribution`3-09`

» Law of Large Numbers

### Module 4: Prediction and Probability

"Prediction and Probability" begins with a deep-dive into probability and using probabilities to make informed predictions on future events. You will complete dozens of problems on basic probability, explore how to describe dependent probabilistic events, and use Python to make predictions under uncertainty.

`4-01`

» Probability Introduction`4-02`

» The Monty Hall Problem`4-03`

» Multi-event Probability: Multiplication Rule`4-04`

» Multi-event Probability: Addition Rule`4-05`

» Conditional Probability`4-06`

» Bayes' Theorem

### Module 5: Polling, Confidence Intervals, and the Normal Distribution

"Polling, Confidence Intervals, and the Normal Distribution" starts with an exploration of different sampling techniques. You will learn how bias and sampling variability can affect the results of surveys. From that, you know how to use expectation and inference as a way to make predictions and decisions under uncertainty.

`5-01`

» Random Variables`5-02`

» Bernoulli & Binomial Random Variables`5-03`

» Python Functions for Random Distributions`5-04`

» Central Limit Theorem`5-05`

» Polling and Sampling`5-06`

» Confidence Intervals`5-07`

» Hypothesis Testing

### Module 6: Towards Machine Learning

"Towards Machine Learning" applies all of the foundational knowledge applied in the previous modules to using modern techniques to help computers discover common similarities in data and to predict future outcomes based on previously-seen events. Completion of this and all other modules provides you with the ability to advance to dedicated machine learning courses.

`6-01`

» Overview of Machine Learning`6-02`

» Correlation`6-03`

» Linear Regression`6-04`

» Machine Learning Models in Python with sk-learn`6-05`

» Clustering`6-06`

» Towards Machine Learning in Python

### Data Science Exploration

At The University of Illinois, the second-semester core data science sequence course is Data Science Exploration (STAT 207). Data Science Exploration builds on the foundations developed here in DISCOVERY.