Project 2: You and Data Science (Part I)

Due: Monday, November 17th at 11:59pm

Throughout this semester, you are growing into an amazing Data Scientist! You have seen dozens of datasets we have provided throughout the semester and used Data Science to analyze them. For the final project, we want you to teach us something -- we want to learn about something you are passionate about!

This project is going to be broken in two parts:

Dataset and Exploratory Data Analysis (Part I).
Data Science (Part II).

For this final project in Data Science DISCOVERY, you will use Data Science to explore something you are passionate about or interested in learning more about.

For Part I of You and Data Science, you have to

Set up your workspace and create your own Jupyter notebook.
Find a dataset.
Perform Exploratory Data Analysis (EDA) on that dataset.

Setting Up Your Project Workspace

To complete this project, there is no starter code or starter files -- you are building it from scratch!

However, we do want to nerd out with your work so we need you to place it in a specific spot in your stat107 directory so you can turn it in and so that we can find it:

In your stat107/netid directory (the directory that contains all of your labs, extra credit microprojects, etc.), create a new folder called project2.
You will want to complete ALL your work related to project2 in your project2 folder (directory).
At the end, you'll turn in your whole folder. :)

Dataset

The idea of this project is that you will use a dataset you are passionate about. It can be anything -- it can be a dataset used from another class (eg: think if you had any data you get in Excel), it can be a dataset you found online, or it can be a dataset you gather yourself. Some ideas include but are not limited to:

A dataset about a hobby you are interested in (eg: vacation destinations, best beaches, fashion trends, Instagram, music, etc.)
A dataset about something you enjoy doing or watching (eg: swimming, volleyball, Rocket League, Illini Football, etc.)
A dataset about a topic related to your major (economics, communications, political science, etc.)
Any dataset that means something to you.

IMPORTANT: Keep in mind that the dataset that you select for Part I of You and Data Science will also be used for Part II.

Once you find a dataset you can either download it and move it into your stat107/netid/project2 folder, or use a URL to read it in. You can use any dataset as long as the following requirements are met.

You must use a non-trivial dataset. The dataset must have at least 200 data points (this could be 20 rows with 10 columns, 50 rows with 4 columns, etc).
The dataset you use must NOT be a dataset we have used in class or lab (details on how to find datasets in the "Dataset" section below).
You must create your own Python notebook. You will turn in both code and analysis, and the exact questions you answer and code you write will be up to you!

With students from so many different majors in Data Science DISCOVERY, we are excited for everything we are going to learn from you! :)

Online Data Sources

The best data is data that you personally care about. This may be data from a club you are part of or data about something you're passionate about that you already have available.

If you have no datasets at all, here are several websites that many people use as sources for datasets:

Government-Provided Datasets

City of Chicago Data Portal, https://data.cityofchicago.org/
State of Illinois Data Portal, https://data.illinois.gov/
U.S. Government's Open Data, https://data.gov/

Collections of University-Provided Datasets

UIUC Division of Management Information (DMI) Student Enrollment Data, https://www.dmi.illinois.edu/stuenr/
UC Irvine Machine Learning Repository, https://archive.ics.uci.edu/
Stanford Large Network Dataset Collection, http://snap.stanford.edu/data/index.html

Third-Party Data Sources

(Note: These sites listed here are generally third-party data providers. This means that they do not collect the data themselves, but simply pass on data that others have provided. Some datasets provided may be high-quality and trusted, others may be completely made up data.)

BigML Blog List of Datasets, http://blog.bigml.com/list-of-public-data-sources-fit-for-machine-learning/
Kaggle, http://www.kaggle.com/
League of Legends Match Data Downloads, https://oracleselixir.com/tools/downloads
sports-reference.com

Project Notebook

The major deliverable for this project is a notebook of your analysis and summaries of your findings.

To create your notebook:

Open Visual Studio Code and then choose "File -> Open Folder". Go into your project2 folder you created earlier (Desktop -> stat107 -> netid -> project2).
Once you have the project2 space open inside of Visual Studio Code, choose "New -> New File..."
In the options dialog, choose Jupyter Notebook. You now have a blank notebook.
We recommend immediately saving it and calling it project2.ipynb.

Jupyter Notebook Format

You will need to add a combination of "Code" (Python) and "Markdown" cells to complete your notebook. You can hover your mouse below each cell to see the + Code and + Markdown options to add new blocks of certain types.

Code blocks are used for Python programming. Everything in a Code block will be read as Python.
Markdown blocks are used for writing. Everything in a Markdown block will be read as Markdown (formatted text).

You can learn about the options available for Markdown on the "Basic Syntax" guide for Markdown or any other source for Markdown documentation.

Deliverable

The Jupyter notebook is your only deliverable. The requirements are:

You must have four sections in your notebook. Each section MUST start with a clearly identifiable Markdown cell that contains a "Header 1" of your current section. (See the "Basic Syntax" guide for Markdown to understand what "Header 1" means in Markdown. Make sure you do this correctly! You will lose points if you don't.)
The four sections for Part 1 must be:
- Section 1: Dataset: In Markdown, explain what dataset you chose and why you chose it. Include why is it meaningful to you, how you went about finding it and what you want to discover by using Data Science on this specific dataset. Then, in Python, load your dataset into a DataFrame and write a few simple commands to see the characteristics of the dataset (ex: number of rows and columns, names of columns etc.).
  
  REMINDER: Keep in mind that the dataset that you select for Part I of You and Data Science will also be used for Part II.
- Section 2: Exploratory Data Analysis: In Markdown, explain which descriptive statistics can help you give a broad overview of the data (ex: measures of center, measure of spread, size (rows/columns), other interesting descriptive statistics, etc.). In Python, do this exploratory data analysis. Then report your observations about your results in Markdown.
  
  Reference Labs: lab_intro, lab_pandas, lab_exp_design, lab_simpsons_paradox
- Section 3: Exploratory Data Visualization: In Python, create data visualizations using plots. These plots should help you tell a story about your data and be graphs that are meaningful and easy to understand. Specifically:
  - Your graph must be easy to read and inform a reader about something specific to your dataset. It cannot be a random plot of meaningless data.
  - Your graph must have a title, xlabel, and ylabel parameters to show a title, x-axis label, and a y-axis label.
  - You may have multiple graphs. We prefer fewer high-quality graphs over many low-quality graphs.
  - Finally, in Markdown, provide a summary of what your visualizations show.
    
    Reference Labs: lab_plots, lab_gpa
- Section 4: Planning for Part II: In Markdown, write about:
  - Some observations that you've made throughout your Exploratory Data Analysis about your dataset.
  - Some ideas about the specific questions you'd like to analyze and explore in Part II of your project. You do not have to explain or provide the methods that you are going to use, but your plans should at least make sense and be non-trivial. You are not fully committing to these ideas for Part II, but they can be the starting point of your Data Science section of the project.
    Reference: You can get ideas of what we'll see later in the course by checking out Modules 5 and 6 of DISCOVERY (https://discovery.cs.illinois.edu/learn/)
Your audience is going to be Prof. Wade, Prof. Karle, and/or your lab TA. You do NOT need to explain Python or Data Science to us, but you should assume we know nothing about your specific interest/passion/dataset.
Make sure to save your work and submit it to GitHub before 11:59pm on the due date.

Submission

Make sure you have saved your notebook. Once your notebook is saved, you will turn in your project2 folder just like you have done for all of your other assignments. This submission process is different than usual!!

In your stat107/netid directory, add your project2 notebook ONLY by running the following command:

git add project2/project2.ipynb

Make sure you don't have any errors. This command will add your project2/project2.ipynb notebook only if you are in the stat107 directory. Once you've added the project2/project2.ipynb file, turn it in with the following commands:

git commit -m "project 2 submission"
git push

Make absolutely sure your files are turned in by checking your repository on GitHub found here: https://github.com/orgs/stat107-illinois/repositories.

GitHub Error: File Size Too Large

If you added a dataset that is too large for GitHub (over 100 MB), you will not be able to commit your work. There are several ways to resolve this.

SUPER IMPORTANT: Before you run any of the commands, make a backup of your project2 notebook. To do this, use your file explorer/finder, go to your stat107/netid/project2 directory, and copy project2.ipynb notebook file to your desktop to make save a backup copy.

Once you have a backup, try any of these options:

Solution #1 (Simplest Solution): Reset and Re-Commit

The simplest solution is to reset your repo back in time by one commit by running:

git reset HEAD~1

Then, git add ONLY your notebook file. Then run the rest of the turn-in commands as normal (commit + push). If this is not successful, continue to the next solution.

Solution #2 (Always Works): Re-clone Your Repository

The robust solution that will always work is to re-clone your repository:

First, rename your netid folder to something like backup. This will create a folder that backs up all of your current work.
Re-clone your course repository -- you can do that by following the "Creating a Local Clone of Your STAT 107 Repository" section in the guide STAT 107 GitHub Setup
With your newly cloned netid directory, copy your project folder form the backup into your netid directory.
After copying the project2 folder, git add ONLY your notebook file. Then run the rest of the turn-in commands as normal (commit + push).