Data Science Guides
These DISCOVERY guides are short, solution-focused examples of common tasks in Data Science. We create several new guides each week, so there is constantly something new!
Guides Using pandas DataFrames
DataFrame Fundamentals
DataFrame Indexing: .loc[] vs .iloc[]
Theloc
andiloc
functions are commonly used to select certain groups of rows (and columns) of a pandas DataFrame.Retrieve a Single Value From A DataFrame
Common function used to retrieve single values from a DataFrame include theat
,iat
,loc
, andiloc
functions.Working With Columns and Series in a DataFrame
Manipulating columns — sometimes called "variables" — is a foundational data science skill.Run a Custom Function on Every Row in a DataFrame
We can use theapply
function to run a function on every row of a DataFrame.Finding Descriptive Statistics for Columns in a DataFrame
A great way to familiarize ourselves with all the new information is to look at descriptive statistics (sometimes known as summary statistics) for all applicable variables.Finding Quantiles of a Column in a DataFrame
We can find many different quantiles for sets of numbers using thequantile
function of a DataFrame.Using Previous Observations when Computation Values in a DataFrame
When you're analyzing data reported on a regular basis (ex: daily cases, monthly reports, etc), it is common to need to use the values from the previous one or more observations in your calculation. Thedf.column.shift(1)
observation reports the value for acolumn
from one observation earlier.Starting Your Own Data Science Project
Starting your own data science project has never been easier!Removing Duplicates from a DataFrame using pandas
A brief guide on removing duplicates from a DataFrame.Simulations in Python
Sometimes, it's hard and time-consuming to carry out experiments in person, and simualtion is there to help!
Reading and Importing Data into DataFrames
Creating a DataFrame from an Excel file using Pandas
Many datasets are provided in an Excel file format (file extension.xlsx
). Thepd.read_excel
function provides two primary ways to read an Excel file.Creating a DataFrame from an HTML table using Pandas
HTML tables can be found on many different websites and can contain useful data we may want to analyze.Creating a DataFrame from a Fixed-Width File using Pandas
Some datasets are provided in a fixed-width file format (common extension is.txt
, but includes many others as well).Creating a DataFrame from a CSV file using Pandas
Many datasets are provided in a comma-separated value file format (file extension.csv
). Thepd.read_csv
function provides two primary ways to read a CSV file.
Combining DataFrames
Combining DataFrames by Merging
A detailed guide with examples of combining DataFrames based on matching the contents of the data from columns, usingpd.merge
.Combining DataFrames by Concatenation
Concatenation is a great way to combine DataFrames with identical columns. Concatenation does not look at the contents of the data at all and only joins the DataFrame end-to-end.Combining DataFrames by Joining
A brief guide to combining DataFrames together in pandas withjoin
.
Row Selection using DataFrames
Select Rows From A DataFrame
There are numerous ways to select rows from a DataFrame. One method is to select rows based on the content of its columns. To do this, we can use conditions.Finding Minimum and Maximum Values in a DataFrame Column
It's often helpful to know a few specific values for each column (aka variable) in a DataFrame -- mainly the highest value, lowest value, and all unique values.Slice Objects and DataFrames
When working with data from a pandas DataFrame, oftentimes we want to select a range of cells rather than specific ones. To do this, we can use slice objects.Selecting Rows that are IN and NOT IN a DataFrame
The.isin
function can be used to select rows from a DataFrame that are or are not in another DataFrame.Selecting DataFrame Rows Based on String Contents
When working with text, it is often useful to select rows that contain a specific string. The .str.contains function allows us to test each row's data to determine if a specific string exists in the text.
Modifying DataFrames
Creating New Columns in a DataFrame
There are two primary methods of creating new columns in a DataFrame: creating a new column calculated from data you already have or using Python to create new data.Sorting a DataFrame Using Pandas
Thesort_values
method of a DataFrame is used to sort a DataFrame by the data in a column.Removing Rows from a DataFrame
Thedrop
function is used to remove rows or columns from a pandas DataFrame.Removing Columns in a DataFrame
To make the data less cluttered, you can remove a column from your DataFrame using pandas.Handling Missing Data in Pandas
While it would be nice if our datasets all had the values we expect, it's not always the case. Oftentimes certain cells in a DataFrame will be empty, or contain a value that we don't want.Grouping Data by column in a DataFrame
Thegroupby
can be used to combine rows in a DataFrame to help better analyze large DataFrames.Defining your Own Aggregation Function in Pandas
Write your own aggregation function which can be used in combination with Pandas groupby.
Data Visualization
Creating Simple Data Visualizations in Python using matplotlib
The matplotlib library in Python provides an extremely simple way to create professional Data Visualizations. This guide explores the Python needed to create scatter plots, bar charts, pie charts, and line charts!Generating Emojis in Python
Learn how to use Unicode escape sequences and the emoji library to generate emojis in Python.Enhancing Data Visualizations with Matplotlib's Color Options
Learn how to use color names, color codes, RGB values, and hexadecimal codes to enhance your visualizations.
Saving and Exporting DataFrames
Saving a DataFrame to a CSV file using Pandas
. An easy way to save your dataset is to export it to a CSV file that can then be shared. This can be done with the pandasto_csv
function.Saving a DataFrame to an Excel file using Pandas
An easy way to save your dataset is to export it to an Excel file that can then be shared. This can be done with the pandasto_excel
function.
Guides for Python Fundamentals
Python Fundamentals
Unlocking the Mysteries of Python's 'random' Library
This guide illuminates the dynamic capabilities of Python's 'random' library, empowering you to infuse your code with randomness and making unpredictable choices.Parentheses, Square Brackets and Curly Braces in Python
Brief description on when to use parentheses()
, square brackets[]
and curly braces{}
in pythonPython Data Types
This tutorial covers the basics of Python data types, type conversion, and the difference between string literals and variables.
Guides Using Statistics
Classic Problems
Monty Hall Problem - Interactive Game and Three Intuitive Solutions
An interactive version of the classic Monty Hall Problem and three intuitive solutions explained.Chaos Game - Creating Fractals using Simulation
Using random simulation and scatter plots, we can create beautiful fractal shapes with unbelievable intricate details.
Descriptive Statistics
Modifying Values in Data and its Effect on Descriptive Statistics
Mathematical examples of modifying values in a dataset and how common statistics change as a result.Introduction to Exploratory Data Analysis
A quick bit of advice on how to tackle some tough questions with exploratory data analysis in Python.
Probability
Conditional Data Simulation Examples in Python
Three simple examples of using Python and pandas to simulate real world scenarios.Seven Detailed Examples Using The Addition Rule
Mathematical and Python examples of using the addition rule to calculate the probability of multiple events occurring.Python Functions for Bernoulli and Binomial Distribution
Using functions from the scipy.stats library to represent Bernoulli and Binomial distributions in pythonSix Detailed Examples Using The Multiplication Rule
Mathematical and Python examples of using the multiplication rule to calculate the probability of multiple events occurring.
Statistics Formulas
- Correlated, Uncorrelated, and Independent Random Variables
A pair of random variables can be correlated, uncorrelated, or independent.
Statistics with Python
Calculating Standard Deviation in Python
When we're presented with numerical data, we often find descriptive statistics to better understand it. One of these statistics is called the standard deviation, which measures the spread of our data around the mean (average).Cross Validation in Python
In absence of a test dataset, cross validation is a helpful approach to get a idea of how well the model performs and what level of flexibility is appropriate.3 Ways to Calculate the RMSE in Python
Three simple methods for calculating the Root Mean Square Error, or RMSE, in Python.
Other Guides
Ethics in Data Science
- Ethics of Data with Humans Subjects
Conducting social science research typically requires using data from human research subjects, which is both incredibly useful for improving lives across the globe and full of ethical conundrums. This guide will help you understand key ethical principles for human subject research and related data analysis.
System Setup
Setup Your System for Data Science
As you begin your journey as a Data Scientist, it is important to get familiar with tools on your own system in addition to tools in your web browser.Your System's Terminal
Every operating system contains a Command Line Interface (CLI) that lets you interact with your computer using a keyboard known as a terminal. You can do everything you already do on a computer via the terminal, but you can also do a whole lot more!First Time Setup for MicroProjects
A detailed guide for getting setup to start programming MicroProjects!