Data Science Guides
These DISCOVERY guides are short, solution-focused examples of common tasks in Data Science. We create several new guides each week, so there is constantly something new!
Guides Using pandas DataFrames
DataFrame Indexing: .loc vs .iloc
ilocfunctions are commonly used to select certain groups of rows (and columns) of a pandas DataFrame.
Retrieve a Single Value From A DataFrame
Common function used to retrieve single values from a DataFrame include the
Working With Columns and Series in a DataFrame
Manipulating columns — sometimes called "variables" — is a foundational data science skill.
Run a Custom Function on Every Row in a DataFrame
We can use the
applyfunction to run a function on every row of a DataFrame.
Finding Descriptive Statistics for Columns in a DataFrame
A great way to familiarize ourselves with all the new information is to look at descriptive statistics (sometimes known as summary statistics) for all applicable variables.
Finding Quantiles of a Column in a DataFrame
We can find many different quantiles for sets of numbers using the
quantilefunction of a DataFrame.
Using Previous Observations when Computation Values in a DataFrame
When you're analyzing data reported on a regular basis (ex: daily cases, monthly reports, etc), it is common to need to use the values from the previous one or more observations in your calculation. The
df.column.shift(1)observation reports the value for a
columnfrom one observation earlier.
Reading and Importing Data into DataFrames
Creating a DataFrame from an Excel file using Pandas
Many datasets are provided in an Excel file format (file extension
pd.read_excelfunction provides two primary ways to read an Excel file.
Creating a DataFrame from an HTML table using Pandas
HTML tables can be found on many different websites and can contain useful data we may want to analyze.
Creating a DataFrame from a Fixed-Width File using Pandas
Some datasets are provided in a fixed-width file format (common extension is
.txt, but includes many others as well).
Creating a DataFrame from a CSV file using Pandas
Many datasets are provided in a comma-separated value file format (file extension
pd.read_csvfunction provides two primary ways to read a CSV file.
Combining DataFrames by Merging
A detailed guide with examples of combining DataFrames based on matching the contents of the data from columns, using
Combining DataFrames by Concatenation
Concatenation is a great way to combine DataFrames with identical columns. Concatenation does not look at the contents of the data at all and only joins the DataFrame end-to-end.
Combining DataFrames by Joining
A brief guide to combining DataFrames together in pandas with
Row Selection using DataFrames
Select Rows From A DataFrame
There are numerous ways to select rows from a DataFrame. One method is to select rows based on the content of its columns. To do this, we can use conditions.
Finding Minimum and Maximum Values in a DataFrame Column
It's often helpful to know a few specific values for each column (aka variable) in a DataFrame -- mainly the highest value, lowest value, and all unique values.
Slice Objects and DataFrames
When working with data from a pandas DataFrame, oftentimes we want to select a range of cells rather than specific ones. To do this, we can use slice objects.
Selecting Rows that are IN and NOT IN a DataFrame
.isinfunction can be used to select rows from a DataFrame that are or are not in another DataFrame.
Selecting DataFrame Rows Based on String Contents
When working with text, it is often useful to select rows that contain a specific string. The .str.contains function allows us to test each row's data to determine if a specific string exists in the text.
Creating New Columns in a DataFrame
There are two primary methods of creating new columns in a DataFrame: creating a new column calculated from data you already have or using Python to create new data.
Sorting a DataFrame Using Pandas
sort_valuesmethod of a DataFrame is used to sort a DataFrame by the data in a column.
Removing Rows from a DataFrame
dropfunction is used to remove rows or columns from a pandas DataFrame.
Removing Columns in a DataFrame
To make the data less cluttered, you can remove a column from your DataFrame using pandas.
Handling Missing Data in Pandas
While it would be nice if our datasets all had the values we expect, it's not always the case. Oftentimes certain cells in a DataFrame will be empty, or contain a value that we don't want.
Grouping Data by column in a DataFrame
groupbycan be used to combine rows in a DataFrame to help better analyze large DataFrames.
- Creating Simple Data Visualizations in Python using matplotlib
The matplotlib library in Python provides an extremely simple way to create professional Data Visualizations. This guide explores the Python needed to create scatter plots, bar charts, pie charts, and line charts!
Saving and Exporting DataFrames
Saving a DataFrame to a CSV file using Pandas
. An easy way to save your dataset is to export it to a CSV file that can then be shared. This can be done with the pandas
Saving a DataFrame to an Excel file using Pandas
An easy way to save your dataset is to export it to an Excel file that can then be shared. This can be done with the pandas
Guides Using Statistics
- Monty Hall Problem - Interactive Game and Three Intuitive Solutions
An interactive version of the classic Monty Hall Problem and three intuitive solutions explained.
Seven Detailed Examples Using The Addition Rule
Mathematical and Python examples of using the addition rule to calculate the probability of multiple events occurring.
Python Functions for Bernoulli and Binomial Distribution
Using functions from the scipy.stats library to represent Bernoulli and Binomial distributions in python
Six Detailed Examples Using The Multiplication Rule
Mathematical and Python examples of using the multiplication rule to calculate the probability of multiple events occurring.
Statistics with Python
- Calculating Standard Deviation in Python
When we're presented with numerical data, we often find descriptive statistics to better understand it. One of these statistics is called the standard deviation, which measures the spread of our data around the mean (average).
Setup Your System for Data Science
As you begin your journey as a Data Scientist, it is important to get familiar with tools on your own system in addition to tools in your web browser.
Your System's Terminal
Every operating system contains a Command Line Interface (CLI) that lets you interact with your computer using a keyboard known as a terminal. You can do everything you already do on a computer via the terminal, but you can also do a whole lot more!
First Time Setup for MicroProjects
A detailed guide for getting setup to start programming MicroProjects!