Descriptive Statistics

← Exploratory Data Analysis Overview Next: Adding Rows and Columns to a DataFrame →

When you think of descriptive statistics, the first thing that often comes to mind is measures of center and spread!

Measures of center can be things like mean, median, and mode, whereas measures of spread can be range or standard deviation. Below we will look at what all of these are and how to calculate them.

Measures of Center

Measures of center give us insight into the mean (average) value of a dataset, the median value of dataset, the most frequently occurring value, and other properties that usually describe a property of a dataset with a single value.

Measure of Center: Mean / Average

To find the mean (also commonly called the average) of a list of numbers you sum all the numbers and divide by how many there are on the list. For example, we can find the mean of the list of numbers: 2, 2, 3, 4, 8, 10:

Calculation of the mean (average) of the list **2, 2, 3, 4, 8, 10**.

Using Python, we can find the mean of ALL values in our data for any specific variable (column):

df['ColumnName'].mean()

Important note: If you change any number in your list, the mean will change! It’s very sensitive.

Measure of Center: Median

To find the median, list the numbers in order and find the middle number. (Note that with an even number of data points, we really have two "middle" numbers. When there are two "middle numbers", take the average of the two middle numbers to find the median.)

Using the same list of numbers, we can find the median as well:

Calculation of the median of the list **2, 2, 3, 4, 8, 10**.

Using Python, we can find the median of ALL values in our data for any specific variable (column):

df['ColumnName'].median()

Important note: Unlike the mean, the median is not sensitive to changes in extreme values. For example, if you add 10 to the biggest number on a list, the median doesn’t change.

Measure of Center: Mode

The mode is the "most common" number on the list!

Calculation of the mode of the list **2, 2, 3, 4, 8, 10**.

Measures of Spread

Measures of spread give us insight into how "spread out" the data is in the dataset with a single value. When measures of spread are small, the data is packed near the center; when the value of spread is large, the data is spread spread out and not concentrated near the center.

Measure of Spread: Range

The simplest measure of the spread of a list of numbers is the range. The range is defined as the difference between the lowest and highest values.

Calculation of the range of the list **2, 2, 3, 4, 8, 10**.

Using Python, there is no df.range() function. Instead, we calculate the range by taking the maximum value (.max()) and subtracting the minimum value (.min()):

df['ColumnName'].max() - df['ColumnName'].min()

Measure of Spread: Variance and Standard Deviation (SD)

The variance and standard deviation (commonly abbreviated as SD) are measures the spread around the average. Both are calculated by finding the deviation of each value from the mean (average) value of all the values. To calculate both values, we follow five steps:

Calculate the mean (average) of all of the values.
Subtract the mean (average) from each value to find each value's the deviation from the mean.
Find the squared deviation for each value by squaring each deviation that was found in the previous step.
Find the variance by adding all of the squared deviation together and then dividing that sum by n-1. This is the variance and is denoted by mathematically as σ².
The standard deviation, a related measure of spread, is found by taking the square root of the variance. The standard deviation is denoted mathematically as σ and has the same units as the data.

As an example, we have worked out finding the variance and standard deviation of our list:

Calculation of the variance and standard deviation of the list **2, 2, 3, 4, 8, 10**.

A low SD means that most of the numbers are very close to the average. A high SD means that the numbers are spread out.

Using Python for Measures of Spread

Using Python, the function to return the standard deviation is .std() and variance is .var():

df['ColumnName'].std()

df['ColumnName'].var()

Mathematical Formula

Example Walk-Throughs with Worksheets

Video 1: Mean, Median, and Mode

Follow along with the worksheet to work through the problem:

Download Blank Worksheet (PDF)

Video 2: Discovering Properties of Center

Follow along with the worksheet to work through the problem:

Download Blank Worksheet (PDF)

Video 3: Range and Standard Deviation

Follow along with the worksheet to work through the problem:

Download Blank Worksheet (PDF)

Video 4: Discovering Properties of Spread

Follow along with the worksheet to work through the problem:

Download Blank Worksheet (PDF)

Practice Questions

Q1: Which of the following is the correct order of steps when solving for the standard deviation of a list of numbers?

Q2: What is the standard deviation of the following list of numbers: 20, 22, 24, 26, 28, 30

Q3: Find the average of this list of numbers: 92, 17, 84, 29, 71, 47, 63, 21

Q4: Find the median of this list of numbers: 2, 5, 17, 9, 25, 6, 8, 12

Q5: What is the mode of the following exam scores: 90, 90, 81, 73, 94, 99, 94, 81, 94

Q6: What is the range of the following mobile phone bills (in dollars): $45, $22, $105, $79, $80, $112, $90, $64

Q7: What is the relationship between the variance and the standard deviation?

← Exploratory Data Analysis Overview Next: Adding Rows and Columns to a DataFrame →