Modifying Values in Data and its Effect on Descriptive Statistics


When we are given numerical or quantitative data, one of the first tools of exploratory data analysis is to find basic descriptive statistics including measures of center (mean, median) and spread (standard deviation, variance).

In this guide, we'll explore how the measures of center and spread changes when your dataset is modified:

  • What happens when you add (or subtract) a constant value to all observations in your data?
  • What happens when you multiply (or divide) a constant value to all observations in your data?

Adding (or Subtracting) a Constant To All Values

Let's break down what happens when you add or subtract a constant from each value in a dataset. While this example focuses on addition, keep in mind that for subtraction it will be the same. For example, if we subtract 3 from all values, we are really adding (-3) to every value.

The effect on the mean

The formula for the mean is the sum of all the individual values divided by the total number of values. Mathematically:

$\mu$ = ${(x_1 + x_2 + ... + x_n) \over n}$

If we add any constant -- for example, the value $\textcolor{red}{3}$ -- to every single number, the new mean would be:

$\mu_{+3}$ = ${((x_1 + \textcolor{red}{3}) + (x_2 + \textcolor{red}{3}) + ... + (x_n + \textcolor{red}{3})) \over n}$

$\mu_{+3}$ = ${(x_1 + x_2 + ... + x_n) + (\textcolor{red}{3}n) \over n}$

$\mu_{+3}$ = ${(x_1 + x_2 + ... + x_n) \over n} + {(\textcolor{red}{3}n) \over n}$

$\mu_{+3}$ = ${(x_1 + x_2 + ... + x_n) \over n} + \textcolor{red}{3}$

$\mu_{+3}$ = $\mu + \textcolor{red}{3}$

Therefore, we can see that the mean will increase by exactly the amount you add to each value. If you add 5 to all values, the means increase by 5; if you add 42 to all values, the mean increase by 42.

The effect on the median

The same is true for the median, but for a different reason. The median is the center or midpoint value in a dataset, so when we add a constant to all our values the positions don't change. Here's a demonstration using a simple array of numbers:

Initial Values
[2, 4, 5, 9, 10]
median = 5

Modified Values (Initial + 3)
[2+3, 4+3, 5+3, 9+3, 10+3]
[5, 7, 8, 12, 13]
median = 8 =5+3

The effect on the standard deviation and variance

For standard deviation ($\sigma$) and variance ($\sigma^2$), let's start by looking at it mathematically. Our formula for standard deviation is:

$SD = \sigma = {\sqrt{(x_1-\mu)^2 + (x_2-\mu)^2 + ... + (x_n-\mu)^2} \over (n-1)}$

If we add 3 to each value, like before, our new mean is $(\mu + \textcolor{red}{3})$. Our new standard deviation formula looks like:

$\sigma_{+3}$ = ${\sqrt{((x_1 + \textcolor{red}{3})-(\mu + \textcolor{red}{3}))^2 + ((x_2 + \textcolor{red}{3})-(\mu + \textcolor{red}{3}))^2 + ... + ((x_n + \textcolor{red}{3})-(\mu + \textcolor{red}{3}))^2} \over (n-1)}$

$\sigma_{+3}$ = ${\sqrt{(x_1 + \textcolor{red}{3} - \mu \textcolor{red}{- 3})^2 + (x_2 + \textcolor{red}{3} - \mu \textcolor{red}{- 3})^2 + ... + (x_n + \textcolor{red}{3} - \mu \textcolor{red}{- 3})^2} \over (n-1)}$

$\sigma_{+3}$ = ${\sqrt{(x_1 - \mu + \textcolor{red}{3} \textcolor{red}{- 3})^2 + (x_2 - \mu + \textcolor{red}{3} \textcolor{red}{- 3})^2 + ... + (x_n - \mu + \textcolor{red}{3} \textcolor{red}{- 3})^2} \over (n-1)}$

$\sigma_{+3}$ = ${\sqrt{(x_1-\mu)^2 + (x_2-\mu)^2 + ... + (x_n-\mu)^2} \over (n-1)}$

$\sigma_{+3}$ = $\sigma$

We find that adding any constant to all the values does not change the standard deviation -- it gets back to our original formula! Logically, this makes sense, too. The standard deviation measure the average distance, or spread, from the mean. When we add a value to each data point, the average distance will not change.

Since there is no change to the standard deviation, there will also be no change to the square of the standard deviation (the variance):

${(\sigma_{+3})}^2$ = $(\sigma)^2$

Analysis

When we add (or subtract) a constant to all values in our data, the "center" (mean and median) of the data increase by that amount. However, the "spread" of the data does not change at all!

Multiplying (or Dividing) All Values By a Constant

Now let's break down what happens when you multiply or divide each value in a dataset by a constant. While this example focuses on multiplication, keep in mind that for division it will be the same.

For example, if we divide all values by 5, we are really multiplying every value by ${1 \over 5}$.

The effect on the mean

The formula for the mean is the sum of all the individual values divided by the total number of values. Mathematically:

$\mu$ = ${(x_1 + x_2 + ... + x_n) \over n}$

If we multiply all values by a constant value, say 5, we want to determine what our new mean and median will be. Let's break this down algebraically:

$\mu_{\times 5}$ = $(\textcolor{red}{5}x_1) + (\textcolor{red}{5}x_2) + ... + (\textcolor{red}{5}x_n) \over n$

$\mu_{\times 5}$ = $\textcolor{red}{5}(x_1 + x_2 + ... + x_n) \over n$

$\mu_{\times 5}$ = $\textcolor{red}{5} \times {(x_1 + x_2 + ... + x_n) \over n}$

$\mu_{\times 5}$ = $\textcolor{red}{5} \times \mu$

We can see that the mean is changed by the same constant that we multiply each value.

The effect on the median

The same is true for the median, but for a different reason. The median is the center or midpoint value in a dataset, so when we multiply every value by a constant the positions don't change. Here's a demonstration using a simple array of numbers:

Initial Values
[2, 4, 5, 9, 10]
median = 5

Modified Values (Initial × 5)
[2×5, 4×5, 5×5, 9×5, 10×5]
[10, 20, 25, 45, 50]
median = 25 = 5×5

The effect on the standard deviation and variance

For standard deviation ($\sigma$) and variance ($\sigma^2$), let's start by looking at it mathematically. Our formula for standard deviation is:

$SD = \sigma = {\sqrt{(x_1-\mu)^2 + (x_2-\mu)^2 + ... + (x_n-\mu)^2} \over (n-1)}$

If we multiply each value by 5, like before, our new mean is $(\textcolor{red}{5}\mu)$. Substituting in increasing all of our data values by a factor of five, we can derive the new standard deviation:

$\sigma_{5\times}$ = ${\sqrt{((\textcolor{red}{5}x_1)-(\textcolor{red}{5}\mu))^2 + ((\textcolor{red}{5}x_2)-(\textcolor{red}{5}\mu))^2 + ... + ((\textcolor{red}{5}x_n)-(\textcolor{red}{5}\mu))^2} \over (n-1)}$

$\sigma_{5\times}$ = ${\sqrt{(\textcolor{red}{5}(x_1 - \mu))^2 + (\textcolor{red}{5}(x_2 - \mu))^2 + ... + (\textcolor{red}{5}(x_n - \mu))^2} \over (n-1)}$

$\sigma_{5\times}$ = ${\sqrt{(\textcolor{red}{25}(x_1 - \mu)^2) + (\textcolor{red}{25}(x_2 - \mu)^2) + ... + (\textcolor{red}{25}(x_n - \mu)^2)} \over (n-1)}$

$\sigma_{5\times}$ = ${\sqrt{\textcolor{red}{25}((x_1-\mu)^2 + (x_2-\mu)^2 + ... + (x_n-\mu)^2)} \over (n-1)}$

$\sigma_{5\times}$ = ${\textcolor{red}{5}\sqrt{(x_1-\mu)^2 + (x_2-\mu)^2 + ... + (x_n-\mu)^2} \over (n-1)}$

$\sigma_{5\times}$ = $\textcolor{red}{5} \times {\sqrt{(x_1-\mu)^2 + (x_2-\mu)^2 + ... + (x_n-\mu)^2} \over (n-1)}$

$\sigma_{5\times}$ = $\textcolor{red}{5} \times \sigma$

We can find the change in the variance by squaring the standard deviation:

$(\sigma_{5\times})^2$ = $(\textcolor{red}{5} \times \sigma)^2$

$(\sigma_{5\times})^2$ = $\textcolor{red}{5}^2 \times \sigma^2$

We can see the standard deviation increase by the same multiple that we multiply all values, and the variance increases by the square of that multiple.

Analysis

When we multiply (or divide) a constant to all values in our data, the "center" (mean and median) of the data increase by that multiple. For the standard deviation, when multiply every value in the dataset by a constant, the new SD is $New = Old * Change$ and the new variance is $New = Old * Change^2$.

A lecture covering the mean, median, standard deviation and variance is part of the DISCOVERY course content: Descriptive Statistics