Correlation


Now we are going to look at the linear relationship between two variables! Visually, we can use a scatter plot to show the relationship between two variables (X and Y). The variable on the x-axis is known as the independent variable and the variable on the y-axis is known as the dependent variable.

We can use df.plot.scatter() to create scatter plots in Python.

In addition to looking at two variables graphically, we can also calculate a statistic that mathematically represents the linear relationship between X and Y. We can measure the strength of this linear relationship using the Correlation Coefficient.

CORRELATION COEFFICIENT ( r )

The correlation coefficient (often represented by the letter, r) measures the strength of the linear association between two variables (X and Y). It measures how tightly points are clustered around a line. It is relevant when the scatter plot forms a linear trend.

The correlation coefficient is always between –1 and 1.
The closer the points hug a line with a positive slope, the closer r is to +1. The closer the points hug a line with a negative slope the closer r is to -1.

If there is no association between x and y then the correlation coefficient is 0 and the scatter plot has no linear pattern.

In other words,

  • A correlation of 1 or -1 means you can perfectly predict one variable knowing the other.
  • A correlation of 0 means that knowing one variable gives you no information about the other.

Here are some examples below:

How to mathematically calculate the correlation coefficient:

In words:

  1. Convert x-values and y-values to standard units (z-scores). Z-scores tell you how many SDs a value is above or below average.
  2. Multiply each x-value (in standard units) by each corresponding y-value (in standard units)
  3. The correlation coefficient is the sum of the products divided by n-1.

In symbols:

Correlation in Python

In Python, the following code will display the correlation coefficient for every numeric column (variable) in a DataFrame:

df.corr()

The output is called a correlation matrix. Finding the correlation matrix can be an important part of Exploratory Data Analysis to see if there are any linear relationships between two variables!


Example Walk-Throughs with Worksheets

Video 1: Correlation Examples

Follow along with the worksheet to work through the problem:

Video 2: Outliers Impact on Correlation

Follow along with the worksheet to work through the problem:

Video 3: Correlation Coefficient in Python

Follow along with the worksheet to work through the problem:

Practice Questions

Q1: Which of the following cannot be a correlation coefficient?
Q2: The diamond dataset has 10 variables in total( including 3 categorical variables and 7 numeric variables). What is the sum of all elements in the diagonal of the correlation matrix?