The coefficient of correlation is the square of the coefficient of determination

Graphical explanation of the squared Pearson correlation coefficient and coefficient of determination to help you spot statistical lies

Difference between the Pearson correlation coefficient and the coefficient of determination. Image by author.

Picture this- You are a stock analyst responsible for predicting Walmart’s stock price ahead of its quarterly earnings report. You are hard at work just when your data scientist walks in saying they discovered a little-known data stream providing daily Walmart parking lot occupancy that seems well correlated with Walmart’s historic revenues. You are understandably excited. You ask them to use the parking lot data alongside other standard metrics in a machine learning model to forecast Walmart’s stock price.

So far so good.

The data scientist returns in a few hours claiming that after careful validation of the model, its predictions are strongly correlated with the true stock price. Do you accept the model without any further investigations?

I hope not.

Correlations are good for identifying patterns in data, but almost meaningless for quantifying a model’s performance, especially for complex models (like machine learning models). This is because correlations only tell if two things follow each other (e.g., parking lot occupancy and Walmart’s stock), but don’t tell how they match each other (e.g., predicted and actual stock price). For that, model performance metrics like the coefficient of determination (R²) can help.

In this article, we will learn:

  1. What is the correlation coefficient (r) and its square (r²)?
  2. What is the coefficient of determination (R²)?
  3. When to use each of the above?

1. Correlation coefficient: “How good is this predictor?”

Shorter the sum of blue lines, closer the correlation coefficient is to +1. Image by author.

Correlation coefficients help quantify mutual relationships or connections between two things. Some well-known correlated quantities are weight and height of humans, house value and its area, and, as we saw in the above example, a store’s revenue and its parking lot occupancy.

One of the most widely used correlation coefficients is the Pearson correlation coefficient (usually denoted by r). Graphically, this can be understood as “how close is the data to the line of best fit?”

r ranges from −1 to +1. Grey line is the line that fits the data the best. Image by author.
  1. If the points are very far away, r is close to 0
  2. If the points are very close to the line and the line is sloping upward, r is close to +1
  3. If the points are very close to the line and the line is sloping downward, r is close to −1

Notice how the figure above has missing numbers on the axes? That is because the Pearson correlation coefficient is independent of the magnitude of the numbers; it is sensitive to relative changes only. This property is usually desirable since variables rarely have the same magnitudes. E.g., Walmart’s stock price is tens of dollars whereas the numbers of cars parked in front of its stores are in the thousands.

However, due to its insensitivity to actual magnitude, the Pearson correlation coefficient can be misused to give a false sense of confidence when two things are indeed expected to have the same magnitude.

To make matters worse, some people take the square of the Pearson correlation coefficient to bring it between 0 and +1 and call it r². But this is not to be confused with the coefficient of determination (R²) which is explained below.

2. Coefficient of determination: “How good is this model?”

Longer the sum of orange lines, lower the coefficient of determination. Image by author.

Unlike the Pearson correlation coefficient, the coefficient of determination measures how well the predicted values match (and not just follow) the observed values. It depends on the distance between the points and the 1:1 line (and not the best-fit line) as shown above. Closer the data to the 1:1 line, higher the coefficient of determination.

The coefficient of determination is often denoted by R². However, it is not the square of anything. It can range from any negative number to +1.

R² can range from negative infinity to +1. Grey line is the line where the quantities on both axes are equal (also known as 1:1 line). Image by author.
  1. R² = +1 indicates that the predictions match the observations perfectly
  2. R² = 0 indicates that the predictions are as good as random guesses around the mean of the observed values
  3. Negative R² indicates that the predictions are worse than random

Since R² indicates the distance of points from the 1:1 line, it does depend on the magnitude of the numbers (unlike r²).

3. When to use what?

The Pearson correlation coefficient (r) is used to identify patterns in things whereas the coefficient of determination (R²) is used to identify the strength of a model.

By taking the square of r, you get the squared Pearson correlation coefficient (r²) which is completely different from the coefficient of determination (R²), except in very specific cases of linear regression (when both the grey lines from the above figures merge making the blue and orange lines equivalent).

Thus, the Pearson correlation coefficient or its square should rarely be used to evaluate a model’s performance. This is explained using 3 examples in the figure below.

Model predictions from 3 different models for Walmart’s stock price. Image by author.
  1. Model 1: R² = 0.99 indicates that it almost perfectly predicts stock prices.
  2. Model 2: R² = 0.59 indicates that it predicts stock prices poorly. However, if you looked at r² only, you would have been overly optimistic. This kind of biased prediction is extremely common with machine learning models. It is thus all the more important to visualize your predictions rather than just summarize them using statistics.
  3. Model 3: R² = −0.98 indicates that it is worse than randomly guessing the stock price around $50. But again if you had just looked at r², you might have lost all your money! Side note: Believe it or not, stock predictions opposite to actual trends are quite common. It has also given rise to a whole new field called Contrarian Investing.

Recap

  1. Correlations are useful to find patterns and relationships in data but mostly useless to evaluate predictions.
  2. To evaluate predictions, use metrics like the coefficient of determination which captures how well predictions match observations, or how much of the variation in observed data is explained by the predictions.
  3. The squared Pearson correlation coefficient is usually not equal to the coefficient of determination (or r² ≠ R²)

If you want a math-y explanation of the difference between r² and R², check out this excellent article by Deepak Khandelwal.

Is the correlation coefficient the same as the coefficient of determination?

The Pearson correlation coefficient (r) is used to identify patterns in things whereas the coefficient of determination (R²) is used to identify the strength of a model.

Is the coefficient of correlation calculated as the square of the slope?

No, the steepness or slope of the line isn't related to the correlation coefficient value. The correlation coefficient only tells you how closely your data fit on a line, so two datasets with the same correlation coefficient can have very different slopes.

Is R

The correlation coefficient formula will tell you how strong of a linear relationship there is between two variables. R Squared is the square of the correlation coefficient, r (hence the term r squared).