Picture this- You are a stock analyst responsible for predicting Walmart’s stock price ahead of its quarterly earnings report. You are hard at
work just when your data scientist walks in saying they discovered a little-known data stream providing daily Walmart parking lot occupancy that seems well correlated with Walmart’s historic revenues. You are understandably excited. You ask them to use the parking lot data alongside other standard metrics in a machine learning model to
forecast Walmart’s stock price. So far so good. The data scientist returns in a few hours claiming that after careful validation of the model, its predictions are strongly correlated with the true stock price. Do you accept the model without any further investigations? I hope not. Correlations are good for identifying patterns in data, but almost meaningless for quantifying a model’s performance, especially for complex models (like machine learning models). This is because correlations only tell if two things follow each other (e.g., parking lot occupancy and Walmart’s stock), but don’t tell how they match each other (e.g., predicted and actual stock price). For that, model performance metrics like the coefficient of determination (R²) can help. In this article, we will learn:
1. Correlation coefficient: “How good is this predictor?”Shorter the sum of blue lines, closer the correlation coefficient is to +1. Image by author.Correlation coefficients help quantify mutual relationships or connections between two things. Some well-known correlated quantities are weight and height of humans, house value and its area, and, as we saw in the above example, a store’s revenue and its parking lot occupancy. One of the most widely used correlation coefficients is the Pearson correlation coefficient (usually denoted by r). Graphically, this can be understood as “how close is the data to the line of best fit?” r ranges from −1 to +1. Grey line is the line that fits the data the best. Image by author.
Notice how the figure above has missing numbers on the axes? That is because the Pearson correlation coefficient is independent of the magnitude of the numbers; it is sensitive to relative changes only. This property is usually desirable since variables rarely have the same magnitudes. E.g., Walmart’s stock price is tens of dollars whereas the numbers of cars parked in front of its stores are in the thousands. However, due to its insensitivity to actual magnitude, the Pearson correlation coefficient can be misused to give a false sense of confidence when two things are indeed expected to have the same magnitude. To make matters worse, some people take the square of the Pearson correlation coefficient to bring it between 0 and +1 and call it r². But this is not to be confused with the coefficient of determination (R²) which is explained below. 2. Coefficient of determination: “How good is this model?”Longer the sum of orange lines, lower the coefficient of determination. Image by author.Unlike the Pearson correlation coefficient, the coefficient of determination measures how well the predicted values match (and not just follow) the observed values. It depends on the distance between the points and the 1:1 line (and not the best-fit line) as shown above. Closer the data to the 1:1 line, higher the coefficient of determination. The coefficient of determination is often denoted by R². However, it is not the square of anything. It can range from any negative number to +1. R² can range from negative infinity to +1. Grey line is the line where the quantities on both axes are equal (also known as 1:1 line). Image by author.
Since R² indicates the distance of points from the 1:1 line, it does depend on the magnitude of the numbers (unlike r²). 3. When to use what?The Pearson correlation coefficient (r) is used to identify patterns in things whereas the coefficient of determination (R²) is used to identify the strength of a model. By taking the square of r, you get the squared Pearson correlation coefficient (r²) which is completely different from the coefficient of determination (R²), except in very specific cases of linear regression (when both the grey lines from the above figures merge making the blue and orange lines equivalent). Thus, the Pearson correlation coefficient or its square should rarely be used to evaluate a model’s performance. This is explained using 3 examples in the figure below. Model predictions from 3 different models for Walmart’s stock price. Image by author.
Recap
If you want a math-y explanation of the difference between r² and R², check out this excellent article by Deepak Khandelwal. Is the correlation coefficient the same as the coefficient of determination?The Pearson correlation coefficient (r) is used to identify patterns in things whereas the coefficient of determination (R²) is used to identify the strength of a model.
Is the coefficient of correlation calculated as the square of the slope?No, the steepness or slope of the line isn't related to the correlation coefficient value. The correlation coefficient only tells you how closely your data fit on a line, so two datasets with the same correlation coefficient can have very different slopes.
Is RThe correlation coefficient formula will tell you how strong of a linear relationship there is between two variables. R Squared is the square of the correlation coefficient, r (hence the term r squared).
|