Which of the following is true when the coefficient of determination is equal to 1?

A statistical measure that determines the proportion of variance in the dependent variable that can be explained by the independent variable

What is the Coefficient of Determination?

The coefficient of determination (R² or r-squared) is a statistical measure in a regression model that determines the proportion of variance in the dependent variable that can be explained by the independent variable. In other words, the coefficient of determination tells one how well the data fits the model (the goodness of fit).

Although the coefficient of determination provides some useful insights regarding the regression model, one should not rely solely on the measure in the assessment of a statistical model. It does not disclose information about the causation relationship between the independent and dependent variables, and it does not indicate the correctness of the regression model. Therefore, the user should always draw conclusions about the model by analyzing the coefficient of determination together with other variables in a statistical model.

The coefficient of determination can take any values between 0 to 1. In addition, the statistical metric is frequently expressed in percentages.

Interpretation of the Coefficient of Determination (R²)

The most common interpretation of the coefficient of determination is how well the regression model fits the observed data. For example, a coefficient of determination of 60% shows that 60% of the data fit the regression model. Generally, a higher coefficient indicates a better fit for the model.

However, it is not always the case that a high r-squared is good for the regression model. The quality of the coefficient depends on several factors, including the units of measure of the variables, the nature of the variables employed in the model, and the applied data transformation. Thus, sometimes, a high coefficient can indicate issues with the regression model.

No universal rule governs how to incorporate the coefficient of determination in the assessment of a model. The context in which the forecast or the experiment is based is extremely important, and in different scenarios, the insights from the statistical metric can vary.

Calculation of the Coefficient

Mathematically, the coefficient of determination can be found using the following formula:

Where:

  • SSregression – The sum of squares due to regression (explained sum of squares)
  • SStotal –  The total sum of squares

Although the terms “total sum of squares” and “sum of squares due to regression” seem confusing, the variables’ meanings are straightforward.

The total sum of squares measures the variation in the observed data (data used in regression modeling). The sum of squares due to regression measures how well the regression model represents the data that were used for modeling.

More Resources

To keep learning and advancing your career, the additional CFI resources below will be useful:

  • Free Data Science Course
  • Basic Statistics Concepts in Finance
  • Binomial Distribution
  • Central Limit Theorem
  • Regression Analysis

Let's start our investigation of the coefficient of determination, r2, by looking at two different examples — one example in which the relationship between the response y and the predictor x is very weak and a second example in which the relationship between the response y and the predictor x is fairly strong. If our measure is going to work well, it should be able to distinguish between these two very different situations.

Here's a plot illustrating a very weak relationship between y and x. There are two lines on the plot, a horizontal line placed at the average response, \(\bar{y}\), and a shallow-sloped estimated regression line, \(\hat{y}\). Note that the slope of the estimated regression line is not very steep, suggesting that as the predictor x increases, there is not much of a change in the average response y. Also, note that the data points do not "hug" the estimated regression line:

\(SSR=\sum_{i=1}^{n}(\hat{y}_i-\bar{y})^2=119.1\)

\(SSE=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2=1708.5\)

\(SSTO=\sum_{i=1}^{n}(y_i-\bar{y})^2=1827.6\)

The calculations on the right of the plot show contrasting "sums of squares" values:

  • SSR is the "regression sum of squares" and quantifies how far the estimated sloped regression line, \(\hat{y}_i\), is from the horizontal "no relationship line," the sample mean or \(\bar{y}\).
  • SSE is the "error sum of squares" and quantifies how much the data points, \(y_i\), vary around the estimated regression line, \(\hat{y}_i\).
  • SSTO is the "total sum of squares" and quantifies how much the data points, \(y_i\), vary around their mean, \(\bar{y}\).

Note that SSTO = SSR + SSE. The sums of squares appear to tell the story pretty well. They tell us that most of the variation in the response y (SSTO = 1827.6) is just due to random variation (SSE = 1708.5), not due to the regression of y on x (SSR = 119.1). You might notice that SSR divided by SSTO is 119.1/1827.6 or 0.065. Do you see where this quantity appears on the above fitted line plot?

Contrast the above example with the following one in which the plot illustrates a fairly convincing relationship between y and x. The slope of the estimated regression line is much steeper, suggesting that as the predictor x increases, there is a fairly substantial change (decrease) in the response y. And, here, the data points do "hug" the estimated regression line:

\(SSR=\sum_{i=1}^{n}(\hat{y}_i-\bar{y})^2=6679.3\)

\(SSE=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2=1708.5\)

\(SSTO=\sum_{i=1}^{n}(y_i-\bar{y})^2=8487.8\)

The sums of squares for this dataset tell a very different story, namely that most of the variation in the response y (SSTO = 8487.8) is due to the regression of y on x (SSR = 6679.3) not just due to random error (SSE = 1708.5). And, SSR divided by SSTO is 6679.3/8487.8 or 0.799, which again appears on the fitted line plot.

The previous two examples have suggested how we should define the measure formally. In short, the "coefficient of determination" or "r-squared value," denoted r2, is the regression sum of squares divided by the total sum of squares. Alternatively, as demonstrated in this screencast below, since SSTO = SSR + SSE, the quantity r2 also equals one minus the ratio of the error sum of squares to the total sum of squares: 

\[r^2=\frac{SSR}{SSTO}=1-\frac{SSE}{SSTO}\]

Here are some basic characteristics of the measure:

  • Since r2 is a proportion, it is always a number between 0 and 1.
  • If r2 = 1, all of the data points fall perfectly on the regression line.The predictor x accounts for all of the variation in y!
  • If r2 = 0, the estimated regression line is perfectly horizontal. The predictor x accounts for none of the variation in y!

We've learned the interpretation for the two easy cases — when r2 = 0 or r2 = 1 — but, how do we interpret r2 when it is some number between 0 and 1, like 0.23 or 0.57, say? Here are two similar, yet slightly different, ways in which the coefficient of determination r2 can be interpreted. We say either:

"r2 ×100 percent of the variation in y is reduced by taking into account predictor x"

or:

"r2 ×100 percent of the variation in y is "explained by" the variation in predictor x."

Many statisticians prefer the first interpretation. I tend to favor the second. The risk with using the second interpretation — and hence why "explained by" appears in quotes — is that it can be misunderstood as suggesting that the predictor x causes the change in the response y. Association is not causation. That is, just because a dataset is characterized by having a large r-squared value, it does not imply that x causes the changes in y. As long as you keep the correct meaning in mind, it is fine to use the second interpretation. A variation on the second interpretation is to say, "r2 ×100 percent of the variation in y is accounted for by the variation in predictor x."

Students often ask: "what's considered a large r-squared value?" It depends on the research area. Social scientists who are often trying to learn something about the huge variation in human behavior will tend to find it very hard to get r-squared values much above, say 25% or 30%. Engineers, on the other hand, who tend to study more exact systems would likely find an r-squared value of just 30% unacceptable. The moral of the story is to read the literature to learn what typical r-squared values are for your research area!

Let's revisit the skin cancer mortality example (skincancer.txt). Any statistical software that performs simple linear regression analysis will report the r-squared value for you, which in this case is 67.98% or 68% to the nearest whole number.

We can say that 68% of the variation in the skin cancer mortality rate is reduced by taking into account latitude. Or, we can say — with knowledge of what it really means — that 68% of the variation in skin cancer mortality is "explained by" latitude.

Can the coefficient of determination be 1?

The coefficient of determination is a number between 0 and 1 that measures how well a statistical model predicts an outcome.

Is the coefficient of determination is equal to 1 then the correlation coefficient?

The coefficient of determination is the square of the correlation(r), thus it ranges from 0 to 1. With linear regression, the coefficient of determination is equal to the square of the correlation between the x and y variables.

Which of the following is true of the coefficient of determination?

Which of the following is TRUE about the coefficient of determination? It is always less than 0.

What does an r2 value of 1 mean?

R-squared, otherwise known as R² typically has a value in the range of 0 through to 1. A value of 1 indicates that predictions are identical to the observed values; it is not possible to have a value of R² of more than 1.

Toplist

Neuester Beitrag

Stichworte