THE DERIVATION IN THIS POST IS INCORRECT, PLEASE SEE AN UPDATED ONE INSTEAD.

Coefficient of determination (aka. R2)

Consider the ordinary least square (OLS) model:

y=Xβ+ϵ

After fitting the model to the data (X, y), let

ˆy=Xβ

We would like to understand the relationship between the variance of y and that of ˆy.

V[y]=V[ˆy+ϵ]=V[ˆy]+V[ϵ]+2Cov(ˆy,ϵ)=V[ˆy]+V[ϵ]

See proof for the second equality in the Supplemental. The third equation is because ˆy and ϵ are independent random variables, thus their covariance is 0.

Define

R2=V[ˆy]V[y]=ni(ˆyiˉˆy)2ni(yiˉy)2

Therefore, R2 measures the ratio of V[ˆy] over V[y], and it’s commonly interpreted as the the amount of variance in y that can be explained by the OLS model.

Pearson correlation coefficient

ρ=Cov(y,ˆy)σyσˆy

See definition on Wikipedia.

Relationship between ρ and R2

Now we’ve defined both coefficient of determination and Pearson correlation coefficient, let’s see their relationship.

Note

Cov(y,ˆy)=Cov((ˆy+ϵ),ˆy)=Cov(ˆy,ˆy)+Cov(ϵ,ˆy)=V(ˆy)

See proof for the second equality in the Supplemental. So

ρ2=V[ˆy]2V[y]V[ˆy]=V[ˆy]V[y]=R2

Therefore, ρ2=R2, neat!

Here is a notebook I wrote demonstrating this result.

Supplemental

Note it’s straightforward to prove

Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XY])E[X]E[Y]

This fact will be used in both of the two proofs below.

Proof for V[X+Y]=V[X]+V[Y]+2Cov(X,Y)

V[X+Y]=E[(X+Y)2]E[X+Y]2=E[(X+Y)2](E[X]+E[Y])2=E[X2]+E[Y2]+E[2XY]E[X]2E[Y]22E[X]E[Y]=(E[X2]E[X]2)+(E[Y2]E[Y]2)+(2(E[XY])E[X]E[Y])=V[X]+V[Y]+2Cov(X,Y)

The 4th equality used Eq. (3).

Proof for Cov(X,(Y+Z))=Cov(X,Y)+Cov(X,Z)

Cov(X,(Y+Z))=E[X(Y+Z)]E[X]E[Y+Z]=E[XY]+E[XZ]E[X]E[Y]E[X]E[Z]=(E[XY]E[X]E[Y])+(E[XZ]E[X]E[Z])=Cov(X,Y)+Cov(X,Z)

The 1st and 4th equalities used Eq. (3).

References:

  • https://economictheoryblog.com/2014/11/05/the-coefficient-of-determination-latex-r2/
  • https://economictheoryblog.com/2014/11/05/proof/