Relationship between R squared and Pearson correlation coefficient
THE DERIVATION IN THIS POST IS INCORRECT, PLEASE SEE AN UPDATED ONE INSTEAD.
Coefficient of determination (aka. R2)
Consider the ordinary least square (OLS) model:
y=Xβ+ϵAfter fitting the model to the data (X, y), let
ˆy=XβWe would like to understand the relationship between the variance of y and that of ˆy.
V[y]=V[ˆy+ϵ]=V[ˆy]+V[ϵ]+2Cov(ˆy,ϵ)=V[ˆy]+V[ϵ]See proof for the second equality in the Supplemental. The third equation is because ˆy and ϵ are independent random variables, thus their covariance is 0.
Define
R2=V[ˆy]V[y]=∑ni(ˆyi−ˉˆy)2∑ni(yi−ˉy)2Therefore, R2 measures the ratio of V[ˆy] over V[y], and it’s commonly interpreted as the the amount of variance in y that can be explained by the OLS model.
Pearson correlation coefficient
ρ=Cov(y,ˆy)σyσˆy
See definition on Wikipedia.
Relationship between ρ and R2
Now we’ve defined both coefficient of determination and Pearson correlation coefficient, let’s see their relationship.
Note
Cov(y,ˆy)=Cov((ˆy+ϵ),ˆy)=Cov(ˆy,ˆy)+Cov(ϵ,ˆy)=V(ˆy)See proof for the second equality in the Supplemental. So
ρ2=V[ˆy]2V[y]V[ˆy]=V[ˆy]V[y]=R2Therefore, ρ2=R2, neat!
Here is a notebook I wrote demonstrating this result.
Supplemental
Note it’s straightforward to prove
Cov(X,Y)=E[(X−E[X])(Y−E[Y])]=E[XY])−E[X]E[Y]This fact will be used in both of the two proofs below.
Proof for V[X+Y]=V[X]+V[Y]+2Cov(X,Y)
V[X+Y]=E[(X+Y)2]−E[X+Y]2=E[(X+Y)2]−(E[X]+E[Y])2=E[X2]+E[Y2]+E[2XY]−E[X]2−E[Y]2−2E[X]E[Y]=(E[X2]−E[X]2)+(E[Y2]−E[Y]2)+(2(E[XY])−E[X]E[Y])=V[X]+V[Y]+2Cov(X,Y)The 4th equality used Eq. (3).
Proof for Cov(X,(Y+Z))=Cov(X,Y)+Cov(X,Z)
Cov(X,(Y+Z))=E[X(Y+Z)]−E[X]E[Y+Z]=E[XY]+E[XZ]−E[X]E[Y]−E[X]E[Z]=(E[XY]−E[X]E[Y])+(E[XZ]−E[X]E[Z])=Cov(X,Y)+Cov(X,Z)The 1st and 4th equalities used Eq. (3).
References:
- https://economictheoryblog.com/2014/11/05/the-coefficient-of-determination-latex-r2/
- https://economictheoryblog.com/2014/11/05/proof/