Introduction

The linear model:

\[\begin{align} X \hat{\mathbf{w}} &= \hat{\mathbf{y}} \end{align}\]

where

$X$ is a $N \times D$ data matrix, $N$ is the number of data points, and $D$ is the number of feature dimensions. *$\hat{\mathbf{w}}$ is the coefficients to be esitmated, and
$\hat{y}$ is the predicted targets.

Suppose the ground truth targets are represented by $\mathbf{y}$.

Regular sum-of-squares loss

\[\begin{align} L &= \frac{1}{2} \left(\hat{\mathbf{y}} - \mathbf{y} \right)^T \left(\hat{\mathbf{y}} - \mathbf{y} \right) \\ &= \frac{1}{2} \left(X\hat{\mathbf{w}} - \mathbf{y} \right)^T \left(X\hat{\mathbf{w}} - \mathbf{y} \right) \end{align}\]

Take derivative of $L$ and set it to zero,

\[\begin{align} \frac{\partial L}{\partial \mathbf{w}} &= X^T \left(X\hat{\mathbf{w}} - \mathbf{y} \right) = 0 \\ X^TX\hat{\mathbf{w}} &= X^T \mathbf{y} \\ \hat{\mathbf{w}} &= (X^TX)^{-1}X^T \mathbf{y} \end{align}\]

So the predictions are

\[\begin{align} \hat{\mathbf{t}} &= X(X^TX)^{-1}X^T \mathbf{y} \end{align}\]

Weighted sum-of-squares loss

\[\begin{align} L &= \frac{1}{2} \left(\hat{\mathbf{y}} - \mathbf{y} \right)^T R \left(\hat{\mathbf{y}} - \mathbf{y} \right) \\ &= \frac{1}{2} \left(X\hat{\mathbf{w}} - \mathbf{y} \right)^T R \left(X\hat{\mathbf{w}} - \mathbf{y} \right) \end{align}\]

where $R$ is a diagonal weight matrix for each data point, $R$ is in the form of $\begin{bmatrix} r_1 & & \\ & \ddots & \\ & & r_n \end{bmatrix}$.

Similar to the case of regular sum-of-squares loss, take derivative of $L$ and set it to zero,

\[\begin{align} \frac{\partial L}{\partial \mathbf{w}} &= X^T R \left(X\hat{\mathbf{w}} - \mathbf{y} \right) = 0 \\ X^T R X\hat{\mathbf{w}} &= X^T R \mathbf{y} \\ \hat{\mathbf{w}} &= (X^T R X)^{-1}X^T R \mathbf{y} \end{align}\]

and the predictions are

\[\begin{align} \hat{\mathbf{t}} &= X(X^T R X)^{-1}X^T R \mathbf{y} \end{align}\]