Here, I summarize what I’ve learned about logistic regression from three sources.

TL;DR:

With the posterior set to be in a logistic (sigmoid) form (Eq. \eqref{eq:logistic}), logistic regression tries to find a hyperplane (linear) that maximizes the likehood of the observing the given data based on the distances from the data points to the hyperplane.

# From Ali Ghodsi’s lecture

The lecture introduces logistic regression by modeling the posterior directly.

\begin{equation} p(y=1|x;\beta) = \frac{e^{\beta^T x}}{1 + e^{\beta^T x}} \label{eq:logistic} \end{equation}

Its complement is

\begin{equation} p(y=0|x;\beta) = 1 - p(y=1|x;\beta) = \frac{1}{1 + e^{\beta^T x}} \end{equation}

Aggregate the two together,

\begin{equation} p(y|x;\beta) = (\frac{e^{\beta^T x}}{1 + e^{\beta^T x}})^y (\frac{1}{1 + e^{\beta^T x}}) ^ {1-y} \end{equation}

Given $x$ and $y$ from the data, we want to estimate $\beta$ with maximum likelihood.

\begin{equation} L(\beta) = \prod_{i}^{n}p(y_i|x_i,\beta) \end{equation}

To maximize $L$ is equivalent to maximizing

Taking its derivative

After that, the function could be solved by Newton-Raphson method (aka. Newton’s method).

\begin{align} \frac {\partial^2 l(\beta)}{\partial \beta \partial \beta ^T} &= \sum_{i}^{n} -(1 - p(x_i | \beta) p(x_i|\beta) x_i x_i^T \end{align}

Then updte $\beta$ till convergence

\begin{align} \beta^{\mathrm{new}} \leftarrow \beta ^{\mathrm{old}} - [\frac {\partial l(\beta)}{\partial \beta}][\frac {\partial^2 l(\beta)}{\partial \beta \partial \beta ^T}^{-1}] \end{align}

Put Newton-Raphson method in matrix form.

Define

\begin{align} y_{\mathrm{n \times 1}} \end{align}

\begin{align} X_{\mathrm{n \times (d+1)}} \end{align}

\begin{align} P_{\mathrm{n \times 1}} \quad \mathrm{where} \quad P_i = p(x_i | \beta) \end{align}

and a diagonal matrix \begin{align} W \quad \mathrm{where} \quad W_{ii} = (1 - p(x_i|\beta))(p(x_i|\beta)) \end{align}

Then,

\begin{align} \frac {\partial l(\beta)}{\partial \beta} = X^T (y - P) \end{align}

is of shape $(d+1) \times 1$.

\begin{align} \frac {\partial^2 l(\beta)}{\partial \beta \partial \beta ^T} = -X^T W X \end{align}

is of shape $(d+1) \times (d+1)$.

So

\begin{align} \beta^{\mathrm{new}} \leftarrow \beta ^{\mathrm{old}} + (X^T W X)^{-1} X^T(y - P) \end{align}

# From Andrew Ng’s course

Also summarized here.

Posterior is set to be

\begin{equation} p(y=1|x;\beta) = \frac{1}{1 + e^{\beta^T x}} \end{equation}

Logistic regression is solved by minimizing a cost function (log loss)

\begin{equation} J(\beta) = - \sum_{i}^{n} y_i \mathrm{log} (p(y_i|x_i,\beta)) + (1 - y_i) \mathrm{log} (1 - p(y_i|x_i, \beta)) \end{equation}

or simply

\begin{equation} J(\beta) = - \sum y \mathrm{log} (p) + (1 - y) \mathrm{log} (1 - p) \label{eq:logloss} \end{equation}

and $J(\beta)$ could be minimized with gradient descent.

This is exactly the same as the log-likelihood, comparing Eq.\eqref{eq:logloss} to Eq.\eqref{eq:log-likelihood}, but the minus sign at the beginning. The minus sign is also the reason why the log-likelihood needs maximized while the log loss needs minimized.