Softmax Regression

== Introduction ==

'''Softmax regression''', also known as '''multinomial logistic regression''', is a generalisation of logistic regression to problems where there are more than 2 class labels. An example would be classifying the digits from the MNIST data set - each input can be labelled with 1 of 10 possible class labels. 

== Mathematical form ==

Formally, we consider the classification problem where we have <math>m</math> <math>k</math>-dimensional inputs <math>x^{(1)}, x^{(2)}, \ldots, x^{(m)}</math> with corresponding class labels <math>y^{(1)}, y^{(2)}, \ldots, y^{(m)}</math>, where <math>y^{(i)} \in \{1, 2, \ldots, n\}</math>, with <math>n</math> being the number of classes. 

Our hypothesis <math>h_{\theta}(x)</math>, returns a vector of probabilities, such that

<math>
\begin{align} 
h(x^{(i)}) = 
\begin{bmatrix} 
P(y^{(i)} = 1 | x^{(i)}) \\ 
P(y^{(i)} = 2 | x^{(i)}) \\ 
\vdots \\ 
P(y^{(i)} = n | x^{(i)}) 
\end{bmatrix}
= 
\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
\begin{bmatrix} 
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_n^T x^{(i)} } \\
\end{bmatrix}
\end{align}
</math>

where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. Note that '''strictly speaking, we only need <math>n - 1</math> parameters for <math>n</math> classes''', but for convenience, we use <math>n</math> parameters in our derivation. We will explore this further in the later section on parameters.

Our objective is to maximise the likelihood of the data, <math>L(\theta; x, y) = \prod_{i=1}^{m}{ P(y^{(i)} | x^{(i)}) }</math>. 

For convenience, we consider the log-likelihood of the data,

<math>
\begin{align}
\ell(\theta) &= \ln L(\theta; x, y) \\
&= \ln \prod_{i=1}^{m}{ P(y^{(i)} | x^{(i)}) } \\
&= \sum_{i=1}^{m}{ \ln \frac{ e^{ \theta^T_{y^{(i)}} x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } } \\
&= \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }}
\end{align}
</math>

To find <math>\theta</math> such that <math>\ell(\theta)</math> is maximised, we first find the derivatives of <math>\ell(\theta)</math> with respect to <math>\theta_{k}</math>:

<math>
\begin{align}
\frac{\partial \ell(\theta)}{\partial \theta_k} &= \frac{\partial}{\partial \theta_k} \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} \\
&= I_{ \{ y^{(i)} = k\} } x^{(i)} - \frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } e^{ \theta_k^T x^{(i)} } \qquad \text{(where } I_{ \{ y^{(i)} = k\} } \text{is 1 when } y^{(i)} = k \text{ and 0 otherwise) }  \\
&= I_{ \{ y^{(i)} = k\} } x^{(i)} - P(y^{(i)} = k | x^{(i)}) 
\end{align}
</math>

With this, we can now find a set of parameters that maximises <math>\ell(\theta)</math>, for instance by using gradient ascent.

== Parameters ==

We noted earlier that we actually only need <math>n - 1</math> parameters to model <math>n</math> classes. To see why this is so, consider our hypothesis again:

<math>
\begin{align} 
h(x^{(i)}) &= 

\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
\begin{bmatrix} 
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_n^T x^{(i)} } \\
\end{bmatrix} \\

&= 

\frac{e^{ \theta_n^T x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
\cdot
\frac{1}{e^{ \theta_n^T x^{(i)} } }
\begin{bmatrix} 
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_n^T x^{(i)} } \\
\end{bmatrix} \\

&= 

\frac{1}{ \sum_{j=1}^{n}{e^{ (\theta_j^T  - \theta_n^T) x^{(i)} }} }
\begin{bmatrix} 
e^{ (\theta_1^T - \theta_n^T) x^{(i)} } \\
e^{ (\theta_2^T - \theta_n^T) x^{(i)} } \\
\vdots \\
e^{ (\theta_n^T - \theta_n^T) x^{(i)} } \\
\end{bmatrix} \\
\end{align}
</math>

Letting <math>\Theta_j = \theta_j - \theta_n</math> for <math>j = 1, 2 \ldots n - 1</math> gives

<math>

\begin{align}
h(x^{(i)}) &= \frac{1}{ 1 + \sum_{j=1}^{n-1}{e^{ \Theta_j x^{(i)} }} }
\begin{bmatrix} 
e^{ \Theta_1^T x^{(i)} } \\
e^{ \Theta_2^T x^{(i)} } \\
\vdots \\
1 \\
\end{bmatrix} \\

\end{align}
</math>

Showing that only <math>n-1</math> parameters are required.

[[TODO: Explain why overparametisation may be a good thing?]]

=== Logistic regression ===

In the special case where <math>n = 2</math>, softmax regression reduces to logistic regression:

<math>

\begin{align}
h(x^{(i)}) &= 

\frac{1}{ 1 + e^{ \Theta_1 x^{(i)} } }
\begin{bmatrix} 
e^{ \Theta_1^T x^{(i)} } \\
1 \\
\end{bmatrix} \\

&= 

\frac{e^{ \Theta_1 x^{(1)} } }{ 1 + e^{ \Theta_1 x^{(i)} } }
\cdot
\frac{1}{e^{ \Theta_1 x^{(1)} } }
\begin{bmatrix} 
e^{ \Theta_1^T x^{(i)} } \\
1 \\
\end{bmatrix} \\

&= 

\frac{1}{ e^{ -\Theta_1 x^{(i)} } + 1 }
\begin{bmatrix} 
1 \\
e^{ -\Theta_1^T x^{(i)} } \\
\end{bmatrix} \\


\end{align}
</math>