Softmax Regression

== Introduction ==

'''Softmax regression''', also known as '''multinomial logistic regression''', is a generalisation of logistic regression to problems where there are more than 2 class labels. 

Recall that in logistic regression, our hypothesis was of the form:

<math>\begin{align}
h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)},
\end{align}</math>

where we trained the logistic regression weights to optimize the log-likelihood of the dataset using <math> p(y|x) = h_\theta(x) </math>. In softmax regression, we are interested in multi-class problems where each example is assigned to one of <tt>K</tt> labels. One example of a multi-class classification problem would be classifying digits on the MNIST dataset where each example has label 1 of 10 possible labels (i.e., where it is the digit 0, 1, ... or 9). 

To extend the logistic regression framework which only outputs a single probability value, we consider a hypothesis that outputs K values (summing to 1) that represent the predicted probability distribution.

== Mathematical form ==

Formally, we consider the classification problem where we have <math>m</math> <math>k</math>-dimensional inputs <math>x^{(1)}, x^{(2)}, \ldots, x^{(m)}</math> with corresponding class labels <math>y^{(1)}, y^{(2)}, \ldots, y^{(m)}</math>, where <math>y^{(i)} \in \{1, 2, \ldots, n\}</math>, with <math>n</math> being the number of classes. 

Our hypothesis <math>h_{\theta}(x)</math>, returns a vector of probabilities, such that

<math>
\begin{align} 
h(x^{(i)}) = 
\begin{bmatrix} 
P(y^{(i)} = 1 | x^{(i)}) \\ 
P(y^{(i)} = 2 | x^{(i)}) \\ 
\vdots \\ 
P(y^{(i)} = n | x^{(i)}) 
\end{bmatrix}
= 
\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
\begin{bmatrix} 
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_n^T x^{(i)} } \\
\end{bmatrix}
\end{align}
</math>

where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. Note that '''strictly speaking, we only need <math>n - 1</math> parameters for <math>n</math> classes''', but for convenience, we use <math>n</math> parameters in our derivation. We will explore this further in the later section on parameters.

Our objective is to maximise the likelihood of the data, <math>L(\theta; x, y) = \prod_{i=1}^{m}{ P(y^{(i)} | x^{(i)}) }</math>. 

For convenience, we consider the log-likelihood of the data,

<math>
\begin{align}
\ell(\theta) &= \ln L(\theta; x, y) \\
&= \ln \prod_{i=1}^{m}{ P(y^{(i)} | x^{(i)}) } \\
&= \sum_{i=1}^{m}{ \ln \frac{ e^{ \theta^T_{y^{(i)}} x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } } \\
&= \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }}
\end{align}
</math>

To find <math>\theta</math> such that <math>\ell(\theta)</math> is maximised, we first find the derivatives of <math>\ell(\theta)</math> with respect to <math>\theta_{k}</math>:

<math>
\begin{align}
\frac{\partial \ell(\theta)}{\partial \theta_k} &= \frac{\partial}{\partial \theta_k} \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} \\

&= I_{ \{ y^{(i)} = k\} } x^{(i)} - \frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } 
\cdot
e^{ \theta_k^T x^{(i)} } 
\cdot
x^{(i)}
\qquad \text{(where } I_{ \{ y^{(i)} = k\} } \text{is 1 when } y^{(i)} = k \text{ and 0 otherwise) }  \\

&= x^{(i)} ( I_{ \{ y^{(i)} = k\} }  - P(y^{(i)} = k | x^{(i)}) )
\end{align}
</math>

With this, we can now find a set of parameters that maximises <math>\ell(\theta)</math>, for instance by using gradient ascent.

=== Weight decay ===

When using softmax in practice, you might find that the weights sometimes balloon up to very large numbers. This can create numerical difficulties and other issues during training or when the trained weights are used in other settings (as in a stacked autoencoder). 

Why should the weights balloon up? You can check for yourself that if our current parameters <math>\theta</math> classify the examples perfectly, then multiplying each of the parameters by a large constant increases the log-likelihood of the data under the parameters. 

In order to combat this, when using softmax in practice, it may be useful to include a weight decay term to keep the weights small.

The weight decay term takes the form:

<math>
\begin{align}
w(\theta) = \frac{\lambda}{2} \sum_{i}{ \sum_{j}{ \theta_{ij}^2 } }
\end{align}
</math>

This is combined with the log-likelihood function to give a cost function, <math>J(\theta)</math>, which we want to '''minimise''' (observe that we have '''negated the log-likelihood''' so that minimising the cost function maximising the log-likelihood):

<math>
\begin{align}
J(\theta) = -\ell(\theta) + w(\theta)
\end{align}
</math>

The gradients with respect to the cost function must then be adjusted to account for the weight decay term:

<math>
\begin{align}
\frac{\partial J(\theta)}{\partial \theta_k}

&= x^{(i)} ( I_{ \{ y^{(i)} = k\} }  - P(y^{(i)} = k | x^{(i)}) ) + \lambda \theta_k
\end{align}
</math>

Minimising <math>J(\theta)</math> will now maximise the log-likelihood while keeping the weights low.

== Parameters ==

We noted earlier that we actually only need <math>n - 1</math> parameters to model <math>n</math> classes. To see why this is so, consider our hypothesis again:

<math>
\begin{align} 
h(x^{(i)}) &= 

\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
\begin{bmatrix} 
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_n^T x^{(i)} } \\
\end{bmatrix} \\

&= 

\frac{e^{ \theta_n^T x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
\cdot
\frac{1}{e^{ \theta_n^T x^{(i)} } }
\begin{bmatrix} 
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_n^T x^{(i)} } \\
\end{bmatrix} \\

&= 

\frac{1}{ \sum_{j=1}^{n}{e^{ (\theta_j^T  - \theta_n^T) x^{(i)} }} }
\begin{bmatrix} 
e^{ (\theta_1^T - \theta_n^T) x^{(i)} } \\
e^{ (\theta_2^T - \theta_n^T) x^{(i)} } \\
\vdots \\
e^{ (\theta_n^T - \theta_n^T) x^{(i)} } \\
\end{bmatrix} \\
\end{align}
</math>

Letting <math>\Theta_j = \theta_j - \theta_n</math> for <math>j = 1, 2 \ldots n - 1</math> gives

<math>

\begin{align}
h(x^{(i)}) &= \frac{1}{ 1 + \sum_{j=1}^{n-1}{e^{ \Theta_j x^{(i)} }} }
\begin{bmatrix} 
e^{ \Theta_1^T x^{(i)} } \\
e^{ \Theta_2^T x^{(i)} } \\
\vdots \\
1 \\
\end{bmatrix} \\

\end{align}
</math>

Showing that only <math>n-1</math> parameters are required.

=== Logistic regression ===

In the special case where <math>n = 2</math>, softmax regression reduces to logistic regression:

<math>

\begin{align}
h(x^{(i)}) &= 

\frac{1}{ 1 + e^{ \Theta_1 x^{(i)} } }
\begin{bmatrix} 
e^{ \Theta_1^T x^{(i)} } \\
1 \\
\end{bmatrix} \\

&= 

\frac{e^{ \Theta_1 x^{(1)} } }{ 1 + e^{ \Theta_1 x^{(i)} } }
\cdot
\frac{1}{e^{ \Theta_1 x^{(1)} } }
\begin{bmatrix} 
e^{ \Theta_1^T x^{(i)} } \\
1 \\
\end{bmatrix} \\

&= 

\frac{1}{ e^{ -\Theta_1 x^{(i)} } + 1 }
\begin{bmatrix} 
1 \\
e^{ -\Theta_1^T x^{(i)} } \\
\end{bmatrix} \\


\end{align}
</math>