Softmax Regression

== Introduction ==

'''Softmax regression''', also known as '''multinomial logistic regression''', is a generalisation of logistic regression to problems where there are more than 2 class labels. 

Recall that in logistic regression, our hypothesis was of the form:

<math>\begin{align}
h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)},
\end{align}</math>

where we trained the logistic regression weights to optimize the log-likelihood of the dataset using <math> p(y|x) = h_\theta(x) </math>. In softmax regression, we are interested in multi-class problems where each example is assigned to one of <tt>K</tt> labels. One example of a multi-class classification problem would be classifying digits on the MNIST dataset where each example has label 1 of 10 possible labels (i.e., where it is the digit 0, 1, ... or 9). 

To extend the logistic regression framework which only outputs a single probability value, we consider a hypothesis that outputs K values (summing to 1) that represent the predicted probability distribution. Formally, let us consider the classification problem where we have <math>m</math> <math>k</math>-dimensional inputs <math>x^{(1)}, x^{(2)}, \ldots, x^{(m)}</math> with corresponding class labels <math>y^{(1)}, y^{(2)}, \ldots, y^{(m)}</math>, where <math>y^{(i)} \in \{1, 2, \ldots, n\}</math>, with <math>n</math> being the number of classes. 

Our hypothesis <math>h_{\theta}(x)</math>, returns a vector of probabilities, such that

<math>
\begin{align} 
h(x^{(i)}) = 
\begin{bmatrix} 
P(y^{(i)} = 1 | x^{(i)}) \\ 
P(y^{(i)} = 2 | x^{(i)}) \\ 
\vdots \\ 
P(y^{(i)} = n | x^{(i)}) 
\end{bmatrix}
= 
\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
\begin{bmatrix} 
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_n^T x^{(i)} } \\
\end{bmatrix}
\end{align}
</math>

where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. Notice that <math>\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } </math> normalizes the distribution so that it sums to one. 

''Strictly speaking, we only need <math>n - 1</math> parameters for <math>n</math> classes, but for convenience, we use <math>n</math> parameters in our derivation.''

Now, this hypothesis defines a predicted probability distribution given some <tt>x</tt>, <math>P(y | x^{(i)}) = h(x^{(i)})  </math>. Thus to train the model, a natural choice is to maximize the (conditional) log-likelihood of the data, <math>l(\theta; x, y) = \sum_{i=1}^{m} \ln { P(y^{(i)} | x^{(i)}) }</math>. 

== Optimizing Softmax Regression ==

Expanding the log-likelihood expression, we find that:

<math>
\begin{align}
\ell(\theta) &= \ln L(\theta; x, y) \\
&= \ln \prod_{i=1}^{m}{ P(y^{(i)} | x^{(i)}) } \\
&= \sum_{i=1}^{m}{ \ln \frac{ e^{ \theta^T_{y^{(i)}} x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } } \\
&= \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }}
\end{align}
</math>

Unfortunately, there is no close form solution to this optimization problem (although it is concave), and we usually use an off-the-shelf optimization method (e.g., L-BFGS, stochastic gradient descent) to find the optimal parameters. Using these optimization methods require computing the gradient (<math>\ell(\theta)</math> w.r.t. <math>\theta_{k}</math>), which can can be derived as follows:

<math>
\begin{align}
\frac{\partial \ell(\theta)}{\partial \theta_k} &= \frac{\partial}{\partial \theta_k} \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} \\

&= I_{ \{ y^{(i)} = k\} } x^{(i)} - \frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } 
\cdot
e^{ \theta_k^T x^{(i)} } 
\cdot
x^{(i)}
\qquad \text{(where } I_{ \{ y^{(i)} = k\} } \text{is 1 when } y^{(i)} = k \text{ and 0 otherwise) }  \\

&= x^{(i)} ( I_{ \{ y^{(i)} = k\} }  - P(y^{(i)} = k | x^{(i)}) )
\end{align}
</math>

With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc.

=== Weight decay ===



When using softmax in practice, you might find that the weights sometimes balloon up to very large numbers. This can create numerical difficulties and other issues during training or when the trained weights are used in other settings (as in a stacked autoencoder). 

Why should the weights balloon up? You can check for yourself that if our current parameters <math>\theta</math> classify the examples perfectly, then multiplying each of the parameters by a large constant increases the log-likelihood of the data under the parameters. 

In order to combat this, when using softmax in practice, it may be useful to include a weight decay term to keep the weights small.

The weight decay term takes the form:

<math>
\begin{align}
w(\theta) = \frac{\lambda}{2} \sum_{i}{ \sum_{j}{ \theta_{ij}^2 } }
\end{align}
</math>

This is combined with the log-likelihood function to give a cost function, <math>J(\theta)</math>, which we want to '''minimise''' (observe that we have '''negated the log-likelihood''' so that minimising the cost function maximising the log-likelihood):

<math>
\begin{align}
J(\theta) = -\ell(\theta) + w(\theta)
\end{align}
</math>

The gradients with respect to the cost function must then be adjusted to account for the weight decay term:

<math>
\begin{align}
\frac{\partial J(\theta)}{\partial \theta_k}

&= x^{(i)} ( I_{ \{ y^{(i)} = k\} }  - P(y^{(i)} = k | x^{(i)}) ) + \lambda \theta_k
\end{align}
</math>

Minimising <math>J(\theta)</math> will now maximise the log-likelihood while keeping the weights low.

== Parameters ==

We noted earlier that we actually only need <math>n - 1</math> parameters to model <math>n</math> classes. To see why this is so, consider our hypothesis again:

<math>
\begin{align} 
h(x^{(i)}) &= 

\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
\begin{bmatrix} 
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_n^T x^{(i)} } \\
\end{bmatrix} \\

&= 

\frac{e^{ \theta_n^T x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
\cdot
\frac{1}{e^{ \theta_n^T x^{(i)} } }
\begin{bmatrix} 
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_2^T x^{(i)} } \\
\vdots \\
e^{ \theta_n^T x^{(i)} } \\
\end{bmatrix} \\

&= 

\frac{1}{ \sum_{j=1}^{n}{e^{ (\theta_j^T  - \theta_n^T) x^{(i)} }} }
\begin{bmatrix} 
e^{ (\theta_1^T - \theta_n^T) x^{(i)} } \\
e^{ (\theta_2^T - \theta_n^T) x^{(i)} } \\
\vdots \\
e^{ (\theta_n^T - \theta_n^T) x^{(i)} } \\
\end{bmatrix} \\
\end{align}
</math>

Letting <math>\Theta_j = \theta_j - \theta_n</math> for <math>j = 1, 2 \ldots n - 1</math> gives

<math>

\begin{align}
h(x^{(i)}) &= \frac{1}{ 1 + \sum_{j=1}^{n-1}{e^{ \Theta_j x^{(i)} }} }
\begin{bmatrix} 
e^{ \Theta_1^T x^{(i)} } \\
e^{ \Theta_2^T x^{(i)} } \\
\vdots \\
1 \\
\end{bmatrix} \\

\end{align}
</math>

Showing that only <math>n-1</math> parameters are required.

=== Logistic regression ===

In the special case where <math>n = 2</math>, softmax regression reduces to logistic regression:

<math>

\begin{align}
h(x^{(i)}) &= 

\frac{1}{ 1 + e^{ \Theta_1 x^{(i)} } }
\begin{bmatrix} 
e^{ \Theta_1^T x^{(i)} } \\
1 \\
\end{bmatrix} \\

&= 

\frac{e^{ \Theta_1 x^{(1)} } }{ 1 + e^{ \Theta_1 x^{(i)} } }
\cdot
\frac{1}{e^{ \Theta_1 x^{(1)} } }
\begin{bmatrix} 
e^{ \Theta_1^T x^{(i)} } \\
1 \\
\end{bmatrix} \\

&= 

\frac{1}{ e^{ -\Theta_1 x^{(i)} } + 1 }
\begin{bmatrix} 
1 \\
e^{ -\Theta_1^T x^{(i)} } \\
\end{bmatrix} \\


\end{align}
</math>