Softmax Regression
From Ufldl
for
Softmax Regression
Jump to:
navigation
,
search
== Introduction == '''Softmax regression''', also known as '''multinomial logistic regression''', is a generalisation of logistic regression to problems where there are more than 2 class labels. An example would be classifying the digits from the MNIST data set - each input can be labelled with 1 of 10 possible class labels. == Mathematical form == Formally, we consider the classification problem where we have <math>m</math> <math>k</math>-dimensional inputs <math>x^{(1)}, x^{(2)}, \ldots, x^{(m)}</math> with corresponding class labels <math>y^{(1)}, y^{(2)}, \ldots, y^{(m)}</math>, where <math>y^{(i)} \in \{1, 2, \ldots, n\}</math>, with <math>n</math> being the number of classes. Our hypothesis <math>h_{\theta}(x)</math>, returns a vector of probabilities, such that <math> \begin{align} h(x^{(i)}) = \begin{bmatrix} P(y^{(i)} = 1 | x^{(i)}) \\ P(y^{(i)} = 2 | x^{(i)}) \\ \vdots \\ P(y^{(i)} = n | x^{(i)}) \end{bmatrix} = \frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } \begin{bmatrix} e^{ \theta_1^T x^{(i)} } \\ e^{ \theta_2^T x^{(i)} } \\ \vdots \\ e^{ \theta_n^T x^{(i)} } \\ \end{bmatrix} \end{align} </math> where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. Note that '''strictly speaking, we only need <math>n - 1</math> parameters for <math>n</math> classes''', but for convenience, we use <math>n</math> parameters in our derivation. We will explore this further in the later section on parameters. Our objective is to maximise the likelihood of the data, <math>L(\theta; x, y) = \prod_{i=1}^{m}{ P(y^{(i)} | x^{(i)}) }</math>. For convenience, we consider the log-likelihood of the data, <math> \begin{align} \ell(\theta) &= \ln L(\theta; x, y) \\ &= \ln \prod_{i=1}^{m}{ P(y^{(i)} | x^{(i)}) } \\ &= \sum_{i=1}^{m}{ \ln \frac{ e^{ \theta^T_{y^{(i)}} x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } } \\ &= \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} \end{align} </math> To find <math>\theta</math> such that <math>\ell(\theta)</math> is maximised, we first find the derivatives of <math>\ell(\theta)</math> with respect to <math>\theta_{k}</math>: <math> \begin{align} \frac{\partial \ell(\theta)}{\partial \theta_k} &= \frac{\partial}{\partial \theta_k} \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} \\ &= I_{ \{ y^{(i)} = k\} } x^{(i)} - \frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } e^{ \theta_k^T x^{(i)} } \qquad \text{(where } I_{ \{ y^{(i)} = k\} } \text{is 1 when } y^{(i)} = k \text{ and 0 otherwise) } \\ &= I_{ \{ y^{(i)} = k\} } x^{(i)} - P(y^{(i)} = k | x^{(i)}) \end{align} </math> With this, we can now find a set of parameters that maximises <math>\ell(\theta)</math>, for instance by using gradient ascent. == Parameters == We noted earlier that we actually only need <math>n - 1</math> parameters to model <math>n</math> classes. To see why this is so, consider our hypothesis again: <math> \begin{align} h(x^{(i)}) &= \frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } \begin{bmatrix} e^{ \theta_1^T x^{(i)} } \\ e^{ \theta_2^T x^{(i)} } \\ \vdots \\ e^{ \theta_n^T x^{(i)} } \\ \end{bmatrix} \\ &= \frac{e^{ \theta_n^T x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } \cdot \frac{1}{e^{ \theta_n^T x^{(i)} } } \begin{bmatrix} e^{ \theta_1^T x^{(i)} } \\ e^{ \theta_2^T x^{(i)} } \\ \vdots \\ e^{ \theta_n^T x^{(i)} } \\ \end{bmatrix} \\ &= \frac{1}{ \sum_{j=1}^{n}{e^{ (\theta_j^T - \theta_n^T) x^{(i)} }} } \begin{bmatrix} e^{ (\theta_1^T - \theta_n^T) x^{(i)} } \\ e^{ (\theta_2^T - \theta_n^T) x^{(i)} } \\ \vdots \\ e^{ (\theta_n^T - \theta_n^T) x^{(i)} } \\ \end{bmatrix} \\ \end{align} </math> Letting <math>\Theta_j = \theta_j - \theta_n</math> for <math>j = 1, 2 \ldots n - 1</math> gives <math> \begin{align} h(x^{(i)}) &= \frac{1}{ 1 + \sum_{j=1}^{n-1}{e^{ \Theta_j x^{(i)} }} } \begin{bmatrix} e^{ \Theta_1^T x^{(i)} } \\ e^{ \Theta_2^T x^{(i)} } \\ \vdots \\ 1 \\ \end{bmatrix} \\ \end{align} </math> Showing that only <math>n-1</math> parameters are required. [[TODO: Explain why overparametisation may be a good thing?]] === Logistic regression === In the special case where <math>n = 2</math>, softmax regression reduces to logistic regression: <math> \begin{align} h(x^{(i)}) &= \frac{1}{ 1 + e^{ \Theta_1 x^{(i)} } } \begin{bmatrix} e^{ \Theta_1^T x^{(i)} } \\ 1 \\ \end{bmatrix} \\ &= \frac{e^{ \Theta_1 x^{(1)} } }{ 1 + e^{ \Theta_1 x^{(i)} } } \cdot \frac{1}{e^{ \Theta_1 x^{(1)} } } \begin{bmatrix} e^{ \Theta_1^T x^{(i)} } \\ 1 \\ \end{bmatrix} \\ &= \frac{1}{ e^{ -\Theta_1 x^{(i)} } + 1 } \begin{bmatrix} 1 \\ e^{ -\Theta_1^T x^{(i)} } \\ \end{bmatrix} \\ \end{align} </math>
Template:Languages
(
view source
)
Template:Softmax
(
view source
)
Return to
Softmax Regression
.
Views
Page
Discussion
View source
History
Personal tools
Log in
ufldl resources
UFLDL Tutorial
Recommended Readings
wiki
Main page
Recent changes
Random page
Help
Search
Toolbox
What links here
Related changes
Special pages