Softmax Regression
From Ufldl
m (Swapped partial and sum in gradient computation.) |
|||
Line 1: | Line 1: | ||
== Introduction == | == Introduction == | ||
- | '''Softmax regression''', | + | In these notes, we describe the '''Softmax regression''' model, a generalization of logistic regression to |
+ | classification problems where the class label <math>y</math> can take on more than two possible values. | ||
+ | This will be useful for such problems as MNIST, where the goal is to distinguish between 10 different | ||
+ | numerical digits. '''Softmax regression''' is a supervised learning algorithm, but we will shortly be | ||
+ | using it in conjuction with our deep learning/unsupervised feature learning methods. | ||
- | Recall that in logistic regression, | + | Recall that in logistic regression, we had a training set |
+ | <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math> | ||
+ | of $m$ examples. We had considered the binary classification setting, so the | ||
+ | labels were <math>y^{(i)} \in \{0,1\}</math>. Our hypothesis took the form: | ||
<math>\begin{align} | <math>\begin{align} | ||
Line 9: | Line 16: | ||
\end{align}</math> | \end{align}</math> | ||
- | + | and the model parameters <math>\theta</math> were trained to minimize | |
+ | the cost function | ||
+ | <math> | ||
+ | \begin{align} | ||
+ | J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right] | ||
+ | \end{align} | ||
+ | </math> | ||
- | + | In the softmax regression setting, we are interested in multi-class | |
+ | classification (as opposed to only binary classification), and so the label | ||
+ | <math>y</math> can take on <math>k</math> different values, rather than only | ||
+ | two. Thus, in our training set | ||
+ | <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>, | ||
+ | we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>. For example, | ||
+ | in the MNIST digit recognition task, we would have <math>k=10</math> different classes. | ||
+ | Given a test input <math>x</math>, we want our hypothesis to estimate | ||
+ | the probability that <math>p(y=j | x)</math> for <math>j = 1, \ldots, k</math>. | ||
+ | In other words, we want to estimate the probability of the class label taking | ||
+ | on each of the <math>k</math> different possible values. Thus, our hypothesis | ||
+ | will output a <math>k</math> dimensional vector (whose elements sum to 1) giving | ||
+ | us our <math>k</math> estimated probabilities. | ||
- | + | Concretely, our hypothesis | |
+ | <math>h_{\theta}(x)</math> takes the form: | ||
<math> | <math> | ||
- | \begin{align} | + | \begin{align} |
- | h(x^{(i)}) = | + | h(x^{(i)}) = |
- | \begin{bmatrix} | + | \begin{bmatrix} |
- | P(y^{(i)} = 1 | x^{(i)}) \\ | + | P(y^{(i)} = 1 | x^{(i)}) \\ |
- | P(y^{(i)} = 2 | x^{(i)}) \\ | + | P(y^{(i)} = 2 | x^{(i)}) \\ |
- | \vdots \\ | + | \vdots \\ |
- | P(y^{(i)} = | + | P(y^{(i)} = k | x^{(i)}) |
\end{bmatrix} | \end{bmatrix} | ||
- | = | + | = |
\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } | \frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } | ||
- | \begin{bmatrix} | + | \begin{bmatrix} |
e^{ \theta_1^T x^{(i)} } \\ | e^{ \theta_1^T x^{(i)} } \\ | ||
e^{ \theta_2^T x^{(i)} } \\ | e^{ \theta_2^T x^{(i)} } \\ | ||
\vdots \\ | \vdots \\ | ||
- | e^{ \ | + | e^{ \theta_k^T x^{(i)} } \\ |
\end{bmatrix} | \end{bmatrix} | ||
\end{align} | \end{align} | ||
</math> | </math> | ||
- | where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math> | + | where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each |
+ | <math>n</math>-dimensional column vectors that constitute the parameters of our | ||
+ | hypothesis. Notice that | ||
+ | <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math> | ||
+ | normalizes the distribution so that it sums to one. | ||
- | |||
- | + | <!-- | |
+ | To extend the logistic regression framework which only outputs a single | ||
+ | probability value, we consider a hypothesis that outputs K values (summing to | ||
+ | 1) that represent the predicted probability distribution. Formally, let us | ||
+ | consider the classification problem where we have <math>m</math> | ||
+ | <math>k</math>-dimensional inputs <math>x^{(1)}, x^{(2)}, \ldots, | ||
+ | x^{(m)}</math> with corresponding class labels <math>y^{(1)}, y^{(2)}, \ldots, | ||
+ | y^{(m)}</math>, where <math>y^{(i)} \in \{1, 2, \ldots, n\}</math>, with | ||
+ | <math>n</math> being the number of classes. | ||
- | + | where we trained the logistic regression weights to optimize the (conditional) | |
- | + | log-likelihood of the dataset using <math> p(y|x) = h_\theta(x) </math>. In | |
- | + | softmax regression, we are interested in multi-class problems where each | |
+ | example (input image) is assigned to one of <tt>k</tt> labels. One example of a | ||
+ | multi-class classification problem would be classifying digits on the MNIST | ||
+ | dataset where each example has label 1 of 10 possible labels (i.e., where it is | ||
+ | the digit 0, 1, ... or 9). | ||
- | = | + | If you have seen (in a different course) with the justification of logistic regression as |
+ | maximum likelihood estimation, you have have recognized this as | ||
+ | maximizing <math> \sum_i \log p(y^{(i)}|x^{(i)};\theta) = - J(\theta)</math>. | ||
- | + | ||
+ | ''Strictly speaking, we only need <math>n - 1</math> parameters for | ||
+ | <math>n</math> classes, but for convenience, we use <math>n</math> parameters | ||
+ | in our derivation.'' | ||
+ | |||
+ | Now, this hypothesis defines a predicted probability distribution given some | ||
+ | <tt>x</tt>, <math>P(y | x^{(i)}) = h(x^{(i)}) </math>. Thus to train the | ||
+ | model, a natural choice is to maximize the (conditional) log-likelihood of the | ||
+ | data, <math>l(\theta; x, y) = \sum_{i=1}^{m} \ln { P(y^{(i)} | x^{(i)}) | ||
+ | }</math>. | ||
+ | |||
+ | {{Quote| Motivation: One reason for selecting this form of hypotheses comes | ||
+ | from connections to linear discriminant analysis. In particular, if one assumes | ||
+ | a generative model for the data in the form <math>p(x,y) = p(y) \times p(x | | ||
+ | y)</math> and selects for <math>p(x | y)</math> a member of the exponential | ||
+ | family (which includes Gaussians, Poissons, etc.) it is possible to show that | ||
+ | the conditional probability <math>p(y | x)</math> has the same form as our | ||
+ | chosen hypotheses <math>h(x)</math>. For more details, see the | ||
+ | [http://www.stanford.edu/class/cs229/notes/cs229-notes2.pdf CS 229 Lecture 2 | ||
+ | Notes]. }} | ||
+ | !--> | ||
+ | |||
+ | |||
+ | == Cost Function == | ||
+ | |||
+ | We now present the cost function that we'll use for softmax regression. In the equation below, <math>1\{\cdot\}</math> is | ||
+ | the '''indicator function,''' so that <math>1\{\hbox{A true sentence}\}=1</math>, and <math>1\{\hbox{A false sentence}\}=0</math>. | ||
+ | Our cost function will be: | ||
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | + | J(\theta) = - \sum_{i=1}^{m} \sum_{j=1}^{k} \left[ 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right] | |
- | + | ||
- | + | ||
- | + | ||
\end{align} | \end{align} | ||
</math> | </math> | ||
- | + | Notice that this generalizes the logistic regression cost function, which could also have been written: | |
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | + | J(\theta) = - \sum_{i=1}^{m} \sum_{j=0}^{1} \left[ 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right] | |
+ | \end{align} | ||
+ | </math> | ||
- | + | The softmax cost function is similar, except that we now sum over the <math>k</math> different possible values | |
- | + | of the class label. Note also that in softmax regression, we have that | |
- | e^{ \ | + | <math> |
- | + | p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} ) | |
- | + | ||
- | + | ||
- | + | </math>. | |
+ | |||
+ | There is no known closed-form way to solve for the minimum, and thus as usual we'll resort to an iterative | ||
+ | optimization algorithm such as gradient descent or L-BFGS. Taking derivatives, one can show that the gradient is: | ||
+ | |||
+ | <math> | ||
+ | \begin{align} | ||
+ | \nabla_{\theta_j} J(\theta) = -\sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) ) \right] } | ||
\end{align} | \end{align} | ||
</math> | </math> | ||
- | + | where as usual | |
+ | <math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>. Armed | ||
+ | with this derivation of the derivative, one can then plug it into an algorithm such as gradient descent, and have it | ||
+ | minimize <math>J(\theta)</math>. | ||
+ | When implementing softmax regression, we will typically use a modified version of the cost function described above; | ||
+ | specifically, one that incorporates weight decay. We describe the motivation and details below. | ||
- | |||
- | + | === Properties of softmax regression parameterization === | |
- | + | Softmax regression has an unusual property that it has "too many," or "redundant", parameters. Concretely, | |
- | + | suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math> | |
- | + | from it, so that <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>. Our hypothesis | |
+ | now estimates the class label probabilities as | ||
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | + | p(y^{(i)} = j | x^{(i)} ; \theta) | |
+ | &= frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\ | ||
+ | &= frac{e^{\theta_j^T x^{(i)}} e^{\psi^Tx}}{\sum_{l=1}^k e^{\theta_l^T} x^{(i)} e^{\psi^Tx}} | ||
+ | &= frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T} x^{(i)}} | ||
\end{align} | \end{align} | ||
</math> | </math> | ||
- | This | + | In other words, subtracting <math>\psi</math> from every <math>\theta_j</math> |
+ | does not affect our hypothesis' predictions at all! This shows that softmax | ||
+ | regression's parameters are "redundant." More formally, we say that our | ||
+ | softmax model is '''overparameterized,''' meaning that for any hypothesis we might | ||
+ | fit to the data, there're multiple parameter settings that give rise to the same | ||
+ | hypothesis output. | ||
+ | |||
+ | Further, if the cost function <math>J(\theta)</math> is minimized by some | ||
+ | setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>, | ||
+ | then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots, | ||
+ | \theta_n - \psi)</math> for any value of <math>\psi</math>. Thus, the | ||
+ | minimizer of <math>J(\theta)</math> is no longer unique. (Interestingly | ||
+ | however, <math>J(\theta)</math> is still convex, and thus gradient descent will | ||
+ | not run into a local optimum. But the Hessian is singular/non-invertible, | ||
+ | which cause a straightforward implementation of Newton's method to run into | ||
+ | numerical problems.) | ||
+ | |||
+ | Notice also that by setting <math>\psi = \theta_1</math>, one can always | ||
+ | replace <math>\theta_1</math> with <math>\vec{0}</math> (the vector of all | ||
+ | 0's), without affecting the hypothesis. Thus, one could "eliminate" the vector | ||
+ | of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for | ||
+ | any single value of <math>j</math>), without harming the representational power | ||
+ | of our hypothesis. Indeed, rather than optimizing over the <math>kn</math> | ||
+ | parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math> (where | ||
+ | <math>\theta_j \in \Re^n</math>), one could indeed set <math>\theta_1 = | ||
+ | \vec{0}</math> and optimize only with respect to the <math>(k-1)n</math> | ||
+ | remaining parameters, and this would work fine. | ||
+ | |||
+ | In practice, however, it is often cleaner and simpler to implement the version which keeps | ||
+ | all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>. But we will | ||
+ | make one change to the cost function: Adding weight decay. This will take care of | ||
+ | the numerical problems associated with softmax regression's overparameterized representation. | ||
+ | |||
+ | |||
+ | === Weight Decay === | ||
+ | |||
+ | <!-- | ||
+ | Fortunately, this turns out to be a convex optimization problem, and thus algorithms such as | ||
+ | gradient descent and L-BFGS are guaranteed to converge to the global minimum. | ||
+ | !--> | ||
+ | |||
+ | We will modify the cost function by adding a weight decay term | ||
+ | <math>\frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2</math> | ||
+ | which penalizes large values of the parameters. Our cost function is now | ||
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | J(\theta) = -\ | + | J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} \left[ 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right] \right] |
+ | + \frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2 | ||
\end{align} | \end{align} | ||
</math> | </math> | ||
- | + | With this weight decay term (for any <math>\lambda > 0</math>), the cost function | |
+ | <math>J(\theta)</math> is now strictly convex, and is guaranteed to have a | ||
+ | unique solution. The Hessian is now invertible, and <math>J(\theta)</math> is still | ||
+ | convex, and thus algorithms such as gradient descent, L-BFGS, etc. are guaranteed | ||
+ | to converge to the global minimum. | ||
+ | |||
+ | To implement these optimization algorithms, we also need the derivative, which | ||
+ | works out to be: | ||
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | \ | + | \nabla_{\theta_j} J(\theta) = -\sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) ) \right] } + \lambda \theta_j |
- | + | ||
\end{align} | \end{align} | ||
</math> | </math> | ||
+ | |||
Minimizing <math>J(\theta)</math> now performs regularized softmax regression. | Minimizing <math>J(\theta)</math> now performs regularized softmax regression. | ||
- | |||
- | + | ||
+ | <!-- | ||
+ | Expanding the log-likelihood expression, we find that: | ||
<math> | <math> | ||
- | \begin{align} | + | \begin{align} |
- | + | \ell(\theta) &= \ln L(\theta; x, y) \\ | |
+ | &= \ln \prod_{i=1}^{m}{ P(y^{(i)} | x^{(i)}) } \\ | ||
+ | &= \sum_{i=1}^{m}{ \ln \frac{ e^{ \theta^T_{y^{(i)}} x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } } \\ | ||
+ | &= \sum_{i=1}^{m}{\left[ \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }}\right]} | ||
+ | \end{align} | ||
+ | </math> | ||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | &= | + | <math> |
+ | \begin{align} | ||
+ | \frac{\partial \ell(\theta)}{\partial \theta_k} &= \frac{\partial}{\partial \theta_k} \sum_{i=1}^{m}{\left[ \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }}\right]} \\ | ||
- | \ | + | &= \sum_{i=1}^{m}{ \left[ I_{ \{ y^{(i)} = k\} } x^{(i)} - \frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } |
\cdot | \cdot | ||
- | + | e^{ \theta_k^T x^{(i)} } | |
- | \ | + | \cdot |
- | + | x^{(i)} \right]} | |
- | + | \qquad \text{(where } I_{ \{ y^{(i)} = k\} } \text{is 1 when } y^{(i)} = k \text{ and 0 otherwise) } \\ | |
- | + | ||
- | + | ||
- | + | ||
- | &= | + | &= \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = k\} - P(y^{(i)} = k | x^{(i)}) ) \right] } |
- | + | ||
- | + | ||
- | \ | + | |
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
\end{align} | \end{align} | ||
</math> | </math> | ||
- | + | With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc. | |
+ | !--> | ||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | + | === Generalizing Logistic Regression === | |
- | In | + | In the special case where <math>k = 2</math>, one can also show that softmax regression reduces to logistic regression. |
+ | This shows that softmax regression is a generalization of logistic regression. Concretely, our hypothesis outputs | ||
- | + | <math> | |
+ | \begin{align} | ||
+ | h(x) &= | ||
- | + | \frac{1}{ e^{\theta_1^Tx} + e^{ \theta_2^T x^{(i)} } } | |
+ | \begin{bmatrix} | ||
+ | e^{ \theta_1^T x } \\ | ||
+ | e^{ \theta_2^T x } | ||
+ | \end{bmatrix} | ||
+ | \end{align} | ||
+ | </math> | ||
- | + | Taking advantage of the fact that this hypothesis | |
+ | is overparameterized and setting $\psi - =\theta_1$, | ||
+ | we can subtract $\theta_1$ from each of the two parameters, giving us | ||
+ | <math> | ||
\begin{align} | \begin{align} | ||
- | h(x | + | h(x) &= |
- | \frac{1}{ | + | \frac{1}{ e^{\vec{0}^Tx} + e^{ (\theta_2-\theta_1)^T x^{(i)} } } |
- | \begin{bmatrix} | + | \begin{bmatrix} |
- | e^{ \ | + | e^{ \vec{0}^T x } \\ |
- | + | e^{ (\theta_2-\theta_1)^T x } | |
\end{bmatrix} \\ | \end{bmatrix} \\ | ||
- | |||
- | \ | + | &= |
- | \ | + | \begin{bmatrix} |
- | \frac | + | \frac{1}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\ |
- | + | \frac{e^{ (\theta_2-\theta_1)^T x }}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } | |
- | e^{ \ | + | |
- | + | ||
\end{bmatrix} \\ | \end{bmatrix} \\ | ||
- | &= | + | &= |
+ | \begin{bmatrix} | ||
+ | \frac{1}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\ | ||
+ | 1 - \frac{1}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\ | ||
+ | \end{bmatrix} | ||
+ | \end{align} | ||
- | |||
- | |||
- | |||
- | |||
- | |||
+ | Thus, replacing $\theta_2-\theta_1$ with a single parameter vector $\theta'$, we find | ||
+ | that softmax regression predicts the probability of one of the classes as | ||
+ | <math>\frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>, | ||
+ | and that of the other class as | ||
+ | <math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>, | ||
+ | same as logistic regression. | ||
- | + | ||
- | </math> | + | === Softmax Regression vs. k Binary Classifiers === |
+ | |||
+ | Suppose you are working on a music classification application, and there are | ||
+ | <math>k</math> types of music that you are trying to detect. Should you use a | ||
+ | softmax classifier, or should you build <math>k</math> separate binary classifiers using | ||
+ | logistic regression? | ||
+ | |||
+ | This will depend on whether the four classes are ''mutually exclusive.'' For example, | ||
+ | if your four classes are classical, country, rock, and jazz, then assuming each | ||
+ | of your training examples is labeled with exactly one of these four class labels, | ||
+ | you should build a softmax classifier with <math>k=4</math>. | ||
+ | (Or if there're also some examples that are none of the above four classes, | ||
+ | then you can set <math>k=5</math> and also have a "none of the above" class.) | ||
+ | |||
+ | If however your categories are has_vocals, dance, sountrack, pop, then the | ||
+ | classes are not mutually exclusive; for example, there can be a piece of pop | ||
+ | music that comes from a sountrack and in addition has vocals. In this case, it | ||
+ | would be more appropriate to build 4 binary logistic regression classifiers, so | ||
+ | that for each new musical piece, your algorithm can separately decide whether | ||
+ | it falls into each of the four categories. |