Softmax Regression
From Ufldl
Line 144: | Line 144: | ||
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m | + | J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\ |
&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right] | &= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right] | ||
\end{align} | \end{align} | ||
Line 152: | Line 152: | ||
of the class label. Note also that in softmax regression, we have that | of the class label. Note also that in softmax regression, we have that | ||
<math> | <math> | ||
- | p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}} | + | p(y^{(i)} = j | x^{(i)} ; \theta) = \frac{e^{\theta_j^T x^{(i)}}}{(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} } |
- | + | ||
</math>. | </math>. | ||
- | There is no known closed-form way to solve for the minimum, and thus as usual we'll resort to an iterative | + | There is no known closed-form way to solve for the minimum of <math>J(\theta)</math>, and thus as usual we'll resort to an iterative |
optimization algorithm such as gradient descent or L-BFGS. Taking derivatives, one can show that the gradient is: | optimization algorithm such as gradient descent or L-BFGS. Taking derivatives, one can show that the gradient is: | ||
Line 165: | Line 164: | ||
</math> | </math> | ||
+ | <!-- | ||
where as usual | where as usual | ||
- | <math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>. | + | <math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>. !--> |
- | with this derivation of the derivative, one can then plug it into an algorithm such as gradient descent, and have it | + | |
- | minimize <math>J(\theta)</math>. | + | Recall the meaning "<math>\nabla_{\theta_j}</math>" notation. In particular, <math>\nabla_{\theta_j} J(\theta)</math> |
+ | is itself a vector, so that it's <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math> | ||
+ | the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_l</math>. | ||
+ | |||
+ | Armed with this derivation of the derivative, one can then plug it into an algorithm such as gradient descent, and have it | ||
+ | minimize <math>J(\theta)</math>. For example, with the standard implementation of gradient descent, on each iteration | ||
+ | we would perform the update <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math> (for each <math>j=1,\ldots,k</math>). | ||
When implementing softmax regression, we will typically use a modified version of the cost function described above; | When implementing softmax regression, we will typically use a modified version of the cost function described above; | ||
Line 176: | Line 182: | ||
== Properties of softmax regression parameterization == | == Properties of softmax regression parameterization == | ||
- | Softmax regression has an unusual property that it has | + | Softmax regression has an unusual property that it has a "redundant" set of parameters. To explain what this means, |
suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math> | suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math> | ||
from it, so that <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>. Our hypothesis | from it, so that <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>. Our hypothesis | ||
Line 184: | Line 190: | ||
\begin{align} | \begin{align} | ||
p(y^{(i)} = j | x^{(i)} ; \theta) | p(y^{(i)} = j | x^{(i)} ; \theta) | ||
- | &= frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\ | + | &= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\ |
- | &= frac{e^{\theta_j^T x^{(i)}} e^{\psi^Tx}}{\sum_{l=1}^k e^{\theta_l^T} x^{(i)} e^{\psi^Tx}} | + | &= \frac{e^{\theta_j^T x^{(i)}} e^{\psi^Tx}}{\sum_{l=1}^k e^{\theta_l^T} x^{(i)} e^{\psi^Tx}} |
- | &= frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T} x^{(i)}} | + | &= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T} x^{(i)}} |
\end{align} | \end{align} | ||
</math> | </math> | ||
Line 194: | Line 200: | ||
regression's parameters are "redundant." More formally, we say that our | regression's parameters are "redundant." More formally, we say that our | ||
softmax model is '''overparameterized,''' meaning that for any hypothesis we might | softmax model is '''overparameterized,''' meaning that for any hypothesis we might | ||
- | fit to the data, there're multiple parameter settings that give rise to the same | + | fit to the data, there're multiple parameter settings that give rise to exactly |
- | hypothesis | + | the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math> |
+ | to the predictions. | ||
Further, if the cost function <math>J(\theta)</math> is minimized by some | Further, if the cost function <math>J(\theta)</math> is minimized by some | ||
Line 201: | Line 208: | ||
then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots, | then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots, | ||
\theta_n - \psi)</math> for any value of <math>\psi</math>. Thus, the | \theta_n - \psi)</math> for any value of <math>\psi</math>. Thus, the | ||
- | minimizer of <math>J(\theta)</math> is | + | minimizer of <math>J(\theta)</math> is not unique. (Interestingly, |
- | + | <math>J(\theta)</math> is still convex, and thus gradient descent will | |
not run into a local optimum. But the Hessian is singular/non-invertible, | not run into a local optimum. But the Hessian is singular/non-invertible, | ||
which cause a straightforward implementation of Newton's method to run into | which cause a straightforward implementation of Newton's method to run into | ||
- | numerical problems.) | + | numerical problems.) |
Notice also that by setting <math>\psi = \theta_1</math>, one can always | Notice also that by setting <math>\psi = \theta_1</math>, one can always | ||
- | replace <math>\theta_1</math> with <math>\vec{0}</math> (the vector of all | + | replace <math>\theta_1</math> with <math>\theta_1 - \psi = \vec{0}</math> (the vector of all |
0's), without affecting the hypothesis. Thus, one could "eliminate" the vector | 0's), without affecting the hypothesis. Thus, one could "eliminate" the vector | ||
of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for | of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for | ||
any single value of <math>j</math>), without harming the representational power | any single value of <math>j</math>), without harming the representational power | ||
of our hypothesis. Indeed, rather than optimizing over the <math>kn</math> | of our hypothesis. Indeed, rather than optimizing over the <math>kn</math> | ||
- | parameters <math>(\theta_1, \theta_2,\ldots, \ | + | parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where |
- | <math>\theta_j \in \Re^n</math>), one could indeed set <math>\theta_1 = | + | <math>\theta_j \in \Re^{n+1}</math>), one could indeed set <math>\theta_1 = |
- | \vec{0}</math> and optimize only with respect to the <math>(k-1)n</math> | + | \vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math> |
- | remaining parameters, and this would work fine. | + | remaining parameters, and this would work fine. |
In practice, however, it is often cleaner and simpler to implement the version which keeps | In practice, however, it is often cleaner and simpler to implement the version which keeps | ||
- | all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>. But we will | + | all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>, without |
+ | arbitrarily setting one of them to zero. But we will | ||
make one change to the cost function: Adding weight decay. This will take care of | make one change to the cost function: Adding weight decay. This will take care of | ||
the numerical problems associated with softmax regression's overparameterized representation. | the numerical problems associated with softmax regression's overparameterized representation. | ||
Line 232: | Line 240: | ||
We will modify the cost function by adding a weight decay term | We will modify the cost function by adding a weight decay term | ||
- | <math>\frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2</math> | + | <math>\frac{\lambda}{2} \sum_{i=1}^k \sum_{j=1}^{n+1} \theta_{ij}^2</math> |
which penalizes large values of the parameters. Our cost function is now | which penalizes large values of the parameters. Our cost function is now | ||
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} | + | J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }} \right] |
+ \frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2 | + \frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2 | ||
\end{align} | \end{align} | ||
Line 244: | Line 252: | ||
With this weight decay term (for any <math>\lambda > 0</math>), the cost function | With this weight decay term (for any <math>\lambda > 0</math>), the cost function | ||
<math>J(\theta)</math> is now strictly convex, and is guaranteed to have a | <math>J(\theta)</math> is now strictly convex, and is guaranteed to have a | ||
- | unique solution. The Hessian is now invertible, and <math>J(\theta)</math> is | + | unique solution. The Hessian is now invertible, and because <math>J(\theta)</math> is |
- | convex, | + | convex, algorithms such as gradient descent, L-BFGS, etc. are guaranteed |
to converge to the global minimum. | to converge to the global minimum. | ||
- | To implement these optimization algorithms, we also need the derivative | + | To implement these optimization algorithms, we also need the derivative of this |
- | + | new definition of <math>J(\theta)</math>. One can show that the derivative is: | |
- | + | ||
<math> | <math> | ||
\begin{align} | \begin{align} | ||
Line 257: | Line 264: | ||
</math> | </math> | ||
- | + | By minimizing <math>J(\theta)</math> with respect to <math>\theta</math>, we will have a working implementation of softmax regression. | |
- | + | ||
- | + | ||
Line 292: | Line 297: | ||
With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc. | With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc. | ||
!--> | !--> | ||
- | |||
- | |||
- | |||
== Relationship to Logistic Regression == | == Relationship to Logistic Regression == | ||
Line 314: | Line 316: | ||
Taking advantage of the fact that this hypothesis | Taking advantage of the fact that this hypothesis | ||
- | is overparameterized and setting | + | is overparameterized and setting <math>\psi - =\theta_1</math>, |
- | we can subtract | + | we can subtract <math>\theta_1</math> from each of the two parameters, giving us |
<math> | <math> | ||
Line 340: | Line 342: | ||
\end{bmatrix} | \end{bmatrix} | ||
\end{align} | \end{align} | ||
+ | </math> | ||
- | Thus, replacing | + | Thus, replacing <math>\theta_2-\theta_1</math> with a single parameter vector $\theta'$, we find |
that softmax regression predicts the probability of one of the classes as | that softmax regression predicts the probability of one of the classes as | ||
<math>\frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>, | <math>\frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>, | ||
Line 353: | Line 356: | ||
Suppose you are working on a music classification application, and there are | Suppose you are working on a music classification application, and there are | ||
- | <math>k</math> types of music that you are trying to | + | <math>k</math> types of music that you are trying to recognize. Should you use a |
softmax classifier, or should you build <math>k</math> separate binary classifiers using | softmax classifier, or should you build <math>k</math> separate binary classifiers using | ||
logistic regression? | logistic regression? | ||
Line 361: | Line 364: | ||
of your training examples is labeled with exactly one of these four class labels, | of your training examples is labeled with exactly one of these four class labels, | ||
you should build a softmax classifier with <math>k=4</math>. | you should build a softmax classifier with <math>k=4</math>. | ||
- | ( | + | (If there're also some examples that are none of the above four classes, |
- | then you can set <math>k=5</math> and also have a "none of the above" class.) | + | then you can set <math>k=5</math> in softmax regression, and also have a fifth, "none of the above," class.) |
If however your categories are has_vocals, dance, sountrack, pop, then the | If however your categories are has_vocals, dance, sountrack, pop, then the | ||
classes are not mutually exclusive; for example, there can be a piece of pop | classes are not mutually exclusive; for example, there can be a piece of pop | ||
music that comes from a sountrack and in addition has vocals. In this case, it | music that comes from a sountrack and in addition has vocals. In this case, it | ||
- | would be more appropriate to build 4 binary logistic regression classifiers | + | would be more appropriate to build 4 binary logistic regression classifiers. |
- | + | This way, for each new musical piece, your algorithm can separately decide whether | |
it falls into each of the four categories. | it falls into each of the four categories. | ||
+ | |||
+ | Now, consider a computer vision example, where you're trying to classify images into | ||
+ | three different classes. (i) Suppose that your classes are indoor_scene, | ||
+ | outdoor_urban_scene, and outdoor_wilderness_scene. Would you use sofmax regression | ||
+ | or three logistic regression classifiers? (ii) Now suppose your classes are | ||
+ | indoor_scene, black_and_white_image, and image_has_people. Would you use softmax | ||
+ | regression of multiple logistic regression classifiers? | ||
+ | |||
+ | In the first case, the classes are mutually exclusive, so a softmax regression | ||
+ | classifier would be appropriate. In the second case, it would be more appropriate to build | ||
+ | three separate logistic regression classifiers. |