Softmax Regression
From Ufldl
(→Introduction) |
|||
Line 73: | Line 73: | ||
For convenience, we will also write | For convenience, we will also write | ||
<math>\theta</math> to denote all the | <math>\theta</math> to denote all the | ||
- | parameters of our model. When you implement softmax regression, | + | parameters of our model. When you implement softmax regression, it is usually |
convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by | convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by | ||
stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that | stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that | ||
Line 133: | Line 133: | ||
We now describe the cost function that we'll use for softmax regression. In the equation below, <math>1\{\cdot\}</math> is | We now describe the cost function that we'll use for softmax regression. In the equation below, <math>1\{\cdot\}</math> is | ||
- | the '''indicator function,''' so that <math>1\{\hbox{ | + | the '''indicator function,''' so that <math>1\{\hbox{a true statement}\}=1</math>, and <math>1\{\hbox{a false statement}\}=0</math>. |
For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be: | For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be: | ||
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right] | + | J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right] |
\end{align} | \end{align} | ||
</math> | </math> | ||
Line 162: | Line 162: | ||
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | \nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) ) \right] } | + | \nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right] } |
\end{align} | \end{align} | ||
</math> | </math> | ||
Line 170: | Line 170: | ||
<math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>. !--> | <math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>. !--> | ||
- | Recall the meaning "<math>\nabla_{\theta_j}</math>" notation. In particular, <math>\nabla_{\theta_j} J(\theta)</math> | + | Recall the meaning of the "<math>\nabla_{\theta_j}</math>" notation. In particular, <math>\nabla_{\theta_j} J(\theta)</math> |
- | is itself a vector, so that | + | is itself a vector, so that its <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math> |
- | the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\ | + | the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_j</math>. |
Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it | Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it | ||
Line 185: | Line 185: | ||
Softmax regression has an unusual property that it has a "redundant" set of parameters. To explain what this means, | Softmax regression has an unusual property that it has a "redundant" set of parameters. To explain what this means, | ||
suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math> | suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math> | ||
- | from it, so that <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>. Our hypothesis | + | from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math> |
+ | (for every <math>j=1, \ldots, k</math>). Our hypothesis | ||
now estimates the class label probabilities as | now estimates the class label probabilities as | ||
Line 193: | Line 194: | ||
&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\ | &= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\ | ||
&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\ | &= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\ | ||
- | &= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T | + | &= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}. |
\end{align} | \end{align} | ||
</math> | </math> | ||
Line 201: | Line 202: | ||
regression's parameters are "redundant." More formally, we say that our | regression's parameters are "redundant." More formally, we say that our | ||
softmax model is '''overparameterized,''' meaning that for any hypothesis we might | softmax model is '''overparameterized,''' meaning that for any hypothesis we might | ||
- | fit to the data, there | + | fit to the data, there are multiple parameter settings that give rise to exactly |
the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math> | the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math> | ||
to the predictions. | to the predictions. | ||
Further, if the cost function <math>J(\theta)</math> is minimized by some | Further, if the cost function <math>J(\theta)</math> is minimized by some | ||
- | setting of the parameters <math>(\theta_1, \theta_2,\ldots, \ | + | setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>, |
then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots, | then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots, | ||
- | \ | + | \theta_k - \psi)</math> for any value of <math>\psi</math>. Thus, the |
minimizer of <math>J(\theta)</math> is not unique. (Interestingly, | minimizer of <math>J(\theta)</math> is not unique. (Interestingly, | ||
<math>J(\theta)</math> is still convex, and thus gradient descent will | <math>J(\theta)</math> is still convex, and thus gradient descent will | ||
- | not run into a local | + | not run into a local optima problems. But the Hessian is singular/non-invertible, |
- | which | + | which causes a straightforward implementation of Newton's method to run into |
numerical problems.) | numerical problems.) | ||
Line 220: | Line 221: | ||
of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for | of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for | ||
any single value of <math>j</math>), without harming the representational power | any single value of <math>j</math>), without harming the representational power | ||
- | of our hypothesis. Indeed, rather than optimizing over the <math> | + | of our hypothesis. Indeed, rather than optimizing over the <math>k(n+1)</math> |
parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where | parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where | ||
- | <math>\theta_j \in \Re^{n+1}</math>), one could | + | <math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 = |
\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math> | \vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math> | ||
remaining parameters, and this would work fine. | remaining parameters, and this would work fine. | ||
Line 240: | Line 241: | ||
We will modify the cost function by adding a weight decay term | We will modify the cost function by adding a weight decay term | ||
- | <math>\frac{\lambda}{2} \sum_{i=1}^k \sum_{j= | + | <math>\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2</math> |
which penalizes large values of the parameters. Our cost function is now | which penalizes large values of the parameters. Our cost function is now | ||
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }} \right] | + | J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }} \right] |
- | + \frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2 | + | + \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2 |
\end{align} | \end{align} | ||
</math> | </math> | ||
Line 256: | Line 257: | ||
to converge to the global minimum. | to converge to the global minimum. | ||
- | To | + | To apply an optimization algorithm, we also need the derivative of this |
new definition of <math>J(\theta)</math>. One can show that the derivative is: | new definition of <math>J(\theta)</math>. One can show that the derivative is: | ||
<math> | <math> | ||
Line 300: | Line 301: | ||
== Relationship to Logistic Regression == | == Relationship to Logistic Regression == | ||
- | In the special case where <math>k = 2</math>, one can | + | In the special case where <math>k = 2</math>, one can show that softmax regression reduces to logistic regression. |
- | This shows that softmax regression is a generalization of logistic regression. Concretely, | + | This shows that softmax regression is a generalization of logistic regression. Concretely, when <math>k=2</math>, |
+ | the softmax regression hypothesis outputs | ||
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | + | h_\theta(x) &= | |
\frac{1}{ e^{\theta_1^Tx} + e^{ \theta_2^T x^{(i)} } } | \frac{1}{ e^{\theta_1^Tx} + e^{ \theta_2^T x^{(i)} } } | ||
Line 316: | Line 318: | ||
Taking advantage of the fact that this hypothesis | Taking advantage of the fact that this hypothesis | ||
- | is overparameterized and setting <math>\psi | + | is overparameterized and setting <math>\psi = \theta_1</math>, |
we can subtract <math>\theta_1</math> from each of the two parameters, giving us | we can subtract <math>\theta_1</math> from each of the two parameters, giving us | ||
Line 351: | Line 353: | ||
<math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>, | <math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>, | ||
same as logistic regression. | same as logistic regression. | ||
- | |||
== Softmax Regression vs. k Binary Classifiers == | == Softmax Regression vs. k Binary Classifiers == | ||
Line 379: | Line 380: | ||
or three logistic regression classifiers? (ii) Now suppose your classes are | or three logistic regression classifiers? (ii) Now suppose your classes are | ||
indoor_scene, black_and_white_image, and image_has_people. Would you use softmax | indoor_scene, black_and_white_image, and image_has_people. Would you use softmax | ||
- | regression | + | regression or multiple logistic regression classifiers? |
In the first case, the classes are mutually exclusive, so a softmax regression | In the first case, the classes are mutually exclusive, so a softmax regression | ||
classifier would be appropriate. In the second case, it would be more appropriate to build | classifier would be appropriate. In the second case, it would be more appropriate to build | ||
three separate logistic regression classifiers. | three separate logistic regression classifiers. | ||
+ | |||
+ | |||
+ | {{Softmax}} | ||
+ | |||
+ | |||
+ | {{Languages|Softmax回归|中文}} |