Softmax Regression
From Ufldl
(→Properties of softmax regression parameterization) |
|||
Line 73: | Line 73: | ||
For convenience, we will also write | For convenience, we will also write | ||
<math>\theta</math> to denote all the | <math>\theta</math> to denote all the | ||
- | parameters of our model. When you implement softmax regression, | + | parameters of our model. When you implement softmax regression, it is usually |
convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by | convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by | ||
stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that | stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that | ||
Line 202: | Line 202: | ||
regression's parameters are "redundant." More formally, we say that our | regression's parameters are "redundant." More formally, we say that our | ||
softmax model is '''overparameterized,''' meaning that for any hypothesis we might | softmax model is '''overparameterized,''' meaning that for any hypothesis we might | ||
- | fit to the data, there | + | fit to the data, there are multiple parameter settings that give rise to exactly |
the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math> | the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math> | ||
to the predictions. | to the predictions. | ||
Line 241: | Line 241: | ||
We will modify the cost function by adding a weight decay term | We will modify the cost function by adding a weight decay term | ||
- | <math>\frac{\lambda}{2} \sum_{i=1}^k \sum_{j= | + | <math>\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2</math> |
which penalizes large values of the parameters. Our cost function is now | which penalizes large values of the parameters. Our cost function is now | ||
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }} \right] | + | J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }} \right] |
- | + \frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2 | + | + \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2 |
\end{align} | \end{align} | ||
</math> | </math> | ||
Line 257: | Line 257: | ||
to converge to the global minimum. | to converge to the global minimum. | ||
- | To | + | To apply an optimization algorithm, we also need the derivative of this |
new definition of <math>J(\theta)</math>. One can show that the derivative is: | new definition of <math>J(\theta)</math>. One can show that the derivative is: | ||
<math> | <math> | ||
Line 301: | Line 301: | ||
== Relationship to Logistic Regression == | == Relationship to Logistic Regression == | ||
- | In the special case where <math>k = 2</math>, one can | + | In the special case where <math>k = 2</math>, one can show that softmax regression reduces to logistic regression. |
- | This shows that softmax regression is a generalization of logistic regression. Concretely, | + | This shows that softmax regression is a generalization of logistic regression. Concretely, when <math>k=2</math>, |
+ | the softmax regression hypothesis outputs | ||
<math> | <math> | ||
\begin{align} | \begin{align} | ||
- | + | h_\theta(x) &= | |
\frac{1}{ e^{\theta_1^Tx} + e^{ \theta_2^T x^{(i)} } } | \frac{1}{ e^{\theta_1^Tx} + e^{ \theta_2^T x^{(i)} } } | ||
Line 317: | Line 318: | ||
Taking advantage of the fact that this hypothesis | Taking advantage of the fact that this hypothesis | ||
- | is overparameterized and setting <math>\psi | + | is overparameterized and setting <math>\psi = \theta_1</math>, |
we can subtract <math>\theta_1</math> from each of the two parameters, giving us | we can subtract <math>\theta_1</math> from each of the two parameters, giving us | ||
Line 352: | Line 353: | ||
<math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>, | <math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>, | ||
same as logistic regression. | same as logistic regression. | ||
- | |||
== Softmax Regression vs. k Binary Classifiers == | == Softmax Regression vs. k Binary Classifiers == | ||
Line 380: | Line 380: | ||
or three logistic regression classifiers? (ii) Now suppose your classes are | or three logistic regression classifiers? (ii) Now suppose your classes are | ||
indoor_scene, black_and_white_image, and image_has_people. Would you use softmax | indoor_scene, black_and_white_image, and image_has_people. Would you use softmax | ||
- | regression | + | regression or multiple logistic regression classifiers? |
In the first case, the classes are mutually exclusive, so a softmax regression | In the first case, the classes are mutually exclusive, so a softmax regression | ||
classifier would be appropriate. In the second case, it would be more appropriate to build | classifier would be appropriate. In the second case, it would be more appropriate to build | ||
three separate logistic regression classifiers. | three separate logistic regression classifiers. | ||
+ | |||
+ | |||
+ | {{Softmax}} | ||
+ | |||
+ | |||
+ | {{Languages|Softmax回归|中文}} |