Softmax Regression

From Ufldl

Jump to: navigation, search
(Introduction)
 
Line 10: Line 10:
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>
of <math>m</math> labeled examples, where the input features are <math>x^{(i)} \in \Re^{n+1}</math>.   
of <math>m</math> labeled examples, where the input features are <math>x^{(i)} \in \Re^{n+1}</math>.   
-
(In this set of notes, we will use the notational convention of letting <math>x^{(i)}</math> be
+
(In this set of notes, we will use the notational convention of letting the feature vectors <math>x</math> be
<math>n+1</math> dimensional, with <math>x_0 = 1</math> corresponding to the intercept term.)  
<math>n+1</math> dimensional, with <math>x_0 = 1</math> corresponding to the intercept term.)  
With logistic regression, we were in the binary classification setting, so the labels  
With logistic regression, we were in the binary classification setting, so the labels  
Line 19: Line 19:
\end{align}</math>
\end{align}</math>
-
and the model parameters <math>\theta</math> were trained to ''minimize''
+
and the model parameters <math>\theta</math> were trained to minimize
the cost function
the cost function
Line 73: Line 73:
For convenience, we will also write  
For convenience, we will also write  
<math>\theta</math> to denote all the
<math>\theta</math> to denote all the
-
parameters of our model.  When you implement softmax regression, is is usually
+
parameters of our model.  When you implement softmax regression, it is usually
-
convenient to represent <math>\theta</math> as a matrix obtained by
+
convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by
-
stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math>, so that
+
stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that
<math>
<math>
\theta = \begin{bmatrix}
\theta = \begin{bmatrix}
-
\mbox{--}\theta_1^T\mbox{--} \\
+
\mbox{---} \theta_1^T \mbox{---} \\
-
\mbox{--}\theta_2^T\mbox{--} \\
+
\mbox{---} \theta_2^T \mbox{---} \\
\vdots \\
\vdots \\
-
\mbox{--}\theta_k^T\mbox{--} \\
+
\mbox{---} \theta_k^T \mbox{---} \\
\end{bmatrix}
\end{bmatrix}
</math>
</math>
Line 133: Line 133:
We now describe the cost function that we'll use for softmax regression.  In the equation below, <math>1\{\cdot\}</math> is
We now describe the cost function that we'll use for softmax regression.  In the equation below, <math>1\{\cdot\}</math> is
-
the '''indicator function,''' so that <math>1\{\hbox{A true statement}\}=1</math>, and <math>1\{\hbox{A false statement}\}=0</math>.
+
the '''indicator function,''' so that <math>1\{\hbox{a true statement}\}=1</math>, and <math>1\{\hbox{a false statement}\}=0</math>.
For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be:
For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be:
<math>
<math>
\begin{align}
\begin{align}
-
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}  1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
+
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}  1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
\end{align}
\end{align}
</math>
</math>
Line 162: Line 162:
<math>
<math>
\begin{align}
\begin{align}
-
\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) ) \right]  }
+
\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right]  }
\end{align}
\end{align}
</math>
</math>
Line 170: Line 170:
<math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>.  !-->
<math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>.  !-->
-
Recall the meaning "<math>\nabla_{\theta_j}</math>" notation.  In particular, <math>\nabla_{\theta_j} J(\theta)</math>
+
Recall the meaning of the "<math>\nabla_{\theta_j}</math>" notation.  In particular, <math>\nabla_{\theta_j} J(\theta)</math>
-
is itself a vector, so that it's <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>
+
is itself a vector, so that its <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>
-
the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_l</math>.  
+
the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_j</math>.  
Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it
Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it
Line 185: Line 185:
Softmax regression has an unusual property that it has a "redundant" set of parameters.  To explain what this means,  
Softmax regression has an unusual property that it has a "redundant" set of parameters.  To explain what this means,  
suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>
suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>
-
from it, so that <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>.  Our hypothesis
+
from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>  
 +
(for every <math>j=1, \ldots, k</math>).  Our hypothesis
now estimates the class label probabilities as
now estimates the class label probabilities as
Line 193: Line 194:
&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
-
&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T} x^{(i)}}
+
&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.
\end{align}
\end{align}
</math>
</math>
Line 201: Line 202:
regression's parameters are "redundant."  More formally, we say that our
regression's parameters are "redundant."  More formally, we say that our
softmax model is '''overparameterized,''' meaning that for any hypothesis we might
softmax model is '''overparameterized,''' meaning that for any hypothesis we might
-
fit to the data, there're multiple parameter settings that give rise to exactly
+
fit to the data, there are multiple parameter settings that give rise to exactly
the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>
the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>
to the predictions.  
to the predictions.  
Further, if the cost function <math>J(\theta)</math> is minimized by some
Further, if the cost function <math>J(\theta)</math> is minimized by some
-
setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>,
+
setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>,
then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,
then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,
-
\theta_n - \psi)</math> for any value of <math>\psi</math>.  Thus, the
+
\theta_k - \psi)</math> for any value of <math>\psi</math>.  Thus, the
minimizer of <math>J(\theta)</math> is not unique.  (Interestingly,  
minimizer of <math>J(\theta)</math> is not unique.  (Interestingly,  
<math>J(\theta)</math> is still convex, and thus gradient descent will
<math>J(\theta)</math> is still convex, and thus gradient descent will
-
not run into a local optimum.  But the Hessian is singular/non-invertible,
+
not run into a local optima problems.  But the Hessian is singular/non-invertible,
-
which cause a straightforward implementation of Newton's method to run into
+
which causes a straightforward implementation of Newton's method to run into
numerical problems.)  
numerical problems.)  
Line 220: Line 221:
of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for
of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for
any single value of <math>j</math>), without harming the representational power
any single value of <math>j</math>), without harming the representational power
-
of our hypothesis.  Indeed, rather than optimizing over the <math>kn</math>
+
of our hypothesis.  Indeed, rather than optimizing over the <math>k(n+1)</math>
parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where
parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where
-
<math>\theta_j \in \Re^{n+1}</math>), one could indeed set <math>\theta_1 =
+
<math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 =
\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>
\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>
remaining parameters, and this would work fine.  
remaining parameters, and this would work fine.  
Line 240: Line 241:
We will modify the cost function by adding a weight decay term  
We will modify the cost function by adding a weight decay term  
-
<math>\frac{\lambda}{2} \sum_{i=1}^k \sum_{j=1}^{n+1} \theta_{ij}^2</math>
+
<math>\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2</math>
which penalizes large values of the parameters.  Our cost function is now
which penalizes large values of the parameters.  Our cost function is now
<math>
<math>
\begin{align}
\begin{align}
-
J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}  \right]
+
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}  \right]
-
               + \frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2
+
               + \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2
\end{align}
\end{align}
</math>
</math>
Line 256: Line 257:
to converge to the global minimum.
to converge to the global minimum.
-
To implement these optimization algorithms, we also need the derivative of this
+
To apply an optimization algorithm, we also need the derivative of this
new definition of <math>J(\theta)</math>.  One can show that the derivative is:
new definition of <math>J(\theta)</math>.  One can show that the derivative is:
<math>
<math>
Line 300: Line 301:
== Relationship to Logistic Regression ==
== Relationship to Logistic Regression ==
-
In the special case where <math>k = 2</math>, one can also show that softmax regression reduces to logistic regression.
+
In the special case where <math>k = 2</math>, one can show that softmax regression reduces to logistic regression.
-
This shows that softmax regression is a generalization of logistic regression.  Concretely, our hypothesis outputs
+
This shows that softmax regression is a generalization of logistic regression.  Concretely, when <math>k=2</math>,
 +
the softmax regression hypothesis outputs
<math>
<math>
\begin{align}
\begin{align}
-
h(x) &=
+
h_\theta(x) &=
\frac{1}{ e^{\theta_1^Tx}  + e^{ \theta_2^T x^{(i)} } }
\frac{1}{ e^{\theta_1^Tx}  + e^{ \theta_2^T x^{(i)} } }
Line 316: Line 318:
Taking advantage of the fact that this hypothesis
Taking advantage of the fact that this hypothesis
-
is overparameterized and setting <math>\psi - =\theta_1</math>,
+
is overparameterized and setting <math>\psi = \theta_1</math>,
we can subtract <math>\theta_1</math> from each of the two parameters, giving us
we can subtract <math>\theta_1</math> from each of the two parameters, giving us
Line 351: Line 353:
<math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>,
<math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>,
same as logistic regression.
same as logistic regression.
-
 
== Softmax Regression vs. k Binary Classifiers ==
== Softmax Regression vs. k Binary Classifiers ==
Line 379: Line 380:
or three logistic regression classifiers?  (ii) Now suppose your classes are
or three logistic regression classifiers?  (ii) Now suppose your classes are
indoor_scene, black_and_white_image, and image_has_people.  Would you use softmax
indoor_scene, black_and_white_image, and image_has_people.  Would you use softmax
-
regression of multiple logistic regression classifiers?
+
regression or multiple logistic regression classifiers?
In the first case, the classes are mutually exclusive, so a softmax regression
In the first case, the classes are mutually exclusive, so a softmax regression
classifier would be appropriate.  In the second case, it would be more appropriate to build
classifier would be appropriate.  In the second case, it would be more appropriate to build
three separate logistic regression classifiers.
three separate logistic regression classifiers.
 +
 +
 +
{{Softmax}}
 +
 +
 +
{{Languages|Softmax回归|中文}}

Latest revision as of 13:24, 7 April 2013

Personal tools