Softmax Regression

Revision as of 18:32, 10 May 2011 (view source)

Ang (Talk | contribs)

(→Introduction)

← Older edit

Latest revision as of 13:24, 7 April 2013 (view source)

Kandeng (Talk | contribs)

Line 73:

For convenience, we will also write

<math>\theta</math> to denote all the

-

parameters of our model. When you implement softmax regression, is is usually

+

parameters of our model. When you implement softmax regression, it is usually

convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by

stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that

Line 133:

We now describe the cost function that we'll use for softmax regression. In the equation below, <math>1\{\cdot\}</math> is

-

the '''indicator function,''' so that <math>1\{\hbox{A true statement}\}=1</math>, and <math>1\{\hbox{A false statement}\}=0</math>.

+

the '''indicator function,''' so that <math>1\{\hbox{a true statement}\}=1</math>, and <math>1\{\hbox{a false statement}\}=0</math>.

For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be:

<math>

\begin{align}

-

J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]

+

J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]

\end{align}

</math>

Line 162:

<math>

\begin{align}

-

\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) ) \right] }

+

\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right] }

\end{align}

</math>

Line 170:

<math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>. !-->

-

Recall the meaning "<math>\nabla_{\theta_j}</math>" notation. In particular, <math>\nabla_{\theta_j} J(\theta)</math>

+

Recall the meaning of the "<math>\nabla_{\theta_j}</math>" notation. In particular, <math>\nabla_{\theta_j} J(\theta)</math>

-

is itself a vector, so that ~~it's~~ <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>

+

is itself a vector, so that its <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>

-

the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\~~theta_l~~</math>.

+

the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_j</math>.

Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it

Line 185:

Softmax regression has an unusual property that it has a "redundant" set of parameters. To explain what this means,

suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>

-

from it, so that <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>. Our hypothesis

+

from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>

+

(for every <math>j=1, \ldots, k</math>). Our hypothesis

now estimates the class label probabilities as

Line 193:

Line 194:

&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\

&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\

-

&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T} x^{(i)}}

+

&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.

\end{align}

</math>

Line 201:

Line 202:

regression's parameters are "redundant." More formally, we say that our

softmax model is '''overparameterized,''' meaning that for any hypothesis we might

-

fit to the data, there~~'re~~ multiple parameter settings that give rise to exactly

+

fit to the data, there are multiple parameter settings that give rise to exactly

the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>

to the predictions.

Further, if the cost function <math>J(\theta)</math> is minimized by some

-

setting of the parameters <math>(\theta_1, \theta_2,\ldots, \~~theta_n~~)</math>,

+

setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>,

then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,

-

\~~theta_n~~ - \psi)</math> for any value of <math>\psi</math>. Thus, the

+

\theta_k - \psi)</math> for any value of <math>\psi</math>. Thus, the

minimizer of <math>J(\theta)</math> is not unique. (Interestingly,

<math>J(\theta)</math> is still convex, and thus gradient descent will

-

not run into a local ~~optimum~~. But the Hessian is singular/non-invertible,

+

not run into a local optima problems. But the Hessian is singular/non-invertible,

-

which ~~cause~~ a straightforward implementation of Newton's method to run into

+

which causes a straightforward implementation of Newton's method to run into

numerical problems.)

Line 220:

Line 221:

of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for

any single value of <math>j</math>), without harming the representational power

-

of our hypothesis. Indeed, rather than optimizing over the <math>kn</math>

+

of our hypothesis. Indeed, rather than optimizing over the <math>k(n+1)</math>

parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where

-

<math>\theta_j \in \Re^{n+1}</math>), one could ~~indeed~~ set <math>\theta_1 =

+

<math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 =

\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>

remaining parameters, and this would work fine.

Line 240:

Line 241:

We will modify the cost function by adding a weight decay term

-

<math>\frac{\lambda}{2} \sum_{i=1}^k \sum_{j=1}^{n+1} \theta_{ij}^2</math>

+

<math>\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2</math>

which penalizes large values of the parameters. Our cost function is now

<math>

\begin{align}

-

J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }} \right]

+

J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }} \right]

-

+ \frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2

+

+ \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2

\end{align}

</math>

Line 256:

Line 257:

to converge to the global minimum.

-

To ~~implement these~~ optimization ~~algorithms~~, we also need the derivative of this

+

To apply an optimization algorithm, we also need the derivative of this

new definition of <math>J(\theta)</math>. One can show that the derivative is:

<math>

Line 300:

Line 301:

== Relationship to Logistic Regression ==

-

In the special case where <math>k = 2</math>, one can ~~also~~ show that softmax regression reduces to logistic regression.

+

In the special case where <math>k = 2</math>, one can show that softmax regression reduces to logistic regression.

-

This shows that softmax regression is a generalization of logistic regression. Concretely, ~~our~~ hypothesis outputs

+

This shows that softmax regression is a generalization of logistic regression. Concretely, when <math>k=2</math>,

+

the softmax regression hypothesis outputs

<math>

\begin{align}

-

h(x) &=

+

h_\theta(x) &=

\frac{1}{ e^{\theta_1^Tx} + e^{ \theta_2^T x^{(i)} } }

Line 316:

Line 318:

Taking advantage of the fact that this hypothesis

-

is overparameterized and setting <math>\psi - =\theta_1</math>,

+

is overparameterized and setting <math>\psi = \theta_1</math>,

we can subtract <math>\theta_1</math> from each of the two parameters, giving us

Line 351:

Line 353:

<math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>,

same as logistic regression.

-

== Softmax Regression vs. k Binary Classifiers ==

Line 379:

Line 380:

or three logistic regression classifiers? (ii) Now suppose your classes are

indoor_scene, black_and_white_image, and image_has_people. Would you use softmax

-

regression of multiple logistic regression classifiers?

+

regression or multiple logistic regression classifiers?

In the first case, the classes are mutually exclusive, so a softmax regression

classifier would be appropriate. In the second case, it would be more appropriate to build

three separate logistic regression classifiers.

+

Softmax Regression

From Ufldl

Latest revision as of 13:24, 7 April 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 73: / Line 73: @@
 For convenience, we will also write
 <math>\theta</math> to denote all the
-parameters of our model.  When you implement softmax regression, is is usually
+parameters of our model.  When you implement softmax regression, it is usually
 convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by
 stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that
@@ Line 133: / Line 133: @@
 We now describe the cost function that we'll use for softmax regression.  In the equation below, <math>1\{\cdot\}</math> is
-the '''indicator function,''' so that <math>1\{\hbox{A true statement}\}=1</math>, and <math>1\{\hbox{A false statement}\}=0</math>.
+the '''indicator function,''' so that <math>1\{\hbox{a true statement}\}=1</math>, and <math>1\{\hbox{a false statement}\}=0</math>.
 For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be:
 <math>
 \begin{align}
-J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}  1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
+J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}  1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
 \end{align}
 </math>
@@ Line 162: / Line 162: @@
 <math>
 \begin{align}
-\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) ) \right]  }
+\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right]  }
 \end{align}
 </math>
@@ Line 170: / Line 170: @@
 <math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>.  !-->
-Recall the meaning "<math>\nabla_{\theta_j}</math>" notation.  In particular, <math>\nabla_{\theta_j} J(\theta)</math>
+Recall the meaning of the "<math>\nabla_{\theta_j}</math>" notation.  In particular, <math>\nabla_{\theta_j} J(\theta)</math>
-is itself a vector, so that it's <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>
+is itself a vector, so that its <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>
-the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_l</math>.
+the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_j</math>.
 Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it
@@ Line 185: / Line 185: @@
 Softmax regression has an unusual property that it has a "redundant" set of parameters.  To explain what this means,
 suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>
-from it, so that <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>.  Our hypothesis
+from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>
+(for every <math>j=1, \ldots, k</math>).  Our hypothesis
 now estimates the class label probabilities as
@@ Line 193: / Line 194: @@
 &= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
 &= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
-&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T} x^{(i)}}
+&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.
 \end{align}
 </math>
@@ Line 201: / Line 202: @@
 regression's parameters are "redundant."  More formally, we say that our
 softmax model is '''overparameterized,''' meaning that for any hypothesis we might
-fit to the data, there're multiple parameter settings that give rise to exactly
+fit to the data, there are multiple parameter settings that give rise to exactly
 the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>
 to the predictions.
 Further, if the cost function <math>J(\theta)</math> is minimized by some
-setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>,
+setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>,
 then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,
-\theta_n - \psi)</math> for any value of <math>\psi</math>.  Thus, the
+\theta_k - \psi)</math> for any value of <math>\psi</math>.  Thus, the
 minimizer of <math>J(\theta)</math> is not unique.  (Interestingly,
 <math>J(\theta)</math> is still convex, and thus gradient descent will
-not run into a local optimum.  But the Hessian is singular/non-invertible,
+not run into a local optima problems.  But the Hessian is singular/non-invertible,
-which cause a straightforward implementation of Newton's method to run into
+which causes a straightforward implementation of Newton's method to run into
 numerical problems.)
@@ Line 220: / Line 221: @@
 of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for
 any single value of <math>j</math>), without harming the representational power
-of our hypothesis.  Indeed, rather than optimizing over the <math>kn</math>
+of our hypothesis.  Indeed, rather than optimizing over the <math>k(n+1)</math>
 parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where
-<math>\theta_j \in \Re^{n+1}</math>), one could indeed set <math>\theta_1 =
+<math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 =
 \vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>
 remaining parameters, and this would work fine.
@@ Line 240: / Line 241: @@
 We will modify the cost function by adding a weight decay term
-<math>\frac{\lambda}{2} \sum_{i=1}^k \sum_{j=1}^{n+1} \theta_{ij}^2</math>
+<math>\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2</math>
 which penalizes large values of the parameters.  Our cost function is now
 <math>
 \begin{align}
-J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}  \right]
+J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}  \right]
-               + \frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2
+               + \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2
 \end{align}
 </math>
@@ Line 256: / Line 257: @@
 to converge to the global minimum.
-To implement these optimization algorithms, we also need the derivative of this
+To apply an optimization algorithm, we also need the derivative of this
 new definition of <math>J(\theta)</math>.  One can show that the derivative is:
 <math>
@@ Line 300: / Line 301: @@
 == Relationship to Logistic Regression ==
-In the special case where <math>k = 2</math>, one can also show that softmax regression reduces to logistic regression.
+In the special case where <math>k = 2</math>, one can show that softmax regression reduces to logistic regression.
-This shows that softmax regression is a generalization of logistic regression.  Concretely, our hypothesis outputs
+This shows that softmax regression is a generalization of logistic regression.  Concretely, when <math>k=2</math>,
+the softmax regression hypothesis outputs
 <math>
 \begin{align}
-h(x) &=
+h_\theta(x) &=
 \frac{1}{ e^{\theta_1^Tx}  + e^{ \theta_2^T x^{(i)} } }
@@ Line 316: / Line 318: @@
 Taking advantage of the fact that this hypothesis
-is overparameterized and setting <math>\psi - =\theta_1</math>,
+is overparameterized and setting <math>\psi = \theta_1</math>,
 we can subtract <math>\theta_1</math> from each of the two parameters, giving us
@@ Line 351: / Line 353: @@
 <math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>,
 same as logistic regression.
 == Softmax Regression vs. k Binary Classifiers ==
@@ Line 379: / Line 380: @@
 or three logistic regression classifiers?  (ii) Now suppose your classes are
 indoor_scene, black_and_white_image, and image_has_people.  Would you use softmax
-regression of multiple logistic regression classifiers?
+regression or multiple logistic regression classifiers?
 In the first case, the classes are mutually exclusive, so a softmax regression
 classifier would be appropriate.  In the second case, it would be more appropriate to build
 three separate logistic regression classifiers.
+{{Softmax}}
+{{Languages|Softmax回归|中文}}