Softmax Regression

Revision as of 18:41, 10 May 2011 (view source)

Ang (Talk | contribs)

(→Properties of softmax regression parameterization)

← Older edit

Latest revision as of 13:24, 7 April 2013 (view source)

Kandeng (Talk | contribs)

Line 73:

For convenience, we will also write

<math>\theta</math> to denote all the

-

parameters of our model. When you implement softmax regression, is is usually

+

parameters of our model. When you implement softmax regression, it is usually

convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by

stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that

Line 202:

regression's parameters are "redundant." More formally, we say that our

softmax model is '''overparameterized,''' meaning that for any hypothesis we might

-

fit to the data, there~~'re~~ multiple parameter settings that give rise to exactly

+

fit to the data, there are multiple parameter settings that give rise to exactly

the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>

to the predictions.

Line 241:

We will modify the cost function by adding a weight decay term

-

<math>\frac{\lambda}{2} \sum_{i=1}^k \sum_{j=1}^{n+1} \theta_{ij}^2</math>

+

<math>\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2</math>

which penalizes large values of the parameters. Our cost function is now

<math>

\begin{align}

-

J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }} \right]

+

J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }} \right]

-

+ \frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2

+

+ \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2

\end{align}

</math>

Line 257:

to converge to the global minimum.

-

To ~~implement these~~ optimization ~~algorithms~~, we also need the derivative of this

+

To apply an optimization algorithm, we also need the derivative of this

new definition of <math>J(\theta)</math>. One can show that the derivative is:

<math>

Line 301:

== Relationship to Logistic Regression ==

-

In the special case where <math>k = 2</math>, one can ~~also~~ show that softmax regression reduces to logistic regression.

+

In the special case where <math>k = 2</math>, one can show that softmax regression reduces to logistic regression.

-

This shows that softmax regression is a generalization of logistic regression. Concretely, ~~our~~ hypothesis outputs

+

This shows that softmax regression is a generalization of logistic regression. Concretely, when <math>k=2</math>,

+

the softmax regression hypothesis outputs

<math>

\begin{align}

-

h(x) &=

+

h_\theta(x) &=

\frac{1}{ e^{\theta_1^Tx} + e^{ \theta_2^T x^{(i)} } }

Line 317:

Line 318:

Taking advantage of the fact that this hypothesis

-

is overparameterized and setting <math>\psi - =\theta_1</math>,

+

is overparameterized and setting <math>\psi = \theta_1</math>,

we can subtract <math>\theta_1</math> from each of the two parameters, giving us

Line 352:

Line 353:

<math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>,

same as logistic regression.

-

== Softmax Regression vs. k Binary Classifiers ==

Line 380:

or three logistic regression classifiers? (ii) Now suppose your classes are

indoor_scene, black_and_white_image, and image_has_people. Would you use softmax

-

regression of multiple logistic regression classifiers?

+

regression or multiple logistic regression classifiers?

In the first case, the classes are mutually exclusive, so a softmax regression

classifier would be appropriate. In the second case, it would be more appropriate to build

three separate logistic regression classifiers.

+

Softmax Regression

From Ufldl

Latest revision as of 13:24, 7 April 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 73: / Line 73: @@
 For convenience, we will also write
 <math>\theta</math> to denote all the
-parameters of our model.  When you implement softmax regression, is is usually
+parameters of our model.  When you implement softmax regression, it is usually
 convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by
 stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that
@@ Line 202: / Line 202: @@
 regression's parameters are "redundant."  More formally, we say that our
 softmax model is '''overparameterized,''' meaning that for any hypothesis we might
-fit to the data, there're multiple parameter settings that give rise to exactly
+fit to the data, there are multiple parameter settings that give rise to exactly
 the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>
 to the predictions.
@@ Line 241: / Line 241: @@
 We will modify the cost function by adding a weight decay term
-<math>\frac{\lambda}{2} \sum_{i=1}^k \sum_{j=1}^{n+1} \theta_{ij}^2</math>
+<math>\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2</math>
 which penalizes large values of the parameters.  Our cost function is now
 <math>
 \begin{align}
-J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}  \right]
+J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}  \right]
-               + \frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2
+               + \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2
 \end{align}
 </math>
@@ Line 257: / Line 257: @@
 to converge to the global minimum.
-To implement these optimization algorithms, we also need the derivative of this
+To apply an optimization algorithm, we also need the derivative of this
 new definition of <math>J(\theta)</math>.  One can show that the derivative is:
 <math>
@@ Line 301: / Line 301: @@
 == Relationship to Logistic Regression ==
-In the special case where <math>k = 2</math>, one can also show that softmax regression reduces to logistic regression.
+In the special case where <math>k = 2</math>, one can show that softmax regression reduces to logistic regression.
-This shows that softmax regression is a generalization of logistic regression.  Concretely, our hypothesis outputs
+This shows that softmax regression is a generalization of logistic regression.  Concretely, when <math>k=2</math>,
+the softmax regression hypothesis outputs
 <math>
 \begin{align}
-h(x) &=
+h_\theta(x) &=
 \frac{1}{ e^{\theta_1^Tx}  + e^{ \theta_2^T x^{(i)} } }
@@ Line 317: / Line 318: @@
 Taking advantage of the fact that this hypothesis
-is overparameterized and setting <math>\psi - =\theta_1</math>,
+is overparameterized and setting <math>\psi = \theta_1</math>,
 we can subtract <math>\theta_1</math> from each of the two parameters, giving us
@@ Line 352: / Line 353: @@
 <math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>,
 same as logistic regression.
 == Softmax Regression vs. k Binary Classifiers ==
@@ Line 380: / Line 380: @@
 or three logistic regression classifiers?  (ii) Now suppose your classes are
 indoor_scene, black_and_white_image, and image_has_people.  Would you use softmax
-regression of multiple logistic regression classifiers?
+regression or multiple logistic regression classifiers?
 In the first case, the classes are mutually exclusive, so a softmax regression
 classifier would be appropriate.  In the second case, it would be more appropriate to build
 three separate logistic regression classifiers.
+{{Softmax}}
+{{Languages|Softmax回归|中文}}