Softmax Regression

Revision as of 05:42, 10 May 2011 (view source)

Ang (Talk | contribs)

← Older edit

Latest revision as of 13:24, 7 April 2013 (view source)

Kandeng (Talk | contribs)

Line 1:

== Introduction ==

-

In these notes, we describe the '''Softmax regression''' model~~, a generalization of~~ logistic regression to

+

In these notes, we describe the '''Softmax regression''' model. This model generalizes logistic regression to

classification problems where the class label <math>y</math> can take on more than two possible values.

-

This will be useful for such problems as MNIST, where the goal is to distinguish between 10 different

+

This will be useful for such problems as MNIST digit classification, where the goal is to distinguish between 10 different

-

numerical digits. ~~'''~~Softmax regression~~'''~~ is a supervised learning algorithm, but we will ~~shortly~~ be

+

numerical digits. Softmax regression is a supervised learning algorithm, but we will later be

using it in conjuction with our deep learning/unsupervised feature learning methods.

Recall that in logistic regression, we had a training set

<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>

-

of $m$ examples. ~~We had considered~~ the binary classification setting, so the

+

of <math>m</math> labeled examples, where the input features are <math>x^{(i)} \in \Re^{n+1}</math>.

-

labels were <math>y^{(i)} \in \{0,1\}</math>. Our hypothesis took the form:

+

(In this set of notes, we will use the notational convention of letting the feature vectors <math>x</math> be

+

<math>n+1</math> dimensional, with <math>x_0 = 1</math> corresponding to the intercept term.)

+

With logistic regression, we were in the binary classification setting, so the labels

+

were <math>y^{(i)} \in \{0,1\}</math>. Our hypothesis took the form:

<math>\begin{align}

Line 30:

Line 33:

two. Thus, in our training set

<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>,

-

we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>. For example,

+

we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>. (Note that

+

our convention will be to index the classes starting from 1, rather than from 0.) For example,

in the MNIST digit recognition task, we would have <math>k=10</math> different classes.

Given a test input <math>x</math>, we want our hypothesis to estimate

-

the probability that <math>p(y=j | x)</math> for <math>j = 1, \ldots, k</math>.

+

the probability that <math>p(y=j | x)</math> for each value of <math>j = 1, \ldots, k</math>.

-

~~In other words~~, we want to estimate the probability of the class label taking

+

I.e., we want to estimate the probability of the class label taking

on each of the <math>k</math> different possible values. Thus, our hypothesis

will output a <math>k</math> dimensional vector (whose elements sum to 1) giving

-

us our <math>k</math> estimated probabilities.

+

us our <math>k</math> estimated probabilities. Concretely, our hypothesis

-

+

-

Concretely, our hypothesis

+

<math>h_{\theta}(x)</math> takes the form:

<math>

\begin{align}

-

h(x^{(i)}) =

+

h_\theta(x^{(i)}) =

\begin{bmatrix}

-

P(y^{(i)} = 1 | x^{(i)}) \\

+

p(y^{(i)} = 1 | x^{(i)}; \theta) \\

-

P(y^{(i)} = 2 | x^{(i)}) \\

+

p(y^{(i)} = 2 | x^{(i)}; \theta) \\

\vdots \\

-

P(y^{(i)} = k | x^{(i)})

+

p(y^{(i)} = k | x^{(i)}; \theta)

\end{bmatrix}

=

-

\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }

+

\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }

\begin{bmatrix}

e^{ \theta_1^T x^{(i)} } \\

Line 63:

Line 65:

</math>

-

~~where~~ <math>\theta_1, \theta_2, \ldots, \~~theta_n</math> are each~~

+

Here <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math> are the

-

~~<math>~~n</math>~~-dimensional column vectors that constitute~~ the parameters of our

+

parameters of our model.

-

~~hypothesis~~. Notice that

+

Notice that

-

<math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>

+

the term <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>

-

normalizes the distribution so that it sums to one.

+

normalizes the distribution, so that it sums to one.

+

For convenience, we will also write

+

<math>\theta</math> to denote all the

+

parameters of our model. When you implement softmax regression, it is usually

+

convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by

+

stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that

+

<math>

+

\theta = \begin{bmatrix}

+

\mbox{---} \theta_1^T \mbox{---} \\

+

\mbox{---} \theta_2^T \mbox{---} \\

+

\vdots \\

+

\mbox{---} \theta_k^T \mbox{---} \\

+

\end{bmatrix}

+

</math>

<!--

Line 113:

Line 129:

Notes]. }}

!-->

-

== Cost Function ==

-

We now ~~present~~ the cost function that we'll use for softmax regression. In the equation below, <math>1\{\cdot\}</math> is

+

We now describe the cost function that we'll use for softmax regression. In the equation below, <math>1\{\cdot\}</math> is

-

the '''indicator function,''' so that <math>1\{\hbox{A true ~~sentence~~}\}=1</math>, and <math>1\{\hbox{A false ~~sentence~~}\}=0</math>.

+

the '''indicator function,''' so that <math>1\{\hbox{a true statement}\}=1</math>, and <math>1\{\hbox{a false statement}\}=0</math>.

-

Our cost function will be:

+

For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be:

<math>

\begin{align}

-

J(\theta) = - \sum_{i=1}^{m} \sum_{j=1}^{k} ~~\left[~~ 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]

+

J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]

\end{align}

</math>

Line 131:

Line 146:

<math>

\begin{align}

-

J(\theta) = - \sum_{i=1}^{m} \sum_{j=0}^{1} ~~\left[~~ 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]

+

J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\

+

&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]

\end{align}

</math>

Line 138:

Line 154:

of the class label. Note also that in softmax regression, we have that

<math>

-

p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )

+

p(y^{(i)} = j | x^{(i)} ; \theta) = \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} }

-

+

</math>.

-

There is no known closed-form way to solve for the minimum, and thus as usual we'll resort to an iterative

+

There is no known closed-form way to solve for the minimum of <math>J(\theta)</math>, and thus as usual we'll resort to an iterative

optimization algorithm such as gradient descent or L-BFGS. Taking derivatives, one can show that the gradient is:

<math>

\begin{align}

-

\nabla_{\theta_j} J(\theta) = -\sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) ) \right] }

+

\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right] }

\end{align}

</math>

+

<!--

where as usual

-

<math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>. ~~Armed~~

+

<math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>. !-->

-

with ~~this derivation~~ of the derivative, one can then plug it into an algorithm such as gradient descent, and have it

+

-

minimize <math>J(\theta)</math>.

+

Recall the meaning of the "<math>\nabla_{\theta_j}</math>" notation. In particular, <math>\nabla_{\theta_j} J(\theta)</math>

+

is itself a vector, so that its <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>

+

the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_j</math>.

+

Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it

+

minimize <math>J(\theta)</math>. For example, with the standard implementation of gradient descent, on each iteration

+

we would perform the update <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math> (for each <math>j=1,\ldots,k</math>).

When implementing softmax regression, we will typically use a modified version of the cost function described above;

specifically, one that incorporates weight decay. We describe the motivation and details below.

+

== Properties of softmax regression parameterization ==

-

~~=== Properties of softmax regression parameterization ===~~

+

Softmax regression has an unusual property that it has a "redundant" set of parameters. To explain what this means,

-

+

-

Softmax regression has an unusual property that it has ~~"too many," or~~ "redundant", parameters. ~~Concretely~~,

+

suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>

-

from it, so that <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>. Our hypothesis

+

from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>

+

(for every <math>j=1, \ldots, k</math>). Our hypothesis

now estimates the class label probabilities as

Line 170:

Line 192:

\begin{align}

p(y^{(i)} = j | x^{(i)} ; \theta)

-

&= frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\

+

&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\

-

&= frac{e^{\theta_j^T x^{(i)}} e^{\psi^Tx}}{\sum_{l=1}^k e^{\theta_l^T} x^{(i)} e^{\psi^Tx}}

+

&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\

-

&= frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T} x^{(i)}}

+

&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.

\end{align}

</math>

Line 180:

Line 202:

regression's parameters are "redundant." More formally, we say that our

softmax model is '''overparameterized,''' meaning that for any hypothesis we might

-

fit to the data, there~~'re~~ multiple parameter settings that give rise to the same

+

fit to the data, there are multiple parameter settings that give rise to exactly

-

hypothesis ~~output~~.

+

the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>

+

to the predictions.

Further, if the cost function <math>J(\theta)</math> is minimized by some

-

setting of the parameters <math>(\theta_1, \theta_2,\ldots, \~~theta_n~~)</math>,

+

setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>,

then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,

-

\~~theta_n~~ - \psi)</math> for any value of <math>\psi</math>. Thus, the

+

\theta_k - \psi)</math> for any value of <math>\psi</math>. Thus, the

-

minimizer of <math>J(\theta)</math> is ~~no longer~~ unique. (Interestingly

+

minimizer of <math>J(\theta)</math> is not unique. (Interestingly,

-

~~however~~, <math>J(\theta)</math> is still convex, and thus gradient descent will

+

<math>J(\theta)</math> is still convex, and thus gradient descent will

-

not run into a local ~~optimum~~. But the Hessian is singular/non-invertible,

+

not run into a local optima problems. But the Hessian is singular/non-invertible,

-

which ~~cause~~ a straightforward implementation of Newton's method to run into

+

which causes a straightforward implementation of Newton's method to run into

-

numerical problems.)

+

numerical problems.)

Notice also that by setting <math>\psi = \theta_1</math>, one can always

-

replace <math>\theta_1</math> with <math>\vec{0}</math> (the vector of all

+

replace <math>\theta_1</math> with <math>\theta_1 - \psi = \vec{0}</math> (the vector of all

0's), without affecting the hypothesis. Thus, one could "eliminate" the vector

of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for

any single value of <math>j</math>), without harming the representational power

-

of our hypothesis. Indeed, rather than optimizing over the <math>kn</math>

+

of our hypothesis. Indeed, rather than optimizing over the <math>k(n+1)</math>

-

parameters <math>(\theta_1, \theta_2,\ldots, \~~theta_n~~)</math> (where

+

parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where

-

<math>\theta_j \in \Re^n</math>), one could ~~indeed~~ set <math>\theta_1 =

+

<math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 =

-

\vec{0}</math> and optimize only with respect to the <math>(k-1)n</math>

+

\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>

-

remaining parameters, and this would work fine.

+

remaining parameters, and this would work fine.

In practice, however, it is often cleaner and simpler to implement the version which keeps

-

all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>. But we will

+

all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>, without

+

arbitrarily setting one of them to zero. But we will

make one change to the cost function: Adding weight decay. This will take care of

the numerical problems associated with softmax regression's overparameterized representation.

-

+

== Weight Decay ==

-

=== Weight Decay ===

+

<!--

Line 218:

Line 241:

We will modify the cost function by adding a weight decay term

-

<math>\frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2</math>

+

<math>\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2</math>

which penalizes large values of the parameters. Our cost function is now

<math>

\begin{align}

-

J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} ~~\left[~~ 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}~~\right]~~ \right]

+

J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }} \right]

-

+ \frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2

+

+ \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2

\end{align}

</math>

Line 230:

Line 253:

With this weight decay term (for any <math>\lambda > 0</math>), the cost function

<math>J(\theta)</math> is now strictly convex, and is guaranteed to have a

-

unique solution. The Hessian is now invertible, and <math>J(\theta)</math> is ~~still~~

+

unique solution. The Hessian is now invertible, and because <math>J(\theta)</math> is

-

convex, ~~and thus~~ algorithms such as gradient descent, L-BFGS, etc. are guaranteed

+

convex, algorithms such as gradient descent, L-BFGS, etc. are guaranteed

to converge to the global minimum.

-

To ~~implement these~~ optimization ~~algorithms~~, we also need the derivative~~, which~~

+

To apply an optimization algorithm, we also need the derivative of this

-

~~works out to be~~:

+

new definition of <math>J(\theta)</math>. One can show that the derivative is:

-

+

<math>

\begin{align}

-

\nabla_{\theta_j} J(\theta) = -\sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) ) \right] } + \lambda \theta_j

+

\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) ) \right] } + \lambda \theta_j

\end{align}

</math>

-

+

By minimizing <math>J(\theta)</math> with respect to <math>\theta</math>, we will have a working implementation of softmax regression.

-

~~Minimizing~~ <math>J(\theta)</math> ~~now performs regularized~~ softmax regression.

+

-

+

Line 279:

Line 299:

!-->

+

== Relationship to Logistic Regression ==

-

+

In the special case where <math>k = 2</math>, one can show that softmax regression reduces to logistic regression.

-

+

This shows that softmax regression is a generalization of logistic regression. Concretely, when <math>k=2</math>,

-

~~=== Generalizing Logistic Regression ===~~

+

the softmax regression hypothesis outputs

-

+

-

In the special case where <math>k = 2</math>, one can ~~also~~ show that softmax regression reduces to logistic regression.

+

-

This shows that softmax regression is a generalization of logistic regression. Concretely, ~~our~~ hypothesis outputs

+

<math>

\begin{align}

-

h(x) &=

+

h_\theta(x) &=

\frac{1}{ e^{\theta_1^Tx} + e^{ \theta_2^T x^{(i)} } }

Line 300:

Line 318:

Taking advantage of the fact that this hypothesis

-

is overparameterized and setting $\psi - =\theta_1$,

+

is overparameterized and setting <math>\psi = \theta_1</math>,

-

we can subtract $\theta_1$ from each of the two parameters, giving us

+

we can subtract <math>\theta_1</math> from each of the two parameters, giving us

<math>

Line 326:

Line 344:

\end{bmatrix}

\end{align}

+

</math>

-

Thus, replacing $\theta_2-\theta_1$ with a single parameter vector $\theta'$, we find

+

Thus, replacing <math>\theta_2-\theta_1</math> with a single parameter vector <math>\theta'</math>, we find

that softmax regression predicts the probability of one of the classes as

<math>\frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>,

Line 335:

Line 354:

same as logistic regression.

-

+

== Softmax Regression vs. k Binary Classifiers ==

-

=== Softmax Regression vs. k Binary Classifiers ===

+

Suppose you are working on a music classification application, and there are

-

<math>k</math> types of music that you are trying to ~~detect~~. Should you use a

+

<math>k</math> types of music that you are trying to recognize. Should you use a

softmax classifier, or should you build <math>k</math> separate binary classifiers using

logistic regression?

Line 347:

Line 365:

of your training examples is labeled with exactly one of these four class labels,

you should build a softmax classifier with <math>k=4</math>.

-

(~~Or if~~ there're also some examples that are none of the above four classes,

+

(If there're also some examples that are none of the above four classes,

-

then you can set <math>k=5</math> and also have a "none of the above" class.)

+

then you can set <math>k=5</math> in softmax regression, and also have a fifth, "none of the above," class.)

-

If however your categories are has_vocals, dance, ~~sountrack~~, pop, then the

+

If however your categories are has_vocals, dance, soundtrack, pop, then the

classes are not mutually exclusive; for example, there can be a piece of pop

-

music that comes from a ~~sountrack~~ and in addition has vocals. In this case, it

+

music that comes from a soundtrack and in addition has vocals. In this case, it

-

would be more appropriate to build 4 binary logistic regression classifiers~~, so~~

+

would be more appropriate to build 4 binary logistic regression classifiers.

-

~~that~~ for each new musical piece, your algorithm can separately decide whether

+

This way, for each new musical piece, your algorithm can separately decide whether

it falls into each of the four categories.

+

Now, consider a computer vision example, where you're trying to classify images into

+

three different classes. (i) Suppose that your classes are indoor_scene,

+

outdoor_urban_scene, and outdoor_wilderness_scene. Would you use sofmax regression

+

or three logistic regression classifiers? (ii) Now suppose your classes are

+

indoor_scene, black_and_white_image, and image_has_people. Would you use softmax

+

regression or multiple logistic regression classifiers?

+

In the first case, the classes are mutually exclusive, so a softmax regression

+

classifier would be appropriate. In the second case, it would be more appropriate to build

+

three separate logistic regression classifiers.

+

Softmax Regression

From Ufldl

Latest revision as of 13:24, 7 April 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 1: / Line 1: @@
 == Introduction ==
-In these notes, we describe the '''Softmax regression''' model, a generalization of logistic regression to
+In these notes, we describe the '''Softmax regression''' model.  This model generalizes logistic regression to
 classification problems where the class label <math>y</math> can take on more than two possible values.
-This will be useful for such problems as MNIST, where the goal is to distinguish between 10 different
+This will be useful for such problems as MNIST digit classification, where the goal is to distinguish between 10 different
-numerical digits.  '''Softmax regression''' is a supervised learning algorithm, but we will shortly be
+numerical digits.  Softmax regression is a supervised learning algorithm, but we will later be
 using it in conjuction with our deep learning/unsupervised feature learning methods.
 Recall that in logistic regression, we had a training set
 <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>
-of $m$ examples.  We had considered the binary classification setting, so the
+of <math>m</math> labeled examples, where the input features are <math>x^{(i)} \in \Re^{n+1}</math>.
-labels were <math>y^{(i)} \in \{0,1\}</math>.  Our hypothesis took the form:
+(In this set of notes, we will use the notational convention of letting the feature vectors <math>x</math> be
+<math>n+1</math> dimensional, with <math>x_0 = 1</math> corresponding to the intercept term.)
+With logistic regression, we were in the binary classification setting, so the labels
+were <math>y^{(i)} \in \{0,1\}</math>.  Our hypothesis took the form:
 <math>\begin{align}
@@ Line 30: / Line 33: @@
 two.  Thus, in our training set
 <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>,
-we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>.  For example,
+we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>.  (Note that
+our convention will be to index the classes starting from 1, rather than from 0.)  For example,
 in the MNIST digit recognition task, we would have <math>k=10</math> different classes.
 Given a test input <math>x</math>, we want our hypothesis to estimate
-the probability that <math>p(y=j | x)</math> for <math>j = 1, \ldots, k</math>.
+the probability that <math>p(y=j | x)</math> for each value of <math>j = 1, \ldots, k</math>.
-In other words, we want to estimate the probability of the class label taking
+I.e., we want to estimate the probability of the class label taking
 on each of the <math>k</math> different possible values.  Thus, our hypothesis
 will output a <math>k</math> dimensional vector (whose elements sum to 1) giving
-us our <math>k</math> estimated probabilities.
+us our <math>k</math> estimated probabilities.  Concretely, our hypothesis
-Concretely, our hypothesis
 <math>h_{\theta}(x)</math> takes the form:
 <math>
 \begin{align}
-h(x^{(i)}) =
+h_\theta(x^{(i)}) =
 \begin{bmatrix}
-P(y^{(i)} = 1 | x^{(i)}) \\
+p(y^{(i)} = 1 | x^{(i)}; \theta) \\
-P(y^{(i)} = 2 | x^{(i)}) \\
+p(y^{(i)} = 2 | x^{(i)}; \theta) \\
 \vdots \\
-P(y^{(i)} = k | x^{(i)})
+p(y^{(i)} = k | x^{(i)}; \theta)
 \end{bmatrix}
 =
-\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
+\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }
 \begin{bmatrix}
 e^{ \theta_1^T x^{(i)} } \\
@@ Line 63: / Line 65: @@
 </math>
-where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each
+Here <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math> are the
-<math>n</math>-dimensional column vectors that constitute the parameters of our
+parameters of our model.
-hypothesis.  Notice that
+Notice that
-<math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>
+the term <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>
-normalizes the distribution so that it sums to one.
+normalizes the distribution, so that it sums to one.
+For convenience, we will also write
+<math>\theta</math> to denote all the
+parameters of our model.  When you implement softmax regression, it is usually
+convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by
+stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that
+<math>
+\theta = \begin{bmatrix}
+\mbox{---} \theta_1^T \mbox{---} \\
+\mbox{---} \theta_2^T \mbox{---} \\
+\vdots \\
+\mbox{---} \theta_k^T \mbox{---} \\
+\end{bmatrix}
+</math>
 <!--
@@ Line 113: / Line 129: @@
 Notes].  }}
 !-->
 == Cost Function ==
-We now present the cost function that we'll use for softmax regression.  In the equation below, <math>1\{\cdot\}</math> is
+We now describe the cost function that we'll use for softmax regression.  In the equation below, <math>1\{\cdot\}</math> is
-the '''indicator function,''' so that <math>1\{\hbox{A true sentence}\}=1</math>, and <math>1\{\hbox{A false sentence}\}=0</math>.
+the '''indicator function,''' so that <math>1\{\hbox{a true statement}\}=1</math>, and <math>1\{\hbox{a false statement}\}=0</math>.
-Our cost function will be:
+For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be:
 <math>
 \begin{align}
-J(\theta) = - \sum_{i=1}^{m} \sum_{j=1}^{k} \left[ 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
+J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k}  1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
 \end{align}
 </math>
@@ Line 131: / Line 146: @@
 <math>
 \begin{align}
-J(\theta) = - \sum_{i=1}^{m} \sum_{j=0}^{1} \left[ 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
+J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m   (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\
+&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
 \end{align}
 </math>
@@ Line 138: / Line 154: @@
 of the class label.  Note also that in softmax regression, we have that
 <math>
-p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )
+p(y^{(i)} = j | x^{(i)} ; \theta) = \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} }
 </math>.
-There is no known closed-form way to solve for the minimum, and thus as usual we'll resort to an iterative
+There is no known closed-form way to solve for the minimum of <math>J(\theta)</math>, and thus as usual we'll resort to an iterative
 optimization algorithm such as gradient descent or L-BFGS.  Taking derivatives, one can show that the gradient is:
 <math>
 \begin{align}
-\nabla_{\theta_j} J(\theta) = -\sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) ) \right]  }
+\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right]  }
 \end{align}
 </math>
+<!--
 where as usual
-<math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>.  Armed
+<math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>.  !-->
-with this derivation of the derivative, one can then plug it into an algorithm such as gradient descent, and have it
-minimize <math>J(\theta)</math>.
+Recall the meaning of the "<math>\nabla_{\theta_j}</math>" notation.  In particular, <math>\nabla_{\theta_j} J(\theta)</math>
+is itself a vector, so that its <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>
+the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_j</math>.
+Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it
+minimize <math>J(\theta)</math>.  For example, with the standard implementation of gradient descent, on each iteration
+we would perform the update <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math> (for each <math>j=1,\ldots,k</math>).
 When implementing softmax regression, we will typically use a modified version of the cost function described above;
 specifically, one that incorporates weight decay.  We describe the motivation and details below.
+== Properties of softmax regression parameterization ==
-=== Properties of softmax regression parameterization ===
+Softmax regression has an unusual property that it has a "redundant" set of parameters.  To explain what this means,
-Softmax regression has an unusual property that it has "too many," or "redundant", parameters.  Concretely,
 suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>
-from it, so that <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>.  Our hypothesis
+from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>
+(for every <math>j=1, \ldots, k</math>).  Our hypothesis
 now estimates the class label probabilities as
@@ Line 170: / Line 192: @@
 \begin{align}
 p(y^{(i)} = j | x^{(i)} ; \theta)
-&= frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
+&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
-&= frac{e^{\theta_j^T x^{(i)}} e^{\psi^Tx}}{\sum_{l=1}^k e^{\theta_l^T} x^{(i)} e^{\psi^Tx}}
+&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
-&= frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T} x^{(i)}}
+&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.
 \end{align}
 </math>
@@ Line 180: / Line 202: @@
 regression's parameters are "redundant."  More formally, we say that our
 softmax model is '''overparameterized,''' meaning that for any hypothesis we might
-fit to the data, there're multiple parameter settings that give rise to the same
+fit to the data, there are multiple parameter settings that give rise to exactly
-hypothesis output.
+the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>
+to the predictions.
 Further, if the cost function <math>J(\theta)</math> is minimized by some
-setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>,
+setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>,
 then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,
-\theta_n - \psi)</math> for any value of <math>\psi</math>.  Thus, the
+\theta_k - \psi)</math> for any value of <math>\psi</math>.  Thus, the
-minimizer of <math>J(\theta)</math> is no longer unique.  (Interestingly
+minimizer of <math>J(\theta)</math> is not unique.  (Interestingly,
-however, <math>J(\theta)</math> is still convex, and thus gradient descent will
+<math>J(\theta)</math> is still convex, and thus gradient descent will
-not run into a local optimum.  But the Hessian is singular/non-invertible,
+not run into a local optima problems.  But the Hessian is singular/non-invertible,
-which cause a straightforward implementation of Newton's method to run into
+which causes a straightforward implementation of Newton's method to run into
 numerical problems.)
 Notice also that by setting <math>\psi = \theta_1</math>, one can always
-replace <math>\theta_1</math> with <math>\vec{0}</math> (the vector of all
+replace <math>\theta_1</math> with <math>\theta_1 - \psi = \vec{0}</math> (the vector of all
 's), without affecting the hypothesis.  Thus, one could "eliminate" the vector
 of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for
 any single value of <math>j</math>), without harming the representational power
-of our hypothesis.  Indeed, rather than optimizing over the <math>kn</math>
+of our hypothesis.  Indeed, rather than optimizing over the <math>k(n+1)</math>
-parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math> (where
+parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where
-<math>\theta_j \in \Re^n</math>), one could indeed set <math>\theta_1 =
+<math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 =
-\vec{0}</math> and optimize only with respect to the <math>(k-1)n</math>
+\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>
 remaining parameters, and this would work fine.
 In practice, however, it is often cleaner and simpler to implement the version which keeps
-all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>.  But we will
+all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>, without
+arbitrarily setting one of them to zero.  But we will
 make one change to the cost function: Adding weight decay.  This will take care of
 the numerical problems associated with softmax regression's overparameterized representation.
+== Weight Decay ==
-=== Weight Decay ===
 <!--
@@ Line 218: / Line 241: @@
 We will modify the cost function by adding a weight decay term
-<math>\frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2</math>
+<math>\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2</math>
 which penalizes large values of the parameters.  Our cost function is now
 <math>
 \begin{align}
-J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} \left[ 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right] \right]
+J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}  \right]
-               + \frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2
+               + \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2
 \end{align}
 </math>
@@ Line 230: / Line 253: @@
 With this weight decay term (for any <math>\lambda > 0</math>), the cost function
 <math>J(\theta)</math> is now strictly convex, and is guaranteed to have a
-unique solution.  The Hessian is now invertible, and <math>J(\theta)</math> is still
+unique solution.  The Hessian is now invertible, and because <math>J(\theta)</math> is
-convex, and thus algorithms such as gradient descent, L-BFGS, etc. are guaranteed
+convex, algorithms such as gradient descent, L-BFGS, etc. are guaranteed
 to converge to the global minimum.
-To implement these optimization algorithms, we also need the derivative, which
+To apply an optimization algorithm, we also need the derivative of this
-works out to be:
+new definition of <math>J(\theta)</math>.  One can show that the derivative is:
 <math>
 \begin{align}
-\nabla_{\theta_j} J(\theta) = -\sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) ) \right]  } + \lambda \theta_j
+\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) ) \right]  } + \lambda \theta_j
 \end{align}
 </math>
+By minimizing <math>J(\theta)</math> with respect to <math>\theta</math>, we will have a working implementation of softmax regression.
-Minimizing <math>J(\theta)</math> now performs regularized softmax regression.
@@ Line 279: / Line 299: @@
 !-->
+== Relationship to Logistic Regression ==
+In the special case where <math>k = 2</math>, one can show that softmax regression reduces to logistic regression.
+This shows that softmax regression is a generalization of logistic regression.  Concretely, when <math>k=2</math>,
-=== Generalizing Logistic Regression ===
+the softmax regression hypothesis outputs
-In the special case where <math>k = 2</math>, one can also show that softmax regression reduces to logistic regression.
-This shows that softmax regression is a generalization of logistic regression.  Concretely, our hypothesis outputs
 <math>
 \begin{align}
-h(x) &=
+h_\theta(x) &=
 \frac{1}{ e^{\theta_1^Tx}  + e^{ \theta_2^T x^{(i)} } }
@@ Line 300: / Line 318: @@
 Taking advantage of the fact that this hypothesis
-is overparameterized and setting $\psi - =\theta_1$,
+is overparameterized and setting <math>\psi = \theta_1</math>,
-we can subtract $\theta_1$ from each of the two parameters, giving us
+we can subtract <math>\theta_1</math> from each of the two parameters, giving us
 <math>
@@ Line 326: / Line 344: @@
 \end{bmatrix}
 \end{align}
+</math>
-Thus, replacing $\theta_2-\theta_1$ with a single parameter vector $\theta'$, we find
+Thus, replacing <math>\theta_2-\theta_1</math> with a single parameter vector <math>\theta'</math>, we find
 that softmax regression predicts the probability of one of the classes as
 <math>\frac{1}{ 1  + e^{ (\theta')^T x^{(i)} } }</math>,
@@ Line 335: / Line 354: @@
 same as logistic regression.
+== Softmax Regression vs. k Binary Classifiers ==
-=== Softmax Regression vs. k Binary Classifiers ===
 Suppose you are working on a music classification application, and there are
-<math>k</math> types of music that you are trying to detect.  Should you use a
+<math>k</math> types of music that you are trying to recognize.  Should you use a
 softmax classifier, or should you build <math>k</math> separate binary classifiers using
 logistic regression?
@@ Line 347: / Line 365: @@
 of your training examples is labeled with exactly one of these four class labels,
 you should build a softmax classifier with <math>k=4</math>.
-(Or if there're also some examples that are none of the above four classes,
+(If there're also some examples that are none of the above four classes,
-then you can set <math>k=5</math> and also have a "none of the above" class.)
+then you can set <math>k=5</math> in softmax regression, and also have a fifth, "none of the above," class.)
-If however your categories are has_vocals, dance, sountrack, pop, then the
+If however your categories are has_vocals, dance, soundtrack, pop, then the
 classes are not mutually exclusive; for example, there can be a piece of pop
-music that comes from a sountrack and in addition has vocals.  In this case, it
+music that comes from a soundtrack and in addition has vocals.  In this case, it
-would be more appropriate to build 4 binary logistic regression classifiers, so
+would be more appropriate to build 4 binary logistic regression classifiers.
-that for each new musical piece, your algorithm can separately decide whether
+This way, for each new musical piece, your algorithm can separately decide whether
 it falls into each of the four categories.
+Now, consider a computer vision example, where you're trying to classify images into
+three different classes.  (i) Suppose that your classes are indoor_scene,
+outdoor_urban_scene, and outdoor_wilderness_scene.  Would you use sofmax regression
+or three logistic regression classifiers?  (ii) Now suppose your classes are
+indoor_scene, black_and_white_image, and image_has_people.  Would you use softmax
+regression or multiple logistic regression classifiers?
+In the first case, the classes are mutually exclusive, so a softmax regression
+classifier would be appropriate.  In the second case, it would be more appropriate to build
+three separate logistic regression classifiers.
+{{Softmax}}
+{{Languages|Softmax回归|中文}}