Softmax Regression

Revision as of 05:59, 10 May 2011 (view source)

Ang (Talk | contribs)

← Older edit

Revision as of 06:23, 10 May 2011 (view source)

Ang (Talk | contribs)

Newer edit →

Line 144:

<math>

\begin{align}

-

J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m ~~y^{(i)}~~ (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + \log h_\theta(x^{(i)}) \right] \\

+

J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\

&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]

\end{align}

Line 152:

of the class label. Note also that in softmax regression, we have that

<math>

-

p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )

+

p(y^{(i)} = j | x^{(i)} ; \theta) = \frac{e^{\theta_j^T x^{(i)}}}{(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} }

-

+

</math>.

-

There is no known closed-form way to solve for the minimum, and thus as usual we'll resort to an iterative

+

There is no known closed-form way to solve for the minimum of <math>J(\theta)</math>, and thus as usual we'll resort to an iterative

optimization algorithm such as gradient descent or L-BFGS. Taking derivatives, one can show that the gradient is:

Line 165:

Line 164:

</math>

+

<!--

where as usual

-

<math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>. ~~Armed~~

+

<math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>. !-->

-

with this derivation of the derivative, one can then plug it into an algorithm such as gradient descent, and have it

+

-

minimize <math>J(\theta)</math>.

+

Recall the meaning "<math>\nabla_{\theta_j}</math>" notation. In particular, <math>\nabla_{\theta_j} J(\theta)</math>

+

is itself a vector, so that it's <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>

+

the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_l</math>.

+

Armed with this derivation of the derivative, one can then plug it into an algorithm such as gradient descent, and have it

+

minimize <math>J(\theta)</math>. For example, with the standard implementation of gradient descent, on each iteration

+

we would perform the update <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math> (for each <math>j=1,\ldots,k</math>).

When implementing softmax regression, we will typically use a modified version of the cost function described above;

Line 176:

Line 182:

== Properties of softmax regression parameterization ==

-

Softmax regression has an unusual property that it has ~~"too many," or~~ "redundant", parameters. ~~Concretely~~,

+

Softmax regression has an unusual property that it has a "redundant" set of parameters. To explain what this means,

suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>

from it, so that <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>. Our hypothesis

Line 184:

Line 190:

\begin{align}

p(y^{(i)} = j | x^{(i)} ; \theta)

-

&= frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\

+

&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\

-

&= frac{e^{\theta_j^T x^{(i)}} e^{\psi^Tx}}{\sum_{l=1}^k e^{\theta_l^T} x^{(i)} e^{\psi^Tx}}

+

&= \frac{e^{\theta_j^T x^{(i)}} e^{\psi^Tx}}{\sum_{l=1}^k e^{\theta_l^T} x^{(i)} e^{\psi^Tx}}

-

&= frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T} x^{(i)}}

+

&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T} x^{(i)}}

\end{align}

</math>

Line 194:

Line 200:

regression's parameters are "redundant." More formally, we say that our

softmax model is '''overparameterized,''' meaning that for any hypothesis we might

-

fit to the data, there're multiple parameter settings that give rise to the same

+

fit to the data, there're multiple parameter settings that give rise to exactly

-

hypothesis ~~output~~.

+

the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>

+

to the predictions.

Further, if the cost function <math>J(\theta)</math> is minimized by some

Line 201:

Line 208:

then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,

\theta_n - \psi)</math> for any value of <math>\psi</math>. Thus, the

-

minimizer of <math>J(\theta)</math> is ~~no longer~~ unique. (Interestingly

+

minimizer of <math>J(\theta)</math> is not unique. (Interestingly,

-

~~however~~, <math>J(\theta)</math> is still convex, and thus gradient descent will

+

<math>J(\theta)</math> is still convex, and thus gradient descent will

not run into a local optimum. But the Hessian is singular/non-invertible,

which cause a straightforward implementation of Newton's method to run into

-

numerical problems.)

+

numerical problems.)

Notice also that by setting <math>\psi = \theta_1</math>, one can always

-

replace <math>\theta_1</math> with <math>\vec{0}</math> (the vector of all

+

replace <math>\theta_1</math> with <math>\theta_1 - \psi = \vec{0}</math> (the vector of all

0's), without affecting the hypothesis. Thus, one could "eliminate" the vector

of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for

any single value of <math>j</math>), without harming the representational power

of our hypothesis. Indeed, rather than optimizing over the <math>kn</math>

-

parameters <math>(\theta_1, \theta_2,\ldots, \~~theta_n~~)</math> (where

+

parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where

-

<math>\theta_j \in \Re^n</math>), one could indeed set <math>\theta_1 =

+

<math>\theta_j \in \Re^{n+1}</math>), one could indeed set <math>\theta_1 =

-

\vec{0}</math> and optimize only with respect to the <math>(k-1)n</math>

+

\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>

-

remaining parameters, and this would work fine.

+

remaining parameters, and this would work fine.

In practice, however, it is often cleaner and simpler to implement the version which keeps

-

all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>. But we will

+

all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>, without

+

arbitrarily setting one of them to zero. But we will

make one change to the cost function: Adding weight decay. This will take care of

the numerical problems associated with softmax regression's overparameterized representation.

Line 232:

Line 240:

We will modify the cost function by adding a weight decay term

-

<math>\frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2</math>

+

<math>\frac{\lambda}{2} \sum_{i=1}^k \sum_{j=1}^{n+1} \theta_{ij}^2</math>

which penalizes large values of the parameters. Our cost function is now

<math>

\begin{align}

-

J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} ~~\left[~~ 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}~~\right]~~ \right]

+

J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }} \right]

+ \frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2

\end{align}

Line 244:

Line 252:

With this weight decay term (for any <math>\lambda > 0</math>), the cost function

<math>J(\theta)</math> is now strictly convex, and is guaranteed to have a

-

unique solution. The Hessian is now invertible, and <math>J(\theta)</math> is ~~still~~

+

unique solution. The Hessian is now invertible, and because <math>J(\theta)</math> is

-

convex, ~~and thus~~ algorithms such as gradient descent, L-BFGS, etc. are guaranteed

+

convex, algorithms such as gradient descent, L-BFGS, etc. are guaranteed

to converge to the global minimum.

-

To implement these optimization algorithms, we also need the derivative~~, which~~

+

To implement these optimization algorithms, we also need the derivative of this

-

~~works out to be~~:

+

new definition of <math>J(\theta)</math>. One can show that the derivative is:

-

+

<math>

\begin{align}

Line 257:

Line 264:

</math>

-

+

By minimizing <math>J(\theta)</math> with respect to <math>\theta</math>, we will have a working implementation of softmax regression.

-

~~Minimizing~~ <math>J(\theta)</math> ~~now performs regularized~~ softmax regression.

+

-

+

Line 292:

Line 297:

With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc.

!-->

-

== Relationship to Logistic Regression ==

Line 314:

Line 316:

Taking advantage of the fact that this hypothesis

-

is overparameterized and setting $\psi - =\theta_1$,

+

is overparameterized and setting <math>\psi - =\theta_1</math>,

-

we can subtract $\theta_1$ from each of the two parameters, giving us

+

we can subtract <math>\theta_1</math> from each of the two parameters, giving us

<math>

Line 340:

Line 342:

\end{bmatrix}

\end{align}

+

</math>

-

Thus, replacing $\theta_2-\theta_1$ with a single parameter vector $\theta'$, we find

+

Thus, replacing <math>\theta_2-\theta_1</math> with a single parameter vector $\theta'$, we find

that softmax regression predicts the probability of one of the classes as

<math>\frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>,

Line 353:

Line 356:

Suppose you are working on a music classification application, and there are

-

<math>k</math> types of music that you are trying to ~~detect~~. Should you use a

+

<math>k</math> types of music that you are trying to recognize. Should you use a

softmax classifier, or should you build <math>k</math> separate binary classifiers using

logistic regression?

Line 361:

Line 364:

of your training examples is labeled with exactly one of these four class labels,

you should build a softmax classifier with <math>k=4</math>.

-

(~~Or if~~ there're also some examples that are none of the above four classes,

+

(If there're also some examples that are none of the above four classes,

-

then you can set <math>k=5</math> and also have a "none of the above" class.)

+

then you can set <math>k=5</math> in softmax regression, and also have a fifth, "none of the above," class.)

If however your categories are has_vocals, dance, sountrack, pop, then the

classes are not mutually exclusive; for example, there can be a piece of pop

music that comes from a sountrack and in addition has vocals. In this case, it

-

would be more appropriate to build 4 binary logistic regression classifiers~~, so~~

+

would be more appropriate to build 4 binary logistic regression classifiers.

-

~~that~~ for each new musical piece, your algorithm can separately decide whether

+

This way, for each new musical piece, your algorithm can separately decide whether

it falls into each of the four categories.

+

Now, consider a computer vision example, where you're trying to classify images into

+

three different classes. (i) Suppose that your classes are indoor_scene,

+

outdoor_urban_scene, and outdoor_wilderness_scene. Would you use sofmax regression

+

or three logistic regression classifiers? (ii) Now suppose your classes are

+

indoor_scene, black_and_white_image, and image_has_people. Would you use softmax

+

regression of multiple logistic regression classifiers?

+

In the first case, the classes are mutually exclusive, so a softmax regression

+

classifier would be appropriate. In the second case, it would be more appropriate to build

+

three separate logistic regression classifiers.

Softmax Regression

From Ufldl

Revision as of 06:23, 10 May 2011

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 144: / Line 144: @@
 <math>
 \begin{align}
-J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)}  (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + \log h_\theta(x^{(i)}) \right] \\
+J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m   (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\
 &= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
 \end{align}
@@ Line 152: / Line 152: @@
 of the class label.  Note also that in softmax regression, we have that
 <math>
-p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )
+p(y^{(i)} = j | x^{(i)} ; \theta) = \frac{e^{\theta_j^T x^{(i)}}}{(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} }
 </math>.
-There is no known closed-form way to solve for the minimum, and thus as usual we'll resort to an iterative
+There is no known closed-form way to solve for the minimum of <math>J(\theta)</math>, and thus as usual we'll resort to an iterative
 optimization algorithm such as gradient descent or L-BFGS.  Taking derivatives, one can show that the gradient is:
@@ Line 165: / Line 164: @@
 </math>
+<!--
 where as usual
-<math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>.  Armed
+<math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>.  !-->
-with this derivation of the derivative, one can then plug it into an algorithm such as gradient descent, and have it
-minimize <math>J(\theta)</math>.
+Recall the meaning "<math>\nabla_{\theta_j}</math>" notation.  In particular, <math>\nabla_{\theta_j} J(\theta)</math>
+is itself a vector, so that it's <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>
+the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_l</math>.
+Armed with this derivation of the derivative, one can then plug it into an algorithm such as gradient descent, and have it
+minimize <math>J(\theta)</math>.  For example, with the standard implementation of gradient descent, on each iteration
+we would perform the update <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math> (for each <math>j=1,\ldots,k</math>).
 When implementing softmax regression, we will typically use a modified version of the cost function described above;
@@ Line 176: / Line 182: @@
 == Properties of softmax regression parameterization ==
-Softmax regression has an unusual property that it has "too many," or "redundant", parameters.  Concretely,
+Softmax regression has an unusual property that it has a "redundant" set of parameters.  To explain what this means,
 suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>
 from it, so that <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>.  Our hypothesis
@@ Line 184: / Line 190: @@
 \begin{align}
 p(y^{(i)} = j | x^{(i)} ; \theta)
-&= frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
+&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
-&= frac{e^{\theta_j^T x^{(i)}} e^{\psi^Tx}}{\sum_{l=1}^k e^{\theta_l^T} x^{(i)} e^{\psi^Tx}}
+&= \frac{e^{\theta_j^T x^{(i)}} e^{\psi^Tx}}{\sum_{l=1}^k e^{\theta_l^T} x^{(i)} e^{\psi^Tx}}
-&= frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T} x^{(i)}}
+&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T} x^{(i)}}
 \end{align}
 </math>
@@ Line 194: / Line 200: @@
 regression's parameters are "redundant."  More formally, we say that our
 softmax model is '''overparameterized,''' meaning that for any hypothesis we might
-fit to the data, there're multiple parameter settings that give rise to the same
+fit to the data, there're multiple parameter settings that give rise to exactly
-hypothesis output.
+the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>
+to the predictions.
 Further, if the cost function <math>J(\theta)</math> is minimized by some
@@ Line 201: / Line 208: @@
 then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,
 \theta_n - \psi)</math> for any value of <math>\psi</math>.  Thus, the
-minimizer of <math>J(\theta)</math> is no longer unique.  (Interestingly
+minimizer of <math>J(\theta)</math> is not unique.  (Interestingly,
-however, <math>J(\theta)</math> is still convex, and thus gradient descent will
+<math>J(\theta)</math> is still convex, and thus gradient descent will
 not run into a local optimum.  But the Hessian is singular/non-invertible,
 which cause a straightforward implementation of Newton's method to run into
 numerical problems.)
 Notice also that by setting <math>\psi = \theta_1</math>, one can always
-replace <math>\theta_1</math> with <math>\vec{0}</math> (the vector of all
+replace <math>\theta_1</math> with <math>\theta_1 - \psi = \vec{0}</math> (the vector of all
 's), without affecting the hypothesis.  Thus, one could "eliminate" the vector
 of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for
 any single value of <math>j</math>), without harming the representational power
 of our hypothesis.  Indeed, rather than optimizing over the <math>kn</math>
-parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math> (where
+parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where
-<math>\theta_j \in \Re^n</math>), one could indeed set <math>\theta_1 =
+<math>\theta_j \in \Re^{n+1}</math>), one could indeed set <math>\theta_1 =
-\vec{0}</math> and optimize only with respect to the <math>(k-1)n</math>
+\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>
 remaining parameters, and this would work fine.
 In practice, however, it is often cleaner and simpler to implement the version which keeps
-all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>.  But we will
+all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>, without
+arbitrarily setting one of them to zero.  But we will
 make one change to the cost function: Adding weight decay.  This will take care of
 the numerical problems associated with softmax regression's overparameterized representation.
@@ Line 232: / Line 240: @@
 We will modify the cost function by adding a weight decay term
-<math>\frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2</math>
+<math>\frac{\lambda}{2} \sum_{i=1}^k \sum_{j=1}^{n+1} \theta_{ij}^2</math>
 which penalizes large values of the parameters.  Our cost function is now
 <math>
 \begin{align}
-J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} \left[ 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right] \right]
+J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}  \right]
                + \frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2
 \end{align}
@@ Line 244: / Line 252: @@
 With this weight decay term (for any <math>\lambda > 0</math>), the cost function
 <math>J(\theta)</math> is now strictly convex, and is guaranteed to have a
-unique solution.  The Hessian is now invertible, and <math>J(\theta)</math> is still
+unique solution.  The Hessian is now invertible, and because <math>J(\theta)</math> is
-convex, and thus algorithms such as gradient descent, L-BFGS, etc. are guaranteed
+convex, algorithms such as gradient descent, L-BFGS, etc. are guaranteed
 to converge to the global minimum.
-To implement these optimization algorithms, we also need the derivative, which
+To implement these optimization algorithms, we also need the derivative of this
-works out to be:
+new definition of <math>J(\theta)</math>.  One can show that the derivative is:
 <math>
 \begin{align}
@@ Line 257: / Line 264: @@
 </math>
+By minimizing <math>J(\theta)</math> with respect to <math>\theta</math>, we will have a working implementation of softmax regression.
-Minimizing <math>J(\theta)</math> now performs regularized softmax regression.
@@ Line 292: / Line 297: @@
 With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc.
 !-->
 == Relationship to Logistic Regression ==
@@ Line 314: / Line 316: @@
 Taking advantage of the fact that this hypothesis
-is overparameterized and setting $\psi - =\theta_1$,
+is overparameterized and setting <math>\psi - =\theta_1</math>,
-we can subtract $\theta_1$ from each of the two parameters, giving us
+we can subtract <math>\theta_1</math> from each of the two parameters, giving us
 <math>
@@ Line 340: / Line 342: @@
 \end{bmatrix}
 \end{align}
+</math>
-Thus, replacing $\theta_2-\theta_1$ with a single parameter vector $\theta'$, we find
+Thus, replacing <math>\theta_2-\theta_1</math> with a single parameter vector $\theta'$, we find
 that softmax regression predicts the probability of one of the classes as
 <math>\frac{1}{ 1  + e^{ (\theta')^T x^{(i)} } }</math>,
@@ Line 353: / Line 356: @@
 Suppose you are working on a music classification application, and there are
-<math>k</math> types of music that you are trying to detect.  Should you use a
+<math>k</math> types of music that you are trying to recognize.  Should you use a
 softmax classifier, or should you build <math>k</math> separate binary classifiers using
 logistic regression?
@@ Line 361: / Line 364: @@
 of your training examples is labeled with exactly one of these four class labels,
 you should build a softmax classifier with <math>k=4</math>.
-(Or if there're also some examples that are none of the above four classes,
+(If there're also some examples that are none of the above four classes,
-then you can set <math>k=5</math> and also have a "none of the above" class.)
+then you can set <math>k=5</math> in softmax regression, and also have a fifth, "none of the above," class.)
 If however your categories are has_vocals, dance, sountrack, pop, then the
 classes are not mutually exclusive; for example, there can be a piece of pop
 music that comes from a sountrack and in addition has vocals.  In this case, it
-would be more appropriate to build 4 binary logistic regression classifiers, so
+would be more appropriate to build 4 binary logistic regression classifiers.
-that for each new musical piece, your algorithm can separately decide whether
+This way, for each new musical piece, your algorithm can separately decide whether
 it falls into each of the four categories.
+Now, consider a computer vision example, where you're trying to classify images into
+three different classes.  (i) Suppose that your classes are indoor_scene,
+outdoor_urban_scene, and outdoor_wilderness_scene.  Would you use sofmax regression
+or three logistic regression classifiers?  (ii) Now suppose your classes are
+indoor_scene, black_and_white_image, and image_has_people.  Would you use softmax
+regression of multiple logistic regression classifiers?
+In the first case, the classes are mutually exclusive, so a softmax regression
+classifier would be appropriate.  In the second case, it would be more appropriate to build
+three separate logistic regression classifiers.