Softmax Regression

From Ufldl

Jump to: navigation, search
 
Line 1: Line 1:
== Introduction ==
== Introduction ==
-
In these notes, we describe the '''Softmax regression''' model, a generalization of logistic regression to
+
In these notes, we describe the '''Softmax regression''' model.  This model generalizes logistic regression to
classification problems where the class label <math>y</math> can take on more than two possible values.
classification problems where the class label <math>y</math> can take on more than two possible values.
-
This will be useful for such problems as MNIST, where the goal is to distinguish between 10 different
+
This will be useful for such problems as MNIST digit classification, where the goal is to distinguish between 10 different
-
numerical digits.  '''Softmax regression''' is a supervised learning algorithm, but we will shortly be
+
numerical digits.  Softmax regression is a supervised learning algorithm, but we will later be
using it in conjuction with our deep learning/unsupervised feature learning methods.
using it in conjuction with our deep learning/unsupervised feature learning methods.
Recall that in logistic regression, we had a training set
Recall that in logistic regression, we had a training set
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>
-
of $m$ examples.  We had considered the binary classification setting, so the
+
of <math>m</math> labeled examples, where the input features are <math>x^{(i)} \in \Re^{n+1}</math>.   
-
labels were <math>y^{(i)} \in \{0,1\}</math>.  Our hypothesis took the form:
+
(In this set of notes, we will use the notational convention of letting the feature vectors <math>x</math> be
 +
<math>n+1</math> dimensional, with <math>x_0 = 1</math> corresponding to the intercept term.)
 +
With logistic regression, we were in the binary classification setting, so the labels  
 +
were <math>y^{(i)} \in \{0,1\}</math>.  Our hypothesis took the form:
<math>\begin{align}
<math>\begin{align}
Line 30: Line 33:
two.  Thus, in our training set
two.  Thus, in our training set
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>,
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>,
-
we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>.  For example,
+
we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>. (Note that
 +
our convention will be to index the classes starting from 1, rather than from 0.) For example,
in the MNIST digit recognition task, we would have <math>k=10</math> different classes.
in the MNIST digit recognition task, we would have <math>k=10</math> different classes.
Given a test input <math>x</math>, we want our hypothesis to estimate
Given a test input <math>x</math>, we want our hypothesis to estimate
-
the probability that <math>p(y=j | x)</math> for <math>j = 1, \ldots, k</math>.
+
the probability that <math>p(y=j | x)</math> for each value of <math>j = 1, \ldots, k</math>.
-
In other words, we want to estimate the probability of the class label taking
+
I.e., we want to estimate the probability of the class label taking
on each of the <math>k</math> different possible values.  Thus, our hypothesis
on each of the <math>k</math> different possible values.  Thus, our hypothesis
will output a <math>k</math> dimensional vector (whose elements sum to 1) giving
will output a <math>k</math> dimensional vector (whose elements sum to 1) giving
-
us our <math>k</math> estimated probabilities.
+
us our <math>k</math> estimated probabilities. Concretely, our hypothesis
-
 
+
-
Concretely, our hypothesis
+
<math>h_{\theta}(x)</math> takes the form:
<math>h_{\theta}(x)</math> takes the form:
<math>
<math>
\begin{align}
\begin{align}
-
h(x^{(i)}) =
+
h_\theta(x^{(i)}) =
\begin{bmatrix}
\begin{bmatrix}
-
P(y^{(i)} = 1 | x^{(i)}) \\
+
p(y^{(i)} = 1 | x^{(i)}; \theta) \\
-
P(y^{(i)} = 2 | x^{(i)}) \\
+
p(y^{(i)} = 2 | x^{(i)}; \theta) \\
\vdots \\
\vdots \\
-
P(y^{(i)} = k | x^{(i)})
+
p(y^{(i)} = k | x^{(i)}; \theta)
\end{bmatrix}
\end{bmatrix}
=
=
-
\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
+
\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }
\begin{bmatrix}
\begin{bmatrix}
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_1^T x^{(i)} } \\
Line 63: Line 65:
</math>
</math>
-
where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each
+
Here <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math> are the
-
<math>n</math>-dimensional column vectors that constitute the parameters of our
+
parameters of our model.   
-
hypothesis.  Notice that
+
Notice that
-
<math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>
+
the term <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>
-
normalizes the distribution so that it sums to one.
+
normalizes the distribution, so that it sums to one.  
 +
For convenience, we will also write
 +
<math>\theta</math> to denote all the
 +
parameters of our model.  When you implement softmax regression, it is usually
 +
convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by
 +
stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that
 +
 +
<math>
 +
\theta = \begin{bmatrix}
 +
\mbox{---} \theta_1^T \mbox{---} \\
 +
\mbox{---} \theta_2^T \mbox{---} \\
 +
\vdots \\
 +
\mbox{---} \theta_k^T \mbox{---} \\
 +
\end{bmatrix}
 +
</math>
<!--
<!--
Line 113: Line 129:
Notes].  }}
Notes].  }}
!-->
!-->
-
 
== Cost Function ==
== Cost Function ==
-
We now present the cost function that we'll use for softmax regression.  In the equation below, <math>1\{\cdot\}</math> is
+
We now describe the cost function that we'll use for softmax regression.  In the equation below, <math>1\{\cdot\}</math> is
-
the '''indicator function,''' so that <math>1\{\hbox{A true sentence}\}=1</math>, and <math>1\{\hbox{A false sentence}\}=0</math>.
+
the '''indicator function,''' so that <math>1\{\hbox{a true statement}\}=1</math>, and <math>1\{\hbox{a false statement}\}=0</math>.
-
Our cost function will be:
+
For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be:
<math>
<math>
\begin{align}
\begin{align}
-
J(\theta) = - \sum_{i=1}^{m} \sum_{j=1}^{k} \left[ 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
+
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
\end{align}
\end{align}
</math>
</math>
Line 131: Line 146:
<math>
<math>
\begin{align}
\begin{align}
-
J(\theta) = - \sum_{i=1}^{m} \sum_{j=0}^{1} \left[ 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
+
J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m  (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\
 +
&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
\end{align}
\end{align}
</math>
</math>
Line 138: Line 154:
of the class label.  Note also that in softmax regression, we have that
of the class label.  Note also that in softmax regression, we have that
<math>
<math>
-
p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )
+
p(y^{(i)} = j | x^{(i)} ; \theta) = \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} }
-
 
+
</math>.
</math>.
-
There is no known closed-form way to solve for the minimum, and thus as usual we'll resort to an iterative
+
There is no known closed-form way to solve for the minimum of <math>J(\theta)</math>, and thus as usual we'll resort to an iterative
optimization algorithm such as gradient descent or L-BFGS.  Taking derivatives, one can show that the gradient is:
optimization algorithm such as gradient descent or L-BFGS.  Taking derivatives, one can show that the gradient is:
<math>
<math>
\begin{align}
\begin{align}
-
\nabla_{\theta_j} J(\theta) = -\sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) ) \right]  }
+
\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right]  }
\end{align}
\end{align}
</math>
</math>
 +
<!--
where as usual
where as usual
-
<math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>.  Armed
+
<math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>.  !-->
-
with this derivation of the derivative, one can then plug it into an algorithm such as gradient descent, and have it
+
 
-
minimize <math>J(\theta)</math>.
+
Recall the meaning of the "<math>\nabla_{\theta_j}</math>" notation.  In particular, <math>\nabla_{\theta_j} J(\theta)</math>
 +
is itself a vector, so that its <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math>
 +
the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_j</math>.
 +
 
 +
Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it
 +
minimize <math>J(\theta)</math>.  For example, with the standard implementation of gradient descent, on each iteration
 +
we would perform the update <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math> (for each <math>j=1,\ldots,k</math>).
When implementing softmax regression, we will typically use a modified version of the cost function described above;
When implementing softmax regression, we will typically use a modified version of the cost function described above;
specifically, one that incorporates weight decay.  We describe the motivation and details below.
specifically, one that incorporates weight decay.  We describe the motivation and details below.
 +
== Properties of softmax regression parameterization ==
-
=== Properties of softmax regression parameterization ===
+
Softmax regression has an unusual property that it has a "redundant" set of parameters.  To explain what this means,  
-
 
+
-
Softmax regression has an unusual property that it has "too many," or "redundant", parameters.  Concretely,
+
suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>
suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math>
-
from it, so that <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>.  Our hypothesis
+
from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math>  
 +
(for every <math>j=1, \ldots, k</math>).  Our hypothesis
now estimates the class label probabilities as
now estimates the class label probabilities as
Line 170: Line 192:
\begin{align}
\begin{align}
p(y^{(i)} = j | x^{(i)} ; \theta)
p(y^{(i)} = j | x^{(i)} ; \theta)
-
&= frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
+
&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}}  \\
-
&= frac{e^{\theta_j^T x^{(i)}} e^{\psi^Tx}}{\sum_{l=1}^k e^{\theta_l^T} x^{(i)} e^{\psi^Tx}}
+
&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\
-
&= frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T} x^{(i)}}
+
&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.
\end{align}
\end{align}
</math>
</math>
Line 180: Line 202:
regression's parameters are "redundant."  More formally, we say that our
regression's parameters are "redundant."  More formally, we say that our
softmax model is '''overparameterized,''' meaning that for any hypothesis we might
softmax model is '''overparameterized,''' meaning that for any hypothesis we might
-
fit to the data, there're multiple parameter settings that give rise to the same
+
fit to the data, there are multiple parameter settings that give rise to exactly
-
hypothesis output.
+
the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math>
 +
to the predictions.  
Further, if the cost function <math>J(\theta)</math> is minimized by some
Further, if the cost function <math>J(\theta)</math> is minimized by some
-
setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>,
+
setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>,
then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,
then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,
-
\theta_n - \psi)</math> for any value of <math>\psi</math>.  Thus, the
+
\theta_k - \psi)</math> for any value of <math>\psi</math>.  Thus, the
-
minimizer of <math>J(\theta)</math> is no longer unique.  (Interestingly
+
minimizer of <math>J(\theta)</math> is not unique.  (Interestingly,  
-
however, <math>J(\theta)</math> is still convex, and thus gradient descent will
+
<math>J(\theta)</math> is still convex, and thus gradient descent will
-
not run into a local optimum.  But the Hessian is singular/non-invertible,
+
not run into a local optima problems.  But the Hessian is singular/non-invertible,
-
which cause a straightforward implementation of Newton's method to run into
+
which causes a straightforward implementation of Newton's method to run into
-
numerical problems.)
+
numerical problems.)  
Notice also that by setting <math>\psi = \theta_1</math>, one can always
Notice also that by setting <math>\psi = \theta_1</math>, one can always
-
replace <math>\theta_1</math> with <math>\vec{0}</math> (the vector of all
+
replace <math>\theta_1</math> with <math>\theta_1 - \psi = \vec{0}</math> (the vector of all
0's), without affecting the hypothesis.  Thus, one could "eliminate" the vector
0's), without affecting the hypothesis.  Thus, one could "eliminate" the vector
of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for
of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for
any single value of <math>j</math>), without harming the representational power
any single value of <math>j</math>), without harming the representational power
-
of our hypothesis.  Indeed, rather than optimizing over the <math>kn</math>
+
of our hypothesis.  Indeed, rather than optimizing over the <math>k(n+1)</math>
-
parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math> (where
+
parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where
-
<math>\theta_j \in \Re^n</math>), one could indeed set <math>\theta_1 =
+
<math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 =
-
\vec{0}</math> and optimize only with respect to the <math>(k-1)n</math>
+
\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math>
-
remaining parameters, and this would work fine.
+
remaining parameters, and this would work fine.  
In practice, however, it is often cleaner and simpler to implement the version which keeps
In practice, however, it is often cleaner and simpler to implement the version which keeps
-
all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>.  But we will
+
all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>, without
 +
arbitrarily setting one of them to zero.  But we will
make one change to the cost function: Adding weight decay.  This will take care of
make one change to the cost function: Adding weight decay.  This will take care of
the numerical problems associated with softmax regression's overparameterized representation.
the numerical problems associated with softmax regression's overparameterized representation.
-
 
+
== Weight Decay ==
-
=== Weight Decay ===
+
<!--
<!--
Line 218: Line 241:
We will modify the cost function by adding a weight decay term  
We will modify the cost function by adding a weight decay term  
-
<math>\frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2</math>
+
<math>\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2</math>
which penalizes large values of the parameters.  Our cost function is now
which penalizes large values of the parameters.  Our cost function is now
<math>
<math>
\begin{align}
\begin{align}
-
J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} \left[ 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right] \right]
+
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }} \right]
-
               + \frac{\lambda}{2} \sum_{i} \sum_{j} \theta_{ij}^2
+
               + \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2
\end{align}
\end{align}
</math>
</math>
Line 230: Line 253:
With this weight decay term (for any <math>\lambda > 0</math>), the cost function
With this weight decay term (for any <math>\lambda > 0</math>), the cost function
<math>J(\theta)</math> is now strictly convex, and is guaranteed to have a
<math>J(\theta)</math> is now strictly convex, and is guaranteed to have a
-
unique solution.  The Hessian is now invertible, and <math>J(\theta)</math> is still
+
unique solution.  The Hessian is now invertible, and because <math>J(\theta)</math> is  
-
convex, and thus algorithms such as gradient descent, L-BFGS, etc. are guaranteed
+
convex, algorithms such as gradient descent, L-BFGS, etc. are guaranteed
to converge to the global minimum.
to converge to the global minimum.
-
To implement these optimization algorithms, we also need the derivative, which
+
To apply an optimization algorithm, we also need the derivative of this
-
works out to be:
+
new definition of <math>J(\theta)</math>.  One can show that the derivative is:
-
 
+
<math>
<math>
\begin{align}
\begin{align}
-
\nabla_{\theta_j} J(\theta) = -\sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) ) \right]  } + \lambda \theta_j
+
\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\}  - p(y^{(i)} = j | x^{(i)}; \theta) ) \right]  } + \lambda \theta_j
\end{align}
\end{align}
</math>
</math>
-
 
+
By minimizing <math>J(\theta)</math> with respect to <math>\theta</math>, we will have a working implementation of softmax regression.
-
Minimizing <math>J(\theta)</math> now performs regularized softmax regression.
+
-
 
+
Line 279: Line 299:
!-->
!-->
 +
== Relationship to Logistic Regression ==
-
 
+
In the special case where <math>k = 2</math>, one can show that softmax regression reduces to logistic regression.
-
 
+
This shows that softmax regression is a generalization of logistic regression.  Concretely, when <math>k=2</math>,
-
=== Generalizing Logistic Regression ===
+
the softmax regression hypothesis outputs
-
 
+
-
In the special case where <math>k = 2</math>, one can also show that softmax regression reduces to logistic regression.
+
-
This shows that softmax regression is a generalization of logistic regression.  Concretely, our hypothesis outputs
+
<math>
<math>
\begin{align}
\begin{align}
-
h(x) &=
+
h_\theta(x) &=
\frac{1}{ e^{\theta_1^Tx}  + e^{ \theta_2^T x^{(i)} } }
\frac{1}{ e^{\theta_1^Tx}  + e^{ \theta_2^T x^{(i)} } }
Line 300: Line 318:
Taking advantage of the fact that this hypothesis
Taking advantage of the fact that this hypothesis
-
is overparameterized and setting $\psi - =\theta_1$,
+
is overparameterized and setting <math>\psi = \theta_1</math>,
-
we can subtract $\theta_1$ from each of the two parameters, giving us
+
we can subtract <math>\theta_1</math> from each of the two parameters, giving us
<math>
<math>
Line 326: Line 344:
\end{bmatrix}
\end{bmatrix}
\end{align}
\end{align}
 +
</math>
-
Thus, replacing $\theta_2-\theta_1$ with a single parameter vector $\theta'$, we find
+
Thus, replacing <math>\theta_2-\theta_1</math> with a single parameter vector <math>\theta'</math>, we find
that softmax regression predicts the probability of one of the classes as
that softmax regression predicts the probability of one of the classes as
<math>\frac{1}{ 1  + e^{ (\theta')^T x^{(i)} } }</math>,
<math>\frac{1}{ 1  + e^{ (\theta')^T x^{(i)} } }</math>,
Line 335: Line 354:
same as logistic regression.
same as logistic regression.
-
 
+
== Softmax Regression vs. k Binary Classifiers ==
-
=== Softmax Regression vs. k Binary Classifiers ===
+
Suppose you are working on a music classification application, and there are
Suppose you are working on a music classification application, and there are
-
<math>k</math> types of music that you are trying to detect.  Should you use a
+
<math>k</math> types of music that you are trying to recognize.  Should you use a
softmax classifier, or should you build <math>k</math> separate binary classifiers using
softmax classifier, or should you build <math>k</math> separate binary classifiers using
logistic regression?
logistic regression?
Line 347: Line 365:
of your training examples is labeled with exactly one of these four class labels,
of your training examples is labeled with exactly one of these four class labels,
you should build a softmax classifier with <math>k=4</math>.
you should build a softmax classifier with <math>k=4</math>.
-
(Or if there're also some examples that are none of the above four classes,
+
(If there're also some examples that are none of the above four classes,
-
then you can set <math>k=5</math> and also have a "none of the above" class.)
+
then you can set <math>k=5</math> in softmax regression, and also have a fifth, "none of the above," class.)
-
If however your categories are has_vocals, dance, sountrack, pop, then the
+
If however your categories are has_vocals, dance, soundtrack, pop, then the
classes are not mutually exclusive; for example, there can be a piece of pop
classes are not mutually exclusive; for example, there can be a piece of pop
-
music that comes from a sountrack and in addition has vocals.  In this case, it
+
music that comes from a soundtrack and in addition has vocals.  In this case, it
-
would be more appropriate to build 4 binary logistic regression classifiers, so
+
would be more appropriate to build 4 binary logistic regression classifiers.
-
that for each new musical piece, your algorithm can separately decide whether
+
This way, for each new musical piece, your algorithm can separately decide whether
it falls into each of the four categories.
it falls into each of the four categories.
 +
 +
Now, consider a computer vision example, where you're trying to classify images into
 +
three different classes.  (i) Suppose that your classes are indoor_scene,
 +
outdoor_urban_scene, and outdoor_wilderness_scene.  Would you use sofmax regression
 +
or three logistic regression classifiers?  (ii) Now suppose your classes are
 +
indoor_scene, black_and_white_image, and image_has_people.  Would you use softmax
 +
regression or multiple logistic regression classifiers?
 +
 +
In the first case, the classes are mutually exclusive, so a softmax regression
 +
classifier would be appropriate.  In the second case, it would be more appropriate to build
 +
three separate logistic regression classifiers.
 +
 +
 +
{{Softmax}}
 +
 +
 +
{{Languages|Softmax回归|中文}}

Latest revision as of 13:24, 7 April 2013

Personal tools