Softmax Regression

From Ufldl

Jump to: navigation, search
Line 1: Line 1:
== Introduction ==
== Introduction ==
-
In these notes, we describe the '''Softmax regression''' model, a generalization of logistic regression to
+
In these notes, we describe the '''Softmax regression''' model.  This model generalizes logistic regression to
classification problems where the class label <math>y</math> can take on more than two possible values.
classification problems where the class label <math>y</math> can take on more than two possible values.
-
This will be useful for such problems as MNIST, where the goal is to distinguish between 10 different
+
This will be useful for such problems as MNIST digit classification, where the goal is to distinguish between 10 different
-
numerical digits.  '''Softmax regression''' is a supervised learning algorithm, but we will shortly be
+
numerical digits.  Softmax regression is a supervised learning algorithm, but we will later be
using it in conjuction with our deep learning/unsupervised feature learning methods.
using it in conjuction with our deep learning/unsupervised feature learning methods.
Recall that in logistic regression, we had a training set
Recall that in logistic regression, we had a training set
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>
-
of $m$ examples.  We had considered the binary classification setting, so the
+
of <math>m</math> labeled examples, where the input features are <math>x^{(i)} \in \Re^{n+1}</math>.   
-
labels were <math>y^{(i)} \in \{0,1\}</math>.  Our hypothesis took the form:
+
We had considered the binary classification setting, so the labels  
 +
were <math>y^{(i)} \in \{0,1\}</math>.  Our hypothesis took the form:
<math>\begin{align}
<math>\begin{align}
Line 30: Line 31:
two.  Thus, in our training set
two.  Thus, in our training set
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>,
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>,
-
we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>.  For example,
+
we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>. (Note that
 +
our convention will be to index the classes starting from 1, rather than from 0.) For example,
in the MNIST digit recognition task, we would have <math>k=10</math> different classes.
in the MNIST digit recognition task, we would have <math>k=10</math> different classes.
Given a test input <math>x</math>, we want our hypothesis to estimate
Given a test input <math>x</math>, we want our hypothesis to estimate
-
the probability that <math>p(y=j | x)</math> for <math>j = 1, \ldots, k</math>.
+
the probability that <math>p(y=j | x)</math> for each value of <math>j = 1, \ldots, k</math>.
-
In other words, we want to estimate the probability of the class label taking
+
I.e., we want to estimate the probability of the class label taking
on each of the <math>k</math> different possible values.  Thus, our hypothesis
on each of the <math>k</math> different possible values.  Thus, our hypothesis
will output a <math>k</math> dimensional vector (whose elements sum to 1) giving
will output a <math>k</math> dimensional vector (whose elements sum to 1) giving
-
us our <math>k</math> estimated probabilities.
+
us our <math>k</math> estimated probabilities. Concretely, our hypothesis
-
 
+
-
Concretely, our hypothesis
+
<math>h_{\theta}(x)</math> takes the form:
<math>h_{\theta}(x)</math> takes the form:
<math>
<math>
\begin{align}
\begin{align}
-
h(x^{(i)}) =
+
h_\theta(x^{(i)}) =
\begin{bmatrix}
\begin{bmatrix}
-
P(y^{(i)} = 1 | x^{(i)}) \\
+
p(y^{(i)} = 1 | x^{(i)}; \theta) \\
-
P(y^{(i)} = 2 | x^{(i)}) \\
+
p(y^{(i)} = 2 | x^{(i)}; \theta) \\
\vdots \\
\vdots \\
-
P(y^{(i)} = k | x^{(i)})
+
p(y^{(i)} = k | x^{(i)}; \theta)
\end{bmatrix}
\end{bmatrix}
=
=
-
\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }
+
\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }
\begin{bmatrix}
\begin{bmatrix}
e^{ \theta_1^T x^{(i)} } \\
e^{ \theta_1^T x^{(i)} } \\
Line 63: Line 63:
</math>
</math>
-
where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each
+
Here <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math> are the
-
<math>n</math>-dimensional column vectors that constitute the parameters of our
+
parameters of our model.   
-
hypothesis.  Notice that
+
Notice that
-
<math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>
+
the term <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math>
-
normalizes the distribution so that it sums to one.
+
normalizes the distribution, so that it sums to one.  
 +
For convenience, we will also write
 +
<math>\theta</math> to denote all the
 +
parameters of our model.  When you implement softmax regression, is is usually
 +
convenient to represent <math>\theta</math> as a matrix obtained by
 +
stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math>, so that
 +
 +
<math>
 +
\theta = \begin{bmatrix}
 +
\mbox{--}\theta_1^T\mbox{--}
 +
\mbox{--}\theta_2^T\mbox{--} \\
 +
\vdots
 +
\mbox{--}\theta_k^T\mbox{--}
 +
\end{bmatrix}
 +
</math>
<!--
<!--
Line 113: Line 127:
Notes].  }}
Notes].  }}
!-->
!-->
-
 
== Cost Function ==
== Cost Function ==
-
We now present the cost function that we'll use for softmax regression.  In the equation below, <math>1\{\cdot\}</math> is
+
We now describe the cost function that we'll use for softmax regression.  In the equation below, <math>1\{\cdot\}</math> is
-
the '''indicator function,''' so that <math>1\{\hbox{A true sentence}\}=1</math>, and <math>1\{\hbox{A false sentence}\}=0</math>.
+
the '''indicator function,''' so that <math>1\{\hbox{A true statement}\}=1</math>, and <math>1\{\hbox{A false statement}\}=0</math>.
-
Our cost function will be:
+
For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be:
<math>
<math>
\begin{align}
\begin{align}
-
J(\theta) = - \sum_{i=1}^{m} \sum_{j=1}^{k} \left[ 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
+
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{\theta_j^T x^{(i)}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]
\end{align}
\end{align}
</math>
</math>
Line 131: Line 144:
<math>
<math>
\begin{align}
\begin{align}
-
J(\theta) = - \sum_{i=1}^{m} \sum_{j=0}^{1} \left[ 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
+
J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)}  (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + \log h_\theta(x^{(i)}) \right] \\
 +
&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]
\end{align}
\end{align}
</math>
</math>

Revision as of 05:59, 10 May 2011

Personal tools