Softmax Regression

From Ufldl

Jump to: navigation, search
(Introduction)
Line 11: Line 11:
where we trained the logistic regression weights to optimize the log-likelihood of the dataset using <math> p(y|x) = h_\theta(x) </math>. In softmax regression, we are interested in multi-class problems where each example is assigned to one of <tt>K</tt> labels. One example of a multi-class classification problem would be classifying digits on the MNIST dataset where each example has label 1 of 10 possible labels (i.e., where it is the digit 0, 1, ... or 9).  
where we trained the logistic regression weights to optimize the log-likelihood of the dataset using <math> p(y|x) = h_\theta(x) </math>. In softmax regression, we are interested in multi-class problems where each example is assigned to one of <tt>K</tt> labels. One example of a multi-class classification problem would be classifying digits on the MNIST dataset where each example has label 1 of 10 possible labels (i.e., where it is the digit 0, 1, ... or 9).  
-
To extend the logistic regression framework which only outputs a single probability value, we consider a hypothesis that outputs K values (summing to 1) that represent the predicted probability distribution.
+
To extend the logistic regression framework which only outputs a single probability value, we consider a hypothesis that outputs K values (summing to 1) that represent the predicted probability distribution. Formally, let us consider the classification problem where we have <math>m</math> <math>k</math>-dimensional inputs <math>x^{(1)}, x^{(2)}, \ldots, x^{(m)}</math> with corresponding class labels <math>y^{(1)}, y^{(2)}, \ldots, y^{(m)}</math>, where <math>y^{(i)} \in \{1, 2, \ldots, n\}</math>, with <math>n</math> being the number of classes.  
-
 
+
-
== Mathematical form ==
+
-
 
+
-
Formally, we consider the classification problem where we have <math>m</math> <math>k</math>-dimensional inputs <math>x^{(1)}, x^{(2)}, \ldots, x^{(m)}</math> with corresponding class labels <math>y^{(1)}, y^{(2)}, \ldots, y^{(m)}</math>, where <math>y^{(i)} \in \{1, 2, \ldots, n\}</math>, with <math>n</math> being the number of classes.  
+
Our hypothesis <math>h_{\theta}(x)</math>, returns a vector of probabilities, such that
Our hypothesis <math>h_{\theta}(x)</math>, returns a vector of probabilities, such that
Line 39: Line 35:
</math>
</math>
-
where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. Note that '''strictly speaking, we only need <math>n - 1</math> parameters for <math>n</math> classes''', but for convenience, we use <math>n</math> parameters in our derivation. We will explore this further in the later section on parameters.
+
where <math>\theta_1, \theta_2, \ldots, \theta_n</math> are each <math>k</math>-dimensional column vectors that constitute the parameters of our hypothesis. Notice that <math>\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } </math> normalizes the distribution so that it sums to one.  
-
Our objective is to maximise the likelihood of the data, <math>L(\theta; x, y) = \prod_{i=1}^{m}{ P(y^{(i)} | x^{(i)}) }</math>.  
+
''Strictly speaking, we only need <math>n - 1</math> parameters for <math>n</math> classes, but for convenience, we use <math>n</math> parameters in our derivation.''
-
For convenience, we consider the log-likelihood of the data,
+
Now, this hypothesis defines a predicted probability distribution given some <tt>x</tt>, <math>P(y | x^{(i)}) = h(x^{(i)})  </math>. Thus to train the model, a natural choice is to maximize the (conditional) log-likelihood of the data, <math>l(\theta; x, y) = \sum_{i=1}^{m} \ln { P(y^{(i)} | x^{(i)}) }</math>.
 +
 
 +
== Optimizing Softmax Regression ==
 +
 
 +
Expanding the log-likelihood expression, we find that:
<math>
<math>
Line 54: Line 54:
</math>
</math>
-
To find <math>\theta</math> such that <math>\ell(\theta)</math> is maximised, we first find the derivatives of <math>\ell(\theta)</math> with respect to <math>\theta_{k}</math>:
+
Unfortunately, there is no close form solution to this optimization problem (although it is concave), and we usually use an off-the-shelf optimization method (e.g., L-BFGS, stochastic gradient descent) to find the optimal parameters. Using these optimization methods require computing the gradient (<math>\ell(\theta)</math> w.r.t. <math>\theta_{k}</math>), which can can be derived as follows:
<math>
<math>
Line 71: Line 71:
</math>
</math>
-
With this, we can now find a set of parameters that maximises <math>\ell(\theta)</math>, for instance by using gradient ascent.
+
With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc.
=== Weight decay ===
=== Weight decay ===
 +
 +
When using softmax in practice, you might find that the weights sometimes balloon up to very large numbers. This can create numerical difficulties and other issues during training or when the trained weights are used in other settings (as in a stacked autoencoder).  
When using softmax in practice, you might find that the weights sometimes balloon up to very large numbers. This can create numerical difficulties and other issues during training or when the trained weights are used in other settings (as in a stacked autoencoder).  

Revision as of 04:08, 4 May 2011

Personal tools