Self-Taught Learning to Deep Networks

From Ufldl

Jump to: navigation, search

@@ Line 1: / Line 1: @@
 == Overview ==
-In this section, we will improve upon the features learned from self-taught learning by ''finetuning'' them for our classification objective.
+In the previous section, you used an autoencoder to learn features that were then fed as input
+to a softmax or logistic regression classifier.  There, the features were learned using
+only unlabeled data.  In this section, we show how you can  '''fine-tune''' or further improve
+the learned features using the labeled data.  When you have a large amount of labeled
+training data, this can significantly improve your classifier's performance.
-Recall that in self-taught learning, we first train a sparse autoencoder on our unlabeled data.  Then, given a new example <math>\textstyle x</math>, we can use the
+In self-taught learning, we first trained a sparse autoencoder on the unlabeled data.  Then,
-hidden layer to extract features <math>\textstyle a</math>.  This is shown as follows:
+given a new example <math>\textstyle x</math>, we used the hidden layer to extract
+features <math>\textstyle a</math>.  This is illustrated in the following diagram:
 [[File:STL_SparseAE_Features.png|200px]]
-Now, we are interested in solving a classification task, where our goal is to
+We are interested in solving a classification task, where our goal is to
 predict labels <math>\textstyle y</math>.  We have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
-(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples.
+(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> labeled examples.
-Suppose we replace the original features <math>\textstyle x^{(i)}</math> with features <math>\textstyle a^{(l)}</math>
+We showed previously that we can replace the original features <math>\textstyle x^{(i)}</math> with features <math>\textstyle a^{(l)}</math>
-computed by the sparse autoencoder.  This gives us a training set  <math>\textstyle \{(a^{(2)},
+computed by the sparse autoencoder (the "replacement" representation).  This gives us a training set <math>\textstyle \{(a^{(1)},
-y^{(2)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}</math>.  Finally, we train a logistic
+y^{(1)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}</math>.  Finally, we train a logistic
-classifier to map from the features <math>\textstyle a^{(i)}</math> to the classification label <math>\textstyle y</math>.
+classifier to map from the features <math>\textstyle a^{(i)}</math> to the classification label <math>\textstyle y^{(i)}</math>.
+To illustrate this step, similar to [[Neural Networks|our earlier notes]], we can draw our logistic regression unit (shown in orange) as follows:
-As before, we can draw our logistic unit (show in orange) as
-follows:
 [[File:STL_Logistic_Classifier.png|400px]]
-If we now look at the final classifier that we've learned, in terms
+Now, consider the overall classifier (i.e., the input-output mapping) that we have learned
-of what function it computes given a new test example <math>\textstyle x</math>, we
+using this method.
-see that it can be drawn by putting the two pictures above together.  In
+In particular, let us examine the function that our classifier uses to map from from a new test example
-particular, the final classifier looks like this:
+<math>\textstyle x</math> to a new prediction <math>p(y=1|x)</math>.
+We can draw a representation of this function by putting together the
+two pictures from above.  In particular, the final classifier looks like this:
 [[File:STL_CombinedAE.png|500px]]
-This model was trained in two stages.  The first layer of weights <math>\textstyle W^{(1)}</math>
+The parameters of this model were trained in two stages: The first layer of weights <math>\textstyle W^{(1)}</math>
 mapping from the input <math>\textstyle x</math> to the hidden unit activations <math>\textstyle a</math> were trained
 as part of the sparse autoencoder training process.  The second layer
 of weights <math>\textstyle W^{(2)}</math> mapping from the activations to the output <math>\textstyle y</math> was
-trained using logistic regression.
+trained using logistic regression (or softmax regression).
+But the form of our overall/final classifier is clearly just a whole big neural network.  So,
+having trained up an initial set of parameters for our model (training the first layer using an
+autoencoder, and the second layer
+via logistic/softmax regression), we can further modify all the parameters in our model to try to
+further reduce the training error.  In particular, we can '''fine-tune''' the parameters, meaning perform
+gradient descent (or use L-BFGS) from the current setting of the
+parameters to try to reduce the training error on our labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
+(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math>.
+When fine-tuning is used, sometimes the original unsupervised feature learning steps
+(i.e., training the autoencoder and the logistic classifier) are also called '''pre-training.'''
+The effect of fine-tuning is that the labeled data can be used to modify the weights <math>W^{(1)}</math> as
+well, so that adjustments can be made to the features <math>a</math> extracted by the layer
+of hidden units.
+So far, we have described this process assuming that you used the "replacement" representation, where
+the training examples seen by the logistic classifier are of the form <math>(a^{(i)}, y^{(i)})</math>,
+rather than the "concatenation" representation, where the examples are of the form <math>((x^{(i)}, a^{(i)}), y^{(i)})</math>.
+It is also possible to perform fine-tuning too using the "concatenation" representation; this corresponds
+to a neural network where the input units <math>x_i</math> also feed directly to the logistic
+classifier in the output layer.  (You can draw this using a slightly different type of neural network
+diagram than the ones we have seen so far; in particular, you would have edges that go directly
+from the first layer input nodes to the third layer output node, "skipping over" the hidden layer.)
+However, so long as we are using finetuning, usually the "concatenation" representation usually
+has little advantage over the "replacement" representation.  Thus, if we are using fine-tuning
+in our of unsupervised feature learning or self-taught learning application, usually we will do so
+with a network built using the replacement representation.
-But the final algorithm is clearly just a whole big neural network.  So,
+When should we use fine-tuning?  It is typically used only if you have a large labeled training set; in this
-we can also carrying out further '''fine-tuning''' of the weights to
+setting, fine-tuning can significantly improve the performance of your classifier.  If you
-improve the overall classifier's performance.  In particular,
+have a large unlabeled dataset (for unsupervised feature learning/pre-training) and
-having trained the first layer using an autoencoder and the second layer
+a relatively small labeled training set, then fine-tuning is less likely to help.
-via logistic regression (this process is sometimes called '''pre-training''',
-and sometimes more generally unsupervised feature learning),
-we can now further perform gradient descent from the current value of
-the weights to try to further drive down training error.

Self-Taught Learning to Deep Networks

From Ufldl

Revision as of 05:41, 13 May 2011

Views

Personal tools

ufldl resources

wiki

Search

Toolbox