# Self-Taught Learning to Deep Networks

(Difference between revisions)
Jump to: navigation, search
 Revision as of 00:04, 12 May 2011 (view source)Jngiam (Talk | contribs)← Older edit Revision as of 05:41, 13 May 2011 (view source)Ang (Talk | contribs) Newer edit → Line 1: Line 1: == Overview == == Overview == - In this section, we will improve upon the features learned from self-taught learning by ''finetuning'' them for our classification objective. + In the previous section, you used an autoencoder to learn features that were then fed as input + to a softmax or logistic regression classifier.  There, the features were learned using + only unlabeled data.  In this section, we show how you can  '''fine-tune''' or further improve + the learned features using the labeled data.  When you have a large amount of labeled + training data, this can significantly improve your classifier's performance. - Recall that in self-taught learning, we first train a sparse autoencoder on our unlabeled data.  Then, given a new example [itex]\textstyle x[/itex], we can use the + In self-taught learning, we first trained a sparse autoencoder on the unlabeled data.  Then, - hidden layer to extract features [itex]\textstyle a[/itex].  This is shown as follows: + given a new example [itex]\textstyle x[/itex], we used the hidden layer to extract + features [itex]\textstyle a[/itex].  This is illustrated in the following diagram: [[File:STL_SparseAE_Features.png|200px]] [[File:STL_SparseAE_Features.png|200px]] - Now, we are interested in solving a classification task, where our goal is to + We are interested in solving a classification task, where our goal is to predict labels [itex]\textstyle y[/itex].  We have a labeled training set [itex]\textstyle \{ (x_l^{(1)}, y^{(1)}), predict labels [itex]\textstyle y[/itex].  We have a labeled training set [itex]\textstyle \{ (x_l^{(1)}, y^{(1)}), - (x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}[/itex] of [itex]\textstyle m_l[/itex] examples. + (x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}[/itex] of [itex]\textstyle m_l[/itex] labeled examples. - Suppose we replace the original features [itex]\textstyle x^{(i)}[/itex] with features [itex]\textstyle a^{(l)}[/itex] + We showed previously that we can replace the original features [itex]\textstyle x^{(i)}[/itex] with features [itex]\textstyle a^{(l)}[/itex] - computed by the sparse autoencoder.  This gives us a training set [itex]\textstyle \{(a^{(2)}, + computed by the sparse autoencoder (the "replacement" representation).  This gives us a training set [itex]\textstyle \{(a^{(1)}, - y^{(2)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}[/itex].  Finally, we train a logistic + y^{(1)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}[/itex].  Finally, we train a logistic - classifier to map from the features [itex]\textstyle a^{(i)}[/itex] to the classification label [itex]\textstyle y[/itex]. + classifier to map from the features [itex]\textstyle a^{(i)}[/itex] to the classification label [itex]\textstyle y^{(i)}[/itex]. - + To illustrate this step, similar to [[Neural Networks|our earlier notes]], we can draw our logistic regression unit (shown in orange) as follows: - As before, we can draw our logistic unit (show in orange) as + - follows: + [[File:STL_Logistic_Classifier.png|400px]] [[File:STL_Logistic_Classifier.png|400px]] - If we now look at the final classifier that we've learned, in terms + Now, consider the overall classifier (i.e., the input-output mapping) that we have learned - of what function it computes given a new test example [itex]\textstyle x[/itex], we + using this method. - see that it can be drawn by putting the two pictures above together.  In + In particular, let us examine the function that our classifier uses to map from from a new test example - particular, the final classifier looks like this: + [itex]\textstyle x[/itex] to a new prediction [itex]p(y=1|x)[/itex]. + We can draw a representation of this function by putting together the + two pictures from above.  In particular, the final classifier looks like this: [[File:STL_CombinedAE.png|500px]] [[File:STL_CombinedAE.png|500px]] - This model was trained in two stages.  The first layer of weights [itex]\textstyle W^{(1)}[/itex] + The parameters of this model were trained in two stages: The first layer of weights [itex]\textstyle W^{(1)}[/itex] mapping from the input [itex]\textstyle x[/itex] to the hidden unit activations [itex]\textstyle a[/itex] were trained mapping from the input [itex]\textstyle x[/itex] to the hidden unit activations [itex]\textstyle a[/itex] were trained as part of the sparse autoencoder training process.  The second layer as part of the sparse autoencoder training process.  The second layer of weights [itex]\textstyle W^{(2)}[/itex] mapping from the activations to the output [itex]\textstyle y[/itex] was of weights [itex]\textstyle W^{(2)}[/itex] mapping from the activations to the output [itex]\textstyle y[/itex] was - trained using logistic regression. + trained using logistic regression (or softmax regression). + + But the form of our overall/final classifier is clearly just a whole big neural network.  So, + having trained up an initial set of parameters for our model (training the first layer using an + autoencoder, and the second layer + via logistic/softmax regression), we can further modify all the parameters in our model to try to + further reduce the training error.  In particular, we can '''fine-tune''' the parameters, meaning perform + gradient descent (or use L-BFGS) from the current setting of the + parameters to try to reduce the training error on our labeled training set [itex]\textstyle \{ (x_l^{(1)}, y^{(1)}), + (x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}[/itex]. + + When fine-tuning is used, sometimes the original unsupervised feature learning steps + (i.e., training the autoencoder and the logistic classifier) are also called '''pre-training.''' + The effect of fine-tuning is that the labeled data can be used to modify the weights [itex]W^{(1)}[/itex] as + well, so that adjustments can be made to the features [itex]a[/itex] extracted by the layer + of hidden units. + + So far, we have described this process assuming that you used the "replacement" representation, where + the training examples seen by the logistic classifier are of the form [itex](a^{(i)}, y^{(i)})[/itex], + rather than the "concatenation" representation, where the examples are of the form [itex]((x^{(i)}, a^{(i)}), y^{(i)})[/itex]. + It is also possible to perform fine-tuning too using the "concatenation" representation; this corresponds + to a neural network where the input units [itex]x_i[/itex] also feed directly to the logistic + classifier in the output layer.  (You can draw this using a slightly different type of neural network + diagram than the ones we have seen so far; in particular, you would have edges that go directly + from the first layer input nodes to the third layer output node, "skipping over" the hidden layer.) + However, so long as we are using finetuning, usually the "concatenation" representation usually + has little advantage over the "replacement" representation.  Thus, if we are using fine-tuning + in our of unsupervised feature learning or self-taught learning application, usually we will do so + with a network built using the replacement representation. - But the final algorithm is clearly just a whole big neural network.  So, + When should we use fine-tuning?  It is typically used only if you have a large labeled training set; in this - we can also carrying out further '''fine-tuning''' of the weights to + setting, fine-tuning can significantly improve the performance of your classifier.  If you - improve the overall classifier's performance.  In particular, + have a large unlabeled dataset (for unsupervised feature learning/pre-training) and - having trained the first layer using an autoencoder and the second layer + a relatively small labeled training set, then fine-tuning is less likely to help. - via logistic regression (this process is sometimes called '''pre-training''', + - and sometimes more generally unsupervised feature learning), + - we can now further perform gradient descent from the current value of + - the weights to try to further drive down training error. +

## Overview

In the previous section, you used an autoencoder to learn features that were then fed as input to a softmax or logistic regression classifier. There, the features were learned using only unlabeled data. In this section, we show how you can fine-tune or further improve the learned features using the labeled data. When you have a large amount of labeled training data, this can significantly improve your classifier's performance.

In self-taught learning, we first trained a sparse autoencoder on the unlabeled data. Then, given a new example $\textstyle x$, we used the hidden layer to extract features $\textstyle a$. This is illustrated in the following diagram:

We are interested in solving a classification task, where our goal is to predict labels $\textstyle y$. We have a labeled training set $\textstyle \{ (x_l^{(1)}, y^{(1)}), (x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}$ of $\textstyle m_l$ labeled examples. We showed previously that we can replace the original features $\textstyle x^{(i)}$ with features $\textstyle a^{(l)}$ computed by the sparse autoencoder (the "replacement" representation). This gives us a training set $\textstyle \{(a^{(1)}, y^{(1)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}$. Finally, we train a logistic classifier to map from the features $\textstyle a^{(i)}$ to the classification label $\textstyle y^{(i)}$. To illustrate this step, similar to our earlier notes, we can draw our logistic regression unit (shown in orange) as follows:

Now, consider the overall classifier (i.e., the input-output mapping) that we have learned using this method. In particular, let us examine the function that our classifier uses to map from from a new test example $\textstyle x$ to a new prediction p(y = 1 | x). We can draw a representation of this function by putting together the two pictures from above. In particular, the final classifier looks like this:

The parameters of this model were trained in two stages: The first layer of weights $\textstyle W^{(1)}$ mapping from the input $\textstyle x$ to the hidden unit activations $\textstyle a$ were trained as part of the sparse autoencoder training process. The second layer of weights $\textstyle W^{(2)}$ mapping from the activations to the output $\textstyle y$ was trained using logistic regression (or softmax regression).

But the form of our overall/final classifier is clearly just a whole big neural network. So, having trained up an initial set of parameters for our model (training the first layer using an autoencoder, and the second layer via logistic/softmax regression), we can further modify all the parameters in our model to try to further reduce the training error. In particular, we can fine-tune the parameters, meaning perform gradient descent (or use L-BFGS) from the current setting of the parameters to try to reduce the training error on our labeled training set $\textstyle \{ (x_l^{(1)}, y^{(1)}), (x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}$.

When fine-tuning is used, sometimes the original unsupervised feature learning steps (i.e., training the autoencoder and the logistic classifier) are also called pre-training. The effect of fine-tuning is that the labeled data can be used to modify the weights W(1) as well, so that adjustments can be made to the features a extracted by the layer of hidden units.

So far, we have described this process assuming that you used the "replacement" representation, where the training examples seen by the logistic classifier are of the form (a(i),y(i)), rather than the "concatenation" representation, where the examples are of the form ((x(i),a(i)),y(i)). It is also possible to perform fine-tuning too using the "concatenation" representation; this corresponds to a neural network where the input units xi also feed directly to the logistic classifier in the output layer. (You can draw this using a slightly different type of neural network diagram than the ones we have seen so far; in particular, you would have edges that go directly from the first layer input nodes to the third layer output node, "skipping over" the hidden layer.) However, so long as we are using finetuning, usually the "concatenation" representation usually has little advantage over the "replacement" representation. Thus, if we are using fine-tuning in our of unsupervised feature learning or self-taught learning application, usually we will do so with a network built using the replacement representation.

When should we use fine-tuning? It is typically used only if you have a large labeled training set; in this setting, fine-tuning can significantly improve the performance of your classifier. If you have a large unlabeled dataset (for unsupervised feature learning/pre-training) and a relatively small labeled training set, then fine-tuning is less likely to help.