Self-Taught Learning to Deep Networks

== Overview ==

In this section, we will improve upon the features learned from self-taught learning by ''finetuning'' them for our classification objective.

Recall that in self-taught learning, we first train a sparse autoencoder on our unlabeled data.  Then, given a new example <math>\textstyle x</math>, we can use the
hidden layer to extract features <math>\textstyle a</math>.  This is shown as follows: 

[[File:STL_SparseAE_Features.png|200px]]

Now, we are interested in solving a classification task, where our goal is to
predict labels <math>\textstyle y</math>.  We have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples.
Suppose we replace the original features <math>\textstyle x^{(i)}</math> with features <math>\textstyle a^{(l)}</math>
computed by the sparse autoencoder.  This gives us a training set  <math>\textstyle \{(a^{(2)},
y^{(2)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}</math>.  Finally, we train a logistic
classifier to map from the features <math>\textstyle a^{(i)}</math> to the classification label <math>\textstyle y</math>.

As before, we can draw our logistic unit (show in orange) as
follows:

[[File:STL_Logistic_Classifier.png|400px]]

If we now look at the final classifier that we've learned, in terms
of what function it computes given a new test example <math>\textstyle x</math>, we 
see that it can be drawn by putting the two pictures above together.  In 
particular, the final classifier looks like this:

[[File:STL_CombinedAE.png|500px]]

This model was trained in two stages.  The first layer of weights <math>\textstyle W^{(1)}</math>
mapping from the input <math>\textstyle x</math> to the hidden unit activations <math>\textstyle a</math> were trained
as part of the sparse autoencoder training process.  The second layer
of weights <math>\textstyle W^{(2)}</math> mapping from the activations to the output <math>\textstyle y</math> was
trained using logistic regression.  

But the final algorithm is clearly just a whole big neural network.  So,
we can also carrying out further '''fine-tuning''' of the weights to 
improve the overall classifier's performance.  In particular, 
having trained the first layer using an autoencoder and the second layer
via logistic regression (this process is sometimes called '''pre-training''',
and sometimes more generally unsupervised feature learning),
we can now further perform gradient descent from the current value of
the weights to try to further drive down training error.