Self-Taught Learning

== Overview ==

In machine learning, sometimes it's not who has the best algorithm that wins.  It's who has the most data. 

While one can always try to get more labeled data, that's often expensive.  In
particular, researchers have already gone to extraordinary lengths to use tools
such as AMT (Amazon Mechanical Turk) to get large training sets.  While having
large numbers of people hand-label lots of data is probably a step forward
compared to having large numbers of researchers hand-engineer features, it
would be nice to do better.  In particular, the promise of '''self-taught learning'''
and '''unsupervised feature learning''' is that if we can get our algorithms to learn
from ''unlabeled'' data, then we can easily obtain and learn from massive
amounts of it.  Even though a single unlabeled example is less informative than
a single labeled example, if we can get tons of the former---for example, for
sucking random unlabeled images/audio clips/text documents off the
internet---and if our algorithms can exploit this unlabeled data effectively,
then we might be able to achieve better performance than the massive
hand-engineering and massive hand-labeling 
approaches.

In Self-taught learning and Unsupervised feature learning, we will give our
algorithms a large amount of unlabeled data with which to learn a good feature
representation of the input.  If we are trying to solve a specific
classification task, then we take this learned feature representation and
whatever labeled data we have for that classification task, and apply
supervised learning on that labeled data to solve the classification task.

These ideas are probably most powerful in settings where we have a lot of
unlabeled data, and a relatively smaller amount of labeled data.  However,
these models apply and have often given good results even if we have only
labeled data (in which case we usually perform the feature learning step using
the labeled data, but ignoring the labels). 

In terms of terminology, there are two common unsupervised feature learning
settings, depending on what type of unlabeled data you have.  Lets explain this
with an example.  Suppose your goal is a computer vision task where you'd like
to distinguish between images of cars and images of motorcycles.  Where can we
get lots of unlabeled data?  If you have lots of unlabeled images lying around
that are all images of ''either'' a car or a motorcycle, but where the data
is just missing its label (so you don't know which ones are cars, and which
ones are motorcycles), then you could use that data to learn the features.
This setting---where each unlabeled examples are drawn from the same
distribution as your labeled examples (and thus can be labeled either "car"
or "motorcycle")---is usually called the '''semi-supervised''' setting;
unsupervised feature learning algorithms can be helpful for this.  In practice
however, we often do not have this sort of unlabeled data.  (Where would you
get a database of images where every image is either a car or a motorcycle, but
it's just missing its label?)  Thus, we might instead learn our features using
a large collection of random images downloaded off the internet.  This latter
setting, in which the unlabeled data (random internet images) may be drawn from
a different distribution than the labeled data, is called the '''self-taught
learning''' setting.  In the self-taught learning setting, it is far easier to
obtain large amounts of unlabeled data, and thus leverage the potential of
learning from massive amounts of data.

== Learning features ==

We have already seen how an autoencoder can be used to learn features from
unlabeled data.  Concretely, suppose we have an unlabeled
training set <math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math> 
with <math>\textstyle m_u</math> unlabeled examples.  (The subscript "u" stands for
"unlabeled.")  We can then train a sparse autoencoder on this data 
(perhaps with appropriate whitening or other pre-processing):

[[File:STL_SparseAE.png]]

Having trained the parameters <math>\textstyle W^{(1)}, b^{(1)} W^{(2)} b^{(2)}</math> of this model,
given any new input <math>\textstyle x</math>, we can now compute the corresponding vector of
activations <math>\textstyle a</math> of the hidden units.  As we saw previously, this often gives a
better representation of the input than the original raw input <math>\textstyle x</math>.  We can also
visualize the algorithm for computing the features/activations <math>\textstyle a</math> as the following
neural network:

[[File:STL_SparseAE_Features.png]]

This is just the sparse autoencoder that we previously had, with with the final
layer removed. 

Now, suppose we have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples.
We can now find a better representation for the inputs.  In particular, rather
than representing the first training example as <math>\textstyle x_l^{(1)}</math>, we can feed
<math>\textstyle x_l^{(1)}</math> as the input to our autoencoder, and obtain the corresponding
vector of activations <math>\textstyle a_l^{(1)}</math>.  To represent this example, we can either
just '''replace''' the original feature vector with <math>\textstyle a_l^{(1)}</math>.
Alternatively, we can '''concatenate''' the two feature vectors together,
getting a representation <math>\textstyle (x_l^{(1)}, a_l^{(1)})</math>. 

Thus, our training set now becomes 
<math>\textstyle \{ (a_l^{(1)}, y^{(1)}), (a_l^{(2)}, y^{(2)}), \ldots (a_l^{(m_l)}, y^{(m_l)})
\}</math> (if we use the replacement representation, and use <math>\textstyle a_l^{(i)}</math> to represent the 
<math>\textstyle i</math>-th training example), or <math>\textstyle \{
((x_l^{(1)}, a_l^{(1)}), y^{(1)}), ((x_l^{(2)}, a_l^{(1)}), y^{(2)}), \ldots
((x_l^{(m_l)}, a_l^{(1)}), y^{(m_l)}) \}</math> (if we use the concatenated
representation).  In practice, the concatenated representation often works
better; but for memory or computation representations, we will sometimes use
the replacement representation as well. 

Finally, we can train a supervised learning algorithm such as an SVM, logistic
regression, etc. to obtain a function that makes predictions on the <math>\textstyle y</math> values. 
Given a test example <math>\textstyle x_{\rm test}</math>, we would then follow the same procedure:
For feed it to the autoencoder to get <math>\textstyle a_{\rm test}^{(1)}</math>.  Then, feed 
either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier to get a prediction. 

{{Quote|
'''An important note about preprocessing.'''   During the feature learning
stage where we were learning from the labeled training set 
<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>, we may have computed
various pre-processing parameters.  For example, one may have computed
a mean value of the data and subtracted off this mean to perform mean normalization,
or used PCA to compute a matrix <math>\textstyle U</math> to represent the data as <math>\textstyle U^Tx</math> (or PCA 
whitening or ZCA whitening).  If this is the case, then it is important to
save away these preprocessing parameters, and to use the ''same'' parameters
during the labeled training phase and the test phase, so as to make sure
we are always transforming the data the same way to feed into the autoencoder. 
In particular, if we have computer a matrix <math>\textstyle U</math> using the unlabeled data and PCA,
we should keep the ''same'' matrix <math>\textstyle U</math> and use it to preprocess the
labeled examples and the test data instead.  We should '''not''' re-estimate a
different <math>\textstyle U</math> matrix (or data mean for mean normalization, etc.) using the
labeled training set, since that might result in a dramatically different
pre-processing transformation, which would make the input distribution to
the autoencoder very different from what it was actually trained on. 
}}