Self-Taught Learning

From Ufldl

Jump to: navigation, search
(Working with large images)
 
Line 1: Line 1:
-
== Self-taught learning and Unsupervised feature learning ==
+
== Overview ==
-
In machine learning, sometimes it's not who has the best algorithm that wins.  It's who has the most data.  
+
Assuming that we have a sufficiently powerful learning algorithm, one of the most reliable
 +
ways to get better performance is to give the algorithm more data.  This has led to the
 +
that aphorism that in
 +
machine learning, "sometimes it's not who has the best algorithm that wins; it's  
 +
who has the most data."
-
While one can always try to get more labeled data, that's often expensive.  In
+
One can always try to get more labeled data, but this can be expensive.  In
particular, researchers have already gone to extraordinary lengths to use tools
particular, researchers have already gone to extraordinary lengths to use tools
such as AMT (Amazon Mechanical Turk) to get large training sets.  While having
such as AMT (Amazon Mechanical Turk) to get large training sets.  While having
Line 12: Line 16:
from ''unlabeled'' data, then we can easily obtain and learn from massive
from ''unlabeled'' data, then we can easily obtain and learn from massive
amounts of it.  Even though a single unlabeled example is less informative than
amounts of it.  Even though a single unlabeled example is less informative than
-
a single labeled example, if we can get tons of the former---for example, for
+
a single labeled example, if we can get tons of the former---for example, by downloading
-
sucking random unlabeled images/audio clips/text documents off the
+
random unlabeled images/audio clips/text documents off the
internet---and if our algorithms can exploit this unlabeled data effectively,
internet---and if our algorithms can exploit this unlabeled data effectively,
then we might be able to achieve better performance than the massive
then we might be able to achieve better performance than the massive
-
hand-engineering and massive hand-labeling  
+
hand-engineering and massive hand-labeling approaches.
-
approaches.
+
In Self-taught learning and Unsupervised feature learning, we will give our
In Self-taught learning and Unsupervised feature learning, we will give our
Line 23: Line 26:
representation of the input.  If we are trying to solve a specific
representation of the input.  If we are trying to solve a specific
classification task, then we take this learned feature representation and
classification task, then we take this learned feature representation and
-
whatever labeled data we have for that classification task, and apply
+
whatever (perhaps small amount of) labeled data we have for that classification task, and apply
supervised learning on that labeled data to solve the classification task.
supervised learning on that labeled data to solve the classification task.
-
These ideas are probably most powerful in settings where we have a lot of
+
These ideas probably have the most powerful effects in problems where we have a lot of
-
unlabeled data, and a relatively smaller amount of labeled data.  However,
+
unlabeled data, and a smaller amount of labeled data.  However,
-
these models apply and have often given good results even if we have only
+
they typically give good results even if we have only
labeled data (in which case we usually perform the feature learning step using
labeled data (in which case we usually perform the feature learning step using
-
the labeled data, but ignoring the labels).  
+
the labeled data, but ignoring the labels).
-
In terms of terminology, there are two common unsupervised feature learning
+
== Learning features ==
-
settings, depending on what type of unlabeled data you have.  Lets explain this
+
-
with an example.  Suppose your goal is a computer vision task where you'd like
+
-
to distinguish between images of cars and images of motorcycles.  Where can we
+
-
get lots of unlabeled data?  If you have lots of unlabeled images lying around
+
-
that are all images of ''either'' a car or a motorcycle, but where the data
+
-
is just missing its label (so you don't know which ones are cars, and which
+
-
ones are motorcycles), then you could use that data to learn the features.
+
-
This setting---where each unlabeled examples are drawn from the same
+
-
distribution as your labeled examples (and thus can be labeled either "car"
+
-
or "motorcycle")---is usually called the '''semi-supervised''' setting;
+
-
unsupervised feature learning algorithms can be helpful for this.  In practice
+
-
however, we often do not have this sort of unlabeled data.  (Where would you
+
-
get a database of images where every image is either a car or a motorcycle, but
+
-
it's just missing its label?)  Thus, we might instead learn our features using
+
-
a large collection of random images downloaded off the internet.  This latter
+
-
setting, in which the unlabeled data (random internet images) may be drawn from
+
-
a different distribution than the labeled data, is called the {\bf self-taught
+
-
learning} setting.  In the self-taught learning setting, it is far easier to
+
-
obtain large amounts of unlabeled data, and thus leverage the potential of
+
-
learning from massive amounts of data.
+
-
 
+
-
 
+
-
===Learning features===
+
We have already seen how an autoencoder can be used to learn features from
We have already seen how an autoencoder can be used to learn features from
Line 64: Line 44:
(perhaps with appropriate whitening or other pre-processing):
(perhaps with appropriate whitening or other pre-processing):
-
[[FIGURE OF SPARSE AUTOENCODER]]
+
[[File:STL_SparseAE.png|350px]]
-
Having trained the parameters <math>\textstyle W^{(1)}, b^{(1)} W^{(2)} b^{(2)}</math> of this model,
+
Having trained the parameters <math>\textstyle W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}</math> of this model,
given any new input <math>\textstyle x</math>, we can now compute the corresponding vector of
given any new input <math>\textstyle x</math>, we can now compute the corresponding vector of
activations <math>\textstyle a</math> of the hidden units.  As we saw previously, this often gives a
activations <math>\textstyle a</math> of the hidden units.  As we saw previously, this often gives a
Line 73: Line 53:
neural network:
neural network:
-
[[FIGURE OF SPARSE AUTOENCODER minus the final layer]]
+
[[File:STL_SparseAE_Features.png|300px]]
This is just the sparse autoencoder that we previously had, with with the final
This is just the sparse autoencoder that we previously had, with with the final
Line 79: Line 59:
Now, suppose we have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
Now, suppose we have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
-
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples.
+
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples.  
-
We can now find a better represetation for the inputs.  In particular, rather
+
(The subscript "l" stands for "labeled.") 
 +
We can now find a better representation for the inputs.  In particular, rather
than representing the first training example as <math>\textstyle x_l^{(1)}</math>, we can feed
than representing the first training example as <math>\textstyle x_l^{(1)}</math>, we can feed
<math>\textstyle x_l^{(1)}</math> as the input to our autoencoder, and obtain the corresponding
<math>\textstyle x_l^{(1)}</math> as the input to our autoencoder, and obtain the corresponding
Line 92: Line 73:
\}</math> (if we use the replacement representation, and use <math>\textstyle a_l^{(i)}</math> to represent the  
\}</math> (if we use the replacement representation, and use <math>\textstyle a_l^{(i)}</math> to represent the  
<math>\textstyle i</math>-th training example), or <math>\textstyle \{
<math>\textstyle i</math>-th training example), or <math>\textstyle \{
-
((x_l^{(1)}, a_l^{(1)}), y^{(1)}), ((x_l^{(2)}, a_l^{(1)}), y^{(2)}), \ldots
+
((x_l^{(1)}, a_l^{(1)}), y^{(1)}), ((x_l^{(2)}, a_l^{(1)}), y^{(2)}), \ldots,
((x_l^{(m_l)}, a_l^{(1)}), y^{(m_l)}) \}</math> (if we use the concatenated
((x_l^{(m_l)}, a_l^{(1)}), y^{(m_l)}) \}</math> (if we use the concatenated
representation).  In practice, the concatenated representation often works
representation).  In practice, the concatenated representation often works
Line 101: Line 82:
regression, etc. to obtain a function that makes predictions on the <math>\textstyle y</math> values.  
regression, etc. to obtain a function that makes predictions on the <math>\textstyle y</math> values.  
Given a test example <math>\textstyle x_{\rm test}</math>, we would then follow the same procedure:
Given a test example <math>\textstyle x_{\rm test}</math>, we would then follow the same procedure:
-
For feed it to the autoencoder to get <math>\textstyle a_{\rm test}^{(1)}</math>.  Then, feed  
+
For feed it to the autoencoder to get <math>\textstyle a_{\rm test}</math>.  Then, feed  
-
either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier
+
either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier to get a prediction.
-
to get a prediction.  
+
-
'''An important note about preprocessing.'''  During the feature learning
+
== On pre-processing the data ==
-
stage where we were learning from the labeled training set  
+
 
 +
During the feature learning stage where we were learning from the unlabeled training set  
<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>, we may have computed
<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>, we may have computed
various pre-processing parameters.  For example, one may have computed
various pre-processing parameters.  For example, one may have computed
a mean value of the data and subtracted off this mean to perform mean normalization,
a mean value of the data and subtracted off this mean to perform mean normalization,
-
or used PCA to compute a matrix <math>\textstyle U</math> to represent the data as <math>\textstyle U^Tx</math> (or PCA  
+
or used PCA to compute a matrix <math>\textstyle U</math> to represent the data as <math>\textstyle U^Tx</math> (or used
 +
PCA  
whitening or ZCA whitening).  If this is the case, then it is important to
whitening or ZCA whitening).  If this is the case, then it is important to
save away these preprocessing parameters, and to use the ''same'' parameters
save away these preprocessing parameters, and to use the ''same'' parameters
during the labeled training phase and the test phase, so as to make sure
during the labeled training phase and the test phase, so as to make sure
we are always transforming the data the same way to feed into the autoencoder.  
we are always transforming the data the same way to feed into the autoencoder.  
-
In particular, if we have computer a matrix <math>\textstyle U</math> using the unlabeled data and PCA,
+
In particular, if we have computed a matrix <math>\textstyle U</math> using the unlabeled data and PCA,
we should keep the ''same'' matrix <math>\textstyle U</math> and use it to preprocess the
we should keep the ''same'' matrix <math>\textstyle U</math> and use it to preprocess the
-
labeled examples and the test data instead.  We should '''not''' re-estimate a
+
labeled examples and the test data.  We should '''not''' re-estimate a
different <math>\textstyle U</math> matrix (or data mean for mean normalization, etc.) using the
different <math>\textstyle U</math> matrix (or data mean for mean normalization, etc.) using the
labeled training set, since that might result in a dramatically different
labeled training set, since that might result in a dramatically different
pre-processing transformation, which would make the input distribution to
pre-processing transformation, which would make the input distribution to
-
the autoencoder very different from what it was actually trained on.  
+
the autoencoder very different from what it was actually trained on.
 +
== On the terminology of unsupervised feature learning ==
 +
There are two common unsupervised feature learning settings, depending on what type of
 +
unlabeled data you have.  The more general and powerful setting is the '''self-taught learning'''
 +
setting, which does not assume that your unlabeled data <math>x_u</math> has to
 +
be drawn from the same distribution as your labeled data <math>x_l</math>.  The
 +
more restrictive setting where the unlabeled data comes from exactly the same
 +
distribution as the labeled data is sometimes called the '''semi-supervised learning'''
 +
setting.  This distinctions is best explained with an example, which we now give.
-
===Image classification===
+
Suppose your goal is a computer vision task where you'd like
-
 
+
to distinguish between images of cars and images of motorcycles; so, each labeled
-
[[BLAH]]
+
example in your training set is either an image of a car or an image of a motorcycle.   
-
 
+
Where can we get lots of unlabeled data? The easiest way would be to obtain some
-
 
+
random collection of images, perhaps downloaded off the internetWe could then
-
===Fine-tuning===
+
train the autoencoder on this large collection of images, and obtain useful features
-
 
+
from them. Because here the unlabeled data is drawn from a different distribution
-
Suppose we are doing self-taught learning, and have trained a sparse
+
than the labeled data (i.e., perhaps some of our unlabeled images may contain
-
autoencoder on our unlabeled data.  Given a new example <math>\textstyle x</math>, we can use the
+
cars/motorcycles, but not every image downloaded is either a car or a motorcycle), we
-
hidden layer to extract features <math>\textstyle a</math>.  This is shown as follows:
+
call this self-taught learning.  
-
 
+
-
[[PICTURE]]
+
-
 
+
-
Now, we are interested in solving a classification task, where our goal is to
+
-
predict labels <math>\textstyle y</math>.  We have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
+
-
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples.
+
-
Suppose we replace the original features <math>\textstyle x^{(i)}</math> with features <math>\textstyle a^{(l)}</math>
+
-
computed by the sparse autoencoder.  This gives us a training set  <math>\textstyle \{(a^{(2)},
+
-
y^{(2)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}</math>Finally, we train a logistic
+
-
classifier to map from the features <math>\textstyle a^{(i)}</math> to the classification label <math>\textstyle y</math>.
+
-
As before, we can draw our logistic unit (show in red for illustration) as
+
-
follows:
+
-
 
+
-
[[PICTURE--use different color for node, for clarity?]]
+
-
 
+
-
If we now look at the final classifier that we've learned, in terms
+
-
of what function it computes given a new test example <math>\textstyle x</math>, we
+
-
see that it can be drawn by putting the two pictures above together.  In
+
-
particular, the final classifier looks like this:
+
-
 
+
-
[[PICTURE]]
+
-
 
+
-
This model was trained in two stages. The first layer of weights <math>\textstyle W^{(1)}</math>
+
-
mapping from the input <math>\textstyle x</math> to the hidden unit activations <math>\textstyle a</math> were trained
+
-
as part of the sparse autoencoder training process.  The second layer
+
-
of weights <math>\textstyle W^{(2)}</math> mapping from the activations to the output <math>\textstyle y</math> was
+
-
trained using logistic regression. 
+
-
 
+
-
But the final algorithm is clearly just a whole big neural network.  So,
+
-
we can also carrying out further '''fine-tuning''' of the weights to
+
-
improve the overall classifier's performanceIn particular,
+
-
having trained the first layer using an autoencoder and the second layer
+
-
via logistic regression (this process is sometimes called '''pre-training''',
+
-
and sometimes more generally unsupervised feature learning),
+
-
we can now further perform gradient descent from the current value of
+
-
the weights to try to further drive down training error.
+
-
 
+
-
===Discussion===
+
-
 
+
-
Given that the whole algorithm is just a big neural network, why don't we just
+
-
carry out the fine-tuning step, without doing any pre-training/unsupervised
+
-
feature learning?  There're several reasons:
+
-
 
+
-
<ul>
+
-
<li> First and most important, labeled data is often scarce, and unlabeled
+
-
data is cheap and plentiful. The promise of self-taught learning is that by
+
-
exploiting the massive amount of unlabeled data, we can learn much better
+
-
models. The fine-tuning step can be done only using labeled data.  In
+
-
contrast, by using unlabeled data to learn a good initial value for the
+
-
first layer of weights <math>\textstyle W^{(1)}</math>, we usually get much better classifiers
+
-
after fine-tuning.
+
-
 
+
-
<li> Second, training a neural network using supervised learning involves
+
-
solving a highly non-convex optimization problem (say, minimizing the training
+
-
error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a function of the network parameters
+
-
<math>\textstyle W</math>). 
+
-
The optimization problem can therefore be rife with local optima, and training
+
-
with gradient descent (or methods like conjugate gradient and L-BFGS) do not
+
-
work well.  In contrast, by first initializing the parameters using an
+
-
unsupervised feature learning/pre-training step, we can end up at much better
+
-
solutions.\footnote{Actually, pre-training has benefits beyond just helping to
+
-
get out of local optima.  In particular, it has been shown to also have
+
-
a useful "regularization" effect. (Erhan et al., 2010) But a full discussion
+
-
is beyond the scope of these notes.}
+
-
</ul>
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
 
+
-
or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier
+
-
to get a prediction.
+
-
 
+
-
use just a two layer network (layers <math>\textstyle L_1</math> and <math>\textstyle L_2</math>) as described
+
-
above to extract features from our input. 
+
 +
In contrast, if we happen to have lots of unlabeled images lying around
 +
that are all images of ''either'' a car or a motorcycle, but where the data
 +
is just missing its label (so you don't know which ones are cars, and which
 +
ones are motorcycles), then we could use this form of unlabeled data to
 +
learn the features.  This setting---where each unlabeled example is drawn from the same
 +
distribution as your labeled examples---is sometimes called the semi-supervised
 +
setting.  In practice, we often do not have this sort of unlabeled data (where would you
 +
get a database of images where every image is either a car or a motorcycle, but
 +
just missing its label?), and so in the context of learning features from unlabeled
 +
data, the self-taught learning setting is more broadly applicable.
 +
{{STL}}
-
===Working with large images===
+
{{Languages|自我学习|中文}}

Latest revision as of 13:26, 7 April 2013

Personal tools