Self-Taught Learning

From Ufldl

Jump to: navigation, search
(Overview)
Line 1: Line 1:
== Overview ==
== Overview ==
-
In machine learning, sometimes it's not who has the best algorithm that wins.  It's who has the most data.  
+
In machine learning, one of the most reliable ways to get better performance is
 +
to give your algorithms more data.  This has led to the  that aphorism that in
 +
machine learning, "sometimes it's not who has the best algorithm that wins; it's  
 +
who has the most data."
-
While one can always try to get more labeled data, that's often expensive.  In
+
One can always try to get more labeled data, but this can be expensive.  In
particular, researchers have already gone to extraordinary lengths to use tools
particular, researchers have already gone to extraordinary lengths to use tools
such as AMT (Amazon Mechanical Turk) to get large training sets.  While having
such as AMT (Amazon Mechanical Turk) to get large training sets.  While having
Line 12: Line 15:
from ''unlabeled'' data, then we can easily obtain and learn from massive
from ''unlabeled'' data, then we can easily obtain and learn from massive
amounts of it.  Even though a single unlabeled example is less informative than
amounts of it.  Even though a single unlabeled example is less informative than
-
a single labeled example, if we can get tons of the former---for example, for
+
a single labeled example, if we can get tons of the former---for example, by downloading
-
sucking random unlabeled images/audio clips/text documents off the
+
random unlabeled images/audio clips/text documents off the
internet---and if our algorithms can exploit this unlabeled data effectively,
internet---and if our algorithms can exploit this unlabeled data effectively,
then we might be able to achieve better performance than the massive
then we might be able to achieve better performance than the massive
-
hand-engineering and massive hand-labeling  
+
hand-engineering and massive hand-labeling approaches.
-
approaches.
+
In Self-taught learning and Unsupervised feature learning, we will give our
In Self-taught learning and Unsupervised feature learning, we will give our
Line 23: Line 25:
representation of the input.  If we are trying to solve a specific
representation of the input.  If we are trying to solve a specific
classification task, then we take this learned feature representation and
classification task, then we take this learned feature representation and
-
whatever labeled data we have for that classification task, and apply
+
whatever (perhaps small amount of) labeled data we have for that classification task, and apply
supervised learning on that labeled data to solve the classification task.
supervised learning on that labeled data to solve the classification task.
These ideas are probably most powerful in settings where we have a lot of
These ideas are probably most powerful in settings where we have a lot of
unlabeled data, and a relatively smaller amount of labeled data.  However,
unlabeled data, and a relatively smaller amount of labeled data.  However,
-
these models apply and have often given good results even if we have only
+
these models often given good results even if we have only
labeled data (in which case we usually perform the feature learning step using
labeled data (in which case we usually perform the feature learning step using
the labeled data, but ignoring the labels).  
the labeled data, but ignoring the labels).  
 +
<!--
In terms of terminology, there are two common unsupervised feature learning
In terms of terminology, there are two common unsupervised feature learning
settings, depending on what type of unlabeled data you have.  Lets explain this
settings, depending on what type of unlabeled data you have.  Lets explain this
Line 40: Line 43:
is just missing its label (so you don't know which ones are cars, and which
is just missing its label (so you don't know which ones are cars, and which
ones are motorcycles), then you could use that data to learn the features.
ones are motorcycles), then you could use that data to learn the features.
-
This setting---where each unlabeled examples are drawn from the same
+
This setting---where each unlabeled example is drawn from the same
distribution as your labeled examples (and thus can be labeled either "car"
distribution as your labeled examples (and thus can be labeled either "car"
or "motorcycle")---is usually called the '''semi-supervised''' setting;
or "motorcycle")---is usually called the '''semi-supervised''' setting;
Line 52: Line 55:
'''self-taught learning''' setting.  In the self-taught learning setting, it is far easier to
'''self-taught learning''' setting.  In the self-taught learning setting, it is far easier to
obtain large amounts of unlabeled data, and thus leverage the potential of
obtain large amounts of unlabeled data, and thus leverage the potential of
-
learning from massive amounts of data.
+
learning from massive amounts of data.  
 +
!-->
== Learning features ==
== Learning features ==

Revision as of 22:54, 10 May 2011

Personal tools