Self-Taught Learning

Revision as of 04:01, 6 April 2011 (view source)

Cyfoo (Talk | contribs)

(→Working with large images)

← Older edit

Revision as of 04:37, 4 May 2011 (view source)

Jngiam (Talk | contribs)

Newer edit →

Line 1:

-

== ~~Self-taught learning and Unsupervised feature learning~~ ==

+

== Overview ==

In machine learning, sometimes it's not who has the best algorithm that wins. It's who has the most data.

Line 55:

-

===Learning features===

+

== Learning features ==

We have already seen how an autoencoder can be used to learn features from

Line 80:

Now, suppose we have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),

(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples.

-

We can now find a better ~~represetation~~ for the inputs. In particular, rather

+

We can now find a better representation for the inputs. In particular, rather

than representing the first training example as <math>\textstyle x_l^{(1)}</math>, we can feed

<math>\textstyle x_l^{(1)}</math> as the input to our autoencoder, and obtain the corresponding

Line 102:

Given a test example <math>\textstyle x_{\rm test}</math>, we would then follow the same procedure:

For feed it to the autoencoder to get <math>\textstyle a_{\rm test}^{(1)}</math>. Then, feed

-

either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier

+

either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier to get a prediction.

-

to get a prediction.

+

{{Quote|

'''An important note about preprocessing.''' During the feature learning

stage where we were learning from the labeled training set

Line 122:

pre-processing transformation, which would make the input distribution to

the autoencoder very different from what it was actually trained on.

-

+

}}

-

+

-

+

-

~~===Image classification===~~

+

-

+

-

~~[[BLAH]]~~

+

-

+

-

+

-

~~===Fine-tuning===~~

+

-

+

-

~~Suppose we are doing self-taught learning, and have trained a sparse~~

+

-

~~autoencoder on our unlabeled data. Given a new example <math>\textstyle x</math>, we can use the~~

+

-

~~hidden layer to extract features <math>\textstyle a</math>. This is shown as follows:~~

+

-

+

-

~~[[PICTURE]]~~

+

-

+

-

~~Now, we are interested in solving a classification task, where our goal is to~~

+

-

~~predict labels <math>\textstyle y</math>. We have a labeled training set <math>\textstyle \{ (x_l^{(1)~~}~~, y^{(1)~~}),

+

-

~~(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples.~~

+

-

~~Suppose we replace the original features <math>\textstyle x^{(i)}</math> with features <math>\textstyle a^{(l)}</math>~~

+

-

~~computed by the sparse autoencoder. This gives us a training set <math>\textstyle \{(a^{(2)},~~

+

-

~~y^{(2)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}</math>. Finally, we train a logistic~~

+

-

~~classifier to map from the features <math>\textstyle a^{(i)}</math> to the classification label <math>\textstyle y</math>.~~

+

-

~~As before, we can draw our logistic unit (show in red for illustration) as~~

+

-

~~follows:~~

+

-

+

-

~~[[PICTURE--use different color for node, for clarity?]]~~

+

-

+

-

~~If we now look at the final classifier that we've learned, in terms~~

+

-

~~of what function it computes given a new test example <math>\textstyle x</math>, we~~

+

-

~~see that it can be drawn by putting the two pictures above together. In~~

+

-

~~particular, the final classifier looks like this:~~

+

-

+

-

~~[[PICTURE]]~~

+

-

+

-

~~This model was trained in two stages. The first layer of weights <math>\textstyle W^{(1)}</math>~~

+

-

~~mapping from the input <math>\textstyle x</math> to the hidden unit activations <math>\textstyle a</math> were trained~~

+

-

~~as part of the sparse autoencoder training process. The second layer~~

+

-

~~of weights <math>\textstyle W^{(2)}</math> mapping from the activations to the output <math>\textstyle y</math> was~~

+

-

~~trained using logistic regression.~~

+

-

+

-

~~But the final algorithm is clearly just a whole big neural network. So,~~

+

-

~~we can also carrying out further '''fine-tuning''' of the weights to~~

+

-

~~improve the overall classifier's performance. In particular,~~

+

-

~~having trained the first layer using an autoencoder and the second layer~~

+

-

~~via logistic regression (this process is sometimes called '''pre-training''',~~

+

-

~~and sometimes more generally unsupervised feature learning),~~

+

-

~~we can now further perform gradient descent from the current value of~~

+

-

~~the weights to try to further drive down training error.~~

+

-

+

-

~~===Discussion===~~

+

-

+

-

~~Given that the whole algorithm is just a big neural network, why don't we just~~

+

-

~~carry out the fine-tuning step, without doing any pre-training/unsupervised~~

+

-

~~feature learning? There're several reasons:~~

+

-

+

-

~~<ul>~~

+

-

~~<li> First and most important, labeled data is often scarce, and unlabeled~~

+

-

~~data is cheap and plentiful. The promise of self-taught learning is that by~~

+

-

~~exploiting the massive amount of unlabeled data, we can learn much better~~

+

-

~~models. The fine-tuning step can be done only using labeled data. In~~

+

-

~~contrast, by using unlabeled data to learn a good initial value for the~~

+

-

~~first layer of weights <math>\textstyle W^{(1)}</math>, we usually get much better classifiers~~

+

-

~~after fine-tuning.~~

+

-

+

-

~~<li> Second, training a neural network using supervised learning involves~~

+

-

~~solving a highly non-convex optimization problem (say, minimizing the training~~

+

-

~~error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a function of the network parameters~~

+

-

~~<math>\textstyle W</math>).~~

+

-

~~The optimization problem can therefore be rife with local optima, and training~~

+

-

~~with gradient descent (or methods like conjugate gradient and L-BFGS) do not~~

+

-

~~work well. In contrast, by first initializing the parameters using an~~

+

-

~~unsupervised feature learning/pre-training step, we can end up at much better~~

+

-

~~solutions.\footnote{Actually, pre-training has benefits beyond just helping to~~

+

-

~~get out of local optima. In particular, it has been shown to also have~~

+

-

~~a useful "regularization" effect. (Erhan et al., 2010) But a full discussion~~

+

-

~~is beyond the scope of these notes.}~~

+

-

~~</ul>~~

+

-

+

-

+

-

+

-

+

-

+

-

+

-

+

-

+

-

+

-

+

-

+

-

+

-

+

-

+

-

+

-

+

-

+

-

+

-

+

-

+

-

~~or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier~~

+

-

~~to get a prediction.~~

+

-

+

-

~~use just a two layer network (layers <math>\textstyle L_1</math> and <math>\textstyle L_2</math>) as described~~

+

-

~~above to extract features from our input.~~

+

-

+

-

+

-

+

-

+

-

+

-

~~===Working with large images===~~

+

Self-Taught Learning

From Ufldl

Revision as of 04:37, 4 May 2011

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 1: / Line 1: @@
-== Self-taught learning and Unsupervised feature learning ==
+== Overview ==
 In machine learning, sometimes it's not who has the best algorithm that wins.  It's who has the most data.
@@ Line 55: / Line 55: @@
-===Learning features===
+== Learning features ==
 We have already seen how an autoencoder can be used to learn features from
@@ Line 80: / Line 80: @@
 Now, suppose we have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
 (x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples.
-We can now find a better represetation for the inputs.  In particular, rather
+We can now find a better representation for the inputs.  In particular, rather
 than representing the first training example as <math>\textstyle x_l^{(1)}</math>, we can feed
 <math>\textstyle x_l^{(1)}</math> as the input to our autoencoder, and obtain the corresponding
@@ Line 102: / Line 102: @@
 Given a test example <math>\textstyle x_{\rm test}</math>, we would then follow the same procedure:
 For feed it to the autoencoder to get <math>\textstyle a_{\rm test}^{(1)}</math>.  Then, feed
-either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier
+either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier to get a prediction.
-to get a prediction.
+{{Quote|
 '''An important note about preprocessing.'''   During the feature learning
 stage where we were learning from the labeled training set
@@ Line 122: / Line 122: @@
 pre-processing transformation, which would make the input distribution to
 the autoencoder very different from what it was actually trained on.
+}}
-===Image classification===
-[[BLAH]]
-===Fine-tuning===
-Suppose we are doing self-taught learning, and have trained a sparse
-autoencoder on our unlabeled data.  Given a new example <math>\textstyle x</math>, we can use the
-hidden layer to extract features <math>\textstyle a</math>.  This is shown as follows:
-[[PICTURE]]
-Now, we are interested in solving a classification task, where our goal is to
-predict labels <math>\textstyle y</math>.  We have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
-(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples.
-Suppose we replace the original features <math>\textstyle x^{(i)}</math> with features <math>\textstyle a^{(l)}</math>
-computed by the sparse autoencoder.  This gives us a training set  <math>\textstyle \{(a^{(2)},
-y^{(2)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}</math>.  Finally, we train a logistic
-classifier to map from the features <math>\textstyle a^{(i)}</math> to the classification label <math>\textstyle y</math>.
-As before, we can draw our logistic unit (show in red for illustration) as
-follows:
-[[PICTURE--use different color for node, for clarity?]]
-If we now look at the final classifier that we've learned, in terms
-of what function it computes given a new test example <math>\textstyle x</math>, we
-see that it can be drawn by putting the two pictures above together.  In
-particular, the final classifier looks like this:
-[[PICTURE]]
-This model was trained in two stages.  The first layer of weights <math>\textstyle W^{(1)}</math>
-mapping from the input <math>\textstyle x</math> to the hidden unit activations <math>\textstyle a</math> were trained
-as part of the sparse autoencoder training process.  The second layer
-of weights <math>\textstyle W^{(2)}</math> mapping from the activations to the output <math>\textstyle y</math> was
-trained using logistic regression.
-But the final algorithm is clearly just a whole big neural network.  So,
-we can also carrying out further '''fine-tuning''' of the weights to
-improve the overall classifier's performance.  In particular,
-having trained the first layer using an autoencoder and the second layer
-via logistic regression (this process is sometimes called '''pre-training''',
-and sometimes more generally unsupervised feature learning),
-we can now further perform gradient descent from the current value of
-the weights to try to further drive down training error.
-===Discussion===
-Given that the whole algorithm is just a big neural network, why don't we just
-carry out the fine-tuning step, without doing any pre-training/unsupervised
-feature learning?  There're several reasons:
-<ul>
-<li> First and most important, labeled data is often scarce, and unlabeled
-data is cheap and plentiful.  The promise of self-taught learning is that by
-exploiting the massive amount of unlabeled data, we can learn much better
-models.  The fine-tuning step can be done only using labeled data.  In
-contrast, by using unlabeled data to learn a good initial value for the
-first layer of weights <math>\textstyle W^{(1)}</math>, we usually get much better classifiers
-after fine-tuning.
-<li> Second, training a neural network using supervised learning involves
-solving a highly non-convex optimization problem (say, minimizing the training
-error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a function of the network parameters
-<math>\textstyle W</math>).
-The optimization problem can therefore be rife with local optima, and training
-with gradient descent (or methods like conjugate gradient and L-BFGS) do not
-work well.  In contrast, by first initializing the parameters using an
-unsupervised feature learning/pre-training step, we can end up at much better
-solutions.\footnote{Actually, pre-training has benefits beyond just helping to
-get out of local optima.  In particular, it has been shown to also have
-a useful "regularization" effect. (Erhan et al., 2010) But a full discussion
-is beyond the scope of these notes.}
-</ul>
- or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier
-to get a prediction.
-use just a two layer network (layers <math>\textstyle L_1</math> and <math>\textstyle L_2</math>) as described
-above to extract features from our input.
-===Working with large images===