|
|
Line 1: |
Line 1: |
- | == Self-taught learning and Unsupervised feature learning == | + | == Overview == |
| | | |
| In machine learning, sometimes it's not who has the best algorithm that wins. It's who has the most data. | | In machine learning, sometimes it's not who has the best algorithm that wins. It's who has the most data. |
Line 55: |
Line 55: |
| | | |
| | | |
- | ===Learning features===
| + | == Learning features == |
| | | |
| We have already seen how an autoencoder can be used to learn features from | | We have already seen how an autoencoder can be used to learn features from |
Line 80: |
Line 80: |
| Now, suppose we have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}), | | Now, suppose we have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}), |
| (x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples. | | (x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples. |
- | We can now find a better represetation for the inputs. In particular, rather | + | We can now find a better representation for the inputs. In particular, rather |
| than representing the first training example as <math>\textstyle x_l^{(1)}</math>, we can feed | | than representing the first training example as <math>\textstyle x_l^{(1)}</math>, we can feed |
| <math>\textstyle x_l^{(1)}</math> as the input to our autoencoder, and obtain the corresponding | | <math>\textstyle x_l^{(1)}</math> as the input to our autoencoder, and obtain the corresponding |
Line 102: |
Line 102: |
| Given a test example <math>\textstyle x_{\rm test}</math>, we would then follow the same procedure: | | Given a test example <math>\textstyle x_{\rm test}</math>, we would then follow the same procedure: |
| For feed it to the autoencoder to get <math>\textstyle a_{\rm test}^{(1)}</math>. Then, feed | | For feed it to the autoencoder to get <math>\textstyle a_{\rm test}^{(1)}</math>. Then, feed |
- | either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier | + | either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier to get a prediction. |
- | to get a prediction. | + | |
| | | |
| + | {{Quote| |
| '''An important note about preprocessing.''' During the feature learning | | '''An important note about preprocessing.''' During the feature learning |
| stage where we were learning from the labeled training set | | stage where we were learning from the labeled training set |
Line 122: |
Line 122: |
| pre-processing transformation, which would make the input distribution to | | pre-processing transformation, which would make the input distribution to |
| the autoencoder very different from what it was actually trained on. | | the autoencoder very different from what it was actually trained on. |
- | | + | }} |
- | | + | |
- | | + | |
- | ===Image classification===
| + | |
- | | + | |
- | [[BLAH]]
| + | |
- | | + | |
- | | + | |
- | ===Fine-tuning===
| + | |
- | | + | |
- | Suppose we are doing self-taught learning, and have trained a sparse
| + | |
- | autoencoder on our unlabeled data. Given a new example <math>\textstyle x</math>, we can use the
| + | |
- | hidden layer to extract features <math>\textstyle a</math>. This is shown as follows:
| + | |
- | | + | |
- | [[PICTURE]]
| + | |
- | | + | |
- | Now, we are interested in solving a classification task, where our goal is to
| + | |
- | predict labels <math>\textstyle y</math>. We have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),
| + | |
- | (x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples.
| + | |
- | Suppose we replace the original features <math>\textstyle x^{(i)}</math> with features <math>\textstyle a^{(l)}</math>
| + | |
- | computed by the sparse autoencoder. This gives us a training set <math>\textstyle \{(a^{(2)},
| + | |
- | y^{(2)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}</math>. Finally, we train a logistic
| + | |
- | classifier to map from the features <math>\textstyle a^{(i)}</math> to the classification label <math>\textstyle y</math>.
| + | |
- | As before, we can draw our logistic unit (show in red for illustration) as
| + | |
- | follows:
| + | |
- | | + | |
- | [[PICTURE--use different color for node, for clarity?]]
| + | |
- | | + | |
- | If we now look at the final classifier that we've learned, in terms
| + | |
- | of what function it computes given a new test example <math>\textstyle x</math>, we
| + | |
- | see that it can be drawn by putting the two pictures above together. In
| + | |
- | particular, the final classifier looks like this:
| + | |
- | | + | |
- | [[PICTURE]]
| + | |
- | | + | |
- | This model was trained in two stages. The first layer of weights <math>\textstyle W^{(1)}</math>
| + | |
- | mapping from the input <math>\textstyle x</math> to the hidden unit activations <math>\textstyle a</math> were trained
| + | |
- | as part of the sparse autoencoder training process. The second layer
| + | |
- | of weights <math>\textstyle W^{(2)}</math> mapping from the activations to the output <math>\textstyle y</math> was
| + | |
- | trained using logistic regression.
| + | |
- | | + | |
- | But the final algorithm is clearly just a whole big neural network. So,
| + | |
- | we can also carrying out further '''fine-tuning''' of the weights to
| + | |
- | improve the overall classifier's performance. In particular,
| + | |
- | having trained the first layer using an autoencoder and the second layer
| + | |
- | via logistic regression (this process is sometimes called '''pre-training''',
| + | |
- | and sometimes more generally unsupervised feature learning),
| + | |
- | we can now further perform gradient descent from the current value of
| + | |
- | the weights to try to further drive down training error.
| + | |
- | | + | |
- | ===Discussion===
| + | |
- | | + | |
- | Given that the whole algorithm is just a big neural network, why don't we just
| + | |
- | carry out the fine-tuning step, without doing any pre-training/unsupervised
| + | |
- | feature learning? There're several reasons:
| + | |
- | | + | |
- | <ul>
| + | |
- | <li> First and most important, labeled data is often scarce, and unlabeled
| + | |
- | data is cheap and plentiful. The promise of self-taught learning is that by
| + | |
- | exploiting the massive amount of unlabeled data, we can learn much better
| + | |
- | models. The fine-tuning step can be done only using labeled data. In
| + | |
- | contrast, by using unlabeled data to learn a good initial value for the
| + | |
- | first layer of weights <math>\textstyle W^{(1)}</math>, we usually get much better classifiers
| + | |
- | after fine-tuning.
| + | |
- | | + | |
- | <li> Second, training a neural network using supervised learning involves
| + | |
- | solving a highly non-convex optimization problem (say, minimizing the training
| + | |
- | error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a function of the network parameters
| + | |
- | <math>\textstyle W</math>).
| + | |
- | The optimization problem can therefore be rife with local optima, and training
| + | |
- | with gradient descent (or methods like conjugate gradient and L-BFGS) do not
| + | |
- | work well. In contrast, by first initializing the parameters using an
| + | |
- | unsupervised feature learning/pre-training step, we can end up at much better
| + | |
- | solutions.\footnote{Actually, pre-training has benefits beyond just helping to
| + | |
- | get out of local optima. In particular, it has been shown to also have
| + | |
- | a useful "regularization" effect. (Erhan et al., 2010) But a full discussion
| + | |
- | is beyond the scope of these notes.}
| + | |
- | </ul>
| + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier
| + | |
- | to get a prediction.
| + | |
- | | + | |
- | use just a two layer network (layers <math>\textstyle L_1</math> and <math>\textstyle L_2</math>) as described
| + | |
- | above to extract features from our input.
| + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | ===Working with large images===
| + | |