Deep Networks: Overview
From Ufldl
(→Difficulty of training deep architectures) |
(→Greedy layer-wise training) |
||
Line 128: | Line 128: | ||
== Greedy layer-wise training == | == Greedy layer-wise training == | ||
- | How | + | How can we train a deep network? One method that has seen some |
success is the '''greedy layer-wise training''' method. We describe this | success is the '''greedy layer-wise training''' method. We describe this | ||
method in detail in later sections, but briefly, the main idea is to train the | method in detail in later sections, but briefly, the main idea is to train the | ||
- | layers of the network one at a time, with | + | layers of the network one at a time, so that we first train a network with 1 |
- | + | hidden layer, and only after that is done, train a network with 2 hidden layers, | |
- | supervised (say, with classification error as the objective function), | + | and so on. At each step, we take the old network with <math>k-1</math> hidden |
- | unsupervised ( | + | layers, and add an additional <math>k</math>-th hidden layer (that takes as |
- | + | input the previous hidden layer <math>k-1</math> that we had just | |
- | layers individually are then used to initialize the weights in the deep | + | trained). Training can either be |
- | + | supervised (say, with classification error as the objective function on each | |
- | trained together to optimize the training set error). | + | step), but more frequently it is |
+ | unsupervised (as in an autoencoder; details to provided later). | ||
+ | The weights from training the layers individually are then used to initialize the weights | ||
+ | in the final/overall deep network, and only then is the entire architecture "fine-tuned" (i.e., | ||
+ | trained together to optimize the labeled training set error). | ||
+ | |||
+ | The success of greedy | ||
layer-wise training has been attributed to a number of factors: | layer-wise training has been attributed to a number of factors: | ||
Line 150: | Line 156: | ||
classification layer that maps to the outputs/predictions), our algorithm is | classification layer that maps to the outputs/predictions), our algorithm is | ||
able to learn and discover patterns from massively more amounts of data than | able to learn and discover patterns from massively more amounts of data than | ||
- | purely supervised approaches | + | purely supervised approaches. This often results in much better classifiers |
+ | being learned. | ||
- | === | + | ===Better local optima=== |
After having trained the network | After having trained the network | ||
on the unlabeled data, the weights are now starting at a better location in | on the unlabeled data, the weights are now starting at a better location in | ||
- | parameter space than if they had been randomly initialized. We | + | parameter space than if they had been randomly initialized. We can then |
further fine-tune the weights starting from this location. Empirically, it | further fine-tune the weights starting from this location. Empirically, it | ||
- | turns out that gradient descent from this location is | + | turns out that gradient descent from this location is much more likely to |
lead to a good local minimum, because the unlabeled data has already provided | lead to a good local minimum, because the unlabeled data has already provided | ||
a significant amount of "prior" information about what patterns there | a significant amount of "prior" information about what patterns there | ||
- | are in the input data. | + | are in the input data. |
+ | |||
In the next section, we will describe the specific details of how to go about | In the next section, we will describe the specific details of how to go about | ||
- | implementing greedy layer-wise training. | + | implementing greedy layer-wise training. |