Deep Networks: Overview

From Ufldl

Jump to: navigation, search
(Difficulty of training deep architectures)
 
Line 71: Line 71:
The main learning algorithm that researchers were using was to randomly initialize
The main learning algorithm that researchers were using was to randomly initialize
the weights of a deep network, and then train it using a labeled
the weights of a deep network, and then train it using a labeled
-
training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l)}) \}</math>
+
training set <math>\{ (x^{(1)}_l, y^{(1)}), \ldots, (x^{(m_l)}_l, y^{(m_l)}) \}</math>
using a supervised learning objective, for example by applying gradient descent to try to
using a supervised learning objective, for example by applying gradient descent to try to
drive down the training error.  However, this usually did not work well.
drive down the training error.  However, this usually did not work well.
Line 128: Line 128:
== Greedy layer-wise training ==
== Greedy layer-wise training ==
-
How should deep architectures be trained then?  One method that has seen some
+
How can we train a deep network?  One method that has seen some
success is the '''greedy layer-wise training''' method.  We describe this
success is the '''greedy layer-wise training''' method.  We describe this
method in detail in later sections, but briefly, the main idea is to train the
method in detail in later sections, but briefly, the main idea is to train the
-
layers of the network one at a time, with the input of each layer being the
+
layers of the network one at a time, so that we first train a network with 1
-
output of the previous layer (which has been trained).  Training can either be
+
hidden layer, and only after that is done, train a network with 2 hidden layers,
-
supervised (say, with classification error as the objective function), or
+
and so on.  At each step, we take the old network with <math>k-1</math> hidden
-
unsupervised (say, with the error of the layer in reconstructing its input as
+
layers, and add an additional <math>k</math>-th hidden layer (that takes as
-
the objective function, as in an autoencoder).  The weights from training the
+
input the previous hidden layer <math>k-1</math> that we had just
-
layers individually are then used to initialize the weights in the deep
+
trained).  Training can either be  
-
architecture, and only then is the entire architecture "fine-tuned" (i.e.,
+
supervised (say, with classification error as the objective function on each
-
trained together to optimize the training set error). The success of greedy
+
step), but more frequently it is
 +
unsupervised (as in an autoencoder; details to provided later).   
 +
The weights from training the layers individually are then used to initialize the weights  
 +
in the final/overall deep network, and only then is the entire architecture "fine-tuned" (i.e.,
 +
trained together to optimize the labeled training set error).  
 +
 
 +
The success of greedy
layer-wise training has been attributed to a number of factors:
layer-wise training has been attributed to a number of factors:
Line 150: Line 156:
classification layer that maps to the outputs/predictions), our algorithm is
classification layer that maps to the outputs/predictions), our algorithm is
able to learn and discover patterns from massively more amounts of data than
able to learn and discover patterns from massively more amounts of data than
-
purely supervised approaches, and thus often results in much better hypotheses.
+
purely supervised approaches.  This often results in much better classifiers
 +
being learned.  
-
===Regularization and better local optima===   
+
===Better local optima===   
After having trained the network
After having trained the network
on the unlabeled data, the weights are now starting at a better location in
on the unlabeled data, the weights are now starting at a better location in
-
parameter space than if they had been randomly initialized.  We usually then
+
parameter space than if they had been randomly initialized.  We can then
further fine-tune the weights starting from this location.  Empirically, it
further fine-tune the weights starting from this location.  Empirically, it
-
turns out that gradient descent from this location is also much more likely to
+
turns out that gradient descent from this location is much more likely to
lead to a good local minimum, because the unlabeled data has already provided
lead to a good local minimum, because the unlabeled data has already provided
a significant amount of "prior" information about what patterns there
a significant amount of "prior" information about what patterns there
-
are in the input data.
+
are in the input data.  
 +
 
In the next section, we will describe the specific details of how to go about
In the next section, we will describe the specific details of how to go about
-
implementing greedy layer-wise training.
+
implementing greedy layer-wise training.  
 +
{{CNN}}
<!--
<!--
Line 182: Line 191:
[http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf]
[http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf]
!-->
!-->
 +
 +
 +
{{Languages|深度网络概览|中文}}

Latest revision as of 13:31, 7 April 2013

Personal tools