Deep Networks: Overview

Revision as of 20:33, 13 May 2011 (view source)

Ang (Talk | contribs)

(→Difficulty of training deep architectures)

← Older edit

Revision as of 20:40, 13 May 2011 (view source)

Ang (Talk | contribs)

(→Greedy layer-wise training)

Newer edit →

Line 128:

== Greedy layer-wise training ==

-

How ~~should~~ deep ~~architectures be trained then~~? One method that has seen some

+

How can we train a deep network? One method that has seen some

success is the '''greedy layer-wise training''' method. We describe this

method in detail in later sections, but briefly, the main idea is to train the

-

layers of the network one at a time, with ~~the input of each~~ layer ~~being~~ the

+

layers of the network one at a time, so that we first train a network with 1

-

~~output of~~ the previous layer ~~(which has been~~ trained). Training can either be

+

hidden layer, and only after that is done, train a network with 2 hidden layers,

-

supervised (say, with classification error as the objective function), or

+

and so on. At each step, we take the old network with <math>k-1</math> hidden

-

unsupervised (~~say, with the error of the layer in reconstructing its input as~~

+

layers, and add an additional <math>k</math>-th hidden layer (that takes as

-

~~the objective function,~~ as in an autoencoder). The weights from training the

+

input the previous hidden layer <math>k-1</math> that we had just

-

layers individually are then used to initialize the weights in the deep

+

trained). Training can either be

-

~~architecture~~, and only then is the entire architecture "fine-tuned" (i.e.,

+

supervised (say, with classification error as the objective function on each

-

trained together to optimize the training set error). The success of greedy

+

step), but more frequently it is

+

unsupervised (as in an autoencoder; details to provided later).

+

The weights from training the layers individually are then used to initialize the weights

+

in the final/overall deep network, and only then is the entire architecture "fine-tuned" (i.e.,

+

trained together to optimize the labeled training set error).

+

The success of greedy

layer-wise training has been attributed to a number of factors:

Line 150:

Line 156:

classification layer that maps to the outputs/predictions), our algorithm is

able to learn and discover patterns from massively more amounts of data than

-

purely supervised approaches~~, and thus~~ often results in much better ~~hypotheses~~.

+

purely supervised approaches. This often results in much better classifiers

+

being learned.

-

===~~Regularization and better~~ local optima===

+

===Better local optima===

After having trained the network

on the unlabeled data, the weights are now starting at a better location in

-

parameter space than if they had been randomly initialized. We ~~usually~~ then

+

parameter space than if they had been randomly initialized. We can then

further fine-tune the weights starting from this location. Empirically, it

-

turns out that gradient descent from this location is ~~also~~ much more likely to

+

turns out that gradient descent from this location is much more likely to

lead to a good local minimum, because the unlabeled data has already provided

a significant amount of "prior" information about what patterns there

-

are in the input data.

+

are in the input data.

+

In the next section, we will describe the specific details of how to go about

-

implementing greedy layer-wise training.

+

implementing greedy layer-wise training.

Deep Networks: Overview

From Ufldl

Revision as of 20:40, 13 May 2011

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 128: / Line 128: @@
 == Greedy layer-wise training ==
-How should deep architectures be trained then?  One method that has seen some
+How can we train a deep network?  One method that has seen some
 success is the '''greedy layer-wise training''' method.  We describe this
 method in detail in later sections, but briefly, the main idea is to train the
-layers of the network one at a time, with the input of each layer being the
+layers of the network one at a time, so that we first train a network with 1
-output of the previous layer (which has been trained).  Training can either be
+hidden layer, and only after that is done, train a network with 2 hidden layers,
-supervised (say, with classification error as the objective function), or
+and so on.  At each step, we take the old network with <math>k-1</math> hidden
-unsupervised (say, with the error of the layer in reconstructing its input as
+layers, and add an additional <math>k</math>-th hidden layer (that takes as
-the objective function, as in an autoencoder).  The weights from training the
+input the previous hidden layer <math>k-1</math> that we had just
-layers individually are then used to initialize the weights in the deep
+trained).  Training can either be
-architecture, and only then is the entire architecture "fine-tuned" (i.e.,
+supervised (say, with classification error as the objective function on each
-trained together to optimize the training set error).  The success of greedy
+step), but more frequently it is
+unsupervised (as in an autoencoder; details to provided later).
+The weights from training the layers individually are then used to initialize the weights
+in the final/overall deep network, and only then is the entire architecture "fine-tuned" (i.e.,
+trained together to optimize the labeled training set error).
+The success of greedy
 layer-wise training has been attributed to a number of factors:
@@ Line 150: / Line 156: @@
 classification layer that maps to the outputs/predictions), our algorithm is
 able to learn and discover patterns from massively more amounts of data than
-purely supervised approaches, and thus often results in much better hypotheses.
+purely supervised approaches.  This often results in much better classifiers
+being learned.
-===Regularization and better local optima===
+===Better local optima===
 After having trained the network
 on the unlabeled data, the weights are now starting at a better location in
-parameter space than if they had been randomly initialized.  We usually then
+parameter space than if they had been randomly initialized.  We can then
 further fine-tune the weights starting from this location.  Empirically, it
-turns out that gradient descent from this location is also much more likely to
+turns out that gradient descent from this location is much more likely to
 lead to a good local minimum, because the unlabeled data has already provided
 a significant amount of "prior" information about what patterns there
 are in the input data.
 In the next section, we will describe the specific details of how to go about
 implementing greedy layer-wise training.