Deep Networks: Overview

Revision as of 20:21, 13 May 2011 (view source)

Ang (Talk | contribs)

(→Advantages of deep networks)

← Older edit

Latest revision as of 13:31, 7 April 2013 (view source)

Kandeng (Talk | contribs)

Line 33:

To take a simple example, consider building a boolean circuit/network to

compute the parity (or XOR) of <math>n</math> input bits. Suppose each node in

-

the network can compute either the logical OR of its inputs (or the ~~logical~~

+

the network can compute either the logical OR of its inputs (or the OR of the

negation of the inputs), or compute the logical AND. If we have a network with

-

only 1 hidden layer, the parity function would require a number of nodes that

+

only one input, one hidden, and one output layer, the parity function would require a number of nodes that

is exponential in the input size <math>n</math>. If however we are allowed a

deeper network, then the network/circuit size can be only polynomial in

<math>n</math>.

-

By using a deep network, one can also start to learn part-whole decompositions.

+

By using a deep network, in the case of images, one can also start to learn part-whole decompositions.

For example, the first layer might learn to group together pixels in an image

-

in order to detect edges. The second layer might then group together edges to

+

in order to detect edges (as seen in the earlier exercises). The second layer might then group together edges to

-

detect longer contours, or perhaps simple "~~object~~ parts." An even deeper layer

+

detect longer contours, or perhaps detect simple "parts of objects." An even deeper layer

might then group together these contours or detect even more complex features.

Line 49:

processing. For example, visual images are processed in multiple stages by the

brain, by cortical area "V1", followed by cortical area "V2" (a different part

-

of the brain), and so on.

+

of the brain), and so on.

<!--

Line 69:

researchers had little success training deep architectures.

-

The main ~~method~~ that researchers were using was to randomly initialize

+

The main learning algorithm that researchers were using was to randomly initialize

-

the weights of ~~the~~ deep network, and then train it using a labeled

+

the weights of a deep network, and then train it using a labeled

-

training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l}) \}</math>

+

training set <math>\{ (x^{(1)}_l, y^{(1)}), \ldots, (x^{(m_l)}_l, y^{(m_l)}) \}</math>

-

using a supervised learning objective, ~~using~~ gradient descent to try to

+

using a supervised learning objective, for example by applying gradient descent to try to

drive down the training error. However, this usually did not work well.

There were several reasons for this.

Line 79:

With the method described above, one relies only on

-

labeled data for training. However, labeled data is often scarce, and thus it

+

labeled data for training. However, labeled data is often scarce, and thus for many

-

is ~~easy~~ to ~~overfit~~ the ~~training data and obtain~~ a model ~~which does not~~

+

problems it is difficult to get enough examples to fit the parameters of a

-

~~generalize well~~.

+

complex model. For example, given the high degree of expressive power of deep networks,

+

training on insufficient data would also result in overfitting.

===Local optima===

-

Training a neural network using supervised learning

+

Training a shallow network (with 1 hidden layer) using

+

supervised learning usually resulted in the parameters converging to reasonable values;

+

but when we are training a deep network, this works much less well.

+

In particular, training a neural network using supervised learning

involves solving a highly non-convex optimization problem (say, minimizing the

training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a

-

function of the network parameters <math>\textstyle W</math>). ~~When the~~

+

function of the network parameters <math>\textstyle W</math>).

-

~~network is~~ deep, this ~~optimization~~ problem is rife with bad local optima, and

+

In a deep network, this problem turns out to be rife with bad local optima, and

training with gradient descent (or methods like conjugate gradient and L-BFGS)

-

~~do not~~ work well.

+

no longer work well.

===Diffusion of gradients===

Line 97:

Line 101:

There is an additional technical reason,

pertaining to the gradients becoming very small, that explains why gradient

-

descent (and related algorithms like L-BFGS) do not work well on a deep ~~network~~

+

descent (and related algorithms like L-BFGS) do not work well on a deep networks

with randomly initialized weights. Specifically, when using backpropagation to

compute the derivatives, the gradients that are propagated backwards (from the

-

output layer to the earlier layers of the network) rapidly ~~diminishes~~ in

+

output layer to the earlier layers of the network) rapidly diminish in

magnitude as the depth of the network increases. As a result, the derivative of

the overall cost with respect to the weights in the earlier layers is very

Line 113:

Line 117:

randomly initialized ends up giving similar performance to training a

shallow network (the last few layers) on corrupted input (the result of

-

the processing done by the earlier layers).

+

the processing done by the earlier layers).

<!--

Line 124:

Line 128:

== Greedy layer-wise training ==

-

How ~~should~~ deep ~~architectures be trained then~~? One method that has seen some

+

How can we train a deep network? One method that has seen some

success is the '''greedy layer-wise training''' method. We describe this

method in detail in later sections, but briefly, the main idea is to train the

-

layers of the network one at a time, with ~~the input of each~~ layer ~~being~~ the

+

layers of the network one at a time, so that we first train a network with 1

-

~~output of~~ the previous layer ~~(which has been~~ trained). Training can either be

+

hidden layer, and only after that is done, train a network with 2 hidden layers,

-

supervised (say, with classification error as the objective function), or

+

and so on. At each step, we take the old network with <math>k-1</math> hidden

-

unsupervised (~~say, with the error of the layer in reconstructing its input as~~

+

layers, and add an additional <math>k</math>-th hidden layer (that takes as

-

~~the objective function,~~ as in an autoencoder). The weights from training the

+

input the previous hidden layer <math>k-1</math> that we had just

-

layers individually are then used to initialize the weights in the deep

+

trained). Training can either be

-

~~architecture~~, and only then is the entire architecture "fine-tuned" (i.e.,

+

supervised (say, with classification error as the objective function on each

-

trained together to optimize the training set error). The success of greedy

+

step), but more frequently it is

+

unsupervised (as in an autoencoder; details to provided later).

+

The weights from training the layers individually are then used to initialize the weights

+

in the final/overall deep network, and only then is the entire architecture "fine-tuned" (i.e.,

+

trained together to optimize the labeled training set error).

+

The success of greedy

layer-wise training has been attributed to a number of factors:

Line 146:

Line 156:

classification layer that maps to the outputs/predictions), our algorithm is

able to learn and discover patterns from massively more amounts of data than

-

purely supervised approaches~~, and thus~~ often results in much better ~~hypotheses~~.

+

purely supervised approaches. This often results in much better classifiers

+

being learned.

-

===~~Regularization and better~~ local optima===

+

===Better local optima===

After having trained the network

on the unlabeled data, the weights are now starting at a better location in

-

parameter space than if they had been randomly initialized. We ~~usually~~ then

+

parameter space than if they had been randomly initialized. We can then

further fine-tune the weights starting from this location. Empirically, it

-

turns out that gradient descent from this location is ~~also~~ much more likely to

+

turns out that gradient descent from this location is much more likely to

lead to a good local minimum, because the unlabeled data has already provided

a significant amount of "prior" information about what patterns there

-

are in the input data.

+

are in the input data.

+

In the next section, we will describe the specific details of how to go about

-

implementing greedy layer-wise training.

+

implementing greedy layer-wise training.

+

<!--

Line 178:

Line 191:

[http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf]

!-->

+

Deep Networks: Overview

From Ufldl

Latest revision as of 13:31, 7 April 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 33: / Line 33: @@
 To take a simple example, consider building a boolean circuit/network to
 compute the parity (or XOR) of <math>n</math> input bits.  Suppose each node in
-the network can compute either the logical OR of its inputs (or the logical
+the network can compute either the logical OR of its inputs (or the OR of the
 negation of the inputs), or compute the logical AND.  If we have a network with
-only 1 hidden layer, the parity function would require a number of nodes that
+only one input, one hidden, and one output layer, the parity function would require a number of nodes that
 is exponential in the input size <math>n</math>.  If however we are allowed a
 deeper network, then the network/circuit size can be only polynomial in
 <math>n</math>.
-By using a deep network, one can also start to learn part-whole decompositions.
+By using a deep network, in the case of images, one can also start to learn part-whole decompositions.
 For example, the first layer might learn to group together pixels in an image
-in order to detect edges.  The second layer might then group together edges to
+in order to detect edges (as seen in the earlier exercises).  The second layer might then group together edges to
-detect longer contours, or perhaps simple "object parts."  An even deeper layer
+detect longer contours, or perhaps detect simple "parts of objects."  An even deeper layer
 might then group together these contours or detect even more complex features.
@@ Line 49: / Line 49: @@
 processing.  For example, visual images are processed in multiple stages by the
 brain, by cortical area "V1", followed by cortical area "V2" (a different part
 of the brain), and so on.
 <!--
@@ Line 69: / Line 69: @@
 researchers had little success training deep architectures.
-The main method that researchers were using was to randomly initialize
+The main learning algorithm that researchers were using was to randomly initialize
-the weights of the deep network, and then train it using a labeled
+the weights of a deep network, and then train it using a labeled
-training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l}) \}</math>
+training set <math>\{ (x^{(1)}_l, y^{(1)}), \ldots, (x^{(m_l)}_l, y^{(m_l)}) \}</math>
-using a supervised learning objective, using gradient descent to try to
+using a supervised learning objective, for example by applying gradient descent to try to
 drive down the training error.  However, this usually did not work well.
 There were several reasons for this.
@@ Line 79: / Line 79: @@
 With the method described above, one relies only on
-labeled data for training.  However, labeled data is often scarce, and thus it
+labeled data for training.  However, labeled data is often scarce, and thus for many
-is easy to overfit the training data and obtain a model which does not
+problems it is difficult to get enough examples to fit the parameters of a
-generalize well.
+complex model.  For example, given the high degree of expressive power of deep networks,
+training on insufficient data would also result in overfitting.
 ===Local optima===
-Training a neural network using supervised learning
+Training a shallow network (with 1 hidden layer) using
+supervised learning usually resulted in the parameters converging to reasonable values;
+but when we are training a deep network, this works much less well.
+In particular, training a neural network using supervised learning
 involves solving a highly non-convex optimization problem (say, minimizing the
 training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a
-function of the network parameters <math>\textstyle W</math>).   When the
+function of the network parameters <math>\textstyle W</math>).
-network is deep, this optimization problem is rife with bad local optima, and
+In a deep network, this problem turns out to be rife with bad local optima, and
 training with gradient descent (or methods like conjugate gradient and L-BFGS)
-do not work well.
+no longer work well.
 ===Diffusion of gradients===
@@ Line 97: / Line 101: @@
 There is an additional technical reason,
 pertaining to the gradients becoming very small, that explains why gradient
-descent (and related algorithms like L-BFGS) do not work well on a deep network
+descent (and related algorithms like L-BFGS) do not work well on a deep networks
 with randomly initialized weights.  Specifically, when using backpropagation to
 compute the derivatives, the gradients that are propagated backwards (from the
-output layer to the earlier layers of the network) rapidly diminishes in
+output layer to the earlier layers of the network) rapidly diminish in
 magnitude as the depth of the network increases. As a result, the derivative of
 the overall cost with respect to the weights in the earlier layers is very
@@ Line 113: / Line 117: @@
 randomly initialized ends up giving similar performance to training a
 shallow network (the last few layers) on corrupted input (the result of
 the processing done by the earlier layers).
 <!--
@@ Line 124: / Line 128: @@
 == Greedy layer-wise training ==
-How should deep architectures be trained then?  One method that has seen some
+How can we train a deep network?  One method that has seen some
 success is the '''greedy layer-wise training''' method.  We describe this
 method in detail in later sections, but briefly, the main idea is to train the
-layers of the network one at a time, with the input of each layer being the
+layers of the network one at a time, so that we first train a network with 1
-output of the previous layer (which has been trained).  Training can either be
+hidden layer, and only after that is done, train a network with 2 hidden layers,
-supervised (say, with classification error as the objective function), or
+and so on.  At each step, we take the old network with <math>k-1</math> hidden
-unsupervised (say, with the error of the layer in reconstructing its input as
+layers, and add an additional <math>k</math>-th hidden layer (that takes as
-the objective function, as in an autoencoder).  The weights from training the
+input the previous hidden layer <math>k-1</math> that we had just
-layers individually are then used to initialize the weights in the deep
+trained).  Training can either be
-architecture, and only then is the entire architecture "fine-tuned" (i.e.,
+supervised (say, with classification error as the objective function on each
-trained together to optimize the training set error).  The success of greedy
+step), but more frequently it is
+unsupervised (as in an autoencoder; details to provided later).
+The weights from training the layers individually are then used to initialize the weights
+in the final/overall deep network, and only then is the entire architecture "fine-tuned" (i.e.,
+trained together to optimize the labeled training set error).
+The success of greedy
 layer-wise training has been attributed to a number of factors:
@@ Line 146: / Line 156: @@
 classification layer that maps to the outputs/predictions), our algorithm is
 able to learn and discover patterns from massively more amounts of data than
-purely supervised approaches, and thus often results in much better hypotheses.
+purely supervised approaches.  This often results in much better classifiers
+being learned.
-===Regularization and better local optima===
+===Better local optima===
 After having trained the network
 on the unlabeled data, the weights are now starting at a better location in
-parameter space than if they had been randomly initialized.  We usually then
+parameter space than if they had been randomly initialized.  We can then
 further fine-tune the weights starting from this location.  Empirically, it
-turns out that gradient descent from this location is also much more likely to
+turns out that gradient descent from this location is much more likely to
 lead to a good local minimum, because the unlabeled data has already provided
 a significant amount of "prior" information about what patterns there
 are in the input data.
 In the next section, we will describe the specific details of how to go about
 implementing greedy layer-wise training.
+{{CNN}}
 <!--
@@ Line 178: / Line 191: @@
 [http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf]
 !-->
+{{Languages|深度网络概览|中文}}