Deep Networks: Overview

Revision as of 20:25, 13 May 2011 (view source)

Ang (Talk | contribs)

(→Difficulty of training deep architectures)

← Older edit

Revision as of 20:33, 13 May 2011 (view source)

Ang (Talk | contribs)

(→Difficulty of training deep architectures)

Newer edit →

Line 71:

The main learning algorithm that researchers were using was to randomly initialize

the weights of a deep network, and then train it using a labeled

-

training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l}) \}</math>

+

training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l)}) \}</math>

-

using a supervised learning objective, ~~using~~ gradient descent to try to

+

using a supervised learning objective, for example by applying gradient descent to try to

drive down the training error. However, this usually did not work well.

There were several reasons for this.

Line 79:

With the method described above, one relies only on

-

labeled data for training. However, labeled data is often scarce, and thus it

+

labeled data for training. However, labeled data is often scarce, and thus for many

-

is ~~easy~~ to ~~overfit~~ the ~~training data and obtain~~ a model ~~which does not~~

+

problems it is difficult to get enough examples to fit the parameters of a

-

~~generalize well~~.

+

complex model. For example, given the high degree of expressive power of deep networks,

+

training on insufficient data would also result in overfitting.

===Local optima===

-

Training a neural network using supervised learning

+

Training a shallow network (with 1 hidden layer) using

+

supervised learning usually resulted in the parameters converging to reasonable values;

+

but when we are training a deep network, this works much less well.

+

In particular, training a neural network using supervised learning

involves solving a highly non-convex optimization problem (say, minimizing the

training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a

-

function of the network parameters <math>\textstyle W</math>). ~~When the~~

+

function of the network parameters <math>\textstyle W</math>).

-

~~network is~~ deep, this ~~optimization~~ problem is rife with bad local optima, and

+

In a deep network, this problem turns out to be rife with bad local optima, and

training with gradient descent (or methods like conjugate gradient and L-BFGS)

-

~~do not~~ work well.

+

no longer work well.

===Diffusion of gradients===

Line 97:

Line 101:

There is an additional technical reason,

pertaining to the gradients becoming very small, that explains why gradient

-

descent (and related algorithms like L-BFGS) do not work well on a deep ~~network~~

+

descent (and related algorithms like L-BFGS) do not work well on a deep networks

with randomly initialized weights. Specifically, when using backpropagation to

compute the derivatives, the gradients that are propagated backwards (from the

-

output layer to the earlier layers of the network) rapidly ~~diminishes~~ in

+

output layer to the earlier layers of the network) rapidly diminish in

magnitude as the depth of the network increases. As a result, the derivative of

the overall cost with respect to the weights in the earlier layers is very

Line 113:

Line 117:

randomly initialized ends up giving similar performance to training a

shallow network (the last few layers) on corrupted input (the result of

-

the processing done by the earlier layers).

+

the processing done by the earlier layers).

<!--

Deep Networks: Overview

From Ufldl

Revision as of 20:33, 13 May 2011

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 71: / Line 71: @@
 The main learning algorithm that researchers were using was to randomly initialize
 the weights of a deep network, and then train it using a labeled
-training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l}) \}</math>
+training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l)}) \}</math>
-using a supervised learning objective, using gradient descent to try to
+using a supervised learning objective, for example by applying gradient descent to try to
 drive down the training error.  However, this usually did not work well.
 There were several reasons for this.
@@ Line 79: / Line 79: @@
 With the method described above, one relies only on
-labeled data for training.  However, labeled data is often scarce, and thus it
+labeled data for training.  However, labeled data is often scarce, and thus for many
-is easy to overfit the training data and obtain a model which does not
+problems it is difficult to get enough examples to fit the parameters of a
-generalize well.
+complex model.  For example, given the high degree of expressive power of deep networks,
+training on insufficient data would also result in overfitting.
 ===Local optima===
-Training a neural network using supervised learning
+Training a shallow network (with 1 hidden layer) using
+supervised learning usually resulted in the parameters converging to reasonable values;
+but when we are training a deep network, this works much less well.
+In particular, training a neural network using supervised learning
 involves solving a highly non-convex optimization problem (say, minimizing the
 training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a
-function of the network parameters <math>\textstyle W</math>).   When the
+function of the network parameters <math>\textstyle W</math>).
-network is deep, this optimization problem is rife with bad local optima, and
+In a deep network, this problem turns out to be rife with bad local optima, and
 training with gradient descent (or methods like conjugate gradient and L-BFGS)
-do not work well.
+no longer work well.
 ===Diffusion of gradients===
@@ Line 97: / Line 101: @@
 There is an additional technical reason,
 pertaining to the gradients becoming very small, that explains why gradient
-descent (and related algorithms like L-BFGS) do not work well on a deep network
+descent (and related algorithms like L-BFGS) do not work well on a deep networks
 with randomly initialized weights.  Specifically, when using backpropagation to
 compute the derivatives, the gradients that are propagated backwards (from the
-output layer to the earlier layers of the network) rapidly diminishes in
+output layer to the earlier layers of the network) rapidly diminish in
 magnitude as the depth of the network increases. As a result, the derivative of
 the overall cost with respect to the weights in the earlier layers is very
@@ Line 113: / Line 117: @@
 randomly initialized ends up giving similar performance to training a
 shallow network (the last few layers) on corrupted input (the result of
 the processing done by the earlier layers).
 <!--