Deep Networks: Overview

Revision as of 02:53, 21 April 2011 (view source)

Cyfoo (Talk | contribs)

(→Expressiveness and compactness)

← Older edit

Latest revision as of 13:31, 7 April 2013 (view source)

Kandeng (Talk | contribs)

Line 1:

-

== ~~What is a deep architecture?~~ ==

+

== Overview ==

-

In ~~brief~~, a ~~deep architecture is a multi~~-layer ~~architecture~~ comprising ~~non-linear functions at each level~~. ~~(To see why the functions must be non~~-~~linear~~, ~~consider~~ the ~~effect of composing multiple linear functions~~)~~. Deep arhitectures~~ are ~~favoured over shallow architectures for their greater expressive power and greater ability to generalise, among other things, as will be described in greater detail below~~.

+

In the previous sections, you constructed a 3-layer neural network comprising

+

an input, hidden and output layer. While fairly effective for MNIST, this

+

3-layer model is a fairly '''shallow''' network; by this, we mean that the

+

features (hidden layer activations <math>a^{(2)}</math>) are computed using

+

only "one layer" of computation (the hidden layer).

-

~~To give a concrete example of a~~ deep ~~architecture:~~ in the ~~earlier sections you constructed~~ a 3-layer ~~neural network comprising an input~~, ~~hidden and output layer. Such~~ a network ~~would be considered a shallow architecture~~, ~~since it contains only 1 hidden layer. If you added~~ more ~~hidden layers to your network, as you will be doing later in this section, it would become~~ a ~~deep architecture~~.

+

In this section, we begin to discuss '''deep''' neural networks, meaning ones

+

in which we have multiple hidden layers; this will allow us to compute much

+

more complex features of the input. Because each hidden layer computes a

+

non-linear transformation of the previous layer, a deep network can have

+

significantly greater representational power (i.e., can learn

+

significantly more complex functions) than a shallow one.

-

~~== Why use~~ deep ~~architectures? ==~~

+

Note that when training a deep network, it is important to use a ''non-linear''

+

activation function <math>f(\cdot)</math> in each hidden layer. This is

+

because multiple layers of linear functions would itself compute only a linear

+

function of the input (i.e., composing multiple linear functions together

+

results in just another linear function), and thus be no more expressive than

+

using just a single layer of hidden units.

-

==~~=Expressiveness and compactness===~~

+

== Advantages of deep networks ==

-

~~Deep architectures can model a greater range~~ of ~~functions than shallow architectures. Further, with~~ deep ~~architectures, these functions can be modelled with less components (neurons in the case of neural~~ networks) than in the equivalent shallow architectures. In fact, there are functions that a k-layer architecture can represent compactly (with the number of components ''polynomial'' in the number of inputs), but a (k-1)-layer architecture cannot (with the number of components ''exponential'' in the number of inputs).

+

-

~~For example, in~~ a ~~boolean~~ network ~~in which alternate layers implement the logical OR and logical AND~~ of ~~preceding layers~~, ~~the parity function would require and exponential number of components to be represented in~~ a 2-layer network~~, but~~ a ~~polynomial~~ number of ~~components if represented~~ in a network of ~~sufficient depth~~.

+

Why do we want to use a deep network? The primary advantage is

+

that it can compactly represent a significantly larger set of fuctions

+

than shallow networks. Formally, one can show that there are functions

+

which a <math>k</math>-layer network can represent compactly

+

(with a number of hidden units that is ''polynomial'' in the number

+

of inputs), that a <math>(k-1)</math>-layer network cannot represent

+

unless it has an exponentially large number of hidden units.

-

~~Informally, one way~~ a ~~deep architecture helps in representing functions compactly is through ''factorisation''. Factorisation~~, ~~as the name suggests, occurs when the~~ network ~~represents at lower layers functions of~~ the input ~~that are then reused multiple times at higher layers~~. ~~To gain some intuition for this, consider an arithmetic~~ network ~~for computing~~ the ~~values~~ of ~~polynomials~~, ~~in which alternate layers implement addition and multiplication~~. ~~In this~~ network, ~~an intermediate~~ layer ~~could compute~~ the ~~values~~ of ~~terms which are then used repeatedly~~ in the ~~next higher layer~~, the ~~results of which are used repeatedly~~ in ~~the next higher layer, and so on~~.

+

To take a simple example, consider building a boolean circuit/network to

+

compute the parity (or XOR) of <math>n</math> input bits. Suppose each node in

+

the network can compute either the logical OR of its inputs (or the OR of the

+

negation of the inputs), or compute the logical AND. If we have a network with

+

only one input, one hidden, and one output layer, the parity function would require a number of nodes that

+

is exponential in the input size <math>n</math>. If however we are allowed a

+

deeper network, then the network/circuit size can be only polynomial in

+

<math>n</math>.

-

~~===Statistical efficiency===~~

+

By using a deep network, in the case of images, one can also start to learn part-whole decompositions.

-

~~Another upshot of the compact representation that~~ deep ~~architectures afford is statistical efficiency~~ - ~~less training data is needed~~ to ~~tune~~ the ~~comparatively smaller number~~ of ~~parameters in a compact representation~~.

+

For example, the first layer might learn to group together pixels in an image

+

in order to detect edges (as seen in the earlier exercises). The second layer might then group together edges to

+

detect longer contours, or perhaps detect simple "parts of objects." An even deeper layer

+

might then group together these contours or detect even more complex features.

-

~~== How should we train deep architectures? ==~~

+

Finally, cortical computations (in the brain) also have multiple layers of

+

processing. For example, visual images are processed in multiple stages by the

+

brain, by cortical area "V1", followed by cortical area "V2" (a different part

+

of the brain), and so on.

-

~~While the benefits of~~ deep ~~architectures~~ in ~~terms~~ of ~~their compactness and expressive power have been appreciated~~ for ~~many decades~~, ~~before 2006~~, ~~researchers had little success~~ in ~~training deep architectures~~. ~~Training a randomly initialised deep architecture often led to poor results. But why should~~ this be the ~~case?~~

+

<!--

+

Informally, one way a deep network helps in representing functions compactly is

+

through ''factorization''. Factorization, as the name suggests, occurs when the

+

network represents at lower layers functions of the input that are then reused

+

multiple times at higher layers. To gain some intuition for this, consider an

+

arithmetic network for computing the values of polynomials, in which alternate

+

layers implement addition and multiplication. In this network, an intermediate

+

layer could compute the values of terms which are then used repeatedly in the

+

next higher layer, the results of which are used repeatedly in the next higher

+

layer, and so on.

+

!-->

-

==~~=Why random initialisation fails=~~==

+

== Difficulty of training deep architectures ==

-

~~====Local optima====~~

+

While the theoretical benefits of deep networks in terms of their compactness

-

~~The optimisation landscape~~ for ~~the objective function used in~~ deep architectures is likely to be non-convex and filled with local optima. A gradient-based training method starting from a random location is hence likely to end up in an undesirable local optima in the neighbourhood of the starting location.

+

and expressive power have been appreciated for many decades, until recently

+

researchers had little success training deep architectures.

-

~~====Diffusion of gradients====~~

+

The main learning algorithm that researchers were using was to randomly initialize

-

In the ~~case~~ of deep ~~neural networks~~ using ~~backpropagation for~~ training, ~~the~~ gradient ~~propagated backwards~~ to ~~earlier layers rapidly diminishes as~~ the ~~depth of the network increases~~. ~~As a result~~, ~~the weights of the earlier layers change slowly, and the earlier layers fail to learn much~~.

+

the weights of a deep network, and then train it using a labeled

+

training set <math>\{ (x^{(1)}_l, y^{(1)}), \ldots, (x^{(m_l)}_l, y^{(m_l)}) \}</math>

+

using a supervised learning objective, for example by applying gradient descent to try to

+

drive down the training error. However, this usually did not work well.

+

There were several reasons for this.

-

====~~No learning in earlier layers~~====

+

===Availability of data===

-

Related to the problem of diffusion of gradients, if the last few layers in a neural network have a large enough number of neurons, it may be possible for them to model the target function alone without the help of the earlier layers. Hence, training the entire network at once with all the layers randomly initialised would be similar to training a shallow network (the last few layers) on corrupted input (the result of the processing done by the earlier layers). When the last layer is used for classification, with enough training data test error is low, but training error is high, suggesting that the last few layers are overfitting the training data.

+

-

~~===Greedy layer-wise~~ training~~===~~

+

With the method described above, one relies only on

+

labeled data for training. However, labeled data is often scarce, and thus for many

+

problems it is difficult to get enough examples to fit the parameters of a

+

complex model. For example, given the high degree of expressive power of deep networks,

+

training on insufficient data would also result in overfitting.

-

How should deep architectures be trained then? One method that has seen some success is the '''greedy layer-wise training''' method. In this method, the layers of the architecture are trained one at a time, with the input being the output of the previous layer (which has been trained). Training can either be supervised (say, with classification error as the objective function), or unsupervised (say, with the error of the layer in reconstructing its input as the objective function, as in an autoencoder). The weights from training the layers individually are then used to initialise the weights in the deep architecture, and only then is the entire architecture '''fine-tuned''', that is, trained together. The success of greedy layer-wise training has been attributed to a number of factors:

+

===Local optima===

-

~~====Regularisation and better optima====~~

+

Training a shallow network (with 1 hidden layer) using

-

~~Because~~ the ~~weights of the layers have already been initialised~~ to reasonable values, the ~~final solution is somewhat constrained to be near the good initial solution~~ (~~which may be seen~~ as a ~~prior on~~ the parameters). ~~Furthermore, training starts at~~ a ~~better location than when the weights are randomly initialised~~, ~~vastly increasing the likelihood of obtaining a better~~ local optima.

+

supervised learning usually resulted in the parameters converging to reasonable values;

+

but when we are training a deep network, this works much less well.

+

In particular, training a neural network using supervised learning

+

involves solving a highly non-convex optimization problem (say, minimizing the

+

training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a

+

function of the network parameters <math>\textstyle W</math>).

+

In a deep network, this problem turns out to be rife with bad local optima, and

+

training with gradient descent (or methods like conjugate gradient and L-BFGS)

+

no longer work well.

-

====~~Feature learning~~====

+

===Diffusion of gradients===

-

Training the earlier layers individually allows them to learn useful representations, capturing useful regularities which may be useful for the later layers, and in the overall task (probably classification).

+

There is an additional technical reason,

+

pertaining to the gradients becoming very small, that explains why gradient

+

descent (and related algorithms like L-BFGS) do not work well on a deep networks

+

with randomly initialized weights. Specifically, when using backpropagation to

+

compute the derivatives, the gradients that are propagated backwards (from the

+

output layer to the earlier layers of the network) rapidly diminish in

+

magnitude as the depth of the network increases. As a result, the derivative of

+

the overall cost with respect to the weights in the earlier layers is very

+

small. Thus, when using gradient descent, the weights of the earlier layers

+

change slowly, and the earlier layers fail to learn much. This problem

+

is often called the "diffusion of gradients."

+

A closely related problem to the diffusion of gradients is that if the last few

+

layers in a neural network have a large enough number of neurons, it may be

+

possible for them to model the labeled data alone without the help of the

+

earlier layers. Hence, training the entire network at once with all the layers

+

randomly initialized ends up giving similar performance to training a

+

shallow network (the last few layers) on corrupted input (the result of

+

the processing done by the earlier layers).

+

<!--

+

When the last layer is used

+

for classification, often training a network like this results in low

+

training error, but high error is low, but training error is high, suggesting that

+

the last few layers are over-fitting the training data.

+

!-->

+

== Greedy layer-wise training ==

+

How can we train a deep network? One method that has seen some

+

success is the '''greedy layer-wise training''' method. We describe this

+

method in detail in later sections, but briefly, the main idea is to train the

+

layers of the network one at a time, so that we first train a network with 1

+

hidden layer, and only after that is done, train a network with 2 hidden layers,

+

and so on. At each step, we take the old network with <math>k-1</math> hidden

+

layers, and add an additional <math>k</math>-th hidden layer (that takes as

+

input the previous hidden layer <math>k-1</math> that we had just

+

trained). Training can either be

+

supervised (say, with classification error as the objective function on each

+

step), but more frequently it is

+

unsupervised (as in an autoencoder; details to provided later).

+

The weights from training the layers individually are then used to initialize the weights

+

in the final/overall deep network, and only then is the entire architecture "fine-tuned" (i.e.,

+

trained together to optimize the labeled training set error).

+

The success of greedy

+

layer-wise training has been attributed to a number of factors:

+

===Availability of data===

+

While labeled data can be expensive to obtain,

+

unlabeled data is cheap and plentiful. The promise of self-taught learning is

+

that by exploiting the massive amount of unlabeled data, we can learn much

+

better models. By using unlabeled data to learn a good initial value for the

+

weights in all the layers <math>\textstyle W^{(l)}</math> (except for the final

+

classification layer that maps to the outputs/predictions), our algorithm is

+

able to learn and discover patterns from massively more amounts of data than

+

purely supervised approaches. This often results in much better classifiers

+

being learned.

+

===Better local optima===

+

After having trained the network

+

on the unlabeled data, the weights are now starting at a better location in

+

parameter space than if they had been randomly initialized. We can then

+

further fine-tune the weights starting from this location. Empirically, it

+

turns out that gradient descent from this location is much more likely to

+

lead to a good local minimum, because the unlabeled data has already provided

+

a significant amount of "prior" information about what patterns there

+

are in the input data.

+

In the next section, we will describe the specific details of how to go about

+

implementing greedy layer-wise training.

+

<!--

+

Specifically,

+

since the weights of the layers have already been initialized to reasonable

+

values, the final solution tends to be near the good initial solution, forming

+

a useful "regularization" effect. (more details in Erhan et al., 2010).

+

!-->

+

<!--

== References ==

-

~~Bengio, Y~~. ~~"Learning~~ Deep ~~Architectures for AI,"~~

+

Erhan et al. (2010). Why Does Unsupervised Pre-training Help Deep Learning?.

-

~~''Foundations and Trends in Machine~~ Learning~~'', vol~~. ~~2, no. 1, pp. 1–127, 2009~~.

+

AISTATS 2010.

-

http://dx.~~doi~~.~~org/10~~.~~1561~~/~~2200000006~~

+

[http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf]

+

!-->

+

Deep Networks: Overview

From Ufldl

Latest revision as of 13:31, 7 April 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 1: / Line 1: @@
-== What is a deep architecture? ==
+== Overview ==
-In brief, a deep architecture is a multi-layer architecture comprising non-linear functions at each level. (To see why the functions must be non-linear, consider the effect of composing multiple linear functions). Deep arhitectures are favoured over shallow architectures for their greater expressive power and greater ability to generalise, among other things, as will be described in greater detail below.
+In the previous sections, you constructed a 3-layer neural network comprising
+an input, hidden and output layer.  While fairly effective for MNIST, this
+-layer model is a fairly '''shallow''' network; by this, we mean that the
+features (hidden layer activations <math>a^{(2)}</math>) are computed using
+only "one layer" of computation (the hidden layer).
-To give a concrete example of a deep architecture: in the earlier sections you constructed a 3-layer neural network comprising an input, hidden and output layer. Such a network would be considered a shallow architecture, since it contains only 1 hidden layer. If you added more hidden layers to your network, as you will be doing later in this section, it would become a deep architecture.
+In this section, we begin to discuss '''deep''' neural networks, meaning ones
+in which we have multiple hidden layers; this will allow us to compute much
+more complex features of the input.  Because each hidden layer computes a
+non-linear transformation of the previous layer, a deep network can have
+significantly greater representational power (i.e., can learn
+significantly more complex functions) than a shallow one.
-== Why use deep architectures? ==
+Note that when training a deep network, it is important to use a ''non-linear''
+activation function <math>f(\cdot)</math> in each hidden layer.   This is
+because multiple layers of linear functions would itself compute only a linear
+function of the input (i.e., composing multiple linear functions together
+results in just another linear function), and thus be no more expressive than
+using just a single layer of hidden units.
-===Expressiveness and compactness===
+== Advantages of deep networks ==
-Deep architectures can model a greater range of functions than shallow architectures. Further, with deep architectures, these functions can be modelled with less components (neurons in the case of neural networks) than in the equivalent shallow architectures. In fact, there are functions that a k-layer architecture can represent compactly (with the number of components ''polynomial'' in the number of inputs), but a (k-1)-layer architecture cannot (with the number of components ''exponential'' in the number of inputs).
-For example, in a boolean network in which alternate layers implement the logical OR and logical AND of preceding layers, the parity function would require and exponential number of components to be represented in a 2-layer network, but a polynomial number of components if represented in a network of sufficient depth.
+Why do we want to use a deep network?  The primary advantage is
+that it can compactly represent a significantly larger set of fuctions
+than shallow networks.  Formally, one can show that there are functions
+which a <math>k</math>-layer network can represent compactly
+(with a number of hidden units that is ''polynomial'' in the number
+of inputs), that a <math>(k-1)</math>-layer network cannot represent
+unless it has an exponentially large number of hidden units.
-Informally, one way a deep architecture helps in representing functions compactly is through ''factorisation''. Factorisation, as the name suggests, occurs when the network represents at lower layers functions of the input that are then reused multiple times at higher layers. To gain some intuition for this, consider an arithmetic network for computing the values of polynomials, in which alternate layers implement addition and multiplication. In this network, an intermediate layer could compute the values of terms which are then used repeatedly in the next higher layer, the results of which are used repeatedly in the next higher layer, and so on.
+To take a simple example, consider building a boolean circuit/network to
+compute the parity (or XOR) of <math>n</math> input bits.  Suppose each node in
+the network can compute either the logical OR of its inputs (or the OR of the
+negation of the inputs), or compute the logical AND.  If we have a network with
+only one input, one hidden, and one output layer, the parity function would require a number of nodes that
+is exponential in the input size <math>n</math>.  If however we are allowed a
+deeper network, then the network/circuit size can be only polynomial in
+<math>n</math>.
-===Statistical efficiency===
+By using a deep network, in the case of images, one can also start to learn part-whole decompositions.
-Another upshot of the compact representation that deep architectures afford is statistical efficiency - less training data is needed to tune the comparatively smaller number of parameters in a compact representation.
+For example, the first layer might learn to group together pixels in an image
+in order to detect edges (as seen in the earlier exercises).  The second layer might then group together edges to
+detect longer contours, or perhaps detect simple "parts of objects."  An even deeper layer
+might then group together these contours or detect even more complex features.
-== How should we train deep architectures? ==
+Finally, cortical computations (in the brain) also have multiple layers of
+processing.  For example, visual images are processed in multiple stages by the
+brain, by cortical area "V1", followed by cortical area "V2" (a different part
+of the brain), and so on.
-While the benefits of deep architectures in terms of their compactness and expressive power have been appreciated for many decades, before 2006, researchers had little success in training deep architectures. Training a randomly initialised deep architecture often led to poor results. But why should this be the case?
+<!--
+Informally, one way a deep network helps in representing functions compactly is
+through ''factorization''. Factorization, as the name suggests, occurs when the
+network represents at lower layers functions of the input that are then reused
+multiple times at higher layers. To gain some intuition for this, consider an
+arithmetic network for computing the values of polynomials, in which alternate
+layers implement addition and multiplication. In this network, an intermediate
+layer could compute the values of terms which are then used repeatedly in the
+next higher layer, the results of which are used repeatedly in the next higher
+layer, and so on.
+!-->
-===Why random initialisation fails===
+== Difficulty of training deep architectures ==
-====Local optima====
+While the theoretical benefits of deep networks in terms of their compactness
-The optimisation landscape for the objective function used in deep architectures is likely to be non-convex and filled with local optima. A gradient-based training method starting from a random location is hence likely to end up in an undesirable local optima in the neighbourhood of the starting location.
+and expressive power have been appreciated for many decades, until recently
+researchers had little success training deep architectures.
-====Diffusion of gradients====
+The main learning algorithm that researchers were using was to randomly initialize
-In the case of deep neural networks using backpropagation for training, the gradient propagated backwards to earlier layers rapidly diminishes as the depth of the network increases. As a result, the weights of the earlier layers change slowly, and the earlier layers fail to learn much.
+the weights of a deep network, and then train it using a labeled
+training set <math>\{ (x^{(1)}_l, y^{(1)}), \ldots, (x^{(m_l)}_l, y^{(m_l)}) \}</math>
+using a supervised learning objective, for example by applying gradient descent to try to
+drive down the training error.  However, this usually did not work well.
+There were several reasons for this.
-====No learning in earlier layers====
+===Availability of data===
-Related to the problem of diffusion of gradients, if the last few layers in a neural network have a large enough number of neurons, it may be possible for them to model the target function alone without the help of the earlier layers. Hence, training the entire network at once with all the layers randomly initialised would be similar to training a shallow network (the last few layers) on corrupted input (the result of the processing done by the earlier layers). When the last layer is used for classification, with enough training data test error is low, but training error is high, suggesting that the last few layers are overfitting the training data.
-===Greedy layer-wise training===
+With the method described above, one relies only on
+labeled data for training.  However, labeled data is often scarce, and thus for many
+problems it is difficult to get enough examples to fit the parameters of a
+complex model.  For example, given the high degree of expressive power of deep networks,
+training on insufficient data would also result in overfitting.
-How should deep architectures be trained then? One method that has seen some success is the '''greedy layer-wise training''' method. In this method, the layers of the architecture are trained one at a time, with the input being the output of the previous layer (which has been trained). Training can either be supervised (say, with classification error as the objective function), or unsupervised (say, with the error of the layer in reconstructing its input as the objective function, as in an autoencoder). The weights from training the layers individually are then used to initialise the weights in the deep architecture, and only then is the entire architecture '''fine-tuned''', that is, trained together. The success of greedy layer-wise training has been attributed to a number of factors:
+===Local optima===
-====Regularisation and better optima====
+Training a shallow network (with 1 hidden layer) using
-Because the weights of the layers have already been initialised to reasonable values, the final solution is somewhat constrained to be near the good initial solution (which may be seen as a prior on the parameters). Furthermore, training starts at a better location than when the weights are randomly initialised, vastly increasing the likelihood of obtaining a better local optima.
+supervised learning usually resulted in the parameters converging to reasonable values;
+but when we are training a deep network, this works much less well.
+In particular, training a neural network using supervised learning
+involves solving a highly non-convex optimization problem (say, minimizing the
+training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a
+function of the network parameters <math>\textstyle W</math>).
+In a deep network, this problem turns out to be rife with bad local optima, and
+training with gradient descent (or methods like conjugate gradient and L-BFGS)
+no longer work well.
-====Feature learning====
+===Diffusion of gradients===
-Training the earlier layers individually allows them to learn useful representations, capturing useful regularities which may be useful for the later layers, and in the overall task (probably classification).
+There is an additional technical reason,
+pertaining to the gradients becoming very small, that explains why gradient
+descent (and related algorithms like L-BFGS) do not work well on a deep networks
+with randomly initialized weights.  Specifically, when using backpropagation to
+compute the derivatives, the gradients that are propagated backwards (from the
+output layer to the earlier layers of the network) rapidly diminish in
+magnitude as the depth of the network increases. As a result, the derivative of
+the overall cost with respect to the weights in the earlier layers is very
+small.  Thus, when using gradient descent, the weights of the earlier layers
+change slowly, and the earlier layers fail to learn much.  This problem
+is often called the "diffusion of gradients."
+A closely related problem to the diffusion of gradients is that if the last few
+layers in a neural network have a large enough number of neurons, it may be
+possible for them to model the labeled data alone without the help of the
+earlier layers.  Hence, training the entire network at once with all the layers
+randomly initialized ends up giving similar performance to training a
+shallow network (the last few layers) on corrupted input (the result of
+the processing done by the earlier layers).
+<!--
+When the last layer is used
+for classification, often training a network like this results in low
+training error, but high error is low, but training error is high, suggesting that
+the last few layers are over-fitting the training data.
+!-->
+== Greedy layer-wise training ==
+How can we train a deep network?  One method that has seen some
+success is the '''greedy layer-wise training''' method.  We describe this
+method in detail in later sections, but briefly, the main idea is to train the
+layers of the network one at a time, so that we first train a network with 1
+hidden layer, and only after that is done, train a network with 2 hidden layers,
+and so on.  At each step, we take the old network with <math>k-1</math> hidden
+layers, and add an additional <math>k</math>-th hidden layer (that takes as
+input the previous hidden layer <math>k-1</math> that we had just
+trained).  Training can either be
+supervised (say, with classification error as the objective function on each
+step), but more frequently it is
+unsupervised (as in an autoencoder; details to provided later).
+The weights from training the layers individually are then used to initialize the weights
+in the final/overall deep network, and only then is the entire architecture "fine-tuned" (i.e.,
+trained together to optimize the labeled training set error).
+The success of greedy
+layer-wise training has been attributed to a number of factors:
+===Availability of data===
+While labeled data can be expensive to obtain,
+unlabeled data is cheap and plentiful.  The promise of self-taught learning is
+that by exploiting the massive amount of unlabeled data, we can learn much
+better models.  By using unlabeled data to learn a good initial value for the
+weights in all the layers <math>\textstyle W^{(l)}</math> (except for the final
+classification layer that maps to the outputs/predictions), our algorithm is
+able to learn and discover patterns from massively more amounts of data than
+purely supervised approaches.  This often results in much better classifiers
+being learned.
+===Better local optima===
+After having trained the network
+on the unlabeled data, the weights are now starting at a better location in
+parameter space than if they had been randomly initialized.  We can then
+further fine-tune the weights starting from this location.  Empirically, it
+turns out that gradient descent from this location is much more likely to
+lead to a good local minimum, because the unlabeled data has already provided
+a significant amount of "prior" information about what patterns there
+are in the input data.
+In the next section, we will describe the specific details of how to go about
+implementing greedy layer-wise training.
+{{CNN}}
+<!--
+Specifically,
+since the weights of the layers have already been initialized to reasonable
+values, the final solution tends to be near the good initial solution, forming
+a useful "regularization" effect. (more details in Erhan et al., 2010).
+!-->
+<!--
 == References ==
-Bengio, Y. "Learning Deep Architectures for AI,"
+Erhan et al. (2010). Why Does Unsupervised Pre-training Help Deep Learning?.
-''Foundations and Trends in Machine Learning'', vol. 2, no. 1, pp. 1–127, 2009.
+AISTATS 2010.
-http://dx.doi.org/10.1561/2200000006
+[http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf]
+!-->
+{{Languages|深度网络概览|中文}}