Deep Networks: Overview

Revision as of 02:47, 21 April 2011 (view source)

(Created page with "== What is a deep architecture? == In brief, a deep architecture is a multi-layer architecture comprising non-linear functions at each level. (To see why the functions must be ...")

← Older edit

Revision as of 02:53, 21 April 2011 (view source)

Cyfoo (Talk | contribs)

Newer edit →

Line 7:

== Why use deep architectures? ==

-

~~'''~~Expressiveness and compactness~~.'''~~ Deep architectures can model a greater range of functions than shallow architectures. Further, with deep architectures, these functions can be modelled with less components (neurons in the case of neural networks) than in the equivalent shallow architectures. In fact, there are functions that a k-layer architecture can represent compactly (with the number of components ''polynomial'' in the number of inputs), but a (k-1)-layer architecture cannot(with the number of components ''exponential'' in the number of inputs).

+

===Expressiveness and compactness===

+

Deep architectures can model a greater range of functions than shallow architectures. Further, with deep architectures, these functions can be modelled with less components (neurons in the case of neural networks) than in the equivalent shallow architectures. In fact, there are functions that a k-layer architecture can represent compactly (with the number of components ''polynomial'' in the number of inputs), but a (k-1)-layer architecture cannot(with the number of components ''exponential'' in the number of inputs).

For example, in a boolean network in which alternate layers implement the logical OR and logical AND of preceding layers, the parity function would require and exponential number of components to be represented in a 2-layer network, but a polynomial number of components if represented in a network of sufficient depth.

Line 13:

Line 14:

Informally, one way a deep architecture helps in representing functions compactly is through ''factorisation''. Factorisation, as the name suggests, occurs when the network represents at lower layers functions of the input that are then reused multiple times at higher layers. To gain some intuition for this, consider an arithmetic network for computing the values of polynomials, in which alternate layers implement addition and multiplication. In this network, an intermediate layer could compute the values of terms which are then used repeatedly in the next higher layer, the results of which are used repeatedly in the next higher layer, and so on.

-

~~'''~~Statistical efficiency~~.'''~~ Another upshot of the compact representation that deep architectures afford is statistical efficiency - less training data is needed to tune the comparatively smaller number of parameters in a compact representation.

+

===Statistical efficiency===

+

Another upshot of the compact representation that deep architectures afford is statistical efficiency - less training data is needed to tune the comparatively smaller number of parameters in a compact representation.

== How should we train deep architectures? ==

Line 19:

Line 21:

While the benefits of deep architectures in terms of their compactness and expressive power have been appreciated for many decades, before 2006, researchers had little success in training deep architectures. Training a randomly initialised deep architecture often led to poor results. But why should this be the case?

-

'''Local optima.''' The optimisation landscape for the objective function used in deep architectures is likely to be non-convex and filled with local optima. A gradient-based training method starting from a random ~~location is hence likely to end up in an undesirable local optima in the neighbourhood of the starting location.~~

+

===Why random initialisation fails===

-

~~'''Diffusion of gradients.''' In~~ the ~~case of~~ deep ~~neural networks using backpropagation for training, the gradient propagated backwards~~ to ~~earlier layers rapidly diminishes as the depth of the network increases~~. As a ~~result,~~ the ~~weights~~ of the ~~earlier layers change slowly, and the earlier layers fail to learn much~~.

+

====Local optima====

+

The optimisation landscape for the objective function used in deep architectures is likely to be non-convex and filled with local optima. A gradient-based training method starting from a random location is hence likely to end up in an undesirable local optima in the neighbourhood of the starting location.

-

~~'''~~No learning in earlier layers~~.'''~~ Related to the problem of diffusion of gradients, if the last few layers in a neural network have a large enough number of neurons, it may be possible for them to model the target function alone without the help of the earlier layers. Hence, training the entire network at once with all the layers randomly initialised would be similar to training a shallow network (the last few layers) on corrupted input (the result of the processing done by the earlier layers). When the last layer is used for classification, with enough training data test error is low, but training error is high, suggesting that the last few layers are overfitting the training data.

+

====Diffusion of gradients====

+

In the case of deep neural networks using backpropagation for training, the gradient propagated backwards to earlier layers rapidly diminishes as the depth of the network increases. As a result, the weights of the earlier layers change slowly, and the earlier layers fail to learn much.

+

====No learning in earlier layers====

+

Related to the problem of diffusion of gradients, if the last few layers in a neural network have a large enough number of neurons, it may be possible for them to model the target function alone without the help of the earlier layers. Hence, training the entire network at once with all the layers randomly initialised would be similar to training a shallow network (the last few layers) on corrupted input (the result of the processing done by the earlier layers). When the last layer is used for classification, with enough training data test error is low, but training error is high, suggesting that the last few layers are overfitting the training data.

+

===Greedy layer-wise training===

How should deep architectures be trained then? One method that has seen some success is the '''greedy layer-wise training''' method. In this method, the layers of the architecture are trained one at a time, with the input being the output of the previous layer (which has been trained). Training can either be supervised (say, with classification error as the objective function), or unsupervised (say, with the error of the layer in reconstructing its input as the objective function, as in an autoencoder). The weights from training the layers individually are then used to initialise the weights in the deep architecture, and only then is the entire architecture '''fine-tuned''', that is, trained together. The success of greedy layer-wise training has been attributed to a number of factors:

-

~~'''~~Regularisation and better optima~~.'''~~ Because the weights of the layers have already been initialised to reasonable values, the final solution is somewhat constrained to be near the good initial solution (which may be seen as a prior on the parameters). Furthermore, training starts at a better location than when the weights are randomly initialised, vastly increasing the likelihood of obtaining a better local optima.

+

====Regularisation and better optima====

+

Because the weights of the layers have already been initialised to reasonable values, the final solution is somewhat constrained to be near the good initial solution (which may be seen as a prior on the parameters). Furthermore, training starts at a better location than when the weights are randomly initialised, vastly increasing the likelihood of obtaining a better local optima.

-

~~'''~~Feature learning~~.'''~~ Training the earlier layers individually allows them to learn useful representations, capturing useful regularities which may be useful for the later layers, and in the overall task (probably classification).

+

====Feature learning====

+

Training the earlier layers individually allows them to learn useful representations, capturing useful regularities which may be useful for the later layers, and in the overall task (probably classification).

== References ==

Deep Networks: Overview

From Ufldl

Revision as of 02:53, 21 April 2011

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 7: / Line 7: @@
 == Why use deep architectures? ==
-'''Expressiveness and compactness.''' Deep architectures can model a greater range of functions than shallow architectures. Further, with deep architectures, these functions can be modelled with less components (neurons in the case of neural networks) than in the equivalent shallow architectures. In fact, there are functions that a k-layer architecture can represent compactly (with the number of components ''polynomial'' in the number of inputs), but a (k-1)-layer architecture cannot(with the number of components ''exponential'' in the number of inputs).
+===Expressiveness and compactness===
+Deep architectures can model a greater range of functions than shallow architectures. Further, with deep architectures, these functions can be modelled with less components (neurons in the case of neural networks) than in the equivalent shallow architectures. In fact, there are functions that a k-layer architecture can represent compactly (with the number of components ''polynomial'' in the number of inputs), but a (k-1)-layer architecture cannot(with the number of components ''exponential'' in the number of inputs).
 For example, in a boolean network in which alternate layers implement the logical OR and logical AND of preceding layers, the parity function would require and exponential number of components to be represented in a 2-layer network, but a polynomial number of components if represented in a network of sufficient depth.
@@ Line 13: / Line 14: @@
 Informally, one way a deep architecture helps in representing functions compactly is through ''factorisation''. Factorisation, as the name suggests, occurs when the network represents at lower layers functions of the input that are then reused multiple times at higher layers. To gain some intuition for this, consider an arithmetic network for computing the values of polynomials, in which alternate layers implement addition and multiplication. In this network, an intermediate layer could compute the values of terms which are then used repeatedly in the next higher layer, the results of which are used repeatedly in the next higher layer, and so on.
-'''Statistical efficiency.''' Another upshot of the compact representation that deep architectures afford is statistical efficiency - less training data is needed to tune the comparatively smaller number of parameters in a compact representation.
+===Statistical efficiency===
+Another upshot of the compact representation that deep architectures afford is statistical efficiency - less training data is needed to tune the comparatively smaller number of parameters in a compact representation.
 == How should we train deep architectures? ==
@@ Line 19: / Line 21: @@
 While the benefits of deep architectures in terms of their compactness and expressive power have been appreciated for many decades, before 2006, researchers had little success in training deep architectures. Training a randomly initialised deep architecture often led to poor results. But why should this be the case?
-'''Local optima.''' The optimisation landscape for the objective function used in deep architectures is likely to be non-convex and filled with local optima. A gradient-based training method starting from a random location is hence likely to end up in an undesirable local optima in the neighbourhood of the starting location.
+===Why random initialisation fails===
-'''Diffusion of gradients.''' In the case of deep neural networks using backpropagation for training, the gradient propagated backwards to earlier layers rapidly diminishes as the depth of the network increases. As a result, the weights of the earlier layers change slowly, and the earlier layers fail to learn much.
+====Local optima====
+The optimisation landscape for the objective function used in deep architectures is likely to be non-convex and filled with local optima. A gradient-based training method starting from a random location is hence likely to end up in an undesirable local optima in the neighbourhood of the starting location.
-'''No learning in earlier layers.''' Related to the problem of diffusion of gradients, if the last few layers in a neural network have a large enough number of neurons, it may be possible for them to model the target function alone without the help of the earlier layers. Hence, training the entire network at once with all the layers randomly initialised would be similar to training a shallow network (the last few layers) on corrupted input (the result of the processing done by the earlier layers). When the last layer is used for classification, with enough training data test error is low, but training error is high, suggesting that the last few layers are overfitting the training data.
+====Diffusion of gradients====
+In the case of deep neural networks using backpropagation for training, the gradient propagated backwards to earlier layers rapidly diminishes as the depth of the network increases. As a result, the weights of the earlier layers change slowly, and the earlier layers fail to learn much.
+====No learning in earlier layers====
+Related to the problem of diffusion of gradients, if the last few layers in a neural network have a large enough number of neurons, it may be possible for them to model the target function alone without the help of the earlier layers. Hence, training the entire network at once with all the layers randomly initialised would be similar to training a shallow network (the last few layers) on corrupted input (the result of the processing done by the earlier layers). When the last layer is used for classification, with enough training data test error is low, but training error is high, suggesting that the last few layers are overfitting the training data.
+===Greedy layer-wise training===
 How should deep architectures be trained then? One method that has seen some success is the '''greedy layer-wise training''' method. In this method, the layers of the architecture are trained one at a time, with the input being the output of the previous layer (which has been trained). Training can either be supervised (say, with classification error as the objective function), or unsupervised (say, with the error of the layer in reconstructing its input as the objective function, as in an autoencoder). The weights from training the layers individually are then used to initialise the weights in the deep architecture, and only then is the entire architecture '''fine-tuned''', that is, trained together. The success of greedy layer-wise training has been attributed to a number of factors:
-'''Regularisation and better optima.''' Because the weights of the layers have already been initialised to reasonable values, the final solution is somewhat constrained to be near the good initial solution (which may be seen as a prior on the parameters). Furthermore, training starts at a better location than when the weights are randomly initialised, vastly increasing the likelihood of obtaining a better local optima.
+====Regularisation and better optima====
+Because the weights of the layers have already been initialised to reasonable values, the final solution is somewhat constrained to be near the good initial solution (which may be seen as a prior on the parameters). Furthermore, training starts at a better location than when the weights are randomly initialised, vastly increasing the likelihood of obtaining a better local optima.
-'''Feature learning.''' Training the earlier layers individually allows them to learn useful representations, capturing useful regularities which may be useful for the later layers, and in the overall task (probably classification).
+====Feature learning====
+Training the earlier layers individually allows them to learn useful representations, capturing useful regularities which may be useful for the later layers, and in the overall task (probably classification).
 == References ==