# Deep Networks: Overview

(Difference between revisions)
 Revision as of 00:21, 12 May 2011 (view source)Jngiam (Talk | contribs) (→Availability of data)← Older edit Revision as of 19:39, 13 May 2011 (view source)Ang (Talk | contribs) Newer edit → Line 1: Line 1: == Overview == == Overview == - In the earlier sections, you have constructed a 3-layer neural network comprising an input, hidden and output layer. In general, the neural network can be viewed as a way to compute a complicated function of the inputs; for example, computing the predicted labels (e.g, 0-10) of some inputs (e.g., MNIST digits). By adding more hidden layers, the neural network gains more expressive power and allows us to represent more complicated functions. A neural network with multiple layers of nonlinearities is often referred to as a deep network. + In the previous sections, you constructed a 3-layer neural network comprising + an input, hidden and output layer. While fairly effective for MNIST, the + 3-layer network is a fairly '''shallow''' network; by this, we mean that the + features (hidden layer activations $a^{(2)}$) are computed using + only "one layer" of computation (the hidden layer). - It is important for deep networks to have  non-linear functions at each level, since multiple layers of linear functions can be viewed as just a single linear function. + In this section, we begin to discuss '''deep''' neural networks, meaning ones + in which we have multiple hidden layers, so that we use multiple layers of + computation to compute increasingly complex features from the input. Each + hidden layer computes a non-linear transformation of the previous layer.  By + using more hidden layers, deep networks can have significantly greater + expressive power (i.e., can learn significantly more complex functions) + than simple ones. + + When training a deep network, it is important that we use a ''non-linear'' + activation function $f(\cdot)$ in each hidden layer.  This is + because multiple layers of linear functions would itself compute only a linear + function of the input (i.e., composing multiple linear functions together + results in just another linear function), and thus be no more expressive than + using just a single layer of hidden units. == Advantages of deep networks == == Advantages of deep networks == - ===Expressiveness and compactness=== + Why do we want to use a deep network?  The primary advantage is + that it can compactly represent a significantly larger set of fuctions + than shallow networks.  Formally, one can show that there are functions + which a $k$-layer network can represent compactly + (with a number of hidden units that is ''polynomial'' in the number + of inputs), that a $(k-1)$-layer network cannot represent + unless it has an exponentially large number of hidden units. - Deep networks can model a greater range of functions than shallow networks. Further, with deep networks, these functions can be modeled with less components (neurons in the case of neural networks) than in the equivalent shallow networks. In fact, there are functions that a k-layer network can represent compactly (with the number of components ''polynomial'' in the number of inputs), but a (k-1)-layer network cannot (with the number of components ''exponential'' in the number of inputs). + To take a simple example, consider building a boolean network/circuit to + compute the parity (or XOR) of $n$ input bits.  Suppose each node in + the network can compute either the logical OR of its inputs (or the logical + negation of the inputs), or compute the logical AND.  If we have a network with + only 1 hidden layer, the parity function would require a number of nodes that + is exponential in the input size $n$.  If however we are allowed a + deeper network, then the network/circuit size can be only polynomial in + $n$. - For example, in a boolean network in which alternate layers implement the logical OR and logical AND of preceding layers, the parity function would require and exponential number of components to be represented in a 2-layer network, but a polynomial number of components if represented in a network of sufficient depth. + By using a deep network, one can also start to learn part-whole decompositions. + For example, the first layer might learn to group together pixels in an image + in order to detect edges.  The second layer might then group together edges to + detect longer contours, or perhaps simple "object parts."  An even deeper layer + might then group together these contours or detect even more complex features. - Informally, one way a deep network helps in representing functions compactly is through ''factorization''. Factorization, as the name suggests, occurs when the network represents at lower layers functions of the input that are then reused multiple times at higher layers. To gain some intuition for this, consider an arithmetic network for computing the values of polynomials, in which alternate layers implement addition and multiplication. In this network, an intermediate layer could compute the values of terms which are then used repeatedly in the next higher layer, the results of which are used repeatedly in the next higher layer, and so on. + Finally, cortical computations (in the brain) also have multiple layers of + processing.  For example, visual images are processed in multiple stages by the + brain, by cortical area "V1", followed by cortical area "V2" (a different part + of the brain), and so on. + + == Difficulty of training deep architectures == == Difficulty of training deep architectures == - While the benefits of deep networks in terms of their compactness and expressive power have been appreciated for many decades, before 2006, researchers had little success in training deep architectures. Training a randomly initialized deep network to optimize for a classification objective (e.g,. softmax regression) often led to poor results. + While the theoretical benefits of deep networks in terms of their compactness + and expressive power have been appreciated for many decades, until recently + researchers had little success training deep architectures. + + The main method that researchers were using was to randomly initialize + the weights of the deep network, and then train it using a labeled + training set $\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l}) \}$ + using a supervised learning objective, using gradient descent to try to + drive down the training error. However, this usually did not work well. + There were several reasons for this. + + ===Availability of data=== + + With the method described above, one relies only on + labeled data for training.  However, labeled data is often scarce, and thus it + is easy to overfit the training data and obtain a model which does not + generalize well. + + ===Local optima=== - ===Why random initialization fails=== + Training a neural network using supervised learning + involves solving a highly non-convex optimization problem (say, minimizing the + training error $\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2$ as a + function of the network parameters $\textstyle W$).  When the + network is deep, this optimization problem is rife with bad local optima, and + training with gradient descent (or methods like conjugate gradient and L-BFGS) + do not work well. - ====Availability of data==== + ===Diffusion of gradients=== - With random initialization, one relies only on labeled data for training. However, labeled data is often scarce, + There is an additional technical reason, - and thus it is easy to overfit the training data and obtain a model which does not generalize well. + pertaining to the gradients becoming very small, that explains why gradient + descent (and related algorithms like L-BFGS) do not work well on a deep network + with randomly initialized weights. Specifically, when using backpropagation to + compute the derivatives, the gradients that are propagated backwards (from the + output layer to the earlier layers of the network) rapidly diminishes in + magnitude as the depth of the network increases. As a result, the derivative of + the overall cost with respect to the weights in the earlier layers is very + small.  Thus, when using gradient descent, the weights of the earlier layers + change slowly, and the earlier layers fail to learn much.  This problem + is often called the "diffusion of gradients." - ====Local optima==== + A closely related problem to the diffusion of gradients is that if the last few + layers in a neural network have a large enough number of neurons, it may be + possible for them to model the labeled data alone without the help of the + earlier layers.  Hence, training the entire network at once with all the layers + randomly initialized ends up giving similar performance to training a + shallow network (the last few layers) on corrupted input (the result of + the processing done by the earlier layers). - Training a neural network using supervised learning involves solving a highly non-convex optimization problem (say, minimizing the training + - + == Greedy layer-wise training == - ====Diffusion of gradients==== + How should deep architectures be trained then?  One method that has seen some + success is the '''greedy layer-wise training''' method.  We describe this + method in detail in later sections, but briefly, the main idea is to train the + layers of the network one at a time, with the input of each layer being the + output of the previous layer (which has been trained).  Training can either be + supervised (say, with classification error as the objective function), or + unsupervised (say, with the error of the layer in reconstructing its input as + the objective function, as in an autoencoder).  The weights from training the + layers individually are then used to initialize the weights in the deep + architecture, and only then is the entire architecture "fine-tuned" (i.e., + trained together to optimize the training set error).  The success of greedy + layer-wise training has been attributed to a number of factors: - In the case of deep neural networks using backpropagation for training, the gradient propagated backwards to earlier layers rapidly diminishes as the depth of the network increases. As a result, the weights of the earlier layers change slowly, and the earlier layers fail to learn much. + ===Availability of data=== - Related to the problem of diffusion of gradients, if the last few layers in a neural network have a large enough number of neurons, it may be possible for them to model the target function alone without the help of the earlier layers. Hence, training the entire network at once with all the layers randomly initialised would be similar to training a shallow network (the last few layers) on corrupted input (the result of the processing done by the earlier layers). When the last layer is used for classification, with enough training data test error is low, but training error is high, suggesting that the last few layers are over-fitting the training data. + While labeled data can be expensive to obtain, + unlabeled data is cheap and plentiful.  The promise of self-taught learning is + that by exploiting the massive amount of unlabeled data, we can learn much + better models.  By using unlabeled data to learn a good initial value for the + weights in all the layers $\textstyle W^{(l)}$ (except for the final + classification layer that maps to the outputs/predictions), our algorithm is + able to learn and discover patterns from massively more amounts of data than + purely supervised approaches, and thus often results in much better hypotheses. - ===Greedy layer-wise training=== + ===Regularization and better local optima=== - How should deep architectures be trained then? One method that has seen some success is the '''greedy layer-wise training''' method. In this method, the layers of the architecture are trained one at a time, with the input being the output of the previous layer (which has been trained). Training can either be supervised (say, with classification error as the objective function), or unsupervised (say, with the error of the layer in reconstructing its input as the objective function, as in an autoencoder). The weights from training the layers individually are then used to initialize the weights in the deep architecture, and only then is the entire architecture '''fine-tuned''', that is, trained together. The success of greedy layer-wise training has been attributed to a number of factors: + After having trained the network + on the unlabeled data, the weights are now starting at a better location in + parameter space than if they had been randomly initialized. We usually then + further fine-tune the weights starting from this location.  Empirically, it + turns out that gradient descent from this location is also much more likely to + lead to a good local minimum, because the unlabeled data has already provided + a significant amount of "prior" information about what patterns there + are in the input data. - ====Availability of data==== + In the next section, we will describe the specific details of how to go about + implementing greedy layer-wise training. - While labeled data can be expensive to obtain, unlabeled data is cheap and plentiful. - The promise of self-taught learning is that by - exploiting the massive amount of unlabeled data, we can learn much better - models.  The fine-tuning step can be done only using labeled data.  In - practice, by using unlabeled data to learn a good initial value for the weights in all layers - $\textstyle W^{(l)}$, we usually get much better classifiers - after fine-tuning. - ====Regularization and better local optima==== - Furthermore, training starts at a better location than when the weights are randomly initialized, vastly increasing the likelihood of obtaining a better local optima. Specifically, since the weights of the layers have already been initialized to reasonable values, the final solution tends to be near the good initial solution, forming a useful "regularization" effect. (more details in Erhan et al., 2010). + +

## Overview

In the previous sections, you constructed a 3-layer neural network comprising an input, hidden and output layer. While fairly effective for MNIST, the 3-layer network is a fairly shallow network; by this, we mean that the features (hidden layer activations a(2)) are computed using only "one layer" of computation (the hidden layer).

In this section, we begin to discuss deep neural networks, meaning ones in which we have multiple hidden layers, so that we use multiple layers of computation to compute increasingly complex features from the input. Each hidden layer computes a non-linear transformation of the previous layer. By using more hidden layers, deep networks can have significantly greater expressive power (i.e., can learn significantly more complex functions) than simple ones.

When training a deep network, it is important that we use a non-linear activation function $f(\cdot)$ in each hidden layer. This is because multiple layers of linear functions would itself compute only a linear function of the input (i.e., composing multiple linear functions together results in just another linear function), and thus be no more expressive than using just a single layer of hidden units.

Why do we want to use a deep network? The primary advantage is that it can compactly represent a significantly larger set of fuctions than shallow networks. Formally, one can show that there are functions which a k-layer network can represent compactly (with a number of hidden units that is polynomial in the number of inputs), that a (k − 1)-layer network cannot represent unless it has an exponentially large number of hidden units.

To take a simple example, consider building a boolean network/circuit to compute the parity (or XOR) of n input bits. Suppose each node in the network can compute either the logical OR of its inputs (or the logical negation of the inputs), or compute the logical AND. If we have a network with only 1 hidden layer, the parity function would require a number of nodes that is exponential in the input size n. If however we are allowed a deeper network, then the network/circuit size can be only polynomial in n.

By using a deep network, one can also start to learn part-whole decompositions. For example, the first layer might learn to group together pixels in an image in order to detect edges. The second layer might then group together edges to detect longer contours, or perhaps simple "object parts." An even deeper layer might then group together these contours or detect even more complex features.

Finally, cortical computations (in the brain) also have multiple layers of processing. For example, visual images are processed in multiple stages by the brain, by cortical area "V1", followed by cortical area "V2" (a different part of the brain), and so on.

## Difficulty of training deep architectures

While the theoretical benefits of deep networks in terms of their compactness and expressive power have been appreciated for many decades, until recently researchers had little success training deep architectures.

The main method that researchers were using was to randomly initialize the weights of the deep network, and then train it using a labeled training set $\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l}) \}$ using a supervised learning objective, using gradient descent to try to drive down the training error. However, this usually did not work well. There were several reasons for this.

### Availability of data

With the method described above, one relies only on labeled data for training. However, labeled data is often scarce, and thus it is easy to overfit the training data and obtain a model which does not generalize well.

### Local optima

Training a neural network using supervised learning involves solving a highly non-convex optimization problem (say, minimizing the training error $\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2$ as a function of the network parameters $\textstyle W$). When the network is deep, this optimization problem is rife with bad local optima, and training with gradient descent (or methods like conjugate gradient and L-BFGS) do not work well.

There is an additional technical reason, pertaining to the gradients becoming very small, that explains why gradient descent (and related algorithms like L-BFGS) do not work well on a deep network with randomly initialized weights. Specifically, when using backpropagation to compute the derivatives, the gradients that are propagated backwards (from the output layer to the earlier layers of the network) rapidly diminishes in magnitude as the depth of the network increases. As a result, the derivative of the overall cost with respect to the weights in the earlier layers is very small. Thus, when using gradient descent, the weights of the earlier layers change slowly, and the earlier layers fail to learn much. This problem is often called the "diffusion of gradients."

A closely related problem to the diffusion of gradients is that if the last few layers in a neural network have a large enough number of neurons, it may be possible for them to model the labeled data alone without the help of the earlier layers. Hence, training the entire network at once with all the layers randomly initialized ends up giving similar performance to training a shallow network (the last few layers) on corrupted input (the result of the processing done by the earlier layers).

## Greedy layer-wise training

How should deep architectures be trained then? One method that has seen some success is the greedy layer-wise training method. We describe this method in detail in later sections, but briefly, the main idea is to train the layers of the network one at a time, with the input of each layer being the output of the previous layer (which has been trained). Training can either be supervised (say, with classification error as the objective function), or unsupervised (say, with the error of the layer in reconstructing its input as the objective function, as in an autoencoder). The weights from training the layers individually are then used to initialize the weights in the deep architecture, and only then is the entire architecture "fine-tuned" (i.e., trained together to optimize the training set error). The success of greedy layer-wise training has been attributed to a number of factors:

### Availability of data

While labeled data can be expensive to obtain, unlabeled data is cheap and plentiful. The promise of self-taught learning is that by exploiting the massive amount of unlabeled data, we can learn much better models. By using unlabeled data to learn a good initial value for the weights in all the layers $\textstyle W^{(l)}$ (except for the final classification layer that maps to the outputs/predictions), our algorithm is able to learn and discover patterns from massively more amounts of data than purely supervised approaches, and thus often results in much better hypotheses.

### Regularization and better local optima

After having trained the network on the unlabeled data, the weights are now starting at a better location in parameter space than if they had been randomly initialized. We usually then further fine-tune the weights starting from this location. Empirically, it turns out that gradient descent from this location is also much more likely to lead to a good local minimum, because the unlabeled data has already provided a significant amount of "prior" information about what patterns there are in the input data.

In the next section, we will describe the specific details of how to go about implementing greedy layer-wise training.