Deep Networks: Overview
From Ufldl
(→Advantages of deep networks) |
|||
Line 33: | Line 33: | ||
To take a simple example, consider building a boolean circuit/network to | To take a simple example, consider building a boolean circuit/network to | ||
compute the parity (or XOR) of <math>n</math> input bits. Suppose each node in | compute the parity (or XOR) of <math>n</math> input bits. Suppose each node in | ||
- | the network can compute either the logical OR of its inputs (or the | + | the network can compute either the logical OR of its inputs (or the OR of the |
negation of the inputs), or compute the logical AND. If we have a network with | negation of the inputs), or compute the logical AND. If we have a network with | ||
- | only | + | only one input, one hidden, and one output layer, the parity function would require a number of nodes that |
is exponential in the input size <math>n</math>. If however we are allowed a | is exponential in the input size <math>n</math>. If however we are allowed a | ||
deeper network, then the network/circuit size can be only polynomial in | deeper network, then the network/circuit size can be only polynomial in | ||
<math>n</math>. | <math>n</math>. | ||
- | By using a deep network, one can also start to learn part-whole decompositions. | + | By using a deep network, in the case of images, one can also start to learn part-whole decompositions. |
For example, the first layer might learn to group together pixels in an image | For example, the first layer might learn to group together pixels in an image | ||
- | in order to detect edges. The second layer might then group together edges to | + | in order to detect edges (as seen in the earlier exercises). The second layer might then group together edges to |
- | detect longer contours, or perhaps simple " | + | detect longer contours, or perhaps detect simple "parts of objects." An even deeper layer |
might then group together these contours or detect even more complex features. | might then group together these contours or detect even more complex features. | ||
Line 49: | Line 49: | ||
processing. For example, visual images are processed in multiple stages by the | processing. For example, visual images are processed in multiple stages by the | ||
brain, by cortical area "V1", followed by cortical area "V2" (a different part | brain, by cortical area "V1", followed by cortical area "V2" (a different part | ||
- | of the brain), and so on. | + | of the brain), and so on. |
<!-- | <!-- | ||
Line 69: | Line 69: | ||
researchers had little success training deep architectures. | researchers had little success training deep architectures. | ||
- | The main | + | The main learning algorithm that researchers were using was to randomly initialize |
- | the weights of | + | the weights of a deep network, and then train it using a labeled |
- | training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l}) \}</math> | + | training set <math>\{ (x^{(1)}_l, y^{(1)}), \ldots, (x^{(m_l)}_l, y^{(m_l)}) \}</math> |
- | using a supervised learning objective, | + | using a supervised learning objective, for example by applying gradient descent to try to |
drive down the training error. However, this usually did not work well. | drive down the training error. However, this usually did not work well. | ||
There were several reasons for this. | There were several reasons for this. | ||
Line 79: | Line 79: | ||
With the method described above, one relies only on | With the method described above, one relies only on | ||
- | labeled data for training. However, labeled data is often scarce, and thus | + | labeled data for training. However, labeled data is often scarce, and thus for many |
- | is | + | problems it is difficult to get enough examples to fit the parameters of a |
- | + | complex model. For example, given the high degree of expressive power of deep networks, | |
+ | training on insufficient data would also result in overfitting. | ||
===Local optima=== | ===Local optima=== | ||
- | Training a neural network using supervised learning | + | Training a shallow network (with 1 hidden layer) using |
+ | supervised learning usually resulted in the parameters converging to reasonable values; | ||
+ | but when we are training a deep network, this works much less well. | ||
+ | In particular, training a neural network using supervised learning | ||
involves solving a highly non-convex optimization problem (say, minimizing the | involves solving a highly non-convex optimization problem (say, minimizing the | ||
training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a | training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a | ||
- | function of the network parameters <math>\textstyle W</math>). | + | function of the network parameters <math>\textstyle W</math>). |
- | + | In a deep network, this problem turns out to be rife with bad local optima, and | |
training with gradient descent (or methods like conjugate gradient and L-BFGS) | training with gradient descent (or methods like conjugate gradient and L-BFGS) | ||
- | + | no longer work well. | |
===Diffusion of gradients=== | ===Diffusion of gradients=== | ||
Line 97: | Line 101: | ||
There is an additional technical reason, | There is an additional technical reason, | ||
pertaining to the gradients becoming very small, that explains why gradient | pertaining to the gradients becoming very small, that explains why gradient | ||
- | descent (and related algorithms like L-BFGS) do not work well on a deep | + | descent (and related algorithms like L-BFGS) do not work well on a deep networks |
with randomly initialized weights. Specifically, when using backpropagation to | with randomly initialized weights. Specifically, when using backpropagation to | ||
compute the derivatives, the gradients that are propagated backwards (from the | compute the derivatives, the gradients that are propagated backwards (from the | ||
- | output layer to the earlier layers of the network) rapidly | + | output layer to the earlier layers of the network) rapidly diminish in |
magnitude as the depth of the network increases. As a result, the derivative of | magnitude as the depth of the network increases. As a result, the derivative of | ||
the overall cost with respect to the weights in the earlier layers is very | the overall cost with respect to the weights in the earlier layers is very | ||
Line 113: | Line 117: | ||
randomly initialized ends up giving similar performance to training a | randomly initialized ends up giving similar performance to training a | ||
shallow network (the last few layers) on corrupted input (the result of | shallow network (the last few layers) on corrupted input (the result of | ||
- | the processing done by the earlier layers). | + | the processing done by the earlier layers). |
<!-- | <!-- | ||
Line 124: | Line 128: | ||
== Greedy layer-wise training == | == Greedy layer-wise training == | ||
- | How | + | How can we train a deep network? One method that has seen some |
success is the '''greedy layer-wise training''' method. We describe this | success is the '''greedy layer-wise training''' method. We describe this | ||
method in detail in later sections, but briefly, the main idea is to train the | method in detail in later sections, but briefly, the main idea is to train the | ||
- | layers of the network one at a time, with | + | layers of the network one at a time, so that we first train a network with 1 |
- | + | hidden layer, and only after that is done, train a network with 2 hidden layers, | |
- | supervised (say, with classification error as the objective function), | + | and so on. At each step, we take the old network with <math>k-1</math> hidden |
- | unsupervised ( | + | layers, and add an additional <math>k</math>-th hidden layer (that takes as |
- | + | input the previous hidden layer <math>k-1</math> that we had just | |
- | layers individually are then used to initialize the weights in the deep | + | trained). Training can either be |
- | + | supervised (say, with classification error as the objective function on each | |
- | trained together to optimize the training set error). | + | step), but more frequently it is |
+ | unsupervised (as in an autoencoder; details to provided later). | ||
+ | The weights from training the layers individually are then used to initialize the weights | ||
+ | in the final/overall deep network, and only then is the entire architecture "fine-tuned" (i.e., | ||
+ | trained together to optimize the labeled training set error). | ||
+ | |||
+ | The success of greedy | ||
layer-wise training has been attributed to a number of factors: | layer-wise training has been attributed to a number of factors: | ||
Line 146: | Line 156: | ||
classification layer that maps to the outputs/predictions), our algorithm is | classification layer that maps to the outputs/predictions), our algorithm is | ||
able to learn and discover patterns from massively more amounts of data than | able to learn and discover patterns from massively more amounts of data than | ||
- | purely supervised approaches | + | purely supervised approaches. This often results in much better classifiers |
+ | being learned. | ||
- | === | + | ===Better local optima=== |
After having trained the network | After having trained the network | ||
on the unlabeled data, the weights are now starting at a better location in | on the unlabeled data, the weights are now starting at a better location in | ||
- | parameter space than if they had been randomly initialized. We | + | parameter space than if they had been randomly initialized. We can then |
further fine-tune the weights starting from this location. Empirically, it | further fine-tune the weights starting from this location. Empirically, it | ||
- | turns out that gradient descent from this location is | + | turns out that gradient descent from this location is much more likely to |
lead to a good local minimum, because the unlabeled data has already provided | lead to a good local minimum, because the unlabeled data has already provided | ||
a significant amount of "prior" information about what patterns there | a significant amount of "prior" information about what patterns there | ||
- | are in the input data. | + | are in the input data. |
+ | |||
In the next section, we will describe the specific details of how to go about | In the next section, we will describe the specific details of how to go about | ||
- | implementing greedy layer-wise training. | + | implementing greedy layer-wise training. |
+ | {{CNN}} | ||
<!-- | <!-- | ||
Line 178: | Line 191: | ||
[http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf] | [http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf] | ||
!--> | !--> | ||
+ | |||
+ | |||
+ | {{Languages|深度网络概览|中文}} |