Deep Networks: Overview
From Ufldl
(→Difficulty of training deep architectures) |
(→Difficulty of training deep architectures) |
||
Line 71: | Line 71: | ||
The main learning algorithm that researchers were using was to randomly initialize | The main learning algorithm that researchers were using was to randomly initialize | ||
the weights of a deep network, and then train it using a labeled | the weights of a deep network, and then train it using a labeled | ||
- | training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l}) \}</math> | + | training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l)}) \}</math> |
- | using a supervised learning objective, | + | using a supervised learning objective, for example by applying gradient descent to try to |
drive down the training error. However, this usually did not work well. | drive down the training error. However, this usually did not work well. | ||
There were several reasons for this. | There were several reasons for this. | ||
Line 79: | Line 79: | ||
With the method described above, one relies only on | With the method described above, one relies only on | ||
- | labeled data for training. However, labeled data is often scarce, and thus | + | labeled data for training. However, labeled data is often scarce, and thus for many |
- | is | + | problems it is difficult to get enough examples to fit the parameters of a |
- | + | complex model. For example, given the high degree of expressive power of deep networks, | |
+ | training on insufficient data would also result in overfitting. | ||
===Local optima=== | ===Local optima=== | ||
- | Training a neural network using supervised learning | + | Training a shallow network (with 1 hidden layer) using |
+ | supervised learning usually resulted in the parameters converging to reasonable values; | ||
+ | but when we are training a deep network, this works much less well. | ||
+ | In particular, training a neural network using supervised learning | ||
involves solving a highly non-convex optimization problem (say, minimizing the | involves solving a highly non-convex optimization problem (say, minimizing the | ||
training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a | training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a | ||
- | function of the network parameters <math>\textstyle W</math>). | + | function of the network parameters <math>\textstyle W</math>). |
- | + | In a deep network, this problem turns out to be rife with bad local optima, and | |
training with gradient descent (or methods like conjugate gradient and L-BFGS) | training with gradient descent (or methods like conjugate gradient and L-BFGS) | ||
- | + | no longer work well. | |
===Diffusion of gradients=== | ===Diffusion of gradients=== | ||
Line 97: | Line 101: | ||
There is an additional technical reason, | There is an additional technical reason, | ||
pertaining to the gradients becoming very small, that explains why gradient | pertaining to the gradients becoming very small, that explains why gradient | ||
- | descent (and related algorithms like L-BFGS) do not work well on a deep | + | descent (and related algorithms like L-BFGS) do not work well on a deep networks |
with randomly initialized weights. Specifically, when using backpropagation to | with randomly initialized weights. Specifically, when using backpropagation to | ||
compute the derivatives, the gradients that are propagated backwards (from the | compute the derivatives, the gradients that are propagated backwards (from the | ||
- | output layer to the earlier layers of the network) rapidly | + | output layer to the earlier layers of the network) rapidly diminish in |
magnitude as the depth of the network increases. As a result, the derivative of | magnitude as the depth of the network increases. As a result, the derivative of | ||
the overall cost with respect to the weights in the earlier layers is very | the overall cost with respect to the weights in the earlier layers is very | ||
Line 113: | Line 117: | ||
randomly initialized ends up giving similar performance to training a | randomly initialized ends up giving similar performance to training a | ||
shallow network (the last few layers) on corrupted input (the result of | shallow network (the last few layers) on corrupted input (the result of | ||
- | the processing done by the earlier layers). | + | the processing done by the earlier layers). |
<!-- | <!-- |