梯度检验与高级优化
From Ufldl
Line 1: | Line 1: | ||
+ | 初译: @pocketwalker | ||
+ | |||
+ | 一审:王方,email:fangkey@gmail.com,新浪微博:@GuitarFang | ||
+ | |||
+ | 二审:@大黄蜂的思索 | ||
+ | |||
+ | Wiki上传者:王方,email:fangkey@gmail.com,新浪微博:@GuitarFang | ||
+ | |||
+ | :【原文】: | ||
Backpropagation is a notoriously difficult algorithm to debug and get right, | Backpropagation is a notoriously difficult algorithm to debug and get right, | ||
especially since many subtly buggy implementations of it—for example, one | especially since many subtly buggy implementations of it—for example, one | ||
Line 10: | Line 19: | ||
derivative checking procedure described here will significantly increase | derivative checking procedure described here will significantly increase | ||
your confidence in the correctness of your code. | your confidence in the correctness of your code. | ||
+ | :【初译】: | ||
+ | :【一审】: | ||
+ | :【二审】: | ||
+ | :【原文】: | ||
Suppose we want to minimize <math>\textstyle J(\theta)</math> as a function of <math>\textstyle \theta</math>. | Suppose we want to minimize <math>\textstyle J(\theta)</math> as a function of <math>\textstyle \theta</math>. | ||
For this example, suppose <math>\textstyle J : \Re \mapsto \Re</math>, so that <math>\textstyle \theta \in \Re</math>. | For this example, suppose <math>\textstyle J : \Re \mapsto \Re</math>, so that <math>\textstyle \theta \in \Re</math>. | ||
Line 17: | Line 30: | ||
\theta := \theta - \alpha \frac{d}{d\theta}J(\theta). | \theta := \theta - \alpha \frac{d}{d\theta}J(\theta). | ||
\end{align}</math> | \end{align}</math> | ||
+ | :【初译】: | ||
+ | :【一审】: | ||
+ | :【二审】: | ||
+ | :【原文】: | ||
Suppose also that we have implemented some function <math>\textstyle g(\theta)</math> that purportedly | Suppose also that we have implemented some function <math>\textstyle g(\theta)</math> that purportedly | ||
computes <math>\textstyle \frac{d}{d\theta}J(\theta)</math>, so that we implement gradient descent | computes <math>\textstyle \frac{d}{d\theta}J(\theta)</math>, so that we implement gradient descent | ||
using the update <math>\textstyle \theta := \theta - \alpha g(\theta)</math>. How can we check if our implementation of | using the update <math>\textstyle \theta := \theta - \alpha g(\theta)</math>. How can we check if our implementation of | ||
<math>\textstyle g</math> is correct? | <math>\textstyle g</math> is correct? | ||
+ | :【初译】: | ||
+ | :【一审】: | ||
+ | :【二审】: | ||
+ | :【原文】: | ||
Recall the mathematical definition of the derivative as | Recall the mathematical definition of the derivative as | ||
:<math>\begin{align} | :<math>\begin{align} | ||
Line 32: | Line 53: | ||
\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}} | \frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}} | ||
\end{align}</math> | \end{align}</math> | ||
+ | :【初译】: | ||
+ | :【一审】: | ||
+ | :【二审】: | ||
+ | :【原文】: | ||
In practice, we set <math>{\rm EPSILON}</math> to a small constant, say around <math>\textstyle 10^{-4}</math>. | In practice, we set <math>{\rm EPSILON}</math> to a small constant, say around <math>\textstyle 10^{-4}</math>. | ||
(There's a large range of values of <math>{\rm EPSILON}</math> that should work well, but | (There's a large range of values of <math>{\rm EPSILON}</math> that should work well, but | ||
we don't set <math>{\rm EPSILON}</math> to be "extremely" small, say <math>\textstyle 10^{-20}</math>, | we don't set <math>{\rm EPSILON}</math> to be "extremely" small, say <math>\textstyle 10^{-20}</math>, | ||
as that would lead to numerical roundoff errors.) | as that would lead to numerical roundoff errors.) | ||
+ | :【初译】: | ||
+ | :【一审】: | ||
+ | :【二审】: | ||
+ | :【原文】: | ||
Thus, given a function <math>\textstyle g(\theta)</math> that is supposedly computing | Thus, given a function <math>\textstyle g(\theta)</math> that is supposedly computing | ||
<math>\textstyle \frac{d}{d\theta}J(\theta)</math>, we can now numerically verify its correctness | <math>\textstyle \frac{d}{d\theta}J(\theta)</math>, we can now numerically verify its correctness | ||
Line 48: | Line 77: | ||
you'll usually find that the left- and right-hand sides of the above will agree | you'll usually find that the left- and right-hand sides of the above will agree | ||
to at least 4 significant digits (and often many more). | to at least 4 significant digits (and often many more). | ||
+ | :【初译】: | ||
+ | :【一审】: | ||
+ | :【二审】: | ||
+ | :【原文】: | ||
Now, consider the case where <math>\textstyle \theta \in \Re^n</math> is a vector rather than a single real | Now, consider the case where <math>\textstyle \theta \in \Re^n</math> is a vector rather than a single real | ||
number (so that we have <math>\textstyle n</math> parameters that we want to learn), and <math>\textstyle J: \Re^n \mapsto \Re</math>. In | number (so that we have <math>\textstyle n</math> parameters that we want to learn), and <math>\textstyle J: \Re^n \mapsto \Re</math>. In | ||
Line 54: | Line 87: | ||
the parameters <math>\textstyle W,b</math> into a long vector <math>\textstyle \theta</math>. We now generalize our derivative | the parameters <math>\textstyle W,b</math> into a long vector <math>\textstyle \theta</math>. We now generalize our derivative | ||
checking procedure to the case where <math>\textstyle \theta</math> may be a vector. | checking procedure to the case where <math>\textstyle \theta</math> may be a vector. | ||
+ | :【初译】: | ||
+ | :【一审】: | ||
+ | :【二审】: | ||
+ | :【原文】: | ||
Suppose we have a function <math>\textstyle g_i(\theta)</math> that purportedly computes | Suppose we have a function <math>\textstyle g_i(\theta)</math> that purportedly computes | ||
<math>\textstyle \frac{\partial}{\partial \theta_i} J(\theta)</math>; we'd like to check if <math>\textstyle g_i</math> | <math>\textstyle \frac{\partial}{\partial \theta_i} J(\theta)</math>; we'd like to check if <math>\textstyle g_i</math> | ||
Line 76: | Line 113: | ||
\frac{J(\theta^{(i+)}) - J(\theta^{(i-)})}{2 \times {\rm EPSILON}}. | \frac{J(\theta^{(i+)}) - J(\theta^{(i-)})}{2 \times {\rm EPSILON}}. | ||
\end{align}</math> | \end{align}</math> | ||
+ | :【初译】: | ||
+ | :【一审】: | ||
+ | :【二审】: | ||
+ | :【原文】: | ||
When implementing backpropagation to train a neural network, in a correct implementation | When implementing backpropagation to train a neural network, in a correct implementation | ||
we will have that | we will have that | ||
Line 84: | Line 125: | ||
\nabla_{b^{(l)}} J(W,b) &= \frac{1}{m} \Delta b^{(l)}. | \nabla_{b^{(l)}} J(W,b) &= \frac{1}{m} \Delta b^{(l)}. | ||
\end{align}</math> | \end{align}</math> | ||
+ | :【初译】: | ||
+ | :【一审】: | ||
+ | :【二审】: | ||
+ | |||
+ | :【原文】: | ||
This result shows that the final block of psuedo-code in [[Backpropagation Algorithm]] is indeed | This result shows that the final block of psuedo-code in [[Backpropagation Algorithm]] is indeed | ||
implementing gradient descent. | implementing gradient descent. | ||
Line 91: | Line 137: | ||
your computations of <math>\textstyle \left(\frac{1}{m}\Delta W^{(l)} \right) + \lambda W</math> and <math>\textstyle \frac{1}{m}\Delta b^{(l)}</math> are | your computations of <math>\textstyle \left(\frac{1}{m}\Delta W^{(l)} \right) + \lambda W</math> and <math>\textstyle \frac{1}{m}\Delta b^{(l)}</math> are | ||
indeed giving the derivatives you want. | indeed giving the derivatives you want. | ||
+ | :【初译】: | ||
+ | :【一审】: | ||
+ | :【二审】: | ||
+ | :【原文】: | ||
Finally, so far our discussion has centered on using gradient descent to minimize <math>\textstyle J(\theta)</math>. If you have | Finally, so far our discussion has centered on using gradient descent to minimize <math>\textstyle J(\theta)</math>. If you have | ||
implemented a function that computes <math>\textstyle J(\theta)</math> and <math>\textstyle \nabla_\theta J(\theta)</math>, it turns out there are more | implemented a function that computes <math>\textstyle J(\theta)</math> and <math>\textstyle \nabla_\theta J(\theta)</math>, it turns out there are more | ||
Line 108: | Line 158: | ||
to automatically search for a value of <math>\textstyle \theta</math> that minimizes <math>\textstyle J(\theta)</math>. Algorithms | to automatically search for a value of <math>\textstyle \theta</math> that minimizes <math>\textstyle J(\theta)</math>. Algorithms | ||
such as L-BFGS and conjugate gradient can often be much faster than gradient descent. | such as L-BFGS and conjugate gradient can often be much faster than gradient descent. | ||
+ | :【初译】: | ||
+ | :【一审】: | ||
+ | :【二审】: | ||
{{Sparse_Autoencoder}} | {{Sparse_Autoencoder}} |