梯度检验与高级优化

From Ufldl

Jump to: navigation, search

@@ Line 1: / Line 1: @@
+初译: @pocketwalker
+一审：王方，email：fangkey@gmail.com，新浪微博：@GuitarFang
+二审：@大黄蜂的思索
+Wiki上传者：王方，email：fangkey@gmail.com，新浪微博：@GuitarFang
+:【原文】：
 Backpropagation is a notoriously difficult algorithm to debug and get right,
 especially since many subtly buggy implementations of it&mdash;for example, one
@@ Line 10: / Line 19: @@
 derivative checking procedure described here will significantly increase
 your confidence in the correctness of your code.
+:【初译】：
+:【一审】：
+:【二审】：
+:【原文】：
 Suppose we want to minimize <math>\textstyle J(\theta)</math> as a function of <math>\textstyle \theta</math>.
 For this example, suppose <math>\textstyle J : \Re \mapsto \Re</math>, so that <math>\textstyle \theta \in \Re</math>.
@@ Line 17: / Line 30: @@
 \theta := \theta - \alpha \frac{d}{d\theta}J(\theta).
 \end{align}</math>
+:【初译】：
+:【一审】：
+:【二审】：
+:【原文】：
 Suppose also that we have implemented some function <math>\textstyle g(\theta)</math> that purportedly
 computes <math>\textstyle \frac{d}{d\theta}J(\theta)</math>, so that we implement gradient descent
 using the update <math>\textstyle \theta := \theta - \alpha g(\theta)</math>.  How can we check if our implementation of
 <math>\textstyle g</math> is correct?
+:【初译】：
+:【一审】：
+:【二审】：
+:【原文】：
 Recall the mathematical definition of the derivative as
 :<math>\begin{align}
@@ Line 32: / Line 53: @@
 \frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}
 \end{align}</math>
+:【初译】：
+:【一审】：
+:【二审】：
+:【原文】：
 In practice, we set <math>{\rm EPSILON}</math> to a small constant, say around <math>\textstyle 10^{-4}</math>.
 (There's a large range of values of <math>{\rm EPSILON}</math> that should work well, but
 we don't set <math>{\rm EPSILON}</math> to be "extremely" small, say <math>\textstyle 10^{-20}</math>,
 as that would lead to numerical roundoff errors.)
+:【初译】：
+:【一审】：
+:【二审】：
+:【原文】：
 Thus, given a function <math>\textstyle g(\theta)</math> that is supposedly computing
 <math>\textstyle \frac{d}{d\theta}J(\theta)</math>, we can now numerically verify its correctness
@@ Line 48: / Line 77: @@
 you'll usually find that the left- and right-hand sides of the above will agree
 to at least 4 significant digits (and often many more).
+:【初译】：
+:【一审】：
+:【二审】：
+:【原文】：
 Now, consider the case where <math>\textstyle \theta \in \Re^n</math> is a vector rather than a single real
 number (so that we have <math>\textstyle n</math> parameters that we want to learn), and <math>\textstyle J: \Re^n \mapsto \Re</math>.  In
@@ Line 54: / Line 87: @@
 the parameters <math>\textstyle W,b</math> into a long vector <math>\textstyle \theta</math>.  We now generalize our derivative
 checking procedure to the case where <math>\textstyle \theta</math> may be a vector.
+:【初译】：
+:【一审】：
+:【二审】：
+:【原文】：
 Suppose we have a function <math>\textstyle g_i(\theta)</math> that purportedly computes
 <math>\textstyle \frac{\partial}{\partial \theta_i} J(\theta)</math>; we'd like to check if <math>\textstyle g_i</math>
@@ Line 76: / Line 113: @@
 \frac{J(\theta^{(i+)}) - J(\theta^{(i-)})}{2 \times {\rm EPSILON}}.
 \end{align}</math>
+:【初译】：
+:【一审】：
+:【二审】：
+:【原文】：
 When implementing backpropagation to train a neural network, in a correct implementation
 we will have that
@@ Line 84: / Line 125: @@
 \nabla_{b^{(l)}} J(W,b) &= \frac{1}{m} \Delta b^{(l)}.
 \end{align}</math>
+:【初译】：
+:【一审】：
+:【二审】：
+:【原文】：
 This result shows that the final block of psuedo-code in [[Backpropagation Algorithm]] is indeed
 implementing gradient descent.
@@ Line 91: / Line 137: @@
 your computations of <math>\textstyle \left(\frac{1}{m}\Delta W^{(l)} \right) + \lambda W</math> and <math>\textstyle \frac{1}{m}\Delta b^{(l)}</math> are
 indeed giving the derivatives you want.
+:【初译】：
+:【一审】：
+:【二审】：
+:【原文】：
 Finally, so far our discussion has centered on using gradient descent to minimize <math>\textstyle J(\theta)</math>.  If you have
 implemented a function that computes <math>\textstyle J(\theta)</math> and <math>\textstyle \nabla_\theta J(\theta)</math>, it turns out there are more
@@ Line 108: / Line 158: @@
 to automatically search for a value of <math>\textstyle \theta</math> that minimizes <math>\textstyle J(\theta)</math>.  Algorithms
 such as L-BFGS and conjugate gradient can often be much faster than gradient descent.
+:【初译】：
+:【一审】：
+:【二审】：
 {{Sparse_Autoencoder}}

梯度检验与高级优化

From Ufldl

Revision as of 10:49, 9 March 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox