梯度检验与高级优化

From Ufldl

Jump to: navigation, search

@@ Line 1: / Line 1: @@
-梯度检验与高级优化
+Backpropagation is a notoriously difficult algorithm to debug and get right,
+especially since many subtly buggy implementations of it&mdash;for example, one
+that has an off-by-one error in the indices and that thus only trains some of
+the layers of weights, or an implementation that omits the bias term&mdash;will
+manage to learn something that can look surprisingly reasonable
+(while performing less well than a correct implementation).  Thus, even with a
+buggy implementation, it may not at all be apparent that anything is amiss.
+In this section, we describe a method for numerically checking the derivatives computed
+by your code to make sure that your implementation is correct.  Carrying out the
+derivative checking procedure described here will significantly increase
+your confidence in the correctness of your code.
+Suppose we want to minimize <math>\textstyle J(\theta)</math> as a function of <math>\textstyle \theta</math>.
+For this example, suppose <math>\textstyle J : \Re \mapsto \Re</math>, so that <math>\textstyle \theta \in \Re</math>.
+In this 1-dimensional case, one iteration of gradient descent is given by
+:<math>\begin{align}
+\theta := \theta - \alpha \frac{d}{d\theta}J(\theta).
+\end{align}</math>
+Suppose also that we have implemented some function <math>\textstyle g(\theta)</math> that purportedly
+computes <math>\textstyle \frac{d}{d\theta}J(\theta)</math>, so that we implement gradient descent
+using the update <math>\textstyle \theta := \theta - \alpha g(\theta)</math>.  How can we check if our implementation of
+<math>\textstyle g</math> is correct?
+Recall the mathematical definition of the derivative as
+:<math>\begin{align}
+\frac{d}{d\theta}J(\theta) = \lim_{\epsilon \rightarrow 0}
+\frac{J(\theta+ \epsilon) - J(\theta-\epsilon)}{2 \epsilon}.
+\end{align}</math>
+Thus, at any specific value of <math>\textstyle \theta</math>, we can numerically approximate the derivative
+as follows:
+:<math>\begin{align}
+\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}
+\end{align}</math>
+In practice, we set <math>{\rm EPSILON}</math> to a small constant, say around <math>\textstyle 10^{-4}</math>.
+(There's a large range of values of <math>{\rm EPSILON}</math> that should work well, but
+we don't set <math>{\rm EPSILON}</math> to be "extremely" small, say <math>\textstyle 10^{-20}</math>,
+as that would lead to numerical roundoff errors.)
+Thus, given a function <math>\textstyle g(\theta)</math> that is supposedly computing
+<math>\textstyle \frac{d}{d\theta}J(\theta)</math>, we can now numerically verify its correctness
+by checking that
+:<math>\begin{align}
+g(\theta) \approx
+\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}.
+\end{align}</math>
+The degree to which these two values should approximate each other
+will depend on the details of <math>\textstyle J</math>.  But assuming <math>\textstyle {\rm EPSILON} = 10^{-4}</math>,
+you'll usually find that the left- and right-hand sides of the above will agree
+to at least 4 significant digits (and often many more).
+Now, consider the case where <math>\textstyle \theta \in \Re^n</math> is a vector rather than a single real
+number (so that we have <math>\textstyle n</math> parameters that we want to learn), and <math>\textstyle J: \Re^n \mapsto \Re</math>.  In
+our neural network example we used "<math>\textstyle J(W,b)</math>," but one can imagine "unrolling"
+the parameters <math>\textstyle W,b</math> into a long vector <math>\textstyle \theta</math>.  We now generalize our derivative
+checking procedure to the case where <math>\textstyle \theta</math> may be a vector.
+Suppose we have a function <math>\textstyle g_i(\theta)</math> that purportedly computes
+<math>\textstyle \frac{\partial}{\partial \theta_i} J(\theta)</math>; we'd like to check if <math>\textstyle g_i</math>
+is outputting correct derivative values.  Let <math>\textstyle \theta^{(i+)} = \theta +
+{\rm EPSILON} \times \vec{e}_i</math>, where
+:<math>\begin{align}
+\vec{e}_i = \begin{bmatrix}0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0\end{bmatrix}
+\end{align}</math>
+is the <math>\textstyle i</math>-th basis vector (a
+vector of the same dimension as <math>\textstyle \theta</math>, with a "1" in the <math>\textstyle i</math>-th position
+and "0"s everywhere else).  So,
+<math>\textstyle \theta^{(i+)}</math> is the same as <math>\textstyle \theta</math>, except its <math>\textstyle i</math>-th element has been incremented
+by <math>{\rm EPSILON}</math>.  Similarly, let <math>\textstyle \theta^{(i-)} = \theta - {\rm EPSILON} \times \vec{e}_i</math> be the
+corresponding vector with the <math>\textstyle i</math>-th element decreased by <math>{\rm EPSILON}</math>.
+We can now numerically verify <math>\textstyle g_i(\theta)</math>'s correctness by checking, for each <math>\textstyle i</math>,
+that:
+:<math>\begin{align}
+g_i(\theta) \approx
+\frac{J(\theta^{(i+)}) - J(\theta^{(i-)})}{2 \times {\rm EPSILON}}.
+\end{align}</math>
+When implementing backpropagation to train a neural network, in a correct implementation
+we will have that
+:<math>\begin{align}
+\nabla_{W^{(l)}} J(W,b) &= \left( \frac{1}{m} \Delta W^{(l)} \right) + \lambda W^{(l)} \\
+\nabla_{b^{(l)}} J(W,b) &= \frac{1}{m} \Delta b^{(l)}.
+\end{align}</math>
+This result shows that the final block of psuedo-code in [[Backpropagation Algorithm]] is indeed
+implementing gradient descent.
+To make sure your implementation of gradient descent is correct, it is
+usually very helpful to use the method described above to
+numerically compute the derivatives of <math>\textstyle J(W,b)</math>, and thereby verify that
+your computations of <math>\textstyle \left(\frac{1}{m}\Delta W^{(l)} \right) + \lambda W</math> and <math>\textstyle \frac{1}{m}\Delta b^{(l)}</math> are
+indeed giving the derivatives you want.
+Finally, so far our discussion has centered on using gradient descent to minimize <math>\textstyle J(\theta)</math>.  If you have
+implemented a function that computes <math>\textstyle J(\theta)</math> and <math>\textstyle \nabla_\theta J(\theta)</math>, it turns out there are more
+sophisticated algorithms than gradient descent for trying to minimize <math>\textstyle J(\theta)</math>.  For example, one can envision
+an algorithm that uses gradient descent, but automatically tunes the learning rate <math>\textstyle \alpha</math> so as to try to
+use a step-size that causes <math>\textstyle \theta</math> to approach a local optimum as quickly as possible.
+There are other algorithms that are even more
+sophisticated than this; for example, there are algorithms that try to find an approximation to the
+Hessian matrix, so that it can take more rapid steps towards a local optimum (similar to Newton's method).  A full discussion of these
+algorithms is beyond the scope of these notes, but one example is
+the '''L-BFGS''' algorithm.  (Another example is the '''conjugate gradient''' algorithm.)  You will use one of
+these algorithms in the programming exercise.
+The main thing you need to provide to these advanced optimization algorithms is that for any <math>\textstyle \theta</math>, you have to be able
+to compute <math>\textstyle J(\theta)</math> and <math>\textstyle \nabla_\theta J(\theta)</math>.  These optimization algorithms will then do their own
+internal tuning of the learning rate/step-size <math>\textstyle \alpha</math> (and compute its own approximation to the Hessian, etc.)
+to automatically search for a value of <math>\textstyle \theta</math> that minimizes <math>\textstyle J(\theta)</math>.  Algorithms
+such as L-BFGS and conjugate gradient can often be much faster than gradient descent.
+{{Sparse_Autoencoder}}

梯度检验与高级优化

From Ufldl

Revision as of 10:39, 9 March 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox