梯度检验与高级优化

From Ufldl

Jump to: navigation, search
(Created page with "梯度检验与高级优化")
Line 1: Line 1:
-
梯度检验与高级优化
+
Backpropagation is a notoriously difficult algorithm to debug and get right,
 +
especially since many subtly buggy implementations of it—for example, one
 +
that has an off-by-one error in the indices and that thus only trains some of
 +
the layers of weights, or an implementation that omits the bias term—will
 +
manage to learn something that can look surprisingly reasonable
 +
(while performing less well than a correct implementation).  Thus, even with a
 +
buggy implementation, it may not at all be apparent that anything is amiss.
 +
In this section, we describe a method for numerically checking the derivatives computed
 +
by your code to make sure that your implementation is correct.  Carrying out the
 +
derivative checking procedure described here will significantly increase
 +
your confidence in the correctness of your code.
 +
 
 +
Suppose we want to minimize <math>\textstyle J(\theta)</math> as a function of <math>\textstyle \theta</math>.
 +
For this example, suppose <math>\textstyle J : \Re \mapsto \Re</math>, so that <math>\textstyle \theta \in \Re</math>.
 +
In this 1-dimensional case, one iteration of gradient descent is given by
 +
:<math>\begin{align}
 +
\theta := \theta - \alpha \frac{d}{d\theta}J(\theta).
 +
\end{align}</math>
 +
Suppose also that we have implemented some function <math>\textstyle g(\theta)</math> that purportedly
 +
computes <math>\textstyle \frac{d}{d\theta}J(\theta)</math>, so that we implement gradient descent
 +
using the update <math>\textstyle \theta := \theta - \alpha g(\theta)</math>.  How can we check if our implementation of
 +
<math>\textstyle g</math> is correct?
 +
 
 +
Recall the mathematical definition of the derivative as
 +
:<math>\begin{align}
 +
\frac{d}{d\theta}J(\theta) = \lim_{\epsilon \rightarrow 0}
 +
\frac{J(\theta+ \epsilon) - J(\theta-\epsilon)}{2 \epsilon}.
 +
\end{align}</math>
 +
Thus, at any specific value of <math>\textstyle \theta</math>, we can numerically approximate the derivative
 +
as follows:
 +
:<math>\begin{align}
 +
\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}
 +
\end{align}</math>
 +
In practice, we set <math>{\rm EPSILON}</math> to a small constant, say around <math>\textstyle 10^{-4}</math>.
 +
(There's a large range of values of <math>{\rm EPSILON}</math> that should work well, but
 +
we don't set <math>{\rm EPSILON}</math> to be "extremely" small, say <math>\textstyle 10^{-20}</math>,
 +
as that would lead to numerical roundoff errors.)
 +
 
 +
Thus, given a function <math>\textstyle g(\theta)</math> that is supposedly computing
 +
<math>\textstyle \frac{d}{d\theta}J(\theta)</math>, we can now numerically verify its correctness
 +
by checking that
 +
:<math>\begin{align}
 +
g(\theta) \approx
 +
\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}.
 +
\end{align}</math>
 +
The degree to which these two values should approximate each other
 +
will depend on the details of <math>\textstyle J</math>.  But assuming <math>\textstyle {\rm EPSILON} = 10^{-4}</math>,
 +
you'll usually find that the left- and right-hand sides of the above will agree
 +
to at least 4 significant digits (and often many more).
 +
 
 +
Now, consider the case where <math>\textstyle \theta \in \Re^n</math> is a vector rather than a single real
 +
number (so that we have <math>\textstyle n</math> parameters that we want to learn), and <math>\textstyle J: \Re^n \mapsto \Re</math>.  In
 +
our neural network example we used "<math>\textstyle J(W,b)</math>," but one can imagine "unrolling"
 +
the parameters <math>\textstyle W,b</math> into a long vector <math>\textstyle \theta</math>.  We now generalize our derivative
 +
checking procedure to the case where <math>\textstyle \theta</math> may be a vector.
 +
 
 +
 
 +
 
 +
Suppose we have a function <math>\textstyle g_i(\theta)</math> that purportedly computes
 +
<math>\textstyle \frac{\partial}{\partial \theta_i} J(\theta)</math>; we'd like to check if <math>\textstyle g_i</math>
 +
is outputting correct derivative values.  Let <math>\textstyle \theta^{(i+)} = \theta +
 +
{\rm EPSILON} \times \vec{e}_i</math>, where
 +
:<math>\begin{align}
 +
\vec{e}_i = \begin{bmatrix}0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0\end{bmatrix}
 +
\end{align}</math>
 +
is the <math>\textstyle i</math>-th basis vector (a
 +
vector of the same dimension as <math>\textstyle \theta</math>, with a "1" in the <math>\textstyle i</math>-th position
 +
and "0"s everywhere else).  So,
 +
<math>\textstyle \theta^{(i+)}</math> is the same as <math>\textstyle \theta</math>, except its <math>\textstyle i</math>-th element has been incremented
 +
by <math>{\rm EPSILON}</math>.  Similarly, let <math>\textstyle \theta^{(i-)} = \theta - {\rm EPSILON} \times \vec{e}_i</math> be the
 +
corresponding vector with the <math>\textstyle i</math>-th element decreased by <math>{\rm EPSILON}</math>.
 +
We can now numerically verify <math>\textstyle g_i(\theta)</math>'s correctness by checking, for each <math>\textstyle i</math>,
 +
that:
 +
:<math>\begin{align}
 +
g_i(\theta) \approx
 +
\frac{J(\theta^{(i+)}) - J(\theta^{(i-)})}{2 \times {\rm EPSILON}}.
 +
\end{align}</math>
 +
 
 +
 
 +
When implementing backpropagation to train a neural network, in a correct implementation
 +
we will have that
 +
:<math>\begin{align}
 +
\nabla_{W^{(l)}} J(W,b) &= \left( \frac{1}{m} \Delta W^{(l)} \right) + \lambda W^{(l)} \\
 +
\nabla_{b^{(l)}} J(W,b) &= \frac{1}{m} \Delta b^{(l)}.
 +
\end{align}</math>
 +
This result shows that the final block of psuedo-code in [[Backpropagation Algorithm]] is indeed
 +
implementing gradient descent.
 +
To make sure your implementation of gradient descent is correct, it is
 +
usually very helpful to use the method described above to
 +
numerically compute the derivatives of <math>\textstyle J(W,b)</math>, and thereby verify that
 +
your computations of <math>\textstyle \left(\frac{1}{m}\Delta W^{(l)} \right) + \lambda W</math> and <math>\textstyle \frac{1}{m}\Delta b^{(l)}</math> are
 +
indeed giving the derivatives you want.
 +
 
 +
Finally, so far our discussion has centered on using gradient descent to minimize <math>\textstyle J(\theta)</math>.  If you have
 +
implemented a function that computes <math>\textstyle J(\theta)</math> and <math>\textstyle \nabla_\theta J(\theta)</math>, it turns out there are more
 +
sophisticated algorithms than gradient descent for trying to minimize <math>\textstyle J(\theta)</math>.  For example, one can envision
 +
an algorithm that uses gradient descent, but automatically tunes the learning rate <math>\textstyle \alpha</math> so as to try to
 +
use a step-size that causes <math>\textstyle \theta</math> to approach a local optimum as quickly as possible.
 +
There are other algorithms that are even more
 +
sophisticated than this; for example, there are algorithms that try to find an approximation to the
 +
Hessian matrix, so that it can take more rapid steps towards a local optimum (similar to Newton's method).  A full discussion of these
 +
algorithms is beyond the scope of these notes, but one example is
 +
the '''L-BFGS''' algorithm.  (Another example is the '''conjugate gradient''' algorithm.)  You will use one of
 +
these algorithms in the programming exercise.
 +
The main thing you need to provide to these advanced optimization algorithms is that for any <math>\textstyle \theta</math>, you have to be able
 +
to compute <math>\textstyle J(\theta)</math> and <math>\textstyle \nabla_\theta J(\theta)</math>.  These optimization algorithms will then do their own
 +
internal tuning of the learning rate/step-size <math>\textstyle \alpha</math> (and compute its own approximation to the Hessian, etc.)
 +
to automatically search for a value of <math>\textstyle \theta</math> that minimizes <math>\textstyle J(\theta)</math>.  Algorithms
 +
such as L-BFGS and conjugate gradient can often be much faster than gradient descent.
 +
 
 +
 
 +
{{Sparse_Autoencoder}}

Revision as of 10:39, 9 March 2013

Personal tools