梯度检验与高级优化
From Ufldl
(Created page with "梯度检验与高级优化") |
|||
Line 1: | Line 1: | ||
- | + | Backpropagation is a notoriously difficult algorithm to debug and get right, | |
+ | especially since many subtly buggy implementations of it—for example, one | ||
+ | that has an off-by-one error in the indices and that thus only trains some of | ||
+ | the layers of weights, or an implementation that omits the bias term—will | ||
+ | manage to learn something that can look surprisingly reasonable | ||
+ | (while performing less well than a correct implementation). Thus, even with a | ||
+ | buggy implementation, it may not at all be apparent that anything is amiss. | ||
+ | In this section, we describe a method for numerically checking the derivatives computed | ||
+ | by your code to make sure that your implementation is correct. Carrying out the | ||
+ | derivative checking procedure described here will significantly increase | ||
+ | your confidence in the correctness of your code. | ||
+ | |||
+ | Suppose we want to minimize <math>\textstyle J(\theta)</math> as a function of <math>\textstyle \theta</math>. | ||
+ | For this example, suppose <math>\textstyle J : \Re \mapsto \Re</math>, so that <math>\textstyle \theta \in \Re</math>. | ||
+ | In this 1-dimensional case, one iteration of gradient descent is given by | ||
+ | :<math>\begin{align} | ||
+ | \theta := \theta - \alpha \frac{d}{d\theta}J(\theta). | ||
+ | \end{align}</math> | ||
+ | Suppose also that we have implemented some function <math>\textstyle g(\theta)</math> that purportedly | ||
+ | computes <math>\textstyle \frac{d}{d\theta}J(\theta)</math>, so that we implement gradient descent | ||
+ | using the update <math>\textstyle \theta := \theta - \alpha g(\theta)</math>. How can we check if our implementation of | ||
+ | <math>\textstyle g</math> is correct? | ||
+ | |||
+ | Recall the mathematical definition of the derivative as | ||
+ | :<math>\begin{align} | ||
+ | \frac{d}{d\theta}J(\theta) = \lim_{\epsilon \rightarrow 0} | ||
+ | \frac{J(\theta+ \epsilon) - J(\theta-\epsilon)}{2 \epsilon}. | ||
+ | \end{align}</math> | ||
+ | Thus, at any specific value of <math>\textstyle \theta</math>, we can numerically approximate the derivative | ||
+ | as follows: | ||
+ | :<math>\begin{align} | ||
+ | \frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}} | ||
+ | \end{align}</math> | ||
+ | In practice, we set <math>{\rm EPSILON}</math> to a small constant, say around <math>\textstyle 10^{-4}</math>. | ||
+ | (There's a large range of values of <math>{\rm EPSILON}</math> that should work well, but | ||
+ | we don't set <math>{\rm EPSILON}</math> to be "extremely" small, say <math>\textstyle 10^{-20}</math>, | ||
+ | as that would lead to numerical roundoff errors.) | ||
+ | |||
+ | Thus, given a function <math>\textstyle g(\theta)</math> that is supposedly computing | ||
+ | <math>\textstyle \frac{d}{d\theta}J(\theta)</math>, we can now numerically verify its correctness | ||
+ | by checking that | ||
+ | :<math>\begin{align} | ||
+ | g(\theta) \approx | ||
+ | \frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}. | ||
+ | \end{align}</math> | ||
+ | The degree to which these two values should approximate each other | ||
+ | will depend on the details of <math>\textstyle J</math>. But assuming <math>\textstyle {\rm EPSILON} = 10^{-4}</math>, | ||
+ | you'll usually find that the left- and right-hand sides of the above will agree | ||
+ | to at least 4 significant digits (and often many more). | ||
+ | |||
+ | Now, consider the case where <math>\textstyle \theta \in \Re^n</math> is a vector rather than a single real | ||
+ | number (so that we have <math>\textstyle n</math> parameters that we want to learn), and <math>\textstyle J: \Re^n \mapsto \Re</math>. In | ||
+ | our neural network example we used "<math>\textstyle J(W,b)</math>," but one can imagine "unrolling" | ||
+ | the parameters <math>\textstyle W,b</math> into a long vector <math>\textstyle \theta</math>. We now generalize our derivative | ||
+ | checking procedure to the case where <math>\textstyle \theta</math> may be a vector. | ||
+ | |||
+ | |||
+ | |||
+ | Suppose we have a function <math>\textstyle g_i(\theta)</math> that purportedly computes | ||
+ | <math>\textstyle \frac{\partial}{\partial \theta_i} J(\theta)</math>; we'd like to check if <math>\textstyle g_i</math> | ||
+ | is outputting correct derivative values. Let <math>\textstyle \theta^{(i+)} = \theta + | ||
+ | {\rm EPSILON} \times \vec{e}_i</math>, where | ||
+ | :<math>\begin{align} | ||
+ | \vec{e}_i = \begin{bmatrix}0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0\end{bmatrix} | ||
+ | \end{align}</math> | ||
+ | is the <math>\textstyle i</math>-th basis vector (a | ||
+ | vector of the same dimension as <math>\textstyle \theta</math>, with a "1" in the <math>\textstyle i</math>-th position | ||
+ | and "0"s everywhere else). So, | ||
+ | <math>\textstyle \theta^{(i+)}</math> is the same as <math>\textstyle \theta</math>, except its <math>\textstyle i</math>-th element has been incremented | ||
+ | by <math>{\rm EPSILON}</math>. Similarly, let <math>\textstyle \theta^{(i-)} = \theta - {\rm EPSILON} \times \vec{e}_i</math> be the | ||
+ | corresponding vector with the <math>\textstyle i</math>-th element decreased by <math>{\rm EPSILON}</math>. | ||
+ | We can now numerically verify <math>\textstyle g_i(\theta)</math>'s correctness by checking, for each <math>\textstyle i</math>, | ||
+ | that: | ||
+ | :<math>\begin{align} | ||
+ | g_i(\theta) \approx | ||
+ | \frac{J(\theta^{(i+)}) - J(\theta^{(i-)})}{2 \times {\rm EPSILON}}. | ||
+ | \end{align}</math> | ||
+ | |||
+ | |||
+ | When implementing backpropagation to train a neural network, in a correct implementation | ||
+ | we will have that | ||
+ | :<math>\begin{align} | ||
+ | \nabla_{W^{(l)}} J(W,b) &= \left( \frac{1}{m} \Delta W^{(l)} \right) + \lambda W^{(l)} \\ | ||
+ | \nabla_{b^{(l)}} J(W,b) &= \frac{1}{m} \Delta b^{(l)}. | ||
+ | \end{align}</math> | ||
+ | This result shows that the final block of psuedo-code in [[Backpropagation Algorithm]] is indeed | ||
+ | implementing gradient descent. | ||
+ | To make sure your implementation of gradient descent is correct, it is | ||
+ | usually very helpful to use the method described above to | ||
+ | numerically compute the derivatives of <math>\textstyle J(W,b)</math>, and thereby verify that | ||
+ | your computations of <math>\textstyle \left(\frac{1}{m}\Delta W^{(l)} \right) + \lambda W</math> and <math>\textstyle \frac{1}{m}\Delta b^{(l)}</math> are | ||
+ | indeed giving the derivatives you want. | ||
+ | |||
+ | Finally, so far our discussion has centered on using gradient descent to minimize <math>\textstyle J(\theta)</math>. If you have | ||
+ | implemented a function that computes <math>\textstyle J(\theta)</math> and <math>\textstyle \nabla_\theta J(\theta)</math>, it turns out there are more | ||
+ | sophisticated algorithms than gradient descent for trying to minimize <math>\textstyle J(\theta)</math>. For example, one can envision | ||
+ | an algorithm that uses gradient descent, but automatically tunes the learning rate <math>\textstyle \alpha</math> so as to try to | ||
+ | use a step-size that causes <math>\textstyle \theta</math> to approach a local optimum as quickly as possible. | ||
+ | There are other algorithms that are even more | ||
+ | sophisticated than this; for example, there are algorithms that try to find an approximation to the | ||
+ | Hessian matrix, so that it can take more rapid steps towards a local optimum (similar to Newton's method). A full discussion of these | ||
+ | algorithms is beyond the scope of these notes, but one example is | ||
+ | the '''L-BFGS''' algorithm. (Another example is the '''conjugate gradient''' algorithm.) You will use one of | ||
+ | these algorithms in the programming exercise. | ||
+ | The main thing you need to provide to these advanced optimization algorithms is that for any <math>\textstyle \theta</math>, you have to be able | ||
+ | to compute <math>\textstyle J(\theta)</math> and <math>\textstyle \nabla_\theta J(\theta)</math>. These optimization algorithms will then do their own | ||
+ | internal tuning of the learning rate/step-size <math>\textstyle \alpha</math> (and compute its own approximation to the Hessian, etc.) | ||
+ | to automatically search for a value of <math>\textstyle \theta</math> that minimizes <math>\textstyle J(\theta)</math>. Algorithms | ||
+ | such as L-BFGS and conjugate gradient can often be much faster than gradient descent. | ||
+ | |||
+ | |||
+ | {{Sparse_Autoencoder}} |