Gradient checking and advanced optimization

From Ufldl

Jump to: navigation, search
(Created page with "Backpropagation is a notoriously difficult algorithm to debug and get right, especially since many subtly buggy implementations of it---for example, one that has an off-by-one er...")
 
Line 1: Line 1:
Backpropagation is a notoriously difficult algorithm to debug and get right,
Backpropagation is a notoriously difficult algorithm to debug and get right,
-
especially since many subtly buggy implementations of it---for example, one
+
especially since many subtly buggy implementations of it—for example, one
that has an off-by-one error in the indices and that thus only trains some of
that has an off-by-one error in the indices and that thus only trains some of
-
the layers of weights, or an implementation that omits the bias term---will
+
the layers of weights, or an implementation that omits the bias term—will
manage to learn something that can look surprisingly reasonable
manage to learn something that can look surprisingly reasonable
(while performing less well than a correct implementation).  Thus, even with a
(while performing less well than a correct implementation).  Thus, even with a
Line 32: Line 32:
\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}
\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}
\end{align}</math>
\end{align}</math>
-
In practice, we set {\rm EPSILON} to a small constant, say around <math>\textstyle 10^{-4}</math>.
+
In practice, we set <math>{\rm EPSILON}</math> to a small constant, say around <math>\textstyle 10^{-4}</math>.
-
(There's a large range of values of {\rm EPSILON} that should work well, but
+
(There's a large range of values of <math>{\rm EPSILON}</math> that should work well, but
-
we don't set {\rm EPSILON} to be ``extremely'' small, say <math>\textstyle 10^{-20}</math>,
+
we don't set <math>{\rm EPSILON}</math> to be "extremely" small, say <math>\textstyle 10^{-20}</math>,
as that would lead to numerical roundoff errors.)
as that would lead to numerical roundoff errors.)
Line 51: Line 51:
Now, consider the case where <math>\textstyle \theta \in \Re^n</math> is a vector rather than a single real
Now, consider the case where <math>\textstyle \theta \in \Re^n</math> is a vector rather than a single real
number (so that we have <math>\textstyle n</math> parameters that we want to learn), and <math>\textstyle J: \Re^n \mapsto \Re</math>.  In
number (so that we have <math>\textstyle n</math> parameters that we want to learn), and <math>\textstyle J: \Re^n \mapsto \Re</math>.  In
-
our neural network example we used ``<math>\textstyle J(W,b)</math>,'' but one can imagine ``unrolling''
+
our neural network example we used "<math>\textstyle J(W,b)</math>," but one can imagine "unrolling"
the parameters <math>\textstyle W,b</math> into a long vector <math>\textstyle \theta</math>.  We now generalize our derivative
the parameters <math>\textstyle W,b</math> into a long vector <math>\textstyle \theta</math>.  We now generalize our derivative
checking procedure to the case where <math>\textstyle \theta</math> may be a vector.
checking procedure to the case where <math>\textstyle \theta</math> may be a vector.
Line 65: Line 65:
\end{align}</math>
\end{align}</math>
is the <math>\textstyle i</math>-th basis vector (a
is the <math>\textstyle i</math>-th basis vector (a
-
vector of the same dimension as <math>\textstyle \theta</math>, with a ``1'' in the <math>\textstyle i</math>-th position
+
vector of the same dimension as <math>\textstyle \theta</math>, with a "1" in the <math>\textstyle i</math>-th position
-
and ``0''s everywhere else).  So,
+
and "0"s everywhere else).  So,
<math>\textstyle \theta^{(i+)}</math> is the same as <math>\textstyle \theta</math>, except its <math>\textstyle i</math>-th element has been incremented
<math>\textstyle \theta^{(i+)}</math> is the same as <math>\textstyle \theta</math>, except its <math>\textstyle i</math>-th element has been incremented
-
by {\rm EPSILON}.  Similarly, let <math>\textstyle \theta^{(i-)} = \theta - {\rm EPSILON} \times \vec{e}_i</math> be the
+
by <math>{\rm EPSILON}</math>.  Similarly, let <math>\textstyle \theta^{(i-)} = \theta - {\rm EPSILON} \times \vec{e}_i</math> be the
-
corresponding vector with the <math>\textstyle i</math>-th element decreased by {\rm EPSILON}.
+
corresponding vector with the <math>\textstyle i</math>-th element decreased by <math>{\rm EPSILON}</math>.
We can now numerically verify <math>\textstyle g_i(\theta)</math>'s correctness by checking, for each <math>\textstyle i</math>,
We can now numerically verify <math>\textstyle g_i(\theta)</math>'s correctness by checking, for each <math>\textstyle i</math>,
that:
that:
Line 84: Line 84:
\nabla_{b^{(l)}} J(W,b) &= \frac{1}{m} \Delta b^{(l)}.
\nabla_{b^{(l)}} J(W,b) &= \frac{1}{m} \Delta b^{(l)}.
\end{align}</math>
\end{align}</math>
-
This result shows that the final block of psuedo-code in Section~\ref{sec-backprop} is indeed
+
This result shows that the final block of psuedo-code in [[Backpropagation Algorithm]] is indeed
implementing gradient descent.
implementing gradient descent.
To make sure your implementation of gradient descent is correct, it is
To make sure your implementation of gradient descent is correct, it is
Line 101: Line 101:
Hessian matrix, so that it can take more rapid steps towards a local optimum (similar to Newton's method).  A full discussion of these
Hessian matrix, so that it can take more rapid steps towards a local optimum (similar to Newton's method).  A full discussion of these
algorithms is beyond the scope of these notes, but one example is
algorithms is beyond the scope of these notes, but one example is
-
the '''L-BFGS''' algorithm.  (Another example is '''conjugate gradient'''.)  You will use one of
+
the '''L-BFGS''' algorithm.  (Another example is the '''conjugate gradient''' algorithm.)  You will use one of
these algorithms in the programming exercise.
these algorithms in the programming exercise.
The main thing you need to provide to these advanced optimization algorithms is that for any <math>\textstyle \theta</math>, you have to be able
The main thing you need to provide to these advanced optimization algorithms is that for any <math>\textstyle \theta</math>, you have to be able
Line 108: Line 108:
to automatically search for a value of <math>\textstyle \theta</math> that minimizes <math>\textstyle J(\theta)</math>.  Algorithms
to automatically search for a value of <math>\textstyle \theta</math> that minimizes <math>\textstyle J(\theta)</math>.  Algorithms
such as L-BFGS and conjugate gradient can often be much faster than gradient descent.
such as L-BFGS and conjugate gradient can often be much faster than gradient descent.
 +
 +
 +
{{Sparse_Autoencoder}}
 +
 +
 +
{{Languages|梯度检验与高级优化|中文}}

Latest revision as of 12:40, 7 April 2013

Personal tools