梯度检验与高级优化

初译: @pocketwalker

一审：王方，email：fangkey@gmail.com，新浪微博：@GuitarFang

二审：@大黄蜂的思索

Wiki上传者：王方，email：fangkey@gmail.com，新浪微博：@GuitarFang


:【原文】：
Backpropagation is a notoriously difficult algorithm to debug and get right,
especially since many subtly buggy implementations of it&mdash;for example, one
that has an off-by-one error in the indices and that thus only trains some of
the layers of weights, or an implementation that omits the bias term&mdash;will
manage to learn something that can look surprisingly reasonable
(while performing less well than a correct implementation).  Thus, even with a
buggy implementation, it may not at all be apparent that anything is amiss.
In this section, we describe a method for numerically checking the derivatives computed
by your code to make sure that your implementation is correct.  Carrying out the
derivative checking procedure described here will significantly increase
your confidence in the correctness of your code.
:【初译】：
众所周知，反向传播算法很难调试，尤其是由于它的很多有细小问题的实现。举例来说，在指标上有一位误差的实现而只训练其中一些权重层，或者一个忽略偏离项的实现，都会导致学习的材料显示地惊人一致（然而实际的表现却比正确的实现差）。因此，即使是一个有问题的实现也可能在表面上没有任何征兆。在这一章，我们将描述一个用以数学上检查由你的代码计算得出的导数从而保证实现的正确性的方法。使用这里描述的导数检验方法将很大程度地提高你对代码正确性的信心。
:【一审】：
众所周知，反向传播算法很难调试以使之正确，尤其是当存在很多难于发现的Bug的时候。举例来说，当矩阵下标存在一位偏差时，或者只有一部分权重得到训练时，再或者忘记计算偏置项时，你都会得到一个看上去十分合理的结果（这句的逻辑这样才是正确的）（然而实际的性能却比正确的实现差）。因此，即使是一个有问题的实现也可能在表面上没有任何征兆。在这一章，我们将描述一种数学上的方法，用它来检查由你的代码计算得出的导数，从而保证你的实现是正确的。使用这里描述的导数检验方法将很大程度地提高你对代码正确性的信心。
:【二审】：
众所周知，反向传播算法（我之前都译为“传导”，大家基本都译为“传播”就先传播）是很难调试成功的，特别是，代码中存在很多易于出错的细微之处――比如说，索引的缺位错误（off-by-one error，二审注：如果仅是为了翻译，只要选个合适的词翻译就可，但为了让读者理解，我觉得有必要加上具体说明），因此可能就只求解了一部分层中的权重参数；或者代码中疏略了偏置项――学习的结果很可能是即让人吃惊又觉得很可信（事实是比正确的代码差多了）。因此，即使这代码是错误的，我们也很难轻易发现有什么东西遗漏了。本节中，我们将讨论一种对导数（导数的运算由代码实现）进行数值检测的方法，以确定你的代码是否正确。使用本节所述导数检测的方法，将非常有助于提升对你所写代码正确性的信心。

缺位错误（Off-by-one error）说明：指的是，比如for 循环中循环m次，则应该是for(i=1; i<=m ;i++)，但有时程序员疏忽，会写成for(i=1;i<m;i++)，这就是缺位错误。

:【原文】：
Suppose we want to minimize <math>\textstyle J(\theta)</math> as a function of <math>\textstyle \theta</math>.
For this example, suppose <math>\textstyle J : \Re \mapsto \Re</math>, so that <math>\textstyle \theta \in \Re</math>.
In this 1-dimensional case, one iteration of gradient descent is given by
:<math>\begin{align}
\theta := \theta - \alpha \frac{d}{d\theta}J(\theta).
\end{align}</math>
:【初译】：
假设我们想要最小化变量<math>\textstyle \theta</math>的目标函数<math>\textstyle J(\theta)</math>。假设<math>\textstyle J : \Re \mapsto \Re</math>，则<math>\textstyle \theta \in \Re</math>。在一维的情况下，一次梯度下降的迭代便是
:<math>\begin{align}
\theta := \theta - \alpha \frac{d}{d\theta}J(\theta).
\end{align}</math>
:【一审】：
假设我们想要最小化以<math>\textstyle \theta</math>为自变量的目标函数<math>\textstyle J(\theta)</math>。假设<math>\textstyle J : \Re \mapsto \Re</math>，则<math>\textstyle \theta \in \Re</math>。在一维的情况下，一次梯度下降的迭代便是
:<math>\begin{align}
\theta := \theta - \alpha \frac{d}{d\theta}J(\theta).
\end{align}</math>
:【二审】：
假设我们想要最小化以<math>\textstyle \theta</math>为自变量的目标函数<math>\textstyle J(\theta)</math>。这里，<math>\textstyle J : \Re \mapsto \Re</math>，则<math>\textstyle \theta \in \Re</math>。在一维的情况下，一次梯度下降的迭代便是
:<math>\begin{align}
\theta := \theta - \alpha \frac{d}{d\theta}J(\theta).
\end{align}</math>
:【原文】：
Suppose also that we have implemented some function <math>\textstyle g(\theta)</math> that purportedly
computes <math>\textstyle \frac{d}{d\theta}J(\theta)</math>, so that we implement gradient descent
using the update <math>\textstyle \theta := \theta - \alpha g(\theta)</math>.  How can we check if our implementation of
<math>\textstyle g</math> is correct?
:【初译】：
再假设我们已经实现了某个计算<math>\textstyle \frac{d}{d\theta}J(\theta)</math>的函数<math>\textstyle g(\theta)</math>，于是我们使用<math>\textstyle \theta := \theta - \alpha g(\theta)</math>对<math>\textstyle \theta</math>更新而实现了梯度下降。那么我们如何检验<math>\textstyle g</math>的实现是否正确呢？
:【一审】：
再假设我们已经实现了某个计算<math>\textstyle \frac{d}{d\theta}J(\theta)</math>的函数<math>\textstyle g(\theta)</math>，于是我们使用<math>\textstyle \theta := \theta - \alpha g(\theta)</math>对<math>\textstyle \theta</math>更新，从而实现了梯度下降。那么我们如何检验<math>\textstyle g</math>的实现是否正确呢？
:【二审】：
再假设我们已经实现了某个计算<math>\textstyle \frac{d}{d\theta}J(\theta)</math>的函数<math>\textstyle g(\theta)</math>，于是我们使用<math>\textstyle \theta := \theta - \alpha g(\theta)</math>对<math>\textstyle \theta</math>进行更新，从而实现梯度下降。那么我们如何检验<math>\textstyle g</math>的实现是否正确呢？

:【原文】：
Recall the mathematical definition of the derivative as
:<math>\begin{align}
\frac{d}{d\theta}J(\theta) = \lim_{\epsilon \rightarrow 0}
\frac{J(\theta+ \epsilon) - J(\theta-\epsilon)}{2 \epsilon}.
\end{align}</math>
Thus, at any specific value of <math>\textstyle \theta</math>, we can numerically approximate the derivative
as follows:
:<math>\begin{align}
\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}
\end{align}</math>
:【初译】：
回忆导数的数学定义：
:<math>\begin{align}
\frac{d}{d\theta}J(\theta) = \lim_{\epsilon \rightarrow 0}
\frac{J(\theta+ \epsilon) - J(\theta-\epsilon)}{2 \epsilon}.
\end{align}</math>
因此对于任何<math>\textstyle \theta</math>值，我们都可以在数学上用：
:<math>\begin{align}
\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}
\end{align}</math>
近似。
:【一审】：
回忆导数的数学定义：
:<math>\begin{align}
\frac{d}{d\theta}J(\theta) = \lim_{\epsilon \rightarrow 0}
\frac{J(\theta+ \epsilon) - J(\theta-\epsilon)}{2 \epsilon}.
\end{align}</math>
因此对于任何<math>\textstyle \theta</math>值，我们都可以在数学上用：
:<math>\begin{align}
\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}
\end{align}</math>
近似。
:【二审】：
回忆导数的数学定义：
:<math>\begin{align}
\frac{d}{d\theta}J(\theta) = \lim_{\epsilon \rightarrow 0}
\frac{J(\theta+ \epsilon) - J(\theta-\epsilon)}{2 \epsilon}.
\end{align}</math>
因此对于任何<math>\textstyle \theta</math>值，我们都可以对此导数用：
:<math>\begin{align}
\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}
\end{align}</math>
近似。
:【原文】：
In practice, we set <math>{\rm EPSILON}</math> to a small constant, say around <math>\textstyle 10^{-4}</math>.
(There's a large range of values of <math>{\rm EPSILON}</math> that should work well, but
we don't set <math>{\rm EPSILON}</math> to be "extremely" small, say <math>\textstyle 10^{-20}</math>,
as that would lead to numerical roundoff errors.)
:【初译】：
实际应用中，我们常将<math>{\rm EPSILON}</math>设为在<math>\textstyle 10^{-4}</math>数量级的略小常量。虽然<math>{\rm EPSILON}</math>在很大范围里都能工作得很好，但是我们并不会将它设得太小，比如 <math>\textstyle 10^{-20}</math>，因为那将导致数值舍入误差。
:【一审】：
实际应用中，我们常将<math>{\rm EPSILON}</math>设为一个很小的常量，比如在<math>\textstyle 10^{-4}</math>数量级。虽然<math>{\rm EPSILON}</math>在很大范围里都能工作得很好，但是我们并不会将它设得太小，比如<math>\textstyle 10^{-20}</math>，因为那将导致数值舍入误差。
:【二审】：
实际应用中，我们常将<math>{\rm EPSILON}</math>设为一个很小的常量，比如在<math>\textstyle 10^{-4}</math>（虽然<math>{\rm EPSILON}</math>的取值范围可以很大，但是我们并不会将它设得太小，比如<math>\textstyle 10^{-20}</math>，因为那将导致数值舍入误差。）

:【原文】：
Thus, given a function <math>\textstyle g(\theta)</math> that is supposedly computing
<math>\textstyle \frac{d}{d\theta}J(\theta)</math>, we can now numerically verify its correctness
by checking that
:<math>\begin{align}
g(\theta) \approx
\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}.
\end{align}</math>

:【初译】：
因此给定一个被认为能计算<math>\textstyle \frac{d}{d\theta}J(\theta)</math>的函数<math>\textstyle g(\theta)</math>，现在我们可以通过检查
:<math>\begin{align}
g(\theta) \approx
\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}.
\end{align}</math>
是否成立来验证它的正确性。
:【一审】：
因此给定一个被认为能计算<math>\textstyle \frac{d}{d\theta}J(\theta)</math>的函数<math>\textstyle g(\theta)</math>，现在我们可以通过检查
:<math>\begin{align}
g(\theta) \approx
\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}.
\end{align}</math>
是否成立来验证它的正确性。
:【二审】：
因此给定一个被认为能计算<math>\textstyle \frac{d}{d\theta}J(\theta)</math>的函数<math>\textstyle g(\theta)</math>，现在我们可以通过检查
:<math>\begin{align}
g(\theta) \approx
\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}.
\end{align}</math>
是否成立来验证它的正确性。
:【原文】：
The degree to which these two values should approximate each other
will depend on the details of <math>\textstyle J</math>.  But assuming <math>\textstyle {\rm EPSILON} = 10^{-4}</math>,
you'll usually find that the left- and right-hand sides of the above will agree
to at least 4 significant digits (and often many more).
:【初译】：
“≈”两者有多接近取决于<math>\textstyle J</math>。但是在假定<math>\textstyle {\rm EPSILON} = 10^{-4}</math>的情况下，你通常会发现上式“≈”两边至少有4个有效数字一致（常会更多）。
:【一审】：
上式两端的值的接近程度取决于<math>\textstyle J</math>的具体形式。但是在假定<math>\textstyle {\rm EPSILON} = 10^{-4}</math>的情况下，你通常会发现上式左右两边的数值至少有4个有效数字一致（常会更多）。
:【二审】：
上式两端值的接近程度取决于<math>\textstyle J</math>的具体形式。但是在假定<math>\textstyle {\rm EPSILON} = 10^{-4}</math>的情况下，你通常会发现上式左右两边的数值至少会精确到4位有效数字（通常会更多）。

:【原文】：
Now, consider the case where <math>\textstyle \theta \in \Re^n</math> is a vector rather than a single real
number (so that we have <math>\textstyle n</math> parameters that we want to learn), and <math>\textstyle J: \Re^n \mapsto \Re</math>.  In
our neural network example we used "<math>\textstyle J(W,b)</math>," but one can imagine "unrolling"
the parameters <math>\textstyle W,b</math> into a long vector <math>\textstyle \theta</math>.  We now generalize our derivative
checking procedure to the case where <math>\textstyle \theta</math> may be a vector.
:【初译】：
现在，考虑<math>\textstyle \theta \in \Re^n</math>是一个向量而非单个实数（我们有<math>\textstyle n</math>个参数要学习），并且<math>\textstyle J: \Re^n \mapsto \Re</math>。在神经网络的例子里我们使用<math>\textstyle J(W,b)</math>，可以想象把参数<math>\textstyle W,b</math>扩展到一个长向量<math>\textstyle \theta</math>。现在我们将求导检验方法推广到一般化，即<math>\textstyle \theta</math>可能是一个向量的情况。
:【一审】：
现在，考虑<math>\textstyle \theta \in \Re^n</math>是一个向量而非单个实数（因为我们有<math>\textstyle n</math>个参数要学习），并且<math>\textstyle J: \Re^n \mapsto \Re</math>。在神经网络的例子里我们使用<math>\textstyle J(W,b)</math>，可以想象把参数<math>\textstyle W,b</math>扩展到一个长向量<math>\textstyle \theta</math>。现在我们将求导检验方法推广到一般化，即<math>\textstyle \theta</math>可能是一个向量的情况。
:【二审】：
现在，考虑<math>\textstyle \theta \in \Re^n</math>是一个向量而非单个实数（那么我们就要求解<math>\textstyle n</math>个参数），并且<math>\textstyle J: \Re^n \mapsto \Re</math>。在神经网络的例子里我们使用<math>\textstyle J(W,b)</math>，可以想象把参数<math>\textstyle W,b</math>扩展到一个长向量<math>\textstyle \theta</math>。现在我们将求导检验方法推广到一般化，即<math>\textstyle \theta</math>可能是一个向量的情况。



:【原文】：
Suppose we have a function <math>\textstyle g_i(\theta)</math> that purportedly computes
<math>\textstyle \frac{\partial}{\partial \theta_i} J(\theta)</math>; we'd like to check if <math>\textstyle g_i</math>
is outputting correct derivative values.  Let <math>\textstyle \theta^{(i+)} = \theta +
{\rm EPSILON} \times \vec{e}_i</math>, where
:<math>\begin{align}
\vec{e}_i = \begin{bmatrix}0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0\end{bmatrix}
\end{align}</math>
is the <math>\textstyle i</math>-th basis vector (a
vector of the same dimension as <math>\textstyle \theta</math>, with a "1" in the <math>\textstyle i</math>-th position
and "0"s everywhere else).  So,
<math>\textstyle \theta^{(i+)}</math> is the same as <math>\textstyle \theta</math>, except its <math>\textstyle i</math>-th element has been incremented
by <math>{\rm EPSILON}</math>.  Similarly, let <math>\textstyle \theta^{(i-)} = \theta - {\rm EPSILON} \times \vec{e}_i</math> be the
corresponding vector with the <math>\textstyle i</math>-th element decreased by <math>{\rm EPSILON}</math>.
We can now numerically verify <math>\textstyle g_i(\theta)</math>'s correctness by checking, for each <math>\textstyle i</math>,
that:
:<math>\begin{align}
g_i(\theta) \approx
\frac{J(\theta^{(i+)}) - J(\theta^{(i-)})}{2 \times {\rm EPSILON}}.
\end{align}</math>
:【初译】：
假设我们有一个被认为计算了<math>\textstyle \frac{\partial}{\partial \theta_i} J(\theta)</math>的函数<math>\textstyle g_i(\theta)</math>；我们想要检验<math>\textstyle g_i</math>是否输出正确的求导结果。定义<math>\textstyle \theta^{(i+)} = \theta +
{\rm EPSILON} \times \vec{e}_i</math>，其中
:<math>\begin{align}
\vec{e}_i = \begin{bmatrix}0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0\end{bmatrix}
\end{align}</math>
是第<math>\textstyle i</math>个基向量（和<math>\textstyle \theta</math>大小相同，在第<math>\textstyle i</math>行是“1”而其他位是“0”）。所以，<math>\textstyle \theta^{(i+)}</math>和<math>\textstyle \theta</math>相同，除非第<math>\textstyle i</math>位元素由<math>{\rm EPSILON}</math>增加。类似地，使被<math>{\rm EPSILON}</math>减小了第<math>\textstyle i</math>位的相应向量是<math>\textstyle \theta^{(i-)} = \theta - {\rm EPSILON} \times \vec{e}_i</math>。现在我们可以通过对于每个<math>\textstyle i</math>检查
:<math>\begin{align}
g_i(\theta) \approx
\frac{J(\theta^{(i+)}) - J(\theta^{(i-)})}{2 \times {\rm EPSILON}}.
\end{align}</math>
是否成立来数学上验证<math>\textstyle g_i(\theta)</math>的正确性。
:【一审】：
假设我们有一个用于计算<math>\textstyle \frac{\partial}{\partial \theta_i} J(\theta)</math>的函数<math>\textstyle g_i(\theta)</math>；我们想要检验<math>\textstyle g_i</math>是否输出正确的求导结果。定义<math>\textstyle \theta^{(i+)} = \theta +
{\rm EPSILON} \times \vec{e}_i</math>，其中
:<math>\begin{align}
\vec{e}_i = \begin{bmatrix}0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0\end{bmatrix}
\end{align}</math>
是第<math>\textstyle i</math>个基向量（维度和<math>\textstyle \theta</math>相同，在第<math>\textstyle i</math>行是“1”而其他行是“0”）。所以，除非第<math>\textstyle i</math>行元素增加了<math>{\rm EPSILON}</math>，否则<math>\textstyle \theta^{(i+)}</math>和<math>\textstyle \theta</math>相同。类似地，第<math>\textstyle i</math>行减小了<math>{\rm EPSILON}</math>的相应向量是<math>\textstyle \theta^{(i-)} = \theta - {\rm EPSILON} \times \vec{e}_i</math>。现在我们可以通过对于每个<math>\textstyle i</math>检查下式的数学计算结果是否成立来验证<math>\textstyle g_i(\theta)</math>的正确性：
:<math>\begin{align}
g_i(\theta) \approx
\frac{J(\theta^{(i+)}) - J(\theta^{(i-)})}{2 \times {\rm EPSILON}}.
\end{align}</math>
:【二审】：
假设我们有一个用于计算<math>\textstyle \frac{\partial}{\partial \theta_i} J(\theta)</math>的函数<math>\textstyle g_i(\theta)</math>；我们想要检验<math>\textstyle g_i</math>是否输出正确的求导结果。定义<math>\textstyle \theta^{(i+)} = \theta +
{\rm EPSILON} \times \vec{e}_i</math>，其中
:<math>\begin{align}
\vec{e}_i = \begin{bmatrix}0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0\end{bmatrix}
\end{align}</math>
是第<math>\textstyle i</math>个基向量（维度和<math>\textstyle \theta</math>相同，在第<math>\textstyle i</math>行是“1”而其他行是“0”）。所以，<math>\textstyle \theta^{(i+)}</math>和<math>\textstyle \theta</math>几乎相同，除了第<math>\textstyle i</math>行元素增加了<math>{\rm EPSILON}</math>。类似地，第<math>\textstyle i</math>行减小了<math>{\rm EPSILON}</math>的相应向量是<math>\textstyle \theta^{(i-)} = \theta - {\rm EPSILON} \times \vec{e}_i</math>。现在我们可以通过对于每个<math>\textstyle i</math>检查下式的数学计算结果是否成立来验证<math>\textstyle g_i(\theta)</math>的正确性：
:<math>\begin{align}
g_i(\theta) \approx
\frac{J(\theta^{(i+)}) - J(\theta^{(i-)})}{2 \times {\rm EPSILON}}.
\end{align}</math>

:【原文】：
When implementing backpropagation to train a neural network, in a correct implementation
we will have that
:<math>\begin{align}
\nabla_{W^{(l)}} J(W,b) &= \left( \frac{1}{m} \Delta W^{(l)} \right) + \lambda W^{(l)} \\
\nabla_{b^{(l)}} J(W,b) &= \frac{1}{m} \Delta b^{(l)}.
\end{align}</math>
:【初译】：
在一个正确通过反向传播算法训练神经网络的实现中，我们将得到：
:<math>\begin{align}
\nabla_{W^{(l)}} J(W,b) &= \left( \frac{1}{m} \Delta W^{(l)} \right) + \lambda W^{(l)} \\
\nabla_{b^{(l)}} J(W,b) &= \frac{1}{m} \Delta b^{(l)}.
\end{align}</math>
:【一审】：
在一个正确通过反向传播算法训练神经网络的实现中，我们将得到：
:<math>\begin{align}
\nabla_{W^{(l)}} J(W,b) &= \left( \frac{1}{m} \Delta W^{(l)} \right) + \lambda W^{(l)} \\
\nabla_{b^{(l)}} J(W,b) &= \frac{1}{m} \Delta b^{(l)}.
\end{align}</math>
:【二审】：
当用反射传播算法求解神经网络时，正确的计算方法应该是：
:<math>\begin{align}
\nabla_{W^{(l)}} J(W,b) &= \left( \frac{1}{m} \Delta W^{(l)} \right) + \lambda W^{(l)} \\
\nabla_{b^{(l)}} J(W,b) &= \frac{1}{m} \Delta b^{(l)}.
\end{align}</math>

:【原文】：
This result shows that the final block of psuedo-code in [[Backpropagation Algorithm]] is indeed
implementing gradient descent.
To make sure your implementation of gradient descent is correct, it is
usually very helpful to use the method described above to
numerically compute the derivatives of <math>\textstyle J(W,b)</math>, and thereby verify that
your computations of <math>\textstyle \left(\frac{1}{m}\Delta W^{(l)} \right) + \lambda W</math> and <math>\textstyle \frac{1}{m}\Delta b^{(l)}</math> are
indeed giving the derivatives you want.
:【初译】：
这个结果说明反向传播算法的伪代码最后一块的确实现了梯度下降。为了检验你的梯度下降实现的正确性，通过使用上述方法会带来很大帮助，即计算<math>\textstyle J(W,b)</math>的导数从而验证你对<math>\textstyle \left(\frac{1}{m}\Delta W^{(l)} \right) + \lambda W</math>和<math>\textstyle \frac{1}{m}\Delta b^{(l)}</math>的计算确实给出了你要的求导结果。
:【一审】：
这个结果说明反向传播算法的伪代码实际上是在最后的代码段实现了梯度下降。为了检验你的梯度下降实现的正确性，通过使用上述方法会带来很大帮助，即计算<math>\textstyle J(W,b)</math>的近似导数从而验证你对<math>\textstyle \left(\frac{1}{m}\Delta W^{(l)} \right) + \lambda W</math>和<math>\textstyle \frac{1}{m}\Delta b^{(l)}</math>的计算确实给出了你要的求导结果。
:【二审】：
以上结果表明，在反向传播算法一课中，最后一段伪代码的确执行了梯度下降。为验证梯度下降代码的正确性，使用以上所述方法计算<math>\textstyle J(W,b)</math>的导数是非常有用的，从而确认<math>\textstyle \left(\frac{1}{m}\Delta W^{(l)} \right) + \lambda W</math>与<math>\textstyle \frac{1}{m}\Delta b^{(l)}</math>确实是你想要的导数。

:【原文】：
Finally, so far our discussion has centered on using gradient descent to minimize <math>\textstyle J(\theta)</math>.  If you have
implemented a function that computes <math>\textstyle J(\theta)</math> and <math>\textstyle \nabla_\theta J(\theta)</math>, it turns out there are more
sophisticated algorithms than gradient descent for trying to minimize <math>\textstyle J(\theta)</math>.  For example, one can envision
an algorithm that uses gradient descent, but automatically tunes the learning rate <math>\textstyle \alpha</math> so as to try to
use a step-size that causes <math>\textstyle \theta</math> to approach a local optimum as quickly as possible.
There are other algorithms that are even more
sophisticated than this; for example, there are algorithms that try to find an approximation to the
Hessian matrix, so that it can take more rapid steps towards a local optimum (similar to Newton's method).  A full discussion of these
algorithms is beyond the scope of these notes, but one example is
the '''L-BFGS''' algorithm.  (Another example is the '''conjugate gradient''' algorithm.)  You will use one of
these algorithms in the programming exercise.
The main thing you need to provide to these advanced optimization algorithms is that for any <math>\textstyle \theta</math>, you have to be able
to compute <math>\textstyle J(\theta)</math> and <math>\textstyle \nabla_\theta J(\theta)</math>.  These optimization algorithms will then do their own
internal tuning of the learning rate/step-size <math>\textstyle \alpha</math> (and compute its own approximation to the Hessian, etc.)
to automatically search for a value of <math>\textstyle \theta</math> that minimizes <math>\textstyle J(\theta)</math>.  Algorithms
such as L-BFGS and conjugate gradient can often be much faster than gradient descent.
:【初译】：
最后，迄今为止我们的讨论都集中在使用梯度下降来最小化<math>\textstyle J(\theta)</math>。如果你已经实现了一个函数来计算<math>\textstyle J(\theta)</math>和<math>\textstyle \nabla_\theta J(\theta)</math>，会发现其实有更复杂的算法来尝试最小化<math>\textstyle J(\theta)</math>。举例来说，可以想象这样的一个算法——使用梯度下降，但使之自动调整学习率<math>\textstyle \alpha</math>以致尝试一步步地让<math>\textstyle \theta</math>尽快到达一个局部最优。还有其他算法比这更复杂；比如寻找一个Hessian矩阵的近似，以便它能以更快的步伐到达一个局部最优（和牛顿方法类似）。此类算法的详细讨论超出了这份讲义的范围，但是一个例子是L-BFGS算法（另一个例子是共轭梯度算法）。你将在编程联系里使用这些算法中的一个。对于任意一个<math>\textstyle \theta</math>，你需要提供给这些高级优化算法的东西主要是<math>\textstyle J(\theta)</math>和<math>\textstyle \nabla_\theta J(\theta)</math>。然后这些优化算法会内部调整学习率/步伐大小<math>\textstyle \alpha</math>（和对Hessian的近似等等）来自动寻找一个最小化<math>\textstyle J(\theta)</math>的<math>\textstyle \theta</math>值。诸如L-BFGS和共轭梯度的算法通常比梯度下降更快。
:【一审】：
最后，迄今为止我们的讨论都集中在使用梯度下降来最小化<math>\textstyle J(\theta)</math>。如果你已经实现了一个函数来计算<math>\textstyle J(\theta)</math>和<math>\textstyle \nabla_\theta J(\theta)</math>，会发现其实有更复杂的算法来尝试最小化<math>\textstyle J(\theta)</math>。举例来说，可以想象这样的一个算法：它使用梯度下降，但可以自动调整学习率<math>\textstyle \alpha</math>，以便尝试使用新的步长值，使<math>\textstyle \theta</math>尽快到达一个局部最优。还有其他算法比这更复杂；比如寻找一个Hessian矩阵的近似，以便它能以更快的步伐到达一个局部最优（和牛顿方法类似）。此类算法的详细讨论超出了这份讲义的范围，但是一个例子是L-BFGS算法（另一个例子是共轭梯度算法）。你将在编程练习里使用这些算法中的一个。对于任意一个<math>\textstyle \theta</math>，你需要提供给这些高级优化算法的东西主要是<math>\textstyle J(\theta)</math>和<math>\textstyle \nabla_\theta J(\theta)</math>。然后这些优化算法会内部调整学习率/步伐大小<math>\textstyle \alpha</math>（来计算它自己的近似Hessian矩阵等等）来自动寻找一个最小化<math>\textstyle J(\theta)</math>的<math>\textstyle \theta</math>值。诸如L-BFGS和共轭梯度的算法通常比梯度下降快很多。
:【二审】：
最后，迄今为止我们的讨论都集中在使用梯度下降来最小化<math>\textstyle J(\theta)</math>。如果你已经实现了一个函数来计算<math>\textstyle J(\theta)</math>和<math>\textstyle \nabla_\theta J(\theta)</math>，会发现其实有更绝妙的算法来尝试最小化<math>\textstyle J(\theta)</math>。举例来说，可以想象这样的一个算法：它使用梯度下降，但可以自动调整学习率<math>\textstyle \alpha</math>，以便尝试使用新的步长值，使<math>\textstyle \theta</math>尽快到达一个局部最优。还有其他算法比这更妙；比如寻找一个Hessian矩阵的近似，以便它能以更快的步伐到达一个局部最优（和牛顿方法类似）。此类算法的详细讨论超出了这份讲义的范围，但是L-BFGS算法我们以后会有论述（另一个例子是共轭梯度算法）。你将在编程练习里使用这些算法中的一个。最关键的一点是，对于这些更高级更优化的算法，对于任一个<math>\textstyle \theta</math>，你都能计算出<math>\textstyle J(\theta)</math>和<math>\textstyle \nabla_\theta J(\theta)</math>的最小值，然后这些优化算法会通过内部调整学习率/步长 <math>\textstyle \alpha</math>的大小（来计算它自己的近似Hessian矩阵等等）来自动寻找一个最小化<math>\textstyle J(\theta)</math>的<math>\textstyle \theta</math>值。诸如L-BFGS和共轭梯度的算法通常比梯度下降快很多。



{{Sparse_Autoencoder}}