# 反向传导算法

(Difference between revisions)
 Revision as of 15:55, 7 March 2013 (view source)Kandeng (Talk | contribs)← Older edit Revision as of 16:29, 7 March 2013 (view source)Kandeng (Talk | contribs) Newer edit → Line 2: Line 2: Suppose we have a fixed training set $\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}$ of $m$ training examples. We can train our neural network using batch gradient descent.  In detail, for a single training example $(x,y)$, we define the cost function with respect to that single example to be: Suppose we have a fixed training set $\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}$ of $m$ training examples. We can train our neural network using batch gradient descent.  In detail, for a single training example $(x,y)$, we define the cost function with respect to that single example to be: :【初译】： :【初译】： + 假设我们有一个固定的训练集$\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}$，它包含$m$个训练样本。我们可以用批量梯度下降法训练我们的神经网络。下面进行详细介绍。针对单独的训练样本$(x,y)$，我们定义关于它的代价函数为： :【一校】： :【一校】： :$:[itex] Line 11: Line 12: This is a (one-half) squared-error cost function. Given a training set of [itex]m$ examples, we then define the overall cost function to be: This is a (one-half) squared-error cost function. Given a training set of $m$ examples, we then define the overall cost function to be: :【初译】： :【初译】： + 这是一个（二分之一的）平方差代价函数。给定一个包含$m$个训练样本的训练集，我们于是可以定义整体代价函数为： :【一校】： :【一校】： :$:[itex] Line 26: Line 28: The first term in the definition of [itex]J(W,b)$ is an average sum-of-squares error term. The second term is a regularization term (also called a '''weight decay''' term) that tends to decrease the magnitude of the weights, and helps prevent overfitting. The first term in the definition of $J(W,b)$ is an average sum-of-squares error term. The second term is a regularization term (also called a '''weight decay''' term) that tends to decrease the magnitude of the weights, and helps prevent overfitting. :【初译】： :【初译】： + 以上定义中的第一项$J(W,b)$是一个均方差项。第二项是一个规则化项（也叫权重衰减项），其目的是减小权重的幅度，防止过度拟合。 :【一校】： :【一校】： Line 31: Line 34: [Note: Usually weight decay is not applied to the bias terms $b^{(l)}_i$, as reflected in our definition for $J(W, b)$.  Applying weight decay to the bias units usually makes only a small difference to the final network, however.  If you've taken CS229 (Machine Learning) at Stanford or watched the course's videos on YouTube, you may also recognize this weight decay as essentially a variant of the Bayesian regularization method you saw there, where we placed a Gaussian prior on the parameters and did MAP (instead of maximum likelihood) estimation.] [Note: Usually weight decay is not applied to the bias terms $b^{(l)}_i$, as reflected in our definition for $J(W, b)$.  Applying weight decay to the bias units usually makes only a small difference to the final network, however.  If you've taken CS229 (Machine Learning) at Stanford or watched the course's videos on YouTube, you may also recognize this weight decay as essentially a variant of the Bayesian regularization method you saw there, where we placed a Gaussian prior on the parameters and did MAP (instead of maximum likelihood) estimation.] :【初译】： :【初译】： + [注：通常权重衰减的计算并不使用偏置项$b^{(l)}_i$，如同我们在$J(W, b)$的定义中所反映出来的一样。将偏置单元包含在权重衰减中通常只对最终的神经网络产生很小的影响。如果你在斯坦福选修过CS229（机器学习）课程，或者在YouTube上看过课程视频，你会发现这个权重衰减实际上是一个课上提到的贝叶斯规则化方法的变种，在这种方法中，我们将高斯先验概率放入参数中计算MAP（极大后验假设）估计（而不是极大似然估计）。] :【一校】： :【一校】： Line 36: Line 40: The '''weight decay parameter''' $\lambda$ controls the relative importance of the two terms. Note also the slightly overloaded notation: $J(W,b;x,y)$ is the squared error cost with respect to a single example; $J(W,b)$ is the overall cost function, which includes the weight decay term. The '''weight decay parameter''' $\lambda$ controls the relative importance of the two terms. Note also the slightly overloaded notation: $J(W,b;x,y)$ is the squared error cost with respect to a single example; $J(W,b)$ is the overall cost function, which includes the weight decay term. :【初译】： :【初译】： + '''权重衰减参数'''$\lambda$用于控制公式中两项的相对重要性。在此再次重申一下这两个复杂函数的含义：$J(W,b;x,y)$是针对单独样本计算的方差代价；$J(W,b)$是整体代价函数，它包含权重衰减项。 :【一校】： :【一校】： Line 41: Line 46: This cost function above is often used both for classification and for regression problems. For classification, we let $y = 0$ or $1$ represent the two class labels (recall that the sigmoid activation function outputs values in $[0,1]$; if we were using a tanh activation function, we would instead use -1 and +1 to denote the labels).  For regression problems, we first scale our outputs to ensure that they lie in the $[0,1]$ range (or if we were using a tanh activation function, then the $[-1,1]$ range). This cost function above is often used both for classification and for regression problems. For classification, we let $y = 0$ or $1$ represent the two class labels (recall that the sigmoid activation function outputs values in $[0,1]$; if we were using a tanh activation function, we would instead use -1 and +1 to denote the labels).  For regression problems, we first scale our outputs to ensure that they lie in the $[0,1]$ range (or if we were using a tanh activation function, then the $[-1,1]$ range). :【初译】： :【初译】： + 以上的代价函数经常被用于分类和回归问题。在分类问题中，我们使得$y = 0$或$1$ ，来代表两种类型的标签（回忆一下，这是因为S型（sigmoid）激励函数的值域为$[0,1]$;如果我们使用双曲正切型激励函数，我们应该选用-1和+1作为标签）。对于回归问题，我们首先要对我们的输出值域进行缩放，以保证其范围为$[0,1]$（同样地，如果我们使用双曲正切型激励函数，使其值域范围为$[-1,1]$）。 :【一校】： :【一校】： Line 51: Line 57: function of the input (more formally, $W^{(1)}_{ij}$ will be the same for all values of $i$, so that $a^{(2)}_1 = a^{(2)}_2 = a^{(2)}_3 = \ldots$ for any input $x$). The random initialization serves the purpose of '''symmetry breaking'''. function of the input (more formally, $W^{(1)}_{ij}$ will be the same for all values of $i$, so that $a^{(2)}_1 = a^{(2)}_2 = a^{(2)}_3 = \ldots$ for any input $x$). The random initialization serves the purpose of '''symmetry breaking'''. :【初译】： :【初译】： + 我们的目标是求得函数$J(W,b)$针对$W$和$b$的最小值。为了训练我们的神经网络，我们需要将每一个参数$W^{(l)}_{ij}$和$b^{(l)}_i$初始化为一个很小的、接近零的随机值（比如说，使用正态分布${Normal}(0,\epsilon^2)$生成随机值，其中$\epsilon$设置为$0.01$），之后对目标函数使用诸如批量梯度下降法的最优化算法。因为$J(W, b)$是一个非凸函数，梯度下降法容易只找到局部最优解；但是，在实际应用中，梯度下降法通常都会工作的很好。最后，需要再次强调的是，要将参数进行随机的初始化，而不是全部设置为0。如果所有参数都用相同的值作为初始值，那么所有隐藏层单元最终会根据输入习得相同的函数（更具体的说，$W^{(1)}_{ij}$对于所有$i$都会取相同的值，于是对于任何输入$x$都会有：$a^{(2)}_1 = a^{(2)}_2 = a^{(2)}_3 = \ldots$）。随机初始化的目的是使'''对称破缺'''。 :【一校】： :【一校】： :【原文】： :【原文】： One iteration of gradient descent updates the parameters $W,b$ as follows: One iteration of gradient descent updates the parameters $W,b$ as follows: + :【初译】： + 每一次梯度下降法迭代都会按如下方式对参数W和b进行更新： + :【一校】： ::[itex] \begin{align} \begin{align} Line 61: Line 71: \end{align} \end{align} [/itex] - :【初译】： - :【一校】： :【原文】： :【原文】： where $\alpha$ is the learning rate.  The key step is computing the partial derivatives above. We will now describe the '''backpropagation''' algorithm, which gives an where $\alpha$ is the learning rate.  The key step is computing the partial derivatives above. We will now describe the '''backpropagation''' algorithm, which gives an efficient way to compute these partial derivatives. efficient way to compute these partial derivatives. :【初译】： :【初译】： + 其中$\alpha$是学习速率。其中关键步骤是计算偏导数。我们现在来描述'''反向传播'''算法，它能够提供一种有效的方法来计算偏导数。 :【一校】： :【一校】： Line 72: Line 81: We will first describe how backpropagation can be used to compute $\textstyle \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b; x, y)$ and $\textstyle \frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x, y)$, the partial derivatives of the cost function $J(W,b;x,y)$ defined with respect to a single example $(x,y)$. Once we can compute these, we see that the derivative of the overall cost function $J(W,b)$ can be computed as: We will first describe how backpropagation can be used to compute $\textstyle \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b; x, y)$ and $\textstyle \frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x, y)$, the partial derivatives of the cost function $J(W,b;x,y)$ defined with respect to a single example $(x,y)$. Once we can compute these, we see that the derivative of the overall cost function $J(W,b)$ can be computed as: :【初译】： :【初译】： + 我们首先来描述如何使用反向传播算法来计算$\textstyle \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b; x, y)$和$\textstyle \frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x, y)$，这两项是针对单独样本$(x,y)$的代价函数$J(W,b;x,y)$的偏导数。一旦我们求得了偏导数，我们就可以推导出整体代价函数$J(W,b)$的偏导数 :【一校】： :【一校】： - :【原文】： ::[itex] \begin{align} \begin{align} Line 82: Line 91: \end{align} \end{align} [/itex] - :【初译】： - :【一校】： :【原文】： :【原文】： The two lines above differ slightly because weight decay is applied to $W$ but not $b$. The two lines above differ slightly because weight decay is applied to $W$ but not $b$. :【初译】： :【初译】： + 以上两行公式稍有不同，第一行比第二行多出一项，是因为权重衰减是作用于$W$而不是$b$。 :【一校】： :【一校】： Line 92: Line 100: The intuition behind the backpropagation algorithm is as follows. Given a training example $(x,y)$, we will first run a "forward pass" to compute all the activations throughout the network, including the output value of the hypothesis $h_{W,b}(x)$.  Then, for each node $i$ in layer $l$, we would like to compute an "error term" $\delta^{(l)}_i$ that measures how much that node was "responsible" for any errors in our output. For an output node, we can directly measure the difference between the network's activation and the true target value, and use that to define $\delta^{(n_l)}_i$ (where layer $n_l$ is the output layer).  How about hidden units?  For those, we will compute $\delta^{(l)}_i$ based on a weighted average of the error terms of the nodes that uses $a^{(l)}_i$ as an input.  In detail, here is the backpropagation algorithm: The intuition behind the backpropagation algorithm is as follows. Given a training example $(x,y)$, we will first run a "forward pass" to compute all the activations throughout the network, including the output value of the hypothesis $h_{W,b}(x)$.  Then, for each node $i$ in layer $l$, we would like to compute an "error term" $\delta^{(l)}_i$ that measures how much that node was "responsible" for any errors in our output. For an output node, we can directly measure the difference between the network's activation and the true target value, and use that to define $\delta^{(n_l)}_i$ (where layer $n_l$ is the output layer).  How about hidden units?  For those, we will compute $\delta^{(l)}_i$ based on a weighted average of the error terms of the nodes that uses $a^{(l)}_i$ as an input.  In detail, here is the backpropagation algorithm: :【初译】： :【初译】： + 反向传播算法的思路如下。给出一个训练样本$(x,y)$，我们首先进行“前向传递”运算，计算出所有通过网络的激励，包括输出的假设值$h_{W,b}(x)$。之后，针对第l层的每一个节点$i$，我们可以计算出“残差项”$\delta^{(l)}_i$，该残差项表明了该节点对最终的输出值的残差产生了多少贡献。对于最终的输出节点，我们可以直接得出网络产生的激励值与最终目标样本的真实值之间的差距，我们将这个差距定义为$\delta^{(n_l)}_i$ (where layer $n_l$（这里的第$n_l$层代表的是输出层）。对于隐藏单元我们将如何处理呢？我们首先计算以激励 为输入的节点的残差项的加权平均值，以此为基础计算 。下面将给出反向传播算法的细节： :【一校】： :【一校】： Line 119: Line 128: :【初译】： :【初译】： +
+
1. 进行前馈传导计算，得到$L_2$、$L_3$…直到输出层$L_{n_l}$的激励值。 +
2. 针对第$n_l$层（输出层）的每个输出单元$i$，我们根据以下公式计算残差项： + :+ \begin{align} + \delta^{(n_l)}_i + = \frac{\partial}{\partial z^{(n_l)}_i} \;\; + \frac{1}{2} \left\|y - h_{W,b}(x)\right\|^2 = - (y_i - a^{(n_l)}_i) \cdot f'(z^{(n_l)}_i) + \end{align} + +
3. 对$l = n_l-1, n_l-2, n_l-3, \ldots, 2$的各个层，第$l$层的第$i$个节点的残差项计算方法如下： + ::$+ \delta^{(l)}_i = \left( \sum_{j=1}^{s_{l+1}} W^{(l)}_{ji} \delta^{(l+1)}_j \right) f'(z^{(l)}_i) +$ + [译者注：由于原作者简化了推导过程，使我本人看着十分费解，于是就自己推导了一遍，将过程写在这里： + ：$公式$ + 根据递推过程，将n_l-1与n_l的关系替换为l与l+1的关系，可以得到原作者的结果： + ：$公式$ + 我认为以上的逐步向前递推求导的过程就是“反向传播”算法的本意所在，推导结束，欢迎指正。 + ] +
4. 计算我们需要的偏导数，计算方法如下： + :+ \begin{align} + \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b; x, y) &= a^{(l)}_j \delta_i^{(l+1)} \\ + \frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x, y) &= \delta_i^{(l+1)}. + \end{align} + :【一校】： :【一校】： Line 126: Line 162: f'(z_2), f'(z_2), f'(z_3)][/itex]). f'(z_3)][/itex]). + :【初译】： + 最后，我们用矩阵-向量表示法重写以上算法。我们使用“ ” 表示逐个元素相乘运算（在Matlab或Octave里用“.*”表示，也称作阿达马乘积），例如在向量运算 中，对每个元素有 。我们可以类似地扩展 运算符的定义，将其用于向量的逐元素运算，对于偏导数运算 ，我们也做相同的处理（于是又有 ）。 + :【一校】： + :【原文】： The algorithm can then be written: The algorithm can then be written: Line 148: Line 188:
:【初译】： :【初译】： + 后向传播算法的向量表示法为： + +
+
1. 进行前馈传导计算，利用前向传播的定义公式，得到$\textstyle L_2$、$\textstyle L_3$…直到输出层$\textstyle L_{n_l}$的激励值。 +
2. 对输出层（第$\textstyle n_l$层），计算： + :\begin{align} + \delta^{(n_l)} + = - (y - a^{(n_l)}) \bullet f'(z^{(n_l)}) + \end{align} +
3. 对$\textstyle l = n_l-1, n_l-2, n_l-3, \ldots, 2$的各层，计算： + ::\begin{align} + \delta^{(l)} = \left((W^{(l)})^T \delta^{(l+1)}\right) \bullet f'(z^{(l)}) + \end{align} +
4. 计算最终需要的偏导数值： + :\begin{align} + \nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\ + \nabla_{b^{(l)}} J(W,b;x,y) &= \delta^{(l+1)}. + \end{align} +
:【一校】： :【一校】： Line 154: Line 213: '''Implementation note:''' In steps 2 and 3 above, we need to compute $\textstyle f'(z^{(l)}_i)$ for each value of $\textstyle i$. Assuming $\textstyle f(z)$ is the sigmoid activation function, we would already have $\textstyle a^{(l)}_i$ stored away from the forward pass through the network.  Thus, using the expression that we worked out earlier for $\textstyle f'(z)$, '''Implementation note:''' In steps 2 and 3 above, we need to compute $\textstyle f'(z^{(l)}_i)$ for each value of $\textstyle i$. Assuming $\textstyle f(z)$ is the sigmoid activation function, we would already have $\textstyle a^{(l)}_i$ stored away from the forward pass through the network.  Thus, using the expression that we worked out earlier for $\textstyle f'(z)$, we can compute this as $\textstyle f'(z^{(l)}_i) = a^{(l)}_i (1- a^{(l)}_i)$. we can compute this as $\textstyle f'(z^{(l)}_i) = a^{(l)}_i (1- a^{(l)}_i)$. + :【初译】： + '''实现中应注意：'''在以上的第2步和第3步中，我们需要为每一个$\textstyle i$值计算$\textstyle f'(z^{(l)}_i)$。假设$\textstyle f(z)$是s形激励函数，并且我们已经在神经网络的前向传播运算中得到了$\textstyle a^{(l)}_i$，并将其存储了起来。于是，使用我们早先推导出的s形函数的导数$\textstyle f'(z)$的表达式，我们可以计算得到$\textstyle f'(z^{(l)}_i) = a^{(l)}_i (1- a^{(l)}_i)$。 + :【一校】： + :【原文】： Finally, we are ready to describe the full gradient descent algorithm.  In the pseudo-code Finally, we are ready to describe the full gradient descent algorithm.  In the pseudo-code below, $\textstyle \Delta W^{(l)}$ is a matrix (of the same dimension as $\textstyle W^{(l)}$), and $\textstyle \Delta b^{(l)}$ is a vector (of the same dimension as $\textstyle b^{(l)}$). Note that in this notation, below, $\textstyle \Delta W^{(l)}$ is a matrix (of the same dimension as $\textstyle W^{(l)}$), and $\textstyle \Delta b^{(l)}$ is a vector (of the same dimension as $\textstyle b^{(l)}$). Note that in this notation, "$\textstyle \Delta W^{(l)}$" is a matrix, and in particular it isn't "$\textstyle \Delta$ times $\textstyle W^{(l)}$." We implement one iteration of batch gradient descent as follows: "$\textstyle \Delta W^{(l)}$" is a matrix, and in particular it isn't "$\textstyle \Delta$ times $\textstyle W^{(l)}$." We implement one iteration of batch gradient descent as follows: + :【初译】： + :【一校】： + :【原文】：
1. Set $\textstyle \Delta W^{(l)} := 0$, $\textstyle \Delta b^{(l)} := 0$ (matrix/vector of zeros) for all $\textstyle l$.
2. Set $\textstyle \Delta W^{(l)} := 0$, $\textstyle \Delta b^{(l)} := 0$ (matrix/vector of zeros) for all $\textstyle l$.

## Revision as of 16:29, 7 March 2013

【原文】：

Suppose we have a fixed training set $\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}$ of m training examples. We can train our neural network using batch gradient descent. In detail, for a single training example (x,y), we define the cost function with respect to that single example to be:

【初译】：

【一校】：
\begin{align} J(W,b; x,y) = \frac{1}{2} \left\| h_{W,b}(x) - y \right\|^2. \end{align}
【原文】：

This is a (one-half) squared-error cost function. Given a training set of m examples, we then define the overall cost function to be:

【初译】：

【一校】：
\begin{align} J(W,b) &= \left[ \frac{1}{m} \sum_{i=1}^m J(W,b;x^{(i)},y^{(i)}) \right] + \frac{\lambda}{2} \sum_{l=1}^{n_l-1} \; \sum_{i=1}^{s_l} \; \sum_{j=1}^{s_{l+1}} \left( W^{(l)}_{ji} \right)^2 \\ &= \left[ \frac{1}{m} \sum_{i=1}^m \left( \frac{1}{2} \left\| h_{W,b}(x^{(i)}) - y^{(i)} \right\|^2 \right) \right] + \frac{\lambda}{2} \sum_{l=1}^{n_l-1} \; \sum_{i=1}^{s_l} \; \sum_{j=1}^{s_{l+1}} \left( W^{(l)}_{ji} \right)^2 \end{align}
【原文】：

The first term in the definition of J(W,b) is an average sum-of-squares error term. The second term is a regularization term (also called a weight decay term) that tends to decrease the magnitude of the weights, and helps prevent overfitting.

【初译】：

【一校】：
【原文】：

[Note: Usually weight decay is not applied to the bias terms $b^{(l)}_i$, as reflected in our definition for J(W,b). Applying weight decay to the bias units usually makes only a small difference to the final network, however. If you've taken CS229 (Machine Learning) at Stanford or watched the course's videos on YouTube, you may also recognize this weight decay as essentially a variant of the Bayesian regularization method you saw there, where we placed a Gaussian prior on the parameters and did MAP (instead of maximum likelihood) estimation.]

【初译】：

[注：通常权重衰减的计算并不使用偏置项$b^{(l)}_i$，如同我们在J(W,b)的定义中所反映出来的一样。将偏置单元包含在权重衰减中通常只对最终的神经网络产生很小的影响。如果你在斯坦福选修过CS229（机器学习）课程，或者在YouTube上看过课程视频，你会发现这个权重衰减实际上是一个课上提到的贝叶斯规则化方法的变种，在这种方法中，我们将高斯先验概率放入参数中计算MAP（极大后验假设）估计（而不是极大似然估计）。]

【一校】：
【原文】：

The weight decay parameter λ controls the relative importance of the two terms. Note also the slightly overloaded notation: J(W,b;x,y) is the squared error cost with respect to a single example; J(W,b) is the overall cost function, which includes the weight decay term.

【初译】：

【一校】：
【原文】：

This cost function above is often used both for classification and for regression problems. For classification, we let y = 0 or 1 represent the two class labels (recall that the sigmoid activation function outputs values in [0,1]; if we were using a tanh activation function, we would instead use -1 and +1 to denote the labels). For regression problems, we first scale our outputs to ensure that they lie in the [0,1] range (or if we were using a tanh activation function, then the [ − 1,1] range).

【初译】：

【一校】：
【原文】：

Our goal is to minimize J(W,b) as a function of W and b. To train our neural network, we will initialize each parameter $W^{(l)}_{ij}$ and each $b^{(l)}_i$ to a small random value near zero (say according to a Normal(0,ε2) distribution for some small ε, say 0.01), and then apply an optimization algorithm such as batch gradient descent. Since J(W,b) is a non-convex function, gradient descent is susceptible to local optima; however, in practice gradient descent usually works fairly well. Finally, note that it is important to initialize the parameters randomly, rather than to all 0's. If all the parameters start off at identical values, then all the hidden layer units will end up learning the same function of the input (more formally, $W^{(1)}_{ij}$ will be the same for all values of i, so that $a^{(2)}_1 = a^{(2)}_2 = a^{(2)}_3 = \ldots$ for any input x). The random initialization serves the purpose of symmetry breaking.

【初译】：

【一校】：
【原文】：

【初译】：

【一校】：
\begin{align} W_{ij}^{(l)} &= W_{ij}^{(l)} - \alpha \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b) \\ b_{i}^{(l)} &= b_{i}^{(l)} - \alpha \frac{\partial}{\partial b_{i}^{(l)}} J(W,b) \end{align}
【原文】：

where α is the learning rate. The key step is computing the partial derivatives above. We will now describe the backpropagation algorithm, which gives an efficient way to compute these partial derivatives.

【初译】：

【一校】：
【原文】：

We will first describe how backpropagation can be used to compute $\textstyle \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b; x, y)$ and $\textstyle \frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x, y)$, the partial derivatives of the cost function J(W,b;x,y) defined with respect to a single example (x,y). Once we can compute these, we see that the derivative of the overall cost function J(W,b) can be computed as:

【初译】：

【一校】：
\begin{align} \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b) &= \left[ \frac{1}{m} \sum_{i=1}^m \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b; x^{(i)}, y^{(i)}) \right] + \lambda W_{ij}^{(l)} \\ \frac{\partial}{\partial b_{i}^{(l)}} J(W,b) &= \frac{1}{m}\sum_{i=1}^m \frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x^{(i)}, y^{(i)}) \end{align}
【原文】：

The two lines above differ slightly because weight decay is applied to W but not b.

【初译】：

【一校】：
【原文】：

The intuition behind the backpropagation algorithm is as follows. Given a training example (x,y), we will first run a "forward pass" to compute all the activations throughout the network, including the output value of the hypothesis hW,b(x). Then, for each node i in layer l, we would like to compute an "error term" $\delta^{(l)}_i$ that measures how much that node was "responsible" for any errors in our output. For an output node, we can directly measure the difference between the network's activation and the true target value, and use that to define $\delta^{(n_l)}_i$ (where layer nl is the output layer). How about hidden units? For those, we will compute $\delta^{(l)}_i$ based on a weighted average of the error terms of the nodes that uses $a^{(l)}_i$ as an input. In detail, here is the backpropagation algorithm:

【初译】：

【一校】：
【原文】：
1. Perform a feedforward pass, computing the activations for layers L2, L3, and so on up to the output layer $L_{n_l}$.
2. For each output unit i in layer nl (the output layer), set
\begin{align} \delta^{(n_l)}_i = \frac{\partial}{\partial z^{(n_l)}_i} \;\; \frac{1}{2} \left\|y - h_{W,b}(x)\right\|^2 = - (y_i - a^{(n_l)}_i) \cdot f'(z^{(n_l)}_i) \end{align}
3. For $l = n_l-1, n_l-2, n_l-3, \ldots, 2$
For each node i in layer l, set
$\delta^{(l)}_i = \left( \sum_{j=1}^{s_{l+1}} W^{(l)}_{ji} \delta^{(l+1)}_j \right) f'(z^{(l)}_i)$
4. Compute the desired partial derivatives, which are given as:
\begin{align} \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b; x, y) &= a^{(l)}_j \delta_i^{(l+1)} \\ \frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x, y) &= \delta_i^{(l+1)}. \end{align}
【初译】：
1. 进行前馈传导计算，得到L2L3…直到输出层$L_{n_l}$的激励值。
2. 针对第nl层（输出层）的每个输出单元i，我们根据以下公式计算残差项：
\begin{align} \delta^{(n_l)}_i = \frac{\partial}{\partial z^{(n_l)}_i} \;\; \frac{1}{2} \left\|y - h_{W,b}(x)\right\|^2 = - (y_i - a^{(n_l)}_i) \cdot f'(z^{(n_l)}_i) \end{align}
3. $l = n_l-1, n_l-2, n_l-3, \ldots, 2$的各个层，第l层的第i个节点的残差项计算方法如下：
$\delta^{(l)}_i = \left( \sum_{j=1}^{s_{l+1}} W^{(l)}_{ji} \delta^{(l+1)}_j \right) f'(z^{(l)}_i)$
[译者注：由于原作者简化了推导过程，使我本人看着十分费解，于是就自己推导了一遍，将过程写在这里： ：Failed to parse (PNG conversion failed; check for correct installation of latex, dvips, gs, and convert): 公式 根据递推过程，将n_l-1与n_l的关系替换为l与l+1的关系，可以得到原作者的结果： ：Failed to parse (PNG conversion failed; check for correct installation of latex, dvips, gs, and convert): 公式 我认为以上的逐步向前递推求导的过程就是“反向传播”算法的本意所在，推导结束，欢迎指正。 ]
4. 计算我们需要的偏导数，计算方法如下：
\begin{align} \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b; x, y) &= a^{(l)}_j \delta_i^{(l+1)} \\ \frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x, y) &= \delta_i^{(l+1)}. \end{align}
【一校】：
【原文】：
Finally, we can also re-write the algorithm using matrix-vectorial notation. We will use "$\textstyle \bullet$" to denote the element-wise product operator (denoted ".*" in Matlab or Octave, and also called the Hadamard product), so that if $\textstyle a = b \bullet c$, then $\textstyle a_i = b_ic_i$. Similar to how we extended the definition of $\textstyle f(\cdot)$ to apply element-wise to vectors, we also do the same for $\textstyle f'(\cdot)$ (so that $\textstyle f'([z_1, z_2, z_3]) = [f'(z_1), f'(z_2), f'(z_3)]$).
【初译】：
最后，我们用矩阵-向量表示法重写以上算法。我们使用“ ” 表示逐个元素相乘运算（在Matlab或Octave里用“.*”表示，也称作阿达马乘积），例如在向量运算 中，对每个元素有 。我们可以类似地扩展 运算符的定义，将其用于向量的逐元素运算，对于偏导数运算 ，我们也做相同的处理（于是又有 ）。
【一校】：
【原文】：
The algorithm can then be written:
1. Perform a feedforward pass, computing the activations for layers $\textstyle L_2$, $\textstyle L_3$, up to the output layer $\textstyle L_{n_l}$, using the equations defining the forward propagation steps
2. For the output layer (layer $\textstyle n_l$), set
\begin{align} \delta^{(n_l)} = - (y - a^{(n_l)}) \bullet f'(z^{(n_l)}) \end{align}
3. For $\textstyle l = n_l-1, n_l-2, n_l-3, \ldots, 2$
Set
\begin{align} \delta^{(l)} = \left((W^{(l)})^T \delta^{(l+1)}\right) \bullet f'(z^{(l)}) \end{align}
4. Compute the desired partial derivatives:
\begin{align} \nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\ \nabla_{b^{(l)}} J(W,b;x,y) &= \delta^{(l+1)}. \end{align}
【初译】：

后向传播算法的向量表示法为：

1. 进行前馈传导计算，利用前向传播的定义公式，得到$\textstyle L_2$$\textstyle L_3$…直到输出层$\textstyle L_{n_l}$的激励值。
2. 对输出层（第$\textstyle n_l$层），计算：
\begin{align} \delta^{(n_l)} = - (y - a^{(n_l)}) \bullet f'(z^{(n_l)}) \end{align}
3. $\textstyle l = n_l-1, n_l-2, n_l-3, \ldots, 2$的各层，计算：
\begin{align} \delta^{(l)} = \left((W^{(l)})^T \delta^{(l+1)}\right) \bullet f'(z^{(l)}) \end{align}
4. 计算最终需要的偏导数值：
\begin{align} \nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\ \nabla_{b^{(l)}} J(W,b;x,y) &= \delta^{(l+1)}. \end{align}
【一校】：

【原文】：

Implementation note: In steps 2 and 3 above, we need to compute $\textstyle f'(z^{(l)}_i)$ for each value of $\textstyle i$. Assuming $\textstyle f(z)$ is the sigmoid activation function, we would already have $\textstyle a^{(l)}_i$ stored away from the forward pass through the network. Thus, using the expression that we worked out earlier for $\textstyle f'(z)$, we can compute this as $\textstyle f'(z^{(l)}_i) = a^{(l)}_i (1- a^{(l)}_i)$.

【初译】：

实现中应注意：在以上的第2步和第3步中，我们需要为每一个$\textstyle i$值计算$\textstyle f'(z^{(l)}_i)$。假设$\textstyle f(z)$是s形激励函数，并且我们已经在神经网络的前向传播运算中得到了$\textstyle a^{(l)}_i$，并将其存储了起来。于是，使用我们早先推导出的s形函数的导数$\textstyle f'(z)$的表达式，我们可以计算得到$\textstyle f'(z^{(l)}_i) = a^{(l)}_i (1- a^{(l)}_i)$

【一校】：
【原文】：

Finally, we are ready to describe the full gradient descent algorithm. In the pseudo-code below, $\textstyle \Delta W^{(l)}$ is a matrix (of the same dimension as $\textstyle W^{(l)}$), and $\textstyle \Delta b^{(l)}$ is a vector (of the same dimension as $\textstyle b^{(l)}$). Note that in this notation, "$\textstyle \Delta W^{(l)}$" is a matrix, and in particular it isn't "$\textstyle \Delta$ times $\textstyle W^{(l)}$." We implement one iteration of batch gradient descent as follows:

【初译】：
【一校】：
【原文】：
1. Set $\textstyle \Delta W^{(l)} := 0$, $\textstyle \Delta b^{(l)} := 0$ (matrix/vector of zeros) for all $\textstyle l$.
2. For $\textstyle i = 1$ to $\textstyle m$,
1. Use backpropagation to compute $\textstyle \nabla_{W^{(l)}} J(W,b;x,y)$ and $\textstyle \nabla_{b^{(l)}} J(W,b;x,y)$.
2. Set $\textstyle \Delta W^{(l)} := \Delta W^{(l)} + \nabla_{W^{(l)}} J(W,b;x,y)$.
3. Set $\textstyle \Delta b^{(l)} := \Delta b^{(l)} + \nabla_{b^{(l)}} J(W,b;x,y)$.
3. Update the parameters:
\begin{align} W^{(l)} &= W^{(l)} - \alpha \left[ \left(\frac{1}{m} \Delta W^{(l)} \right) + \lambda W^{(l)}\right] \\ b^{(l)} &= b^{(l)} - \alpha \left[\frac{1}{m} \Delta b^{(l)}\right] \end{align}
【初译】：
【一校】：
【原文】：

To train our neural network, we can now repeatedly take steps of gradient descent to reduce our cost function $\textstyle J(W,b)$.

【初译】：
【一校】：