Backpropagation Algorithm

From Ufldl

Jump to: navigation, search
(Created page with "Suppose we have a fixed training set <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math> of <math>m</math> training examples. We can train our neural network using ...")
Line 95: Line 95:
here is the backpropagation algorithm:
here is the backpropagation algorithm:
-
 
+
:1. Perform a feedforward pass, computing the activations for layers <math>L_2</math>, <math>L_3</math>, and so on up to the output layer <math>L_{n_l}</math>.
-
:1. Perform a feedforward pass, computing the activations for layers <math>L_2</math>, <math>L_3</math>, and  
+
-
so on up to the output layer <math>L_{n_l}</math>.
+
:2. For each output unit <math>i</math> in layer <math>n_l</math> (the output layer), set
:2. For each output unit <math>i</math> in layer <math>n_l</math> (the output layer), set
::<math>\begin{align}
::<math>\begin{align}
Line 114: Line 112:
\frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x, y) &= \delta_i^{(l+1)}.
\frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x, y) &= \delta_i^{(l+1)}.
\end{align}</math>
\end{align}</math>
 +
 +
Finally, we can also re-write
 +
the algorithm using matrix-vectorial notation.
 +
We will use "<math>\textstyle \bullet</math>" to denote the element-wise product
 +
operator (denoted ``{\tt .*}'' in Matlab or Octave, and also called the Hadamard product),
 +
so
 +
that if <math>\textstyle a = b \bullet c</math>, then <math>\textstyle a_i = b_ic_i</math>.
 +
Similar to how we extended the definition of <math>\textstyle f(\cdot)</math> to apply element-wise to vectors, we also
 +
do the same for <math>\textstyle f'(\cdot)</math> (so that <math>\textstyle f'([z_1, z_2, z_3]) =
 +
[\frac{\partial}{\partial z_1} f(z_1),
 +
\frac{\partial}{\partial z_2} f(z_2),
 +
\frac{\partial}{\partial z_3} f(z_3)]</math>).
 +
The algorithm can then be written:
 +
 +
: 1. Perform a feedforward pass, computing the activations for layers <math>\textstyle L_2</math>, <math>\textstyle L_3</math>, up to the output layer <math>\textstyle L_{n_l}</math>, using Equations~(\ref{eqn-forwardprop1}-\ref{eqn-forwardprop2}).
 +
: 2. For the output layer (layer <math>\textstyle n_l</math>), set
 +
::<math>\begin{align}
 +
\delta^{(n_l)}
 +
= - (y - a^{(n_l)}) \bullet f'(z^{(n)})
 +
\end{align}</math>
 +
: 3. For <math>\textstyle l = n_l-1, n_l-2, n_l-3, \ldots, 2</math>
 +
::Set
 +
:::<math>\begin{align}
 +
                \delta^{(l)} = \left((W^{(l)})^T \delta^{(l+1)}\right) \bullet f'(z^{(l)})
 +
                \end{align}</math>
 +
: 4. Compute the desired partial derivatives:
 +
::<math>\begin{align}
 +
\nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\
 +
\nabla_{b^{(l)}} J(W,b;x,y) &= \delta^{(l+1)}.
 +
\end{align}</math>
 +
 +
<br>
 +
'''Implementation note:''' In steps 2 and 3 above, we need to compute <math>\textstyle f'(z^{(l)}_i)</math> for each value of <math>\textstyle i</math>.
 +
Assuming <math>\textstyle f(z)</math> is the sigmoid activation function, we would already have <math>\textstyle a^{(l)}_i</math> stored away from the
 +
forward pass through the network.  Thus, using the expression that we worked out earlier for <math>\textstyle f'(z)</math>,
 +
we can compute this as <math>\textstyle f'(z^{(l)}_i) = a^{(l)}_i (1- a^{(l)}_i)</math>. 
 +
<br>
 +
 +
 +
Finally, we are ready to describe the full gradient descent algorithm.  In the pseudo-code
 +
below, <math>\textstyle \Delta W^{(l)}</math> is a matrix (of the same dimension as <math>\textstyle W^{(l)}</math>), and
 +
<math>\textstyle \Delta b^{(l)}</math> is a vector (of the same dimension as <math>\textstyle b^{(l)}</math>).  Note that in this notation,
 +
``<math>\textstyle \Delta W^{(l)}</math>'' is a matrix, and in particular it isn't ``<math>\textstyle \Delta</math> times <math>\textstyle W^{(l)}</math>.''
 +
We implement one iteration of batch gradient descent as follows:
 +
 +
: 1. Set <math>\textstyle \Delta W^{(l)} := 0</math>, <math>\textstyle \Delta b^{(l)} := 0</math> (matrix/vector of zeros) for all <math>\textstyle l</math>.
 +
: 2. For <math>\textstyle i = 1</math> to <math>\textstyle m</math>,
 +
:: 2a. Use backpropagation to compute <math>\textstyle \nabla_{W^{(l)}} J(W,b;x,y)</math> and \\
 +
<math>\textstyle \nabla_{b^{(l)}} J(W,b;x,y)</math>.
 +
:: 2b. Set <math>\textstyle \Delta W^{(l)} := \Delta W^{(l)} + \nabla_{W^{(l)}} J(W,b;x,y)</math>.
 +
:: 2c. Set <math>\textstyle \Delta b^{(l)} := \Delta b^{(l)} + \nabla_{b^{(l)}} J(W,b;x,y)</math>.
 +
: 3. Update the parameters:
 +
:<math>\begin{align}
 +
W^{(l)} &= W^{(l)} - \alpha \left[ \left(\frac{1}{m} \Delta W^{(l)} \right) + \lambda W^{(l)}\right] \\
 +
b^{(l)} &= b^{(l)} - \alpha \left[\frac{1}{m} \Delta b^{(l)}\right]
 +
\end{align}</math>
 +
 +
To train our neural network, we can now repeatedly take steps of gradient
 +
descent to reduce our cost function <math>\textstyle J(W,b)</math>.

Revision as of 23:26, 26 February 2011

Personal tools