Backpropagation Algorithm

Revision as of 23:14, 26 February 2011 (view source)

(Created page with "Suppose we have a fixed training set <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math> of <math>m</math> training examples. We can train our neural network using ...")

← Older edit

Revision as of 23:26, 26 February 2011 (view source)

Ang (Talk | contribs)

Newer edit →

Line 95:

here is the backpropagation algorithm:

-

+

:1. Perform a feedforward pass, computing the activations for layers <math>L_2</math>, <math>L_3</math>, and so on up to the output layer <math>L_{n_l}</math>.

-

:1. Perform a feedforward pass, computing the activations for layers <math>L_2</math>, <math>L_3</math>, and

+

-

so on up to the output layer <math>L_{n_l}</math>.

+

:2. For each output unit <math>i</math> in layer <math>n_l</math> (the output layer), set

::<math>\begin{align}

Line 114:

Line 112:

\frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x, y) &= \delta_i^{(l+1)}.

\end{align}</math>

+

Finally, we can also re-write

+

the algorithm using matrix-vectorial notation.

+

We will use "<math>\textstyle \bullet</math>" to denote the element-wise product

+

operator (denoted ``{\tt .*}'' in Matlab or Octave, and also called the Hadamard product),

+

so

+

that if <math>\textstyle a = b \bullet c</math>, then <math>\textstyle a_i = b_ic_i</math>.

+

Similar to how we extended the definition of <math>\textstyle f(\cdot)</math> to apply element-wise to vectors, we also

+

do the same for <math>\textstyle f'(\cdot)</math> (so that <math>\textstyle f'([z_1, z_2, z_3]) =

+

[\frac{\partial}{\partial z_1} f(z_1),

+

\frac{\partial}{\partial z_2} f(z_2),

+

\frac{\partial}{\partial z_3} f(z_3)]</math>).

+

The algorithm can then be written:

+

: 1. Perform a feedforward pass, computing the activations for layers <math>\textstyle L_2</math>, <math>\textstyle L_3</math>, up to the output layer <math>\textstyle L_{n_l}</math>, using Equations~(\ref{eqn-forwardprop1}-\ref{eqn-forwardprop2}).

+

: 2. For the output layer (layer <math>\textstyle n_l</math>), set

+

::<math>\begin{align}

+

\delta^{(n_l)}

+

= - (y - a^{(n_l)}) \bullet f'(z^{(n)})

+

\end{align}</math>

+

: 3. For <math>\textstyle l = n_l-1, n_l-2, n_l-3, \ldots, 2</math>

+

::Set

+

:::<math>\begin{align}

+

\delta^{(l)} = \left((W^{(l)})^T \delta^{(l+1)}\right) \bullet f'(z^{(l)})

+

\end{align}</math>

+

: 4. Compute the desired partial derivatives:

+

::<math>\begin{align}

+

\nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\

+

\nabla_{b^{(l)}} J(W,b;x,y) &= \delta^{(l+1)}.

+

\end{align}</math>

+

<br>

+

'''Implementation note:''' In steps 2 and 3 above, we need to compute <math>\textstyle f'(z^{(l)}_i)</math> for each value of <math>\textstyle i</math>.

+

Assuming <math>\textstyle f(z)</math> is the sigmoid activation function, we would already have <math>\textstyle a^{(l)}_i</math> stored away from the

+

forward pass through the network. Thus, using the expression that we worked out earlier for <math>\textstyle f'(z)</math>,

+

we can compute this as <math>\textstyle f'(z^{(l)}_i) = a^{(l)}_i (1- a^{(l)}_i)</math>.

+

<br>

+

Finally, we are ready to describe the full gradient descent algorithm. In the pseudo-code

+

below, <math>\textstyle \Delta W^{(l)}</math> is a matrix (of the same dimension as <math>\textstyle W^{(l)}</math>), and

+

<math>\textstyle \Delta b^{(l)}</math> is a vector (of the same dimension as <math>\textstyle b^{(l)}</math>). Note that in this notation,

+

``<math>\textstyle \Delta W^{(l)}</math>'' is a matrix, and in particular it isn't ``<math>\textstyle \Delta</math> times <math>\textstyle W^{(l)}</math>.''

+

We implement one iteration of batch gradient descent as follows:

+

: 1. Set <math>\textstyle \Delta W^{(l)} := 0</math>, <math>\textstyle \Delta b^{(l)} := 0</math> (matrix/vector of zeros) for all <math>\textstyle l</math>.

+

: 2. For <math>\textstyle i = 1</math> to <math>\textstyle m</math>,

+

:: 2a. Use backpropagation to compute <math>\textstyle \nabla_{W^{(l)}} J(W,b;x,y)</math> and \\

+

<math>\textstyle \nabla_{b^{(l)}} J(W,b;x,y)</math>.

+

:: 2b. Set <math>\textstyle \Delta W^{(l)} := \Delta W^{(l)} + \nabla_{W^{(l)}} J(W,b;x,y)</math>.

+

:: 2c. Set <math>\textstyle \Delta b^{(l)} := \Delta b^{(l)} + \nabla_{b^{(l)}} J(W,b;x,y)</math>.

+

: 3. Update the parameters:

+

:<math>\begin{align}

+

W^{(l)} &= W^{(l)} - \alpha \left[ \left(\frac{1}{m} \Delta W^{(l)} \right) + \lambda W^{(l)}\right] \\

+

b^{(l)} &= b^{(l)} - \alpha \left[\frac{1}{m} \Delta b^{(l)}\right]

+

\end{align}</math>

+

To train our neural network, we can now repeatedly take steps of gradient

+

descent to reduce our cost function <math>\textstyle J(W,b)</math>.

Backpropagation Algorithm

From Ufldl

Revision as of 23:26, 26 February 2011

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 95: / Line 95: @@
 here is the backpropagation algorithm:
+:1. Perform a feedforward pass, computing the activations for layers <math>L_2</math>, <math>L_3</math>, and so on up to the output layer <math>L_{n_l}</math>.
-:1. Perform a feedforward pass, computing the activations for layers <math>L_2</math>, <math>L_3</math>, and
-so on up to the output layer <math>L_{n_l}</math>.
 :2. For each output unit <math>i</math> in layer <math>n_l</math> (the output layer), set
 ::<math>\begin{align}
@@ Line 114: / Line 112: @@
 \frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x, y) &= \delta_i^{(l+1)}.
 \end{align}</math>
+Finally, we can also re-write
+the algorithm using matrix-vectorial notation.
+We will use "<math>\textstyle \bullet</math>" to denote the element-wise product
+operator (denoted ``{\tt .*}'' in Matlab or Octave, and also called the Hadamard product),
+so
+that if <math>\textstyle a = b \bullet c</math>, then <math>\textstyle a_i = b_ic_i</math>.
+Similar to how we extended the definition of <math>\textstyle f(\cdot)</math> to apply element-wise to vectors, we also
+do the same for <math>\textstyle f'(\cdot)</math> (so that <math>\textstyle f'([z_1, z_2, z_3]) =
+[\frac{\partial}{\partial z_1} f(z_1),
+\frac{\partial}{\partial z_2} f(z_2),
+\frac{\partial}{\partial z_3} f(z_3)]</math>).
+The algorithm can then be written:
+: 1. Perform a feedforward pass, computing the activations for layers <math>\textstyle L_2</math>, <math>\textstyle L_3</math>, up to the output layer <math>\textstyle L_{n_l}</math>, using Equations~(\ref{eqn-forwardprop1}-\ref{eqn-forwardprop2}).
+: 2. For the output layer (layer <math>\textstyle n_l</math>), set
+::<math>\begin{align}
+\delta^{(n_l)}
+= - (y - a^{(n_l)}) \bullet f'(z^{(n)})
+\end{align}</math>
+: 3. For <math>\textstyle l = n_l-1, n_l-2, n_l-3, \ldots, 2</math>
+::Set
+:::<math>\begin{align}
+                 \delta^{(l)} = \left((W^{(l)})^T \delta^{(l+1)}\right) \bullet f'(z^{(l)})
+                 \end{align}</math>
+: 4. Compute the desired partial derivatives:
+::<math>\begin{align}
+\nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\
+\nabla_{b^{(l)}} J(W,b;x,y) &= \delta^{(l+1)}.
+\end{align}</math>
+<br>
+'''Implementation note:''' In steps 2 and 3 above, we need to compute <math>\textstyle f'(z^{(l)}_i)</math> for each value of <math>\textstyle i</math>.
+Assuming <math>\textstyle f(z)</math> is the sigmoid activation function, we would already have <math>\textstyle a^{(l)}_i</math> stored away from the
+forward pass through the network.  Thus, using the expression that we worked out earlier for <math>\textstyle f'(z)</math>,
+we can compute this as <math>\textstyle f'(z^{(l)}_i) = a^{(l)}_i (1- a^{(l)}_i)</math>.
+<br>
+Finally, we are ready to describe the full gradient descent algorithm.  In the pseudo-code
+below, <math>\textstyle \Delta W^{(l)}</math> is a matrix (of the same dimension as <math>\textstyle W^{(l)}</math>), and
+<math>\textstyle \Delta b^{(l)}</math> is a vector (of the same dimension as <math>\textstyle b^{(l)}</math>).  Note that in this notation,
+``<math>\textstyle \Delta W^{(l)}</math>'' is a matrix, and in particular it isn't ``<math>\textstyle \Delta</math> times <math>\textstyle W^{(l)}</math>.''
+We implement one iteration of batch gradient descent as follows:
+: 1. Set <math>\textstyle \Delta W^{(l)} := 0</math>, <math>\textstyle \Delta b^{(l)} := 0</math> (matrix/vector of zeros) for all <math>\textstyle l</math>.
+: 2. For <math>\textstyle i = 1</math> to <math>\textstyle m</math>,
+:: 2a. Use backpropagation to compute <math>\textstyle \nabla_{W^{(l)}} J(W,b;x,y)</math> and \\
+<math>\textstyle \nabla_{b^{(l)}} J(W,b;x,y)</math>.
+:: 2b. Set <math>\textstyle \Delta W^{(l)} := \Delta W^{(l)} + \nabla_{W^{(l)}} J(W,b;x,y)</math>.
+:: 2c. Set <math>\textstyle \Delta b^{(l)} := \Delta b^{(l)} + \nabla_{b^{(l)}} J(W,b;x,y)</math>.
+: 3. Update the parameters:
+:<math>\begin{align}
+W^{(l)} &= W^{(l)} - \alpha \left[ \left(\frac{1}{m} \Delta W^{(l)} \right) + \lambda W^{(l)}\right] \\
+b^{(l)} &= b^{(l)} - \alpha \left[\frac{1}{m} \Delta b^{(l)}\right]
+\end{align}</math>
+To train our neural network, we can now repeatedly take steps of gradient
+descent to reduce our cost function <math>\textstyle J(W,b)</math>.