Backpropagation Algorithm

Revision as of 15:07, 25 April 2011 (view source)

m

Latest revision as of 12:50, 7 April 2013 (view source)

Line 19:

The first term in the definition of <math>J(W,b)</math> is an average sum-of-squares error term. The second term is a regularization term (also called a '''weight decay''' term) that tends to decrease the magnitude of the weights, and helps prevent overfitting.

-

[Note: Usually weight decay is not applied to the bias terms <math>b^{(l)}_i</math>, as reflected in our definition for <math>J(W, b)</math>. Applying weight decay to the bias units usually makes only a small ~~different~~ to the final network, however. If you've taken CS229 (Machine Learning) at Stanford or watched the course's videos on YouTube, you may also recognize this weight decay as essentially a variant of the Bayesian regularization method you saw there, where we placed a Gaussian prior on the parameters and did MAP (instead of maximum likelihood) estimation.]

+

[Note: Usually weight decay is not applied to the bias terms <math>b^{(l)}_i</math>, as reflected in our definition for <math>J(W, b)</math>. Applying weight decay to the bias units usually makes only a small difference to the final network, however. If you've taken CS229 (Machine Learning) at Stanford or watched the course's videos on YouTube, you may also recognize this weight decay as essentially a variant of the Bayesian regularization method you saw there, where we placed a Gaussian prior on the parameters and did MAP (instead of maximum likelihood) estimation.]

The '''weight decay parameter''' <math>\lambda</math> controls the relative importance of the two terms. Note also the slightly overloaded notation: <math>J(W,b;x,y)</math> is the squared error cost with respect to a single example; <math>J(W,b)</math> is the overall cost function, which includes the weight decay term.

Line 80:

Finally, we can also re-write the algorithm using matrix-vectorial notation. We will use "<math>\textstyle \bullet</math>" to denote the element-wise product operator (denoted "<tt>.*</tt>" in Matlab or Octave, and also called the Hadamard product), so that if <math>\textstyle a = b \bullet c</math>, then <math>\textstyle a_i = b_ic_i</math>. Similar to how we extended the definition of <math>\textstyle f(\cdot)</math> to apply element-wise to vectors, we also do the same for <math>\textstyle f'(\cdot)</math> (so that <math>\textstyle f'([z_1, z_2, z_3]) =

-

[~~\frac{\partial}{\partial z_1}~~ f(z_1),

+

[f'(z_1),

-

~~\frac{\partial}{\partial z_2}~~ f(z_2),

+

f'(z_2),

-

~~\frac{\partial}{\partial z_3}~~ f(z_3)]</math>).

+

f'(z_3)]</math>).

The algorithm can then be written:

Line 131:

To train our neural network, we can now repeatedly take steps of gradient descent to reduce our cost function <math>\textstyle J(W,b)</math>.

+

Backpropagation Algorithm

From Ufldl

Latest revision as of 12:50, 7 April 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 19: / Line 19: @@
 The first term in the definition of <math>J(W,b)</math> is an average sum-of-squares error term. The second term is a regularization term (also called a '''weight decay''' term) that tends to decrease the magnitude of the weights, and helps prevent overfitting.
-[Note: Usually weight decay is not applied to the bias terms <math>b^{(l)}_i</math>, as reflected in our definition for <math>J(W, b)</math>.  Applying weight decay to the bias units usually makes only a small different to the final network, however.  If you've taken CS229 (Machine Learning) at Stanford or watched the course's videos on YouTube, you may also recognize this weight decay as essentially a variant of the Bayesian regularization method you saw there, where we placed a Gaussian prior on the parameters and did MAP (instead of maximum likelihood) estimation.]
+[Note: Usually weight decay is not applied to the bias terms <math>b^{(l)}_i</math>, as reflected in our definition for <math>J(W, b)</math>.  Applying weight decay to the bias units usually makes only a small difference to the final network, however.  If you've taken CS229 (Machine Learning) at Stanford or watched the course's videos on YouTube, you may also recognize this weight decay as essentially a variant of the Bayesian regularization method you saw there, where we placed a Gaussian prior on the parameters and did MAP (instead of maximum likelihood) estimation.]
 The '''weight decay parameter''' <math>\lambda</math> controls the relative importance of the two terms. Note also the slightly overloaded notation: <math>J(W,b;x,y)</math> is the squared error cost with respect to a single example; <math>J(W,b)</math> is the overall cost function, which includes the weight decay term.
@@ Line 80: / Line 80: @@
 Finally, we can also re-write the algorithm using matrix-vectorial notation. We will use "<math>\textstyle \bullet</math>" to denote the element-wise product operator (denoted "<tt>.*</tt>" in Matlab or Octave, and also called the Hadamard product), so that if <math>\textstyle a = b \bullet c</math>, then <math>\textstyle a_i = b_ic_i</math>. Similar to how we extended the definition of <math>\textstyle f(\cdot)</math> to apply element-wise to vectors, we also do the same for <math>\textstyle f'(\cdot)</math> (so that <math>\textstyle f'([z_1, z_2, z_3]) =
-[\frac{\partial}{\partial z_1} f(z_1),
+[f'(z_1),
-\frac{\partial}{\partial z_2} f(z_2),
+f'(z_2),
-\frac{\partial}{\partial z_3} f(z_3)]</math>).
+f'(z_3)]</math>).
 The algorithm can then be written:
@@ Line 131: / Line 131: @@
 To train our neural network, we can now repeatedly take steps of gradient descent to reduce our cost function <math>\textstyle J(W,b)</math>.
+{{Sparse_Autoencoder}}
+{{Languages|反向传导算法|中文}}