Backpropagation Algorithm

From Ufldl

Jump to: navigation, search
m (formatting)
m
Line 19: Line 19:
The first term in the definition of <math>J(W,b)</math> is an average sum-of-squares error term. The second term is a regularization term (also called a '''weight decay''' term) that tends to decrease the magnitude of the weights, and helps prevent overfitting.
The first term in the definition of <math>J(W,b)</math> is an average sum-of-squares error term. The second term is a regularization term (also called a '''weight decay''' term) that tends to decrease the magnitude of the weights, and helps prevent overfitting.
-
[Note: Usually weight decay is not applied to the bias terms <math>b^{(l)}_i</math>, as reflected in our definition for <math>J(W, b)</math>.  Applying weight decay to the bias units usually makes only a small different to the final network, however.  If you've taken CS229 (Machine Learning) at Stanford or watched the course's videos on YouTube, you may also recognize weight decay this as essentially a variant of the Bayesian regularization method you saw there, where we placed a Gaussian prior on the parameters and did MAP (instead of maximum likelihood) estimation.]
+
[Note: Usually weight decay is not applied to the bias terms <math>b^{(l)}_i</math>, as reflected in our definition for <math>J(W, b)</math>.  Applying weight decay to the bias units usually makes only a small different to the final network, however.  If you've taken CS229 (Machine Learning) at Stanford or watched the course's videos on YouTube, you may also recognize this weight decay as essentially a variant of the Bayesian regularization method you saw there, where we placed a Gaussian prior on the parameters and did MAP (instead of maximum likelihood) estimation.]
The '''weight decay parameter''' <math>\lambda</math> controls the relative importance of the two terms. Note also the slightly overloaded notation: <math>J(W,b;x,y)</math> is the squared error cost with respect to a single example; <math>J(W,b)</math> is the overall cost function, which includes the weight decay term.
The '''weight decay parameter''' <math>\lambda</math> controls the relative importance of the two terms. Note also the slightly overloaded notation: <math>J(W,b;x,y)</math> is the squared error cost with respect to a single example; <math>J(W,b)</math> is the overall cost function, which includes the weight decay term.

Revision as of 15:07, 25 April 2011

Personal tools