# Backpropagation Algorithm

 Revision as of 10:49, 26 May 2011 (view source)Watsuen (Talk | contribs)← Older edit Latest revision as of 12:50, 7 April 2013 (view source)Kandeng (Talk | contribs) Line 19: Line 19: The first term in the definition of $J(W,b)$ is an average sum-of-squares error term. The second term is a regularization term (also called a '''weight decay''' term) that tends to decrease the magnitude of the weights, and helps prevent overfitting. The first term in the definition of $J(W,b)$ is an average sum-of-squares error term. The second term is a regularization term (also called a '''weight decay''' term) that tends to decrease the magnitude of the weights, and helps prevent overfitting. - [Note: Usually weight decay is not applied to the bias terms $b^{(l)}_i$, as reflected in our definition for $J(W, b)$.  Applying weight decay to the bias units usually makes only a small different to the final network, however.  If you've taken CS229 (Machine Learning) at Stanford or watched the course's videos on YouTube, you may also recognize this weight decay as essentially a variant of the Bayesian regularization method you saw there, where we placed a Gaussian prior on the parameters and did MAP (instead of maximum likelihood) estimation.] + [Note: Usually weight decay is not applied to the bias terms $b^{(l)}_i$, as reflected in our definition for $J(W, b)$.  Applying weight decay to the bias units usually makes only a small difference to the final network, however.  If you've taken CS229 (Machine Learning) at Stanford or watched the course's videos on YouTube, you may also recognize this weight decay as essentially a variant of the Bayesian regularization method you saw there, where we placed a Gaussian prior on the parameters and did MAP (instead of maximum likelihood) estimation.] The '''weight decay parameter''' $\lambda$ controls the relative importance of the two terms. Note also the slightly overloaded notation: $J(W,b;x,y)$ is the squared error cost with respect to a single example; $J(W,b)$ is the overall cost function, which includes the weight decay term. The '''weight decay parameter''' $\lambda$ controls the relative importance of the two terms. Note also the slightly overloaded notation: $J(W,b;x,y)$ is the squared error cost with respect to a single example; $J(W,b)$ is the overall cost function, which includes the weight decay term. Line 134: Line 134: {{Sparse_Autoencoder}} {{Sparse_Autoencoder}} + + + {{Languages|反向传导算法|中文}}