Deriving gradients using the backpropagation idea
From Ufldl
for
Deriving gradients using the backpropagation idea
Jump to:
navigation
,
search
== Introduction == In the section on the [[Backpropagation Algorithm | backpropagation algorithm]], you were briefly introduced to backpropagation as a means of deriving gradients for learning in the sparse autoencoder. It turns out that together with matrix calculus, this provides a powerful method and intuition for deriving gradients for more complex matrix functions (functions from matrices to the reals, or symbolically, from <math>\mathbb{R}^{r \times c} \rightarrow \mathbb{R}</math>. First, recall the backpropagation idea, which we present in a modified form appropriate for our purposes below: <ol> <li>For <math>l = n_l, n_l-1, n_l-2, \ldots, 2</math> :For each node <math>i</math> in layer <math>l</math>, set ::<math> \delta^{(l)}_i = \left( \sum_{j=1}^{s_{l+1}} W^{(l)}_{ji} \delta^{(l+1)}_j \right) \bullet \frac{\partial}{\partial z^{(l)}_i} f^{(l)} (z^{(l)}_i) </math> <li>Compute the desired partial derivatives, :<math> \begin{align} \nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\ \end{align} </math> </ol> Quick notation recap: <ul> <li><math>l</math> is the number of layers in the neural network <li><math>n_l</math> is the number of neurons in the <math>l</math>th layer <li><math>W^{(l)}_{ji}</math> is the weight from the <math>i</math>th unit in the <math>l</math>th layer to the <math>j</math>th unit in the <math>(l + 1)</math>th layer <li><math>z^{(l)}_i</math> is the input to the <math>i</math>th unit in the <math>l</math>th layer <li><math>a^{(l)}_i</math> is the activation of the <math>i</math>th unit in the <math>l</math>th layer <li><math>A \bullet B</math> is the Hadamard or element-wise product, which for <math>r \times c</math> matrices <math>A</math> and <math>B</math> yields the <math>r \times c</math> matrix <math>C = A \bullet B</math> such that <math>C_{r, c} = A_{r, c} \cdot B_{r, c}</math> <li><math>f^{(l})</math> is the activation function for units in the <math>l</math>th layer </ul> Notice that we don't consider an objective function in this case, and we allow each layer to have a different activation function <math>f^{(l)}</math>. This will be useful in allowing us to compute the gradients of functions of matrices. == The method == To compute the gradient with respect to some matrix <math>X</math> of a complicated function of matrices, it may be helpful to consider the function as a complicated multi-layer neural network, if possible. We will use two functions from the section on [[Sparse Coding: Autoencoder Interpretation | sparse coding]] to illustrate this. === Example 1: Objective for weight matrix in sparse coding === Recall the objective function for the weight matrix <math>A</math>, given the feature matrix <math>s</math>: :<math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> We would like to find the gradient of <math>J</math> with respect to <math>A</math>, or in symbols, <math>\nabla_A J(A)</math>. Since the objective function is a sum of two terms in <math>A</math>, the gradient is the sum of gradients of each of the individual terms. The gradient of the second term is trivial, so we will consider the gradient of the first term instead. The first term, <math>\lVert As - x \rVert_2^2</math>, can be seen as an instantiation of neural network taking <math>s</math> as an input, and proceeding in four steps, as described and illustrated in the paragraph and diagram below: <ol> <li>Apply <math>A</math> as the weights from the first layer to the second layer. <li>Subtract <math>x</math> from the activation of the second layer, which uses the identity activation function. <li>Pass this unchanged to the third layer, via identity weights. Use the square function as the activation function for the third layer. <li>Sum all the activations of the third layer. </ol> [[File:Backpropagation Method Example 1.png]] === Example 2: Smoothed topographic L1 sparsity penalty in sparse coding === Recall the smoothed topographic L1 sparsity penalty on <math>s</math> in sparse coding: :<math>\sum{ \sqrt{Vss^T + \epsilon} }</math> We would like to find <math>\nabla_s \sum{ \sqrt{Vss^T + \epsilon} }</math>. As above, let's see this term as an instantiation of a neural network: [[File:Backpropagation Method Example 2.png]]
Template:Languages
(
view source
)
Return to
Deriving gradients using the backpropagation idea
.
Views
Page
Discussion
View source
History
Personal tools
Log in
ufldl resources
UFLDL Tutorial
Recommended Readings
wiki
Main page
Recent changes
Random page
Help
Search
Toolbox
What links here
Related changes
Special pages