Deriving gradients using the backpropagation idea
From Ufldl
(→Example 2: Smoothed topographic L1 sparsity penalty in sparse coding) |
|||
Line 5: | Line 5: | ||
First, recall the backpropagation idea, which we present in a modified form appropriate for our purposes below: | First, recall the backpropagation idea, which we present in a modified form appropriate for our purposes below: | ||
<ol> | <ol> | ||
- | <li>For <math> | + | <li>For each output unit <math>i</math> in layer <math>n_l</math> (the final layer), set |
+ | :<math> | ||
+ | \delta^{(n_l)}_i | ||
+ | = \frac{\partial}{\partial z^{(n_l)}_i} \;\; | ||
+ | J(z^{(n_l)}) | ||
+ | </math> | ||
+ | where <math>J(z)</math> is our "objective function" (explained below). | ||
+ | <li>For <math>l = n_l-1, n_l-2, n_l-3, \ldots, 2</math> | ||
:For each node <math>i</math> in layer <math>l</math>, set | :For each node <math>i</math> in layer <math>l</math>, set | ||
::<math> | ::<math> | ||
\delta^{(l)}_i = \left( \sum_{j=1}^{s_{l+1}} W^{(l)}_{ji} \delta^{(l+1)}_j \right) \bullet \frac{\partial}{\partial z^{(l)}_i} f^{(l)} (z^{(l)}_i) | \delta^{(l)}_i = \left( \sum_{j=1}^{s_{l+1}} W^{(l)}_{ji} \delta^{(l+1)}_j \right) \bullet \frac{\partial}{\partial z^{(l)}_i} f^{(l)} (z^{(l)}_i) | ||
- | + | </math> | |
<li>Compute the desired partial derivatives, | <li>Compute the desired partial derivatives, | ||
:<math> | :<math> | ||
Line 29: | Line 36: | ||
</ul> | </ul> | ||
- | + | Let's say we have a function <math>F</math> that takes a matrix <math>X</math> and yields a real number. We would like to use the backpropagation idea to compute the gradient with respect to <math>X</math> of <math>F</math>, that is <math>\nabla_X F</math>. The general idea is to see the function <math>F</math> as a multi-layer neural network, and to derive the gradients using the backpropagation idea. | |
- | + | To do this, we will set our "objective function" to be the function <math>J(z)</math> that when applied to the outputs of the neurons in the last layer yields the value <math>F(x)</math>. For the intermediate layers, we will also choose our activation functions <math>f^{(l)}</math> to this end. | |
- | + | == Examples == | |
+ | |||
+ | We will use two functions from the section on [[Sparse Coding: Autoencoder Interpretation | sparse coding]] to illustrate the method of computing gradients of functions on matrices using the backpropagation idea. | ||
=== Example 1: Objective for weight matrix in sparse coding === | === Example 1: Objective for weight matrix in sparse coding === | ||
Recall the objective function for the weight matrix <math>A</math>, given the feature matrix <math>s</math>: | Recall the objective function for the weight matrix <math>A</math>, given the feature matrix <math>s</math>: | ||
- | :<math> | + | :<math>F(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> |
- | We would like to find the gradient of <math> | + | We would like to find the gradient of <math>F</math> with respect to <math>A</math>, or in symbols, <math>\nabla_A F(A)</math>. Since the objective function is a sum of two terms in <math>A</math>, the gradient is the sum of gradients of each of the individual terms. The gradient of the second term is trivial, so we will consider the gradient of the first term instead. |
The first term, <math>\lVert As - x \rVert_2^2</math>, can be seen as an instantiation of neural network taking <math>s</math> as an input, and proceeding in four steps, as described and illustrated in the paragraph and diagram below: | The first term, <math>\lVert As - x \rVert_2^2</math>, can be seen as an instantiation of neural network taking <math>s</math> as an input, and proceeding in four steps, as described and illustrated in the paragraph and diagram below: | ||
Line 52: | Line 61: | ||
[[File:Backpropagation Method Example 1.png | 400px]] | [[File:Backpropagation Method Example 1.png | 400px]] | ||
+ | |||
+ | The weights and activation functions of this network are as follows: | ||
+ | <table align="center"> | ||
+ | <tr><th width="50px">Layer</th><th width="200px">Weight</th><th width="200px">Activation function <math>f</math></th></tr> | ||
+ | <tr> | ||
+ | <td>1</td> | ||
+ | <td><math>A</math></td> | ||
+ | <td><math>f(z_i) = z_i (identity)</math></td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>2</td> | ||
+ | <td><math>I</math> (identity)</td> | ||
+ | <td><math>f(z_i) = z_i - x_i</math></td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>3</td> | ||
+ | <td>N/A</td> | ||
+ | <td><math>f(z_i) = z_i^2</math></td> | ||
+ | </tr> | ||
+ | </table> | ||
+ | To have <math>J(z^{(3)}) = F(x)</math>, we can set <math>J(z^{(3)}) = \sum_k J(z^{(3)}_k)</math>. | ||
+ | |||
+ | Once we see <math>F</math> as a neural network, the gradient <math>\nabla_X F</math> becomes easy to compute - applying backpropagation yields: | ||
+ | <table align="center"> | ||
+ | <tr><th width="50px">Layer</th><th width="200px">Derivative of activation function <math>f'</math></th><th width="200px">Delta</th><th>Input <math>z</math> to this layer</th></tr> | ||
+ | <tr> | ||
+ | <td>3</td> | ||
+ | <td><math>f'(z_i) = 2z_i</math></td> | ||
+ | <td><math>f'(z_i) = 2z_i</math></td> | ||
+ | <td><math>As - x</math></td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>2</td> | ||
+ | <td><math>f'(z_i) = 1</math></td> | ||
+ | <td><math>\left( I^T \delta^{(3)} \right) \bullet 1</math></td> | ||
+ | <td><math>As</math></td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>1</td> | ||
+ | <td><math>f'(z_i) = 1</math></td> | ||
+ | <td><math>\left( A^T \delta^{(2)} \right) \bullet 1</math></td> | ||
+ | <td><math>s</math></td> | ||
+ | </tr> | ||
+ | </table> | ||
+ | |||
+ | Hence, | ||
+ | :<math> | ||
+ | \begin{align} | ||
+ | \nabla_X F & = A^T I^T 2(As - x) \\ | ||
+ | & = A^T 2(As - x) | ||
+ | \end{align} | ||
+ | </math> | ||
=== Example 2: Smoothed topographic L1 sparsity penalty in sparse coding === | === Example 2: Smoothed topographic L1 sparsity penalty in sparse coding === | ||
Line 62: | Line 123: | ||
[[File:Backpropagation Method Example 2.png | 600px]] | [[File:Backpropagation Method Example 2.png | 600px]] | ||
+ | |||
+ | The weights and activation functions of this network are as follows: | ||
+ | <table align="center"> | ||
+ | <tr><th width="50px">Layer</th><th width="200px">Weight</th><th width="200px">Activation function <math>f</math></th></tr> | ||
+ | <tr> | ||
+ | <td>1</td> | ||
+ | <td><math>I</math></td> | ||
+ | <td><math>f(z_i) = z_i^2</math></td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>2</td> | ||
+ | <td><math>V</math></td> | ||
+ | <td><math>f(z_i) = z_i</math></td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>3</td> | ||
+ | <td><math>I</math></td> | ||
+ | <td><math>f(z_i) = z_i + \epsilon</math></td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>4</td> | ||
+ | <td>N/A</td> | ||
+ | <td><math>f(z_i) = z_i^{\frac{1}{2}}</math></td> | ||
+ | </tr> | ||
+ | </table> | ||
+ | To have <math>J(z^{(4)}) = F(x)</math>, we can set <math>J(z^{(4)}) = \sum_k J(z^{(4)}_k)</math>. | ||
+ | |||
+ | Once we see <math>F</math> as a neural network, the gradient <math>\nabla_X F</math> becomes easy to compute - applying backpropagation yields: | ||
+ | <table align="center"> | ||
+ | <tr><th width="50px">Layer</th><th width="200px">Derivative of activation function <math>f'</math> | ||
+ | </th><th width="200px">Delta</th><th>Input <math>z</math> to this layer</th></tr> | ||
+ | <tr> | ||
+ | <td>4</td> | ||
+ | <td><math>f'(z_i) = \frac{1}{2} z_i^{-\frac{1}{2}}</math></td> | ||
+ | <td><math>f'(z_i) = \frac{1}{2} z_i^{-\frac{1}{2}}</math></td> | ||
+ | <td><math>(Vss^T + \epsilon)</math></td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>3</td> | ||
+ | <td><math>f'(z_i) = 1</math></td> | ||
+ | <td><math>\left( I^T \delta^{(3)} \right) \bullet 1</math></td> | ||
+ | <td><math>Vss^T</math></td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>2</td> | ||
+ | <td><math>f'(z_i) = 1</math></td> | ||
+ | <td><math>\left( V^T \delta^{(3)} \right) \bullet 1</math></td> | ||
+ | <td><math>ss^T</math></td> | ||
+ | </tr> | ||
+ | <tr> | ||
+ | <td>1</td> | ||
+ | <td><math>f'(z_i) = 2z_i</math></td> | ||
+ | <td><math>\left( I^T \delta^{(2)} \right) \bullet 2s</math></td> | ||
+ | <td><math>s</math></td> | ||
+ | </tr> | ||
+ | </table> | ||
+ | |||
+ | Hence, | ||
+ | :<math> | ||
+ | \begin{align} | ||
+ | \nabla_X F & = I^T V^T I^T \frac{1}{2}(Vss^T + \epsilon)^{-\frac{1}{2}} \bullet 2s \\ | ||
+ | & = V^T \frac{1}{2}(Vss^T + \epsilon)^{-\frac{1}{2}} \bullet 2s \\ | ||
+ | & = V^T (Vss^T + \epsilon)^{-\frac{1}{2}} \bullet s | ||
+ | \end{align} | ||
+ | </math> |