Deriving gradients using the backpropagation idea

From Ufldl

Jump to: navigation, search
 
Line 38: Line 38:
Let's say we have a function <math>F</math> that takes a matrix <math>X</math> and yields a real number. We would like to use the backpropagation idea to compute the gradient with respect to <math>X</math> of <math>F</math>, that is <math>\nabla_X F</math>. The general idea is to see the function <math>F</math> as a multi-layer neural network, and to derive the gradients using the backpropagation idea.  
Let's say we have a function <math>F</math> that takes a matrix <math>X</math> and yields a real number. We would like to use the backpropagation idea to compute the gradient with respect to <math>X</math> of <math>F</math>, that is <math>\nabla_X F</math>. The general idea is to see the function <math>F</math> as a multi-layer neural network, and to derive the gradients using the backpropagation idea.  
-
To do this, we will set our "objective function" to be the function <math>J(z)</math> that when applied to the outputs of the neurons in the last layer yields the value <math>F(x)</math>. For the intermediate layers, we will also choose our activation functions <math>f^{(l)}</math> to this end.
+
To do this, we will set our "objective function" to be the function <math>J(z)</math> that when applied to the outputs of the neurons in the last layer yields the value <math>F(X)</math>. For the intermediate layers, we will also choose our activation functions <math>f^{(l)}</math> to this end.
 +
 
 +
Using this method, we can easily compute derivatives with respect to the inputs <math>X</math>, as well as derivatives with respect to any of the weights in the network, as we shall see later.
== Examples ==
== Examples ==
-
We will use two functions from the section on [[Sparse Coding: Autoencoder Interpretation | sparse coding]] to illustrate the method of computing gradients of functions on matrices using the backpropagation idea.
+
To illustrate the use of the backpropagation idea to compute derivatives with respect to the inputs, we will use two functions from the section on [[Sparse Coding: Autoencoder Interpretation | sparse coding]], in examples 1 and 2. In example 3, we use a function from [[Independent Component Analysis | independent component analysis]] to illustrate the use of this idea to compute derivates with respect to weights, and in this specific case, what to do in the case of tied or repeated weights.
=== Example 1: Objective for weight matrix in sparse coding ===
=== Example 1: Objective for weight matrix in sparse coding ===
-
Recall the objective function for the weight matrix <math>A</math>, given the feature matrix <math>s</math>:
+
Recall for [[Sparse Coding: Autoencoder Interpretation | sparse coding]], the objective function for the weight matrix <math>A</math>, given the feature matrix <math>s</math>:
:<math>F(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math>
:<math>F(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math>
Line 68: Line 70:
<td>1</td>
<td>1</td>
<td><math>A</math></td>
<td><math>A</math></td>
-
<td><math>f(z_i) = z_i (identity)</math></td>
+
<td><math>f(z_i) = z_i</math> (identity)</td>
</tr>
</tr>
<tr>
<tr>
Line 116: Line 118:
=== Example 2: Smoothed topographic L1 sparsity penalty in sparse coding  ===
=== Example 2: Smoothed topographic L1 sparsity penalty in sparse coding  ===
-
Recall the smoothed topographic L1 sparsity penalty on <math>s</math> in sparse coding:
+
Recall the smoothed topographic L1 sparsity penalty on <math>s</math> in [[Sparse Coding: Autoencoder Interpretation | sparse coding]]:
:<math>\sum{ \sqrt{Vss^T + \epsilon} }</math>
:<math>\sum{ \sqrt{Vss^T + \epsilon} }</math>
where <math>V</math> is the grouping matrix, <math>s</math> is the feature matrix and <math>\epsilon</math> is a constant.
where <math>V</math> is the grouping matrix, <math>s</math> is the feature matrix and <math>\epsilon</math> is a constant.
Line 163: Line 165:
<td>3</td>
<td>3</td>
<td><math>f'(z_i) = 1</math></td>
<td><math>f'(z_i) = 1</math></td>
-
<td><math>\left( I^T \delta^{(3)} \right) \bullet 1</math></td>
+
<td><math>\left( I^T \delta^{(4)} \right) \bullet 1</math></td>
<td><math>Vss^T</math></td>
<td><math>Vss^T</math></td>
</tr>
</tr>
Line 188: Line 190:
\end{align}
\end{align}
</math>
</math>
 +
 +
=== Example 3: ICA reconstruction cost ===
 +
 +
Recall the [[Independent Component Analysis | independent component analysis (ICA)]] reconstruction cost term:
 +
<math>\lVert W^TWx - x \rVert_2^2</math>
 +
where <math>W</math> is the weight matrix and <math>x</math> is the input.
 +
 +
We would like to find <math>\nabla_W \lVert W^TWx - x \rVert_2^2</math> - the derivative of the term with respect to the '''weight matrix''', rather than the '''input''' as in the earlier two examples. We will still proceed similarly though, seeing this term as an instantiation of a neural network:
 +
 +
[[File:Backpropagation Method Example 3.png | 400px]]
 +
 +
The weights and activation functions of this network are as follows:
 +
<table align="center">
 +
<tr><th width="50px">Layer</th><th width="200px">Weight</th><th width="200px">Activation function <math>f</math></th></tr>
 +
<tr>
 +
<td>1</td>
 +
<td><math>W</math></td>
 +
<td><math>f(z_i) = z_i</math></td>
 +
</tr>
 +
<tr>
 +
<td>2</td>
 +
<td><math>W^T</math></td>
 +
<td><math>f(z_i) = z_i</math></td>
 +
</tr>
 +
<tr>
 +
<td>3</td>
 +
<td><math>I</math></td>
 +
<td><math>f(z_i) = z_i - x_i</math></td>
 +
</tr>
 +
<tr>
 +
<td>4</td>
 +
<td>N/A</td>
 +
<td><math>f(z_i) = z_i^2</math></td>
 +
</tr>
 +
</table>
 +
To have <math>J(z^{(4)}) = F(x)</math>, we can set <math>J(z^{(4)}) = \sum_k J(z^{(4)}_k)</math>.
 +
 +
Now that we can see <math>F</math> as a neural network, we can try to compute the gradient <math>\nabla_W F</math>. However, we now face the difficulty that <math>W</math> appears twice in the network. Fortunately, it turns out that if <math>W</math> appears multiple times in the network, the gradient with respect to <math>W</math> is simply the sum of gradients for each instance of <math>W</math> in the network (you may wish to work out a formal proof of this fact to convince yourself). With this in mind, we will proceed to work out the deltas first:
 +
 +
<table align="center">
 +
<tr><th width="50px">Layer</th><th width="200px">Derivative of activation function <math>f'</math>
 +
</th><th width="200px">Delta</th><th>Input <math>z</math> to this layer</th></tr>
 +
<tr>
 +
<td>4</td>
 +
<td><math>f'(z_i) = 2z_i</math></td>
 +
<td><math>f'(z_i) = 2z_i</math></td>
 +
<td><math>(W^TWx - x)</math></td>
 +
</tr>
 +
<tr>
 +
<td>3</td>
 +
<td><math>f'(z_i) = 1</math></td>
 +
<td><math>\left( I^T \delta^{(4)} \right) \bullet 1</math></td>
 +
<td><math>W^TWx</math></td>
 +
</tr>
 +
<tr>
 +
<td>2</td>
 +
<td><math>f'(z_i) = 1</math></td>
 +
<td><math>\left( (W^T)^T \delta^{(3)} \right) \bullet 1</math></td>
 +
<td><math>Wx</math></td>
 +
</tr>
 +
<tr>
 +
<td>1</td>
 +
<td><math>f'(z_i) = 1</math></td>
 +
<td><math>\left( W^T \delta^{(2)} \right) \bullet 1</math></td>
 +
<td><math>x</math></td>
 +
</tr>
 +
</table>
 +
 +
To find the gradients with respect to <math>W</math>, first we find the gradients with respect to each instance of <math>W</math> in the network.
 +
 +
With respect to <math>W^T</math>:
 +
:<math>
 +
\begin{align}
 +
\nabla_{W^T} F & = \delta^{(3)} a^{(2)T} \\
 +
& = 2(W^TWx - x) (Wx)^T
 +
\end{align}
 +
</math>
 +
 +
With respect to <math>W</math>:
 +
:<math>
 +
\begin{align}
 +
\nabla_{W} F & = \delta^{(2)} a^{(1)T} \\
 +
& = (W^T)(2(W^TWx -x)) x^T
 +
\end{align}
 +
</math>
 +
 +
Taking sums, noting that we need to transpose the gradient with respect to <math>W^T</math> to get the gradient with respect to <math>W</math>, yields the final gradient with respect to <math>W</math> (pardon the slight abuse of notation here):
 +
 +
:<math>
 +
\begin{align}
 +
\nabla_{W} F & = \nabla_{W} F + (\nabla_{W^T} F)^T \\
 +
& = (W^T)(2(W^TWx -x)) x^T + 2(Wx)(W^TWx - x)^T
 +
\end{align}
 +
</math>
 +
 +
 +
 +
{{Languages|用反向传导思想求导|中文}}

Latest revision as of 04:26, 8 April 2013

Personal tools