Neural Network Vectorization

From Ufldl

Jump to: navigation, search
(Created page with "In this section, we derive a vectorized version of our neural network. In our earlier description of Neural Networks, we had already given a partially vectorized implementat...")
Line 14: Line 14:
Concretely, following the [[Logistic Regression Vectorization Example]], let the Matlab/Octave variable <tt>x</tt> be a matrix containing the training inputs, so that <tt>x(:,i)</tt> is the <math>\textstyle i</math>-th training example.  We can then implement forward propagation as:  
Concretely, following the [[Logistic Regression Vectorization Example]], let the Matlab/Octave variable <tt>x</tt> be a matrix containing the training inputs, so that <tt>x(:,i)</tt> is the <math>\textstyle i</math>-th training example.  We can then implement forward propagation as:  
<syntaxhighlight>  
<syntaxhighlight>  
-
% Initial implementation
+
% Unvectorized implementation
for i=1:m,  
for i=1:m,  
   z2 = W1 * x(:,i) + b1;
   z2 = W1 * x(:,i) + b1;
Line 32: Line 32:
</syntaxhighlight>
</syntaxhighlight>
-
In this implementation, <tt>z2</tt>, <tt>a2</tt>, and <tt>z3</tt> are all matrices, with one column per training example.  A common design pattern in vectorizing across training examples is that whereas previously we had a column vector (such as <tt>z2</tt>) per training example, we can often instead try to compute a matrix so that all of these column vectors are stacked together to form a matrix.  Concretely, in this example, <tt>a2</tt> becomes a <math>s_2</math> by <math>m</math> matrix (where <math>s_2</math> is the number of units in layer 2 of the network, and <math>m</math> is the number of training examples).  And, the <math>i</math>-th column of <tt>a2</tt> contains the activations of the hidden units (layer 2) when the <math>i</math>-th training example <tt>x(:,i)</tt> is input to the network.  
+
In this implementation, <tt>z2</tt>, <tt>a2</tt>, and <tt>z3</tt> are all matrices, with one column per training example.  A common design pattern in vectorizing across training examples is that whereas previously we had a column vector (such as <tt>z2</tt>) per training example, we can often instead try to compute a matrix so that all of these column vectors are stacked together to form a matrix.  Concretely, in this example, <tt>a2</tt> becomes a <math>s_2</math> by <math>m</math> matrix (where <math>s_2</math> is the number of units in layer 2 of the network, and <math>m</math> is the number of training examples).  And, the <math>i</math>-th column of <tt>a2</tt> contains the activations of the hidden units (layer 2 of the network) when the <math>i</math>-th training example <tt>x(:,i)</tt> is input to the network.  
-
In the implementation above, we have assumed that the activation function <tt>f(z)</tt> takes as input a matrix <tt>z</tt>, and applies the activation function component-wise to the input.  Note that your implementation of <tt>f(z)</tt> should use Matlab/Octave's matrix operations as much as possible, to avoid <tt>for</tt> loops as well.  We illustrate this below, assuming that <tt>f(z)</tt> is a sigmoid activation function:
+
In the implementation above, we have assumed that the activation function <tt>f(z)</tt> takes as input a matrix <tt>z</tt>, and applies the activation function component-wise to the input.  Note that your implementation of <tt>f(z)</tt> should use Matlab/Octave's matrix operations as much as possible, and avoid <tt>for</tt> loops as well.  We illustrate this below, assuming that <tt>f(z)</tt> is the sigmoid activation function:
<syntaxhighlight>
<syntaxhighlight>
-
% Unvectorized implementation of the activation function
+
% Inefficient, unvectorized implementation of the activation function
function output = unvectorized_f(z)
function output = unvectorized_f(z)
output = zeros(size(z))
output = zeros(size(z))
-
for i=1:size(z,1), for j=1:size(z,2),
+
for i=1:size(z,1),  
-
  output(i,j) = 1/(1+exp(-z(i,j)));
+
  for j=1:size(z,2),
-
end; end;
+
    output(i,j) = 1/(1+exp(-z(i,j)));
 +
  end;  
 +
end;
end
end
Line 51: Line 53:
</syntaxhighlight>
</syntaxhighlight>
-
Finally, our vectorized implementation of forward propagation had ignored <tt>b1</tt> and <tt>b2</tt>.  To incorporate those back in, we will use Matlab/Octave's built-in <tt>repmat</tt> function.  We have:
+
Finally, our vectorized implementation of forward propagation above had ignored <tt>b1</tt> and <tt>b2</tt>.  To incorporate those back in, we will use Matlab/Octave's built-in <tt>repmat</tt> function.  We have:
<syntaxhighlight>
<syntaxhighlight>
% Vectorized implementation of forward propagation
% Vectorized implementation of forward propagation
Line 59: Line 61:
h = f(z3)
h = f(z3)
</syntaxhighlight>
</syntaxhighlight>
-
The result of <tt>repmat(b1,1,m)</tt> is a matrix formed by taking the column vector <tt>b1</tt> and stacking <math>m</math> copies of them in columns as follows:
+
The result of <tt>repmat(b1,1,m)</tt> is a matrix formed by taking the column vector <tt>b1</tt> and stacking <math>m</math> copies of them in columns as follows
-
<math>
+
::<math>
\begin{bmatrix}
\begin{bmatrix}
| & | &  & |  \\
| & | &  & |  \\
-
b1  & b1  & \cdots b1 \\
+
{\rm b1} & {\rm b1} & \cdots & {\rm b1} \\
| & | &  & |   
| & | &  & |   
 +
\end{bmatrix}.
</math>
</math>
 +
This forms a <math>s_2</math> by <math>m</math> matrix.
Thus, the result of adding this to <tt>W1 * x</tt> is that each column of the matrix gets <tt>b1</tt> added to it, as desired.
Thus, the result of adding this to <tt>W1 * x</tt> is that each column of the matrix gets <tt>b1</tt> added to it, as desired.
-
See Matlab/Octave's documentation (type "<tt>help repmat</tt>") for more information.  As a Matlab/Octave built-in function, <tt>repmat</tt> is very efficient--much more so than implementing a <tt>for</tt> loop yourself.
+
See Matlab/Octave's documentation (type "<tt>help repmat</tt>") for more information.  As a Matlab/Octave built-in function, <tt>repmat</tt> is very efficient as well, and runs much faster than if you were to implement the same thing yourself using a <tt>for</tt> loop.  
== Backpropagation ==
== Backpropagation ==
 +
 +
We now describe the main ideas behind vectorizing backpropagation.  Before reading this section, we strongly encourage you to carefully step through all the forward propagation code examples above to make sure you fully understand them.  In this text, we'll only sketch the details of how to vectorize backpropagation, and leave you to derive the details in the [[Exercise:Vectorization|Vectorization exercise]].
 +
 +
We are in a supervised learning setting, so that we have a training set <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math> of <math>m</math> training examples.  (For the autoencoder, we simply set <math>y^{(i)} = x^{(i)}</math>, but our derivation here will consider this more general setting.)
 +
 +
Suppose we have <math>s_3</math> dimensional outputs, so that our target labels are <math>y^{(i)} \in \Re^{s_3}</math>.  In our Matlab/Octave datastructure, we will stack these in columns to form a Matlab/Octave variable <tt>y</tt>, so that the <math>i</math>-th column <tt>y(:,i)</tt> is <math>y^{(i)}</math>.
 +
 +
We now want to compute the gradient terms
 +
<math>\nabla_{W^{(l)}} J(W,b)</math> and <math>\nabla_{b^{(l)}} J(W,b)</math>.  Consider the first of
 +
these terms.  Following our earlier description of the [[Backpropagation Algorithm]], we had that for a single training example <math>(x,y)</math>, we can compute the derivatives as
 +
::<math>
 +
\begin{align}
 +
\delta^{(3)} &= - (y - a^{(3)}) \bullet f'(z^{(3)}), \\
 +
\delta^{(2)} &= ((W^{(2)})^T\delta^{(3)}) \bullet f'(z^{(2)}), \\
 +
\nabla_{W^{(2)}} J(W,b;x,y) &= \delta^{(3)} (a^{(2)})^T, \\
 +
\nabla_{W^{(1)}} J(W,b;x,y) &= \delta^{(2)} (a^{(1)})^T.
 +
\end{align}
 +
</math>
 +
Here, <math>\bullet</math> denote element-wise product.  For simplicity, our description here will ignore the derivatives with respect to <math>b^{(l)}</math>, though your implementation of backpropagation will have to compute those derivatives too.
 +
 +
Suppose we have already implemented the vectorized forward propagation method, so that the matrix-valued <tt>z2</tt>, <tt>a2</tt>,  <tt>z3</tt> and <tt>h</tt> are computed as described above. We can then implement an ''unvectorized'' version of backpropagation as follows:
 +
<syntaxhighlight>
 +
gradW1 = zeros(size(W1));
 +
gradW2 = zeros(size(W2));
 +
for i=1:m,
 +
  delta3 = -(y(:,i) - h(:,i)) .* fprime(z3(:,i));
 +
  delta2 = W2'*delta3(:,i) .* fprime(z2(:,i));
 +
 +
  gradW2 = gradW2 + delta3*a2(:,i)';
 +
  gradW1 = gradW1 + delta2*a1(:,i)';
 +
end;
 +
</syntaxhighlight>
 +
 +
This implementation has a <tt>for</tt> loop.  We would like to come up with an implementation that simultaneously performs backpropagation on all the examples, and eliminates this <tt>for</tt> loop.

Revision as of 00:34, 2 March 2011

Personal tools