Neural Network Vectorization

Revision as of 23:36, 1 March 2011 (view source)

(Created page with "In this section, we derive a vectorized version of our neural network. In our earlier description of Neural Networks, we had already given a partially vectorized implementat...")

← Older edit

Latest revision as of 13:13, 7 April 2013 (view source)

Kandeng (Talk | contribs)

Line 14:

Concretely, following the [[Logistic Regression Vectorization Example]], let the Matlab/Octave variable <tt>x</tt> be a matrix containing the training inputs, so that <tt>x(:,i)</tt> is the <math>\textstyle i</math>-th training example. We can then implement forward propagation as:

-

% ~~Initial~~ implementation

+

% Unvectorized implementation

for i=1:m,

z2 = W1 * x(:,i) + b1;

Line 32:

</syntaxhighlight>

-

In this implementation, <tt>z2</tt>, <tt>a2</tt>, and <tt>z3</tt> are all matrices, with one column per training example. A common design pattern in vectorizing across training examples is that whereas previously we had a column vector (such as <tt>z2</tt>) per training example, we can often instead try to compute a matrix so that all of these column vectors are stacked together to form a matrix. Concretely, in this example, <tt>a2</tt> becomes a <math>s_2</math> by <math>m</math> matrix (where <math>s_2</math> is the number of units in layer 2 of the network, and <math>m</math> is the number of training examples). And, the <math>i</math>-th column of <tt>a2</tt> contains the activations of the hidden units (layer 2) when the <math>i</math>-th training example <tt>x(:,i)</tt> is input to the network.

+

In this implementation, <tt>z2</tt>, <tt>a2</tt>, and <tt>z3</tt> are all matrices, with one column per training example. A common design pattern in vectorizing across training examples is that whereas previously we had a column vector (such as <tt>z2</tt>) per training example, we can often instead try to compute a matrix so that all of these column vectors are stacked together to form a matrix. Concretely, in this example, <tt>a2</tt> becomes a <math>s_2</math> by <math>m</math> matrix (where <math>s_2</math> is the number of units in layer 2 of the network, and <math>m</math> is the number of training examples). And, the <math>i</math>-th column of <tt>a2</tt> contains the activations of the hidden units (layer 2 of the network) when the <math>i</math>-th training example <tt>x(:,i)</tt> is input to the network.

-

In the implementation above, we have assumed that the activation function <tt>f(z)</tt> takes as input a matrix <tt>z</tt>, and applies the activation function component-wise to the input. Note that your implementation of <tt>f(z)</tt> should use Matlab/Octave's matrix operations as much as possible, to avoid <tt>for</tt> loops as well. We illustrate this below, assuming that <tt>f(z)</tt> is a sigmoid activation function:

+

In the implementation above, we have assumed that the activation function <tt>f(z)</tt> takes as input a matrix <tt>z</tt>, and applies the activation function component-wise to the input. Note that your implementation of <tt>f(z)</tt> should use Matlab/Octave's matrix operations as much as possible, and avoid <tt>for</tt> loops as well. We illustrate this below, assuming that <tt>f(z)</tt> is the sigmoid activation function:

-

% ~~Unvectorized~~ implementation of the activation function

+

% Inefficient, unvectorized implementation of the activation function

function output = unvectorized_f(z)

output = zeros(size(z))

-

for i=1:size(z,1), for j=1:size(z,2),

+

for i=1:size(z,1),

-

output(i,j) = 1/(1+exp(-z(i,j)));

+

for j=1:size(z,2),

-

end; end;

+

output(i,j) = 1/(1+exp(-z(i,j)));

+

end;

+

end;

end

Line 51:

Line 53:

</syntaxhighlight>

-

Finally, our vectorized implementation of forward propagation had ignored <tt>b1</tt> and <tt>b2</tt>. To incorporate those back in, we will use Matlab/Octave's built-in <tt>repmat</tt> function. We have:

+

Finally, our vectorized implementation of forward propagation above had ignored <tt>b1</tt> and <tt>b2</tt>. To incorporate those back in, we will use Matlab/Octave's built-in <tt>repmat</tt> function. We have:

% Vectorized implementation of forward propagation

Line 59:

Line 61:

h = f(z3)

</syntaxhighlight>

-

The result of <tt>repmat(b1,1,m)</tt> is a matrix formed by taking the column vector <tt>b1</tt> and stacking <math>m</math> copies of them in columns as follows:

+

The result of <tt>repmat(b1,1,m)</tt> is a matrix formed by taking the column vector <tt>b1</tt> and stacking <math>m</math> copies of them in columns as follows

-

<math>

+

::<math>

\begin{bmatrix}

| & | & & | \\

-

b1 & b1 & \cdots b1 \\

+

{\rm b1} & {\rm b1} & \cdots & {\rm b1} \\

| & | & & |

+

\end{bmatrix}.

</math>

+

This forms a <math>s_2</math> by <math>m</math> matrix.

Thus, the result of adding this to <tt>W1 * x</tt> is that each column of the matrix gets <tt>b1</tt> added to it, as desired.

-

See Matlab/Octave's documentation (type "<tt>help repmat</tt>") for more information. As a Matlab/Octave built-in function, <tt>repmat</tt> is very efficient--much ~~more so~~ than ~~implementing~~ a <tt>for</tt> loop ~~yourself~~.

+

See Matlab/Octave's documentation (type "<tt>help repmat</tt>") for more information. As a Matlab/Octave built-in function, <tt>repmat</tt> is very efficient as well, and runs much faster than if you were to implement the same thing yourself using a <tt>for</tt> loop.

== Backpropagation ==

+

We now describe the main ideas behind vectorizing backpropagation. Before reading this section, we strongly encourage you to carefully step through all the forward propagation code examples above to make sure you fully understand them. In this text, we'll only sketch the details of how to vectorize backpropagation, and leave you to derive the details in the [[Exercise:Vectorization|Vectorization exercise]].

+

We are in a supervised learning setting, so that we have a training set <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math> of <math>m</math> training examples. (For the autoencoder, we simply set <math>y^{(i)} = x^{(i)}</math>, but our derivation here will consider this more general setting.)

+

Suppose we have <math>s_3</math> dimensional outputs, so that our target labels are <math>y^{(i)} \in \Re^{s_3}</math>. In our Matlab/Octave datastructure, we will stack these in columns to form a Matlab/Octave variable <tt>y</tt>, so that the <math>i</math>-th column <tt>y(:,i)</tt> is <math>y^{(i)}</math>.

+

We now want to compute the gradient terms

+

<math>\nabla_{W^{(l)}} J(W,b)</math> and <math>\nabla_{b^{(l)}} J(W,b)</math>. Consider the first of

+

these terms. Following our earlier description of the [[Backpropagation Algorithm]], we had that for a single training example <math>(x,y)</math>, we can compute the derivatives as

+

::<math>

+

\begin{align}

+

\delta^{(3)} &= - (y - a^{(3)}) \bullet f'(z^{(3)}), \\

+

\delta^{(2)} &= ((W^{(2)})^T\delta^{(3)}) \bullet f'(z^{(2)}), \\

+

\nabla_{W^{(2)}} J(W,b;x,y) &= \delta^{(3)} (a^{(2)})^T, \\

+

\nabla_{W^{(1)}} J(W,b;x,y) &= \delta^{(2)} (a^{(1)})^T.

+

\end{align}

+

</math>

+

Here, <math>\bullet</math> denotes element-wise product. For simplicity, our description here will ignore the derivatives with respect to <math>b^{(l)}</math>, though your implementation of backpropagation will have to compute those derivatives too.

+

Suppose we have already implemented the vectorized forward propagation method, so that the matrix-valued <tt>z2</tt>, <tt>a2</tt>, <tt>z3</tt> and <tt>h</tt> are computed as described above. We can then implement an ''unvectorized'' version of backpropagation as follows:

+

+

gradW1 = zeros(size(W1));

+

gradW2 = zeros(size(W2));

+

for i=1:m,

+

delta3 = -(y(:,i) - h(:,i)) .* fprime(z3(:,i));

+

delta2 = W2'*delta3(:,i) .* fprime(z2(:,i));

+

gradW2 = gradW2 + delta3*a2(:,i)';

+

gradW1 = gradW1 + delta2*a1(:,i)';

+

end;

+

</syntaxhighlight>

+

This implementation has a <tt>for</tt> loop. We would like to come up with an implementation that simultaneously performs backpropagation on all the examples, and eliminates this <tt>for</tt> loop.

+

To do so, we will replace the vectors <tt>delta3</tt> and <tt>delta2</tt> with matrices, where one column of each matrix corresponds to each training example. We will also implement a function <tt>fprime(z)</tt> that takes as input a matrix <tt>z</tt>, and applies <math>f'(\cdot)</math> element-wise. Each of the four lines of Matlab in the <tt>for</tt> loop above can then be vectorized and replaced with a single line of Matlab code (without a surrounding <tt>for</tt> loop).

+

In the [[Exercise:Vectorization|Vectorization exercise]], we ask you to derive the vectorized version of this algorithm by yourself. If you are able to do it from this description, we strongly encourage you to do so. Here also are some [[Backpropagation vectorization hints]]; however, we encourage you to try to carry out the vectorization yourself without looking at the hints.

+

== Sparse autoencoder ==

+

The [[Autoencoders_and_Sparsity|sparse autoencoder]] neural network has an additional sparsity penalty that constrains neurons' average firing rate to be close to some target activation <math>\rho</math>. When performing backpropagation on a single training example, we had taken into the account the sparsity penalty by computing the following:

+

:<math>\begin{align}

+

\delta^{(2)}_i =

+

\left( \left( \sum_{j=1}^{s_{2}} W^{(2)}_{ji} \delta^{(3)}_j \right)

+

+ \beta \left( - \frac{\rho}{\hat\rho_i} + \frac{1-\rho}{1-\hat\rho_i} \right) \right) f'(z^{(2)}_i) .

+

\end{align}</math>

+

In the ''unvectorized'' case, this was computed as:

+

+

% Sparsity Penalty Delta

+

sparsity_delta = - rho ./ rho_hat + (1 - rho) ./ (1 - rho_hat);

+

for i=1:m,

+

...

+

delta2 = (W2'*delta3(:,i) + beta*sparsity_delta).* fprime(z2(:,i));

+

...

+

end;

+

</syntaxhighlight>

+

The code above still had a <tt>for</tt> loop over the training set, and <tt>delta2</tt> was a column vector.

+

In contrast, recall that in the vectorized case, <tt>delta2</tt> is now a matrix with <math>m</math> columns corresponding to the <math>m</math> training examples. Now, notice that the <tt>sparsity_delta</tt> term is the same regardless of what training example we are processing. This suggests that vectorizing the computation above can be done by simply adding the same value to each column when constructing the <tt>delta2</tt> matrix. Thus, to vectorize the above computation, we can simply add <tt>sparsity_delta</tt> (e.g., using <tt>repmat</tt>) to each column of <tt>delta2</tt>.

+

-

~~Further, suppose the Matlab/Octave variable <tt>y</tt> is a ''row'' vector of the labels in the~~

+

-

~~training set, so that the variable <tt>y(i)</tt> is <math>\textstyle y^~~{~~(i)} \in \~~{~~0,1\~~}~~</math>. (Here we differ from the~~

+

-

~~CS229 notation. Specifically, in the matrix-valued <tt>x</tt> we stack the training inputs in columns rather than in rows;~~

+

-

~~and <tt>y</tt><math>\in \Re^{1\times m~~}~~</math> is a row vector rather than a column vector.)~~

+

Neural Network Vectorization

From Ufldl

Latest revision as of 13:13, 7 April 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 14: / Line 14: @@
 Concretely, following the [[Logistic Regression Vectorization Example]], let the Matlab/Octave variable <tt>x</tt> be a matrix containing the training inputs, so that <tt>x(:,i)</tt> is the <math>\textstyle i</math>-th training example.  We can then implement forward propagation as:
 <syntaxhighlight>
-% Initial implementation
+% Unvectorized implementation
 for i=1:m,
    z2 = W1 * x(:,i) + b1;
@@ Line 32: / Line 32: @@
 </syntaxhighlight>
-In this implementation, <tt>z2</tt>, <tt>a2</tt>, and <tt>z3</tt> are all matrices, with one column per training example.  A common design pattern in vectorizing across training examples is that whereas previously we had a column vector (such as <tt>z2</tt>) per training example, we can often instead try to compute a matrix so that all of these column vectors are stacked together to form a matrix.  Concretely, in this example, <tt>a2</tt> becomes a <math>s_2</math> by <math>m</math> matrix (where <math>s_2</math> is the number of units in layer 2 of the network, and <math>m</math> is the number of training examples).  And, the <math>i</math>-th column of <tt>a2</tt> contains the activations of the hidden units (layer 2) when the <math>i</math>-th training example <tt>x(:,i)</tt> is input to the network.
+In this implementation, <tt>z2</tt>, <tt>a2</tt>, and <tt>z3</tt> are all matrices, with one column per training example.  A common design pattern in vectorizing across training examples is that whereas previously we had a column vector (such as <tt>z2</tt>) per training example, we can often instead try to compute a matrix so that all of these column vectors are stacked together to form a matrix.  Concretely, in this example, <tt>a2</tt> becomes a <math>s_2</math> by <math>m</math> matrix (where <math>s_2</math> is the number of units in layer 2 of the network, and <math>m</math> is the number of training examples).  And, the <math>i</math>-th column of <tt>a2</tt> contains the activations of the hidden units (layer 2 of the network) when the <math>i</math>-th training example <tt>x(:,i)</tt> is input to the network.
-In the implementation above, we have assumed that the activation function <tt>f(z)</tt> takes as input a matrix <tt>z</tt>, and applies the activation function component-wise to the input.  Note that your implementation of <tt>f(z)</tt> should use Matlab/Octave's matrix operations as much as possible, to avoid <tt>for</tt> loops as well.  We illustrate this below, assuming that <tt>f(z)</tt> is a sigmoid activation function:
+In the implementation above, we have assumed that the activation function <tt>f(z)</tt> takes as input a matrix <tt>z</tt>, and applies the activation function component-wise to the input.  Note that your implementation of <tt>f(z)</tt> should use Matlab/Octave's matrix operations as much as possible, and avoid <tt>for</tt> loops as well.  We illustrate this below, assuming that <tt>f(z)</tt> is the sigmoid activation function:
 <syntaxhighlight>
-% Unvectorized implementation of the activation function
+% Inefficient, unvectorized implementation of the activation function
 function output = unvectorized_f(z)
 output = zeros(size(z))
-for i=1:size(z,1), for j=1:size(z,2),
+for i=1:size(z,1),
-  output(i,j) = 1/(1+exp(-z(i,j)));
+  for j=1:size(z,2),
-end; end;
+    output(i,j) = 1/(1+exp(-z(i,j)));
+  end;
+end;
 end
@@ Line 51: / Line 53: @@
 </syntaxhighlight>
-Finally, our vectorized implementation of forward propagation had ignored <tt>b1</tt> and <tt>b2</tt>.  To incorporate those back in, we will use Matlab/Octave's built-in <tt>repmat</tt> function.  We have:
+Finally, our vectorized implementation of forward propagation above had ignored <tt>b1</tt> and <tt>b2</tt>.  To incorporate those back in, we will use Matlab/Octave's built-in <tt>repmat</tt> function.  We have:
 <syntaxhighlight>
 % Vectorized implementation of forward propagation
@@ Line 59: / Line 61: @@
 h = f(z3)
 </syntaxhighlight>
-The result of <tt>repmat(b1,1,m)</tt> is a matrix formed by taking the column vector <tt>b1</tt> and stacking <math>m</math> copies of them in columns as follows:
+The result of <tt>repmat(b1,1,m)</tt> is a matrix formed by taking the column vector <tt>b1</tt> and stacking <math>m</math> copies of them in columns as follows
-<math>
+::<math>
 \begin{bmatrix}
 | & | &  & |  \\
-b1  & b1  & \cdots b1 \\
+{\rm b1}  & {\rm b1}  & \cdots & {\rm b1} \\
 | & | &  & |
+\end{bmatrix}.
 </math>
+This forms a <math>s_2</math> by <math>m</math> matrix.
 Thus, the result of adding this to <tt>W1 * x</tt> is that each column of the matrix gets <tt>b1</tt> added to it, as desired.
-See Matlab/Octave's documentation (type "<tt>help repmat</tt>") for more information.  As a Matlab/Octave built-in function, <tt>repmat</tt> is very efficient--much more so than implementing a <tt>for</tt> loop yourself.
+See Matlab/Octave's documentation (type "<tt>help repmat</tt>") for more information.  As a Matlab/Octave built-in function, <tt>repmat</tt> is very efficient as well, and runs much faster than if you were to implement the same thing yourself using a <tt>for</tt> loop.
 == Backpropagation ==
+We now describe the main ideas behind vectorizing backpropagation.  Before reading this section, we strongly encourage you to carefully step through all the forward propagation code examples above to make sure you fully understand them.  In this text, we'll only sketch the details of how to vectorize backpropagation, and leave you to derive the details in the [[Exercise:Vectorization|Vectorization exercise]].
+We are in a supervised learning setting, so that we have a training set <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math> of <math>m</math> training examples.  (For the autoencoder, we simply set <math>y^{(i)} = x^{(i)}</math>, but our derivation here will consider this more general setting.)
+Suppose we have <math>s_3</math> dimensional outputs, so that our target labels are <math>y^{(i)} \in \Re^{s_3}</math>.  In our Matlab/Octave datastructure, we will stack these in columns to form a Matlab/Octave variable <tt>y</tt>, so that the <math>i</math>-th column <tt>y(:,i)</tt> is <math>y^{(i)}</math>.
+We now want to compute the gradient terms
+<math>\nabla_{W^{(l)}} J(W,b)</math> and <math>\nabla_{b^{(l)}} J(W,b)</math>.  Consider the first of
+these terms.  Following our earlier description of the [[Backpropagation Algorithm]], we had that for a single training example <math>(x,y)</math>, we can compute the derivatives as
+::<math>
+\begin{align}
+\delta^{(3)} &= - (y - a^{(3)}) \bullet f'(z^{(3)}), \\
+\delta^{(2)} &= ((W^{(2)})^T\delta^{(3)}) \bullet f'(z^{(2)}), \\
+\nabla_{W^{(2)}} J(W,b;x,y) &= \delta^{(3)} (a^{(2)})^T, \\
+\nabla_{W^{(1)}} J(W,b;x,y) &= \delta^{(2)} (a^{(1)})^T.
+\end{align}
+</math>
+Here, <math>\bullet</math> denotes element-wise product.  For simplicity, our description here will ignore the derivatives with respect to <math>b^{(l)}</math>, though your implementation of backpropagation will have to compute those derivatives too.
+Suppose we have already implemented the vectorized forward propagation method, so that the matrix-valued <tt>z2</tt>, <tt>a2</tt>,  <tt>z3</tt> and <tt>h</tt> are computed as described above. We can then implement an ''unvectorized'' version of backpropagation as follows:
+<syntaxhighlight>
+gradW1 = zeros(size(W1));
+gradW2 = zeros(size(W2));
+for i=1:m,
+  delta3 = -(y(:,i) - h(:,i)) .* fprime(z3(:,i));
+  delta2 = W2'*delta3(:,i) .* fprime(z2(:,i));
+  gradW2 = gradW2 + delta3*a2(:,i)';
+  gradW1 = gradW1 + delta2*a1(:,i)';
+end;
+</syntaxhighlight>
+This implementation has a <tt>for</tt> loop.  We would like to come up with an implementation that simultaneously performs backpropagation on all the examples, and eliminates this <tt>for</tt> loop.
+To do so, we will replace the vectors <tt>delta3</tt> and <tt>delta2</tt> with matrices, where one column of each matrix corresponds to each training example.  We will also implement a function <tt>fprime(z)</tt> that takes as input a matrix <tt>z</tt>, and applies <math>f'(\cdot)</math> element-wise.  Each of the four lines of Matlab in the <tt>for</tt> loop above can then be vectorized and replaced with a single line of Matlab code (without a surrounding <tt>for</tt> loop).
+In the [[Exercise:Vectorization|Vectorization exercise]], we ask you to derive the vectorized version of this algorithm by yourself.  If you are able to do it from this description, we strongly encourage you to do so.  Here also are some [[Backpropagation vectorization hints]]; however, we encourage you to try to carry out the vectorization yourself without looking at the hints.
+== Sparse autoencoder ==
+The [[Autoencoders_and_Sparsity|sparse autoencoder]] neural network has an additional sparsity penalty that constrains neurons' average firing rate to be close to some target activation <math>\rho</math>.  When performing backpropagation on a single training example, we had taken into the account the sparsity penalty by computing the following:
+:<math>\begin{align}
+\delta^{(2)}_i =
+  \left( \left( \sum_{j=1}^{s_{2}} W^{(2)}_{ji} \delta^{(3)}_j \right)
++ \beta \left( - \frac{\rho}{\hat\rho_i} + \frac{1-\rho}{1-\hat\rho_i} \right) \right) f'(z^{(2)}_i) .
+\end{align}</math>
+In the ''unvectorized'' case, this was computed as:
+<syntaxhighlight>
+% Sparsity Penalty Delta
+sparsity_delta = - rho ./ rho_hat + (1 - rho) ./ (1 - rho_hat);
+for i=1:m,
+  ...
+  delta2 = (W2'*delta3(:,i) + beta*sparsity_delta).* fprime(z2(:,i));
+  ...
+end;
+</syntaxhighlight>
+The code above still had a <tt>for</tt> loop over the training set, and <tt>delta2</tt> was a column vector.
+In contrast, recall that in the vectorized case, <tt>delta2</tt> is now a matrix with <math>m</math> columns corresponding to the <math>m</math> training examples.  Now, notice that the <tt>sparsity_delta</tt> term is the same regardless of what training example we are processing.  This suggests that vectorizing the computation above can be done by simply adding the same value to each column when constructing the <tt>delta2</tt> matrix. Thus, to vectorize the above computation, we can simply add <tt>sparsity_delta</tt> (e.g., using <tt>repmat</tt>) to each column of <tt>delta2</tt>.
+{{Vectorized Implementation}}
-Further, suppose the Matlab/Octave variable <tt>y</tt> is a ''row'' vector of the labels in the
+{{Languages|神经网络向量化|中文}}
-training set, so that the variable <tt>y(i)</tt> is <math>\textstyle y^{(i)} \in \{0,1\}</math>.  (Here we differ from the
-CS229 notation. Specifically, in the matrix-valued <tt>x</tt> we stack the training inputs in columns rather than in rows;
-and <tt>y</tt><math>\in \Re^{1\times m}</math> is a row vector rather than a column vector.)