Logistic Regression Vectorization Example

From Ufldl

Jump to: navigation, search
(Created page with "Consider training a logistic regression model using batch gradient ascent. Suppose our hypothesis is :<math>\begin{align} h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)}, \end{align}<...")
 
Line 4: Line 4:
h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)},
h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)},
\end{align}</math>
\end{align}</math>
-
where (following CS229 notational convention) we let <math>\textstyle x_0=1</math>, so that <math>\textstyle x \in \Re^{n+1}</math>
+
where (following the notational convention from the OpenClassroom videos and from CS229) we let <math>\textstyle x_0=1</math>, so that <math>\textstyle x \in \Re^{n+1}</math>  
and <math>\textstyle \theta \in \Re^{n+1}</math>, and <math>\textstyle \theta_0</math> is our intercept term.  We have a training set
and <math>\textstyle \theta \in \Re^{n+1}</math>, and <math>\textstyle \theta_0</math> is our intercept term.  We have a training set
<math>\textstyle \{(x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)})\}</math> of <math>\textstyle m</math> examples, and the batch gradient
<math>\textstyle \{(x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)})\}</math> of <math>\textstyle m</math> examples, and the batch gradient
-
ascent update rule is "<math>\textstyle \theta := \theta + \alpha \nabla_\theta \ell(\theta)</math>, where <math>\textstyle \ell(\theta)</math>
+
ascent update rule is <math>\textstyle \theta := \theta + \alpha \nabla_\theta \ell(\theta)</math>, where <math>\textstyle \ell(\theta)</math>
is the log likelihood and <math>\textstyle \nabla_\theta \ell(\theta)</math> is its derivative.
is the log likelihood and <math>\textstyle \nabla_\theta \ell(\theta)</math> is its derivative.
-
[Note: Most of the notation below follows that defined in the class  
+
[Note: Most of the notation below follows that defined in the OpenClassroom videos or in the class  
-
CS229: Machine Learning.  Please see Lecture notes #1 from http://cs229.stanford.edu/ for details.]
+
CS229: Machine Learning.  For details, see either the [http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning OpenClassroom videos] or Lecture Notes #1 of http://cs229.stanford.edu/ .]
We thus need to compute the gradient:
We thus need to compute the gradient:
Line 17: Line 17:
\nabla_\theta \ell(\theta) = \sum_{i=1}^m \left(y^{(i)} - h_\theta(x^{(i)}) \right) x^{(i)}_j.
\nabla_\theta \ell(\theta) = \sum_{i=1}^m \left(y^{(i)} - h_\theta(x^{(i)}) \right) x^{(i)}_j.
\end{align}</math>
\end{align}</math>
-
Suppose that the Matlab/Octave variable <tt>x</tt> is the design matrix, so that
 
-
<tt>x(i,:)'</tt> is the <math>\textstyle i</math>-th training example <math>\textstyle x^{(i)}</math> and <tt>x(i,j)</tt> is  <math>\textstyle x^{(i)}_j</math>.
 
-
Further, suppose the Matlab/Octave variable <tt>y</tt> is a vector of the labels in the
 
-
training set, so that <tt>y(i)</tt> is <math>\textstyle y^{(i)} \in \{0,1\}</math>.
 
-
Here's truly horrible, extremely slow, implementation:
+
Suppose that the Matlab/Octave variable <tt>x</tt> is a matrix containing the training inputs, so that
 +
<tt>x(:,i)</tt> is the <math>\textstyle i</math>-th training example <math>\textstyle x^{(i)}</math>, and <tt>x(j,i)</tt> is  <math>\textstyle x^{(i)}_j</math>. 
 +
Further, suppose the Matlab/Octave variable <tt>y</tt> is a ''row'' vector of the labels in the
 +
training set, so that the variable <tt>y(i)</tt> is <math>\textstyle y^{(i)} \in \{0,1\}</math>.  (Here we differ from the
 +
OpenClassroom/CS229 notation. Specifically, in the matrix-valued <tt>x</tt> we stack the training inputs in columns rather than in rows;
 +
and <tt>y</tt><math>\in \Re^{1\times m}</math> is a row vector rather than a column vector.)
 +
 
 +
Here's truly horrible, extremely slow, implementation of the gradient computation:
<syntaxhighlight lang="matlab">
<syntaxhighlight lang="matlab">
 +
% Implementation 1
grad = zeros(n+1,1);
grad = zeros(n+1,1);
for i=1:m,
for i=1:m,
-
   h = sigmoid(theta'*x(i,:)');
+
   h = sigmoid(theta'*x(:,i));
   temp = y(i) - h;  
   temp = y(i) - h;  
   for j=1:n+1,
   for j=1:n+1,
-
     grad(j) = grad(j) + temp * x(i,j);  
+
     grad(j) = grad(j) + temp * x(j,i);  
   end;
   end;
end;
end;
Line 36: Line 40:
that partially vectorizes the algorithm and gets better performance:  
that partially vectorizes the algorithm and gets better performance:  
<syntaxhighlight lang="matlab">
<syntaxhighlight lang="matlab">
 +
% Implementation 2
grad = zeros(n+1,1);
grad = zeros(n+1,1);
for i=1:m,
for i=1:m,
-
   grad = grad + (y(i) - sigmoid(theta'*x(i,:)'))* x(i,:)';
+
   grad = grad + (y(i) - sigmoid(theta'*x(:,i)))* x(:,i);
end;   
end;   
</syntaxhighlight>
</syntaxhighlight>
-
However, it turns out to be possible to even further vectorize this.  In Matlab/Octave,
+
However, it turns out to be possible to even further vectorize this.  If we can get rid of the for-loop, we can significantly speed up the implementation.  In particular, suppose <tt>b</tt> is a column vector, and <tt>A</tt> is a matrix.  Consider the following ways of computing <tt>A * b</tt>:  
-
it is possible to get rid of for-loops, and doing so will speed up the algorithm.  In
+
-
particular, we can implement the following:  
+
<syntaxhighlight lang="matlab">
<syntaxhighlight lang="matlab">
-
grad = X' * (y- sigmoid(X*theta))
+
% Slow implementation of matrix-vector multiply
 +
grad = zeros(n+1,1);
 +
for i=1:m,
 +
  grad = grad + b(i) * A(:,i);  % more commonly written A(:,i)*b(i)
 +
end;
 +
 
 +
% Fast implementation of matrix-vector multiply
 +
grad = A*b;
</syntaxhighlight>
</syntaxhighlight>
 +
 +
We recognize that Implementation 2 of our gradient descent calculation above is using the slow version with a for-loop, with
 +
<tt>b(i)</tt> playing the role of <tt>(y(i) - sigmoid(theta'*x(:,i)))</tt>, and <tt>A</tt> playing the role of <tt>x</tt>.  We can derive a fast implementation as follows:
 +
<syntaxhighlight lang="matlab">
 +
% Implementation 3
 +
grad = x * (y- sigmoid(theta'*x))';
 +
</syntaxhighlight>
 +
Here, we assume that the Matlab/Octave <tt>sigmoid(z)</tt> takes as input a vector <tt>z</tt>, applies the sigmoid function component-wise to the input, and returns the result.  The output of <tt>sigmoid(z)</tt> is therefore itself also a vector, of the same dimension as the input <tt>z</tt>
 +
 +
When the training set is large, this final implementation takes the greatest advantage of Matlab/Octave's highly optimized numerical linear algebra libraries to carry out the matrix-vector operations, and so this is far more efficient than the earlier implementations. 
 +
 +
Coming up with vectorized implementations isn't always easy, and sometimes requires careful thought.  But as you gain familiarity with vectorized operations, you'll find that there are design patterns (i.e., a small number of ways of vectorizing) that apply to many different pieces of code.
 +
 +
 +
{{Vectorized Implementation}}
 +
 +
 +
{{Languages|逻辑回归的向量化实现样例|中文}}

Latest revision as of 13:09, 7 April 2013

Personal tools