Logistic Regression Vectorization Example

Consider training a logistic regression model using batch gradient ascent.
Suppose our hypothesis is
:<math>\begin{align}
h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)},
\end{align}</math>
where (following CS229 notational convention) we let <math>\textstyle x_0=1</math>, so that <math>\textstyle x \in \Re^{n+1}</math>
and <math>\textstyle \theta \in \Re^{n+1}</math>, and <math>\textstyle \theta_0</math> is our intercept term.  We have a training set
<math>\textstyle \{(x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)})\}</math> of <math>\textstyle m</math> examples, and the batch gradient
ascent update rule is "<math>\textstyle \theta := \theta + \alpha \nabla_\theta \ell(\theta)</math>, where <math>\textstyle \ell(\theta)</math>
is the log likelihood and <math>\textstyle \nabla_\theta \ell(\theta)</math> is its derivative.

[Note: Most of the notation below follows that defined in the class 
CS229: Machine Learning.  Please see Lecture notes #1 from http://cs229.stanford.edu/ for details.]

We thus need to compute the gradient:
:<math>\begin{align}
\nabla_\theta \ell(\theta) = \sum_{i=1}^m \left(y^{(i)} - h_\theta(x^{(i)}) \right) x^{(i)}_j.
\end{align}</math>
Suppose that the Matlab/Octave variable <tt>x</tt> is the design matrix, so that
<tt>x(:,i)</tt> is the <math>\textstyle i</math>-th training example <math>\textstyle x^{(i)}</math> and <tt>x(i,j)</tt> is  <math>\textstyle x^{(i)}_j</math>.  
Further, suppose the Matlab/Octave variable <tt>y</tt> is a ''row'' vector of the labels in the
training set, so that <tt>y(i)</tt> is <math>\textstyle y^{(i)} \in \{0,1\}</math>.  (Here we differ from the 
CS229 notation, because in $<tt>x</tt> we stack the training inputs in columns rather than in rows;
and <tt>y</tt><math>\in \Re^{1\times m}</math> is a row rather than a column vector.) 

Here's truly horrible, extremely slow, implementation:
<syntaxhighlight lang="matlab">
% Implementation 1
grad = zeros(n+1,1);
for i=1:m,
  h = sigmoid(theta'*x(:,i));
  temp = y(i) - h; 
  for j=1:n+1,
    grad(j) = grad(j) + temp * x(j,i); 
  end;
end;
</syntaxhighlight>
The two nested for-loops makes this very slow.  Here's a more typical implementation,
that partially vectorizes the algorithm and gets better performance: 
<syntaxhighlight lang="matlab">
% Implementation 2 
grad = zeros(n+1,1);
for i=1:m,
  grad = grad + (y(i) - sigmoid(theta'*x(:,i)))* x(:,i);
end;  
</syntaxhighlight>

However, it turns out to be possible to even further vectorize this.  In Matlab/Octave,
it is possible to get rid of for-loops, and doing so will speed up the algorithm.  In
particular, we can implement the following: 
<syntaxhighlight lang="matlab">
% Implementation 3
grad = x * (y- sigmoid(theta'*x))'
</syntaxhighlight>
Here, we assume that the Matlab/Octave <tt>sigmoid(z)</tt> takes as input a vector <tt>z</tt>, applies the sigmoid function component-wise to the input, and returns the result.  The output of <tt>sigmoid(z)</tt> is therefore itself also a vector, of the same dimension as the input <tt>z</tt> 

When the training set is large, this final implementation takes the greatest advantage of Matlab/Octave's highly optimized numerical linear algebra libraries to carry out the matrix-vector operations, and so this is far more efficient than the earlier implementations.