# Logistic Regression Vectorization Example

(Difference between revisions)
 Revision as of 00:53, 23 April 2011 (view source)Maiyifan (Talk | contribs)m (Added semicolons, deleted stray quote)← Older edit Revision as of 18:32, 29 April 2011 (view source)Ang (Talk | contribs) Newer edit → Line 4: Line 4: h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)}, h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)}, \end{align}[/itex] \end{align}[/itex] - where (following CS229 notational convention) we let $\textstyle x_0=1$, so that $\textstyle x \in \Re^{n+1}$ + where (following the notational convention from the OpenClassroom videos and from CS229) we let $\textstyle x_0=1$, so that $\textstyle x \in \Re^{n+1}$ and $\textstyle \theta \in \Re^{n+1}$, and $\textstyle \theta_0$ is our intercept term.  We have a training set and $\textstyle \theta \in \Re^{n+1}$, and $\textstyle \theta_0$ is our intercept term.  We have a training set $\textstyle \{(x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)})\}$ of $\textstyle m$ examples, and the batch gradient $\textstyle \{(x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)})\}$ of $\textstyle m$ examples, and the batch gradient Line 10: Line 10: is the log likelihood and $\textstyle \nabla_\theta \ell(\theta)$ is its derivative. is the log likelihood and $\textstyle \nabla_\theta \ell(\theta)$ is its derivative. - [Note: Most of the notation below follows that defined in the class + [Note: Most of the notation below follows that defined in the OpenClassroom videos or in the class - CS229: Machine Learning.  Please see Lecture notes #1 from http://cs229.stanford.edu/ for details.] + CS229: Machine Learning.  For details, see either the [http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning OpenClassroom videos] or Lecture Notes #1 of http://cs229.stanford.edu/ .] We thus need to compute the gradient: We thus need to compute the gradient: Line 22: Line 22: Further, suppose the Matlab/Octave variable y is a ''row'' vector of the labels in the Further, suppose the Matlab/Octave variable y is a ''row'' vector of the labels in the training set, so that the variable y(i) is $\textstyle y^{(i)} \in \{0,1\}$.  (Here we differ from the training set, so that the variable y(i) is $\textstyle y^{(i)} \in \{0,1\}$.  (Here we differ from the - CS229 notation. Specifically, in the matrix-valued x we stack the training inputs in columns rather than in rows; + OpenClassroom/CS229 notation. Specifically, in the matrix-valued x we stack the training inputs in columns rather than in rows; and y$\in \Re^{1\times m}$ is a row vector rather than a column vector.) and y$\in \Re^{1\times m}$ is a row vector rather than a column vector.) Line 55: Line 55: end; end; - % Fast implementation of matrix-vector multiple + % Fast implementation of matrix-vector multiply grad = A*b; grad = A*b;

## Revision as of 18:32, 29 April 2011

Consider training a logistic regression model using batch gradient ascent. Suppose our hypothesis is

\begin{align} h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)}, \end{align}

where (following the notational convention from the OpenClassroom videos and from CS229) we let $\textstyle x_0=1$, so that $\textstyle x \in \Re^{n+1}$ and $\textstyle \theta \in \Re^{n+1}$, and $\textstyle \theta_0$ is our intercept term. We have a training set $\textstyle \{(x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)})\}$ of $\textstyle m$ examples, and the batch gradient ascent update rule is $\textstyle \theta := \theta + \alpha \nabla_\theta \ell(\theta)$, where $\textstyle \ell(\theta)$ is the log likelihood and $\textstyle \nabla_\theta \ell(\theta)$ is its derivative.

[Note: Most of the notation below follows that defined in the OpenClassroom videos or in the class CS229: Machine Learning. For details, see either the OpenClassroom videos or Lecture Notes #1 of http://cs229.stanford.edu/ .]

We thus need to compute the gradient:

\begin{align} \nabla_\theta \ell(\theta) = \sum_{i=1}^m \left(y^{(i)} - h_\theta(x^{(i)}) \right) x^{(i)}_j. \end{align}

Suppose that the Matlab/Octave variable x is a matrix containing the training inputs, so that x(:,i) is the $\textstyle i$-th training example $\textstyle x^{(i)}$, and x(j,i) is $\textstyle x^{(i)}_j$. Further, suppose the Matlab/Octave variable y is a row vector of the labels in the training set, so that the variable y(i) is $\textstyle y^{(i)} \in \{0,1\}$. (Here we differ from the OpenClassroom/CS229 notation. Specifically, in the matrix-valued x we stack the training inputs in columns rather than in rows; and y$\in \Re^{1\times m}$ is a row vector rather than a column vector.)

Here's truly horrible, extremely slow, implementation of the gradient computation:

% Implementation 1
for i=1:m,
h = sigmoid(theta'*x(:,i));
temp = y(i) - h;
for j=1:n+1,
end;
end;

The two nested for-loops makes this very slow. Here's a more typical implementation, that partially vectorizes the algorithm and gets better performance:

% Implementation 2
for i=1:m,
end;

However, it turns out to be possible to even further vectorize this. If we can get rid of the for-loop, we can significantly speed up the implementation. In particular, suppose b is a column vector, and A is a matrix. Consider the following ways of computing A * b:

% Slow implementation of matrix-vector multiply
for i=1:m,
grad = grad + b(i) * A(:,i);  % more commonly written A(:,i)*b(i)
end;

% Fast implementation of matrix-vector multiply
grad = A*b;

We recognize that Implementation 2 of our gradient descent calculation above is using the slow version with a for-loop, with b(i) playing the role of (y(i) - sigmoid(theta'*x(:,i))), and A playing the role of x. We can derive a fast implementation as follows:

% Implementation 3
grad = x * (y- sigmoid(theta'*x))';

Here, we assume that the Matlab/Octave sigmoid(z) takes as input a vector z, applies the sigmoid function component-wise to the input, and returns the result. The output of sigmoid(z) is therefore itself also a vector, of the same dimension as the input z

When the training set is large, this final implementation takes the greatest advantage of Matlab/Octave's highly optimized numerical linear algebra libraries to carry out the matrix-vector operations, and so this is far more efficient than the earlier implementations.

Coming up with vectorized implementations isn't always easy, and sometimes requires careful thought. But as you gain familiarity with vectorized operations, you'll find that there are design patterns (i.e., a small number of ways of vectorizing) that apply to many different pieces of code.