Logistic Regression Vectorization Example
From Ufldl
(Created page with "Consider training a logistic regression model using batch gradient ascent. Suppose our hypothesis is :<math>\begin{align} h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)}, \end{align}<...") |
|||
Line 4: | Line 4: | ||
h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)}, | h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)}, | ||
\end{align}</math> | \end{align}</math> | ||
- | where (following | + | where (following the notational convention from the OpenClassroom videos and from CS229) we let <math>\textstyle x_0=1</math>, so that <math>\textstyle x \in \Re^{n+1}</math> |
and <math>\textstyle \theta \in \Re^{n+1}</math>, and <math>\textstyle \theta_0</math> is our intercept term. We have a training set | and <math>\textstyle \theta \in \Re^{n+1}</math>, and <math>\textstyle \theta_0</math> is our intercept term. We have a training set | ||
<math>\textstyle \{(x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)})\}</math> of <math>\textstyle m</math> examples, and the batch gradient | <math>\textstyle \{(x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)})\}</math> of <math>\textstyle m</math> examples, and the batch gradient | ||
- | ascent update rule is | + | ascent update rule is <math>\textstyle \theta := \theta + \alpha \nabla_\theta \ell(\theta)</math>, where <math>\textstyle \ell(\theta)</math> |
is the log likelihood and <math>\textstyle \nabla_\theta \ell(\theta)</math> is its derivative. | is the log likelihood and <math>\textstyle \nabla_\theta \ell(\theta)</math> is its derivative. | ||
- | [Note: Most of the notation below follows that defined in the class | + | [Note: Most of the notation below follows that defined in the OpenClassroom videos or in the class |
- | CS229: Machine Learning. | + | CS229: Machine Learning. For details, see either the [http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning OpenClassroom videos] or Lecture Notes #1 of http://cs229.stanford.edu/ .] |
We thus need to compute the gradient: | We thus need to compute the gradient: | ||
Line 17: | Line 17: | ||
\nabla_\theta \ell(\theta) = \sum_{i=1}^m \left(y^{(i)} - h_\theta(x^{(i)}) \right) x^{(i)}_j. | \nabla_\theta \ell(\theta) = \sum_{i=1}^m \left(y^{(i)} - h_\theta(x^{(i)}) \right) x^{(i)}_j. | ||
\end{align}</math> | \end{align}</math> | ||
- | |||
- | |||
- | |||
- | |||
- | Here's truly horrible, extremely slow, implementation: | + | Suppose that the Matlab/Octave variable <tt>x</tt> is a matrix containing the training inputs, so that |
+ | <tt>x(:,i)</tt> is the <math>\textstyle i</math>-th training example <math>\textstyle x^{(i)}</math>, and <tt>x(j,i)</tt> is <math>\textstyle x^{(i)}_j</math>. | ||
+ | Further, suppose the Matlab/Octave variable <tt>y</tt> is a ''row'' vector of the labels in the | ||
+ | training set, so that the variable <tt>y(i)</tt> is <math>\textstyle y^{(i)} \in \{0,1\}</math>. (Here we differ from the | ||
+ | OpenClassroom/CS229 notation. Specifically, in the matrix-valued <tt>x</tt> we stack the training inputs in columns rather than in rows; | ||
+ | and <tt>y</tt><math>\in \Re^{1\times m}</math> is a row vector rather than a column vector.) | ||
+ | |||
+ | Here's truly horrible, extremely slow, implementation of the gradient computation: | ||
<syntaxhighlight lang="matlab"> | <syntaxhighlight lang="matlab"> | ||
+ | % Implementation 1 | ||
grad = zeros(n+1,1); | grad = zeros(n+1,1); | ||
for i=1:m, | for i=1:m, | ||
- | h = sigmoid(theta'*x( | + | h = sigmoid(theta'*x(:,i)); |
temp = y(i) - h; | temp = y(i) - h; | ||
for j=1:n+1, | for j=1:n+1, | ||
- | grad(j) = grad(j) + temp * x( | + | grad(j) = grad(j) + temp * x(j,i); |
end; | end; | ||
end; | end; | ||
Line 36: | Line 40: | ||
that partially vectorizes the algorithm and gets better performance: | that partially vectorizes the algorithm and gets better performance: | ||
<syntaxhighlight lang="matlab"> | <syntaxhighlight lang="matlab"> | ||
+ | % Implementation 2 | ||
grad = zeros(n+1,1); | grad = zeros(n+1,1); | ||
for i=1:m, | for i=1:m, | ||
- | grad = grad + (y(i) - sigmoid(theta'*x( | + | grad = grad + (y(i) - sigmoid(theta'*x(:,i)))* x(:,i); |
end; | end; | ||
</syntaxhighlight> | </syntaxhighlight> | ||
- | However, it turns out to be possible to even further vectorize this. | + | However, it turns out to be possible to even further vectorize this. If we can get rid of the for-loop, we can significantly speed up the implementation. In particular, suppose <tt>b</tt> is a column vector, and <tt>A</tt> is a matrix. Consider the following ways of computing <tt>A * b</tt>: |
- | + | ||
- | particular, | + | |
<syntaxhighlight lang="matlab"> | <syntaxhighlight lang="matlab"> | ||
- | grad = | + | % Slow implementation of matrix-vector multiply |
+ | grad = zeros(n+1,1); | ||
+ | for i=1:m, | ||
+ | grad = grad + b(i) * A(:,i); % more commonly written A(:,i)*b(i) | ||
+ | end; | ||
+ | |||
+ | % Fast implementation of matrix-vector multiply | ||
+ | grad = A*b; | ||
</syntaxhighlight> | </syntaxhighlight> | ||
+ | |||
+ | We recognize that Implementation 2 of our gradient descent calculation above is using the slow version with a for-loop, with | ||
+ | <tt>b(i)</tt> playing the role of <tt>(y(i) - sigmoid(theta'*x(:,i)))</tt>, and <tt>A</tt> playing the role of <tt>x</tt>. We can derive a fast implementation as follows: | ||
+ | <syntaxhighlight lang="matlab"> | ||
+ | % Implementation 3 | ||
+ | grad = x * (y- sigmoid(theta'*x))'; | ||
+ | </syntaxhighlight> | ||
+ | Here, we assume that the Matlab/Octave <tt>sigmoid(z)</tt> takes as input a vector <tt>z</tt>, applies the sigmoid function component-wise to the input, and returns the result. The output of <tt>sigmoid(z)</tt> is therefore itself also a vector, of the same dimension as the input <tt>z</tt> | ||
+ | |||
+ | When the training set is large, this final implementation takes the greatest advantage of Matlab/Octave's highly optimized numerical linear algebra libraries to carry out the matrix-vector operations, and so this is far more efficient than the earlier implementations. | ||
+ | |||
+ | Coming up with vectorized implementations isn't always easy, and sometimes requires careful thought. But as you gain familiarity with vectorized operations, you'll find that there are design patterns (i.e., a small number of ways of vectorizing) that apply to many different pieces of code. | ||
+ | |||
+ | |||
+ | {{Vectorized Implementation}} | ||
+ | |||
+ | |||
+ | {{Languages|逻辑回归的向量化实现样例|中文}} |