PCA

From Ufldl

Jump to: navigation, search
m (Removed broken equation links)
 
Line 46: Line 46:
\Sigma = \frac{1}{m} \sum_{i=1}^m (x^{(i)})(x^{(i)})^T.  
\Sigma = \frac{1}{m} \sum_{i=1}^m (x^{(i)})(x^{(i)})^T.  
\end{align}</math>
\end{align}</math>
-
If <math>\textstyle x</math> has zero mean, then <math>\textstyle \Sigma</math> is exactly the covariance matrix of <math>\textstyle x</math>.   
+
If <math>\textstyle x</math> has zero mean, then <math>\textstyle \Sigma</math> is exactly the covariance matrix of <math>\textstyle x</math>.  (The symbol "<math>\textstyle \Sigma</math>", pronounced "Sigma", is the standard notation for denoting the covariance matrix.  Unfortunately it looks just like the summation symbol, as in <math>\sum_{i=1}^n i</math>; but these are two different things.)
It can then be shown that <math>\textstyle u_1</math>---the principal direction of variation of the data---is  
It can then be shown that <math>\textstyle u_1</math>---the principal direction of variation of the data---is  
Line 52: Line 52:
the second eigenvector.
the second eigenvector.
-
'''(Important: For a mathematical derivation/formal justification of this, see the CS229 lecture notes on PCA, linked below.)'''
+
Note: If you are interested in seeing a more formal mathematical derivation/justification of this result, see the CS229 (Machine Learning) lecture notes on PCA (link at bottom of this page).  You won't need to do so to follow along this course, however. 
You can use standard numerical linear algebra software to find these eigenvectors (see Implementation Notes).
You can use standard numerical linear algebra software to find these eigenvectors (see Implementation Notes).
Line 91: Line 91:
This is the training set rotated into the <math>\textstyle u_1</math>,<math>\textstyle u_2</math> basis. In the general
This is the training set rotated into the <math>\textstyle u_1</math>,<math>\textstyle u_2</math> basis. In the general
case, <math>\textstyle U^Tx</math> will be the training set rotated into the basis  
case, <math>\textstyle U^Tx</math> will be the training set rotated into the basis  
-
<math>\textstyle u_1</math>,<math>\textstyle u_2</math>,\ldots,<math>\textstyle u_n</math>.  
+
<math>\textstyle u_1</math>,<math>\textstyle u_2</math>, ...,<math>\textstyle u_n</math>.  
-
One of the properties of <math>\textstyle U</math> is that it is an orthogonal basis, and thus <math>\textstyle U^TU = UU^T = I</math>.
+
One of the properties of <math>\textstyle U</math> is that it is an "orthogonal" matrix, which means
-
So if you ever need to go back from the rotated vectors <math>\textstyle x_{\rm rot}</math> back to the  
+
that it satisfies <math>\textstyle U^TU = UU^T = I</math>.  
 +
So if you ever need to go from the rotated vectors <math>\textstyle x_{\rm rot}</math> back to the  
original data <math>\textstyle x</math>, you can compute  
original data <math>\textstyle x</math>, you can compute  
:<math>\begin{align}
:<math>\begin{align}
Line 101: Line 102:
because <math>\textstyle U x_{\rm rot} =  UU^T x = x</math>.
because <math>\textstyle U x_{\rm rot} =  UU^T x = x</math>.
-
== Reducing the Data ==
+
== Reducing the Data Dimension ==
We see that the principal direction of variation of the data is the first
We see that the principal direction of variation of the data is the first
Line 164: Line 165:
Now, <math>\textstyle \tilde{x} \in \Re^k</math> is a lower-dimensional, "compressed" representation
Now, <math>\textstyle \tilde{x} \in \Re^k</math> is a lower-dimensional, "compressed" representation
of the original <math>\textstyle x \in \Re^n</math>.  Given <math>\textstyle \tilde{x}</math>, how can we recover an approximation <math>\textstyle \hat{x}</math> to  
of the original <math>\textstyle x \in \Re^n</math>.  Given <math>\textstyle \tilde{x}</math>, how can we recover an approximation <math>\textstyle \hat{x}</math> to  
-
the original value of <math>\textstyle x</math>?  From the [[#Rotating the Data|previous section]], we know that <math>\textstyle x = U x_{\rm rot}</math>.  Further,  
+
the original value of <math>\textstyle x</math>?  From an [[#Rotating the Data|earlier section]], we know that <math>\textstyle x = U x_{\rm rot}</math>.  Further,  
we can think of <math>\textstyle \tilde{x}</math> as an approximation to <math>\textstyle x_{\rm rot}</math>, where we have
we can think of <math>\textstyle \tilde{x}</math> as an approximation to <math>\textstyle x_{\rm rot}</math>, where we have
set the last <math>\textstyle n-k</math> components to zeros.  Thus, given <math>\textstyle \tilde{x} \in \Re^k</math>, we can  
set the last <math>\textstyle n-k</math> components to zeros.  Thus, given <math>\textstyle \tilde{x} \in \Re^k</math>, we can  
Line 173: Line 174:
= \sum_{i=1}^k u_i \tilde{x}_i.
= \sum_{i=1}^k u_i \tilde{x}_i.
\end{align}</math>
\end{align}</math>
-
The final equality above comes from the definition of <math>\textstyle U</math> [#Example and Mathematical Background|given earlier].
+
The final equality above comes from the definition of <math>\textstyle U</math> [[#Example and Mathematical Background|given earlier]].
(In a practical implementation, we wouldn't actually zero pad <math>\textstyle \tilde{x}</math> and then multiply
(In a practical implementation, we wouldn't actually zero pad <math>\textstyle \tilde{x}</math> and then multiply
by <math>\textstyle U</math>, since that would mean multiplying a lot of things by zeros; instead, we'd just  
by <math>\textstyle U</math>, since that would mean multiplying a lot of things by zeros; instead, we'd just  
Line 201: Line 202:
approximation to the data.  
approximation to the data.  
-
To decide how to set <math>\textstyle k</math>, we will usually look at the '''percentage of variance
+
To decide how to set <math>\textstyle k</math>, we will usually look at the '''percentage of variance retained'''  
-
retained''' for different values of <math>\textstyle k</math>.  Concretely, if <math>\textstyle k=n</math>, then we have
+
for different values of <math>\textstyle k</math>.  Concretely, if <math>\textstyle k=n</math>, then we have
an exact approximation to the data, and we say that 100% of the variance is
an exact approximation to the data, and we say that 100% of the variance is
retained.  I.e., all of the variation of the original data is retained.   
retained.  I.e., all of the variation of the original data is retained.   
Line 251: Line 252:
a typical animal or person might see over their lifetime.
a typical animal or person might see over their lifetime.
-
'''(NOTE: Usually we use images of outdoor scenes with grass, trees, etc., and cut out small (say 16x16) image patches randomly from these to train the algorithm.  But in practice most feature learning algorithms are extremely robust to the exact type of image  it is trained on, so most images taken with a normal camera, so long as they aren't excessively blurry or have strange artifacts, should work.)'''
+
Note: Usually we use images of outdoor scenes with grass, trees, etc., and cut out small (say 16x16) image patches randomly from these to train the algorithm.  But in practice most feature learning algorithms are extremely robust to the exact type of image  it is trained on, so most images taken with a normal camera, so long as they aren't excessively blurry or have strange artifacts, should work.
-
In this case, it makes little sense to estimate a separate mean and
+
When training on natural images, it makes little sense to estimate a separate mean and
variance for each pixel, because the statistics in one part
variance for each pixel, because the statistics in one part
of the image should (theoretically) be the same as any other.   
of the image should (theoretically) be the same as any other.   
-
This property of images is called '''stationarity'''.
+
This property of images is called '''stationarity.'''  
In detail, in order for PCA to work well, informally we require that (i) The
In detail, in order for PCA to work well, informally we require that (i) The
Line 293: Line 294:
this is not the same thing as estimating a mean value separately for each pixel <math>\textstyle x_j</math>.
this is not the same thing as estimating a mean value separately for each pixel <math>\textstyle x_j</math>.
-
If you are training your algorithm on images other than natural images (for example, images of handwritten characters, or images of single isolated objects centered against a white background), other types of normalization might be worth considering, and the best choice may be application dependent. But when training on natural images, using the per-image mean normalization as the normalization equations above would be a reasonable default.
+
If you are training your algorithm on images other than natural images (for example, images of handwritten characters, or images of single isolated objects centered against a white background), other types of normalization might be worth considering, and the best choice may be application dependent. But when training on natural images, using the per-image mean normalization method as given in the equations above would be a reasonable default.
== References ==
== References ==
http://cs229.stanford.edu
http://cs229.stanford.edu
 +
 +
 +
{{PCA}}
 +
 +
 +
{{Languages|主成分分析|中文}}

Latest revision as of 13:18, 7 April 2013

Personal tools