主成分分析

初译：@交大基层代表 @Emma_lzhang （新浪微博ID）

一审：@Dr金峰

二审：@破破的桥

录入：@Emma_lzhang 邮箱：emma.lzhang@gmail.com


== Introduction ==

引言

【原文】：Principal Components Analysis (PCA) is a dimensionality reduction algorithm
that can be used to significantly speed up your unsupervised feature learning
algorithm.  More importantly, understanding PCA will enable us to later
implement '''whitening''', which is an important pre-processing step for many
algorithms. 

【初译】：主成分分析是一种能够极大提升非监督学习速度的数据降维算法。更重要的，理解主成分分析算法对于学习implement whitening有很大的帮助，而implement whitening算法是一种很重要的预处理算法。

【一审】：主成分分析是一种能够极大提升非监督学习速度的数据降维算法。更重要的，理解主成分分析算法将使我们能使用白化方法，白化处理是一种很重要的预处理算法。

【二审】：主成分分析（PCA）是一种能够极大提升非监督特征学习速度的数据降维算法。更重要的是，理解PCA算法，对实现白化算法有很大的帮助，很多算法都先用白化算法作预处理步骤。


【原文】：Suppose you are training your algorithm on images.  Then the input will be
somewhat redundant, because the values of adjacent pixels in an image are
highly correlated.  Concretely, suppose we are training on 16x16 grayscale
image patches.  Then <math>\textstyle x \in \Re^{256}</math> are 256 dimensional vectors, with one
feature <math>\textstyle x_j</math> corresponding to the intensity of each pixel.  Because of the
correlation between adjacent pixels, PCA will allow us to approximate the input with
a much lower dimensional one, while incurring very little error.

【初译】：假如通过图像输入来训练算法。那么输入数据是有一定冗余的，因为图像中相连的像素是强相关的。具体来说，假如我们正在训练16×16的灰度值图像，那么<math>\textstyle x \in \Re^{256}</math>的数据便是256维的向量，特征值<math>\textstyle x_j</math>表示每个像素的强度值。因为相连像素间具有相关性，PCA算法可以在保证产生很小的误差值的情况下近似的把输入数据投影到更低维的空间里。。

【一审】：假设你使用图像来训练算法。那么输入数据是有一定冗余的，因为图像中相连的像素是强相关的。具体来说，假如我们正在训练16×16的灰度值图像，那么 <math>\textstyle x \in \Re^{256}</math>的数据便是256维的向量，特征值<math>\textstyle x_j</math>表示每个像素的强度值。因为相连像素间具有相关性，PCA算法可以在保证产生很小的误差值的情况下近似的把输入数据投影到更低维的空间里。

【二审】：假设你使用图像来训练算法，那么输入数据是有一定冗余的，因为图像中相连的像素是强相关的。具体来说，假如我们正在训练的16x16灰度值图像，那么<math>\textstyle x \in \Re^{256}</math>是个256维向量，其中，特征值<math>\textstyle x_j</math>对应每个像素的亮度值。因为相连像素间具有相关性，PCA算法可以将输入向量转换为一个维数低很多的近似向量，并仅产生极小的误差。


== Example and Mathematical Background ==

实例和数学背景

【原文】：For our running example, we will use a dataset 
<math>\textstyle \{x^{(1)}, x^{(2)}, \ldots, x^{(m)}\}</math> with 
<math>\textstyle n=2</math> dimensional inputs, so that 
<math>\textstyle x^{(i)} \in \Re^2</math>.
Suppose we want to reduce the data 
from 2 dimensions to 1.  (In practice, we might want to reduce data
from 256 to 50 dimensions, say; but using lower dimensional data in our example
allows us to visualize the algorithms better.)  Here is our dataset:

【初译】：在我们的实例中，输入数据集<math>\textstyle \{x^{(1)}, x^{(2)}, \ldots, x^{(m)}\}</math>是<math>\textstyle n=2</math>维的数据，即<math>\textstyle x^{(i)} \in \Re^2</math>。假如想把数据从2维降到1维。（实际上，我们是想把数据从255维降到50维；但是用低维的数据我们可以用图形来更直观的解释算法）。

【一审】：在我们的实例中，输入数据集<math>\textstyle \{x^{(1)}, x^{(2)}, \ldots, x^{(m)}\}</math>是<math>\textstyle n=2</math>维的数据，即<math>\textstyle x^{(i)} \in \Re^2</math>。假如想把数据从2维降到1维。（实际上，我们是想把数据从255维降到50维；但是用低维的数据我们可以用图形来更直观的解释算法）。

【二审】：在我们的实例中，使用的输入数据集表示为<math>\textstyle \{x^{(1)}, x^{(2)}, \ldots, x^{(m)}\}</math>，维度<math>\textstyle n=2</math>，即<math>\textstyle x^{(i)} \in \Re^2</math>。假设我们想把数据从2维降到1维。（在实际应用中，我们往往需要把数据从256维降到50维；但在我们的示例中使用低维数据，可更直观的展现算法），下图是我们的数据集：

[[File:PCA-rawdata.png|600px]]

【原文】：This data has already been pre-processed so that each of the features <math>\textstyle x_1</math> and <math>\textstyle x_2</math>
have about the same mean (zero) and variance.  

For the purpose of illustration, we have also colored each of the points one of
three colors, depending on their <math>\textstyle x_1</math> value; these colors are not used by the
algorithm, and are for illustration only.

【初译】：数据已经进行了预处理，所以特征值<math>\textstyle x_1</math>和<math>\textstyle x_2</math>有相近的平均值（零）和方差。
为了更直观的区分，根据<math>\textstyle x_1</math>值的大小我们把它们分成了三种颜色。颜色的区分对于算法没有任何影响，仅仅是为了直观表示。

【一审】：数据已经进行了预处理，所以特征值<math>\textstyle x_1</math>和<math>\textstyle x_2</math>有相近的平均值（零）和方差。
为了更直观的区分，根据<math>\textstyle x_1</math>值的大小我们把它们分成了三种颜色。颜色的区分对于算法没有任何影响，仅仅是为了直观表示。

【二审】：这些数据已经进行了预处理，所以每个特征值<math>\textstyle x_1</math>和<math>\textstyle x_2</math>有相同的平均值（零）和方差。
为方便展示，根据<math>\textstyle x_1</math>值的大小，我们将每个点分别涂上了三种颜色的一种。该颜色并不用于算法而仅用于图解。


【原文】：PCA will find a lower-dimensional subspace onto which to project our data.  
From visually examining the data, it appears that <math>\textstyle u_1</math> is the principal direction of 
variation of the data, and <math>\textstyle u_2</math> the secondary direction of variation:

【初译】：

【一审】：

【二审】：


[[File:PCA-u1.png | 600px]]

I.e., the data varies much more in the direction <math>\textstyle u_1</math> than <math>\textstyle u_2</math>. 
To more formally find the directions <math>\textstyle u_1</math> and <math>\textstyle u_2</math>, we first compute the matrix <math>\textstyle \Sigma</math>
as follows:

【初译】：

【一审】：

【二审】：


:<math>\begin{align}
\Sigma = \frac{1}{m} \sum_{i=1}^m (x^{(i)})(x^{(i)})^T. 
\end{align}</math>
If <math>\textstyle x</math> has zero mean, then <math>\textstyle \Sigma</math> is exactly the covariance matrix of <math>\textstyle x</math>.  (The symbol "<math>\textstyle \Sigma</math>", pronounced "Sigma", is the standard notation for denoting the covariance matrix.  Unfortunately it looks just like the summation symbol, as in <math>\sum_{i=1}^n i</math>; but these are two different things.) 

It can then be shown that <math>\textstyle u_1</math>---the principal direction of variation of the data---is 
the top (principal) eigenvector of <math>\textstyle \Sigma</math>, and <math>\textstyle u_2</math> is
the second eigenvector.


【初译】：

【一审】：

【二审】：


Note: If you are interested in seeing a more formal mathematical derivation/justification of this result, see the CS229 (Machine Learning) lecture notes on PCA (link at bottom of this page).  You won't need to do so to follow along this course, however.  

【初译】：

【一审】：

【二审】：


You can use standard numerical linear algebra software to find these eigenvectors (see Implementation Notes).
Concretely, let us compute the eigenvectors of <math>\textstyle \Sigma</math>, and stack
the eigenvectors in columns to form the matrix <math>\textstyle U</math>:

【初译】：

【一审】：

【二审】：


:<math>\begin{align}
U = 
\begin{bmatrix} 
| & | & & |  \\
u_1 & u_2 & \cdots & u_n  \\
| & | & & | 
\end{bmatrix} 		
\end{align}</math>
Here, <math>\textstyle u_1</math> is the principal eigenvector (corresponding to the largest eigenvalue),
<math>\textstyle u_2</math> is the second eigenvector, and so on. 
Also, let <math>\textstyle \lambda_1, \lambda_2, \ldots, \lambda_n</math> be the corresponding eigenvalues. 

【初译】：

【一审】：

【二审】：

The vectors <math>\textstyle u_1</math> and <math>\textstyle u_2</math> in our example form a new basis in which we 
can represent the data.  Concretely, let <math>\textstyle x \in \Re^2</math> be some training example.  Then <math>\textstyle u_1^Tx</math>
is the length (magnitude) of the projection of <math>\textstyle x</math> onto the vector <math>\textstyle u_1</math>.  

Similarly, <math>\textstyle u_2^Tx</math> is the magnitude of <math>\textstyle x</math> projected onto the vector <math>\textstyle u_2</math>.

【初译】：

【一审】：

【二审】：

== Rotating the Data ==

【原文】：Thus, we can represent <math>\textstyle x</math> in the <math>\textstyle (u_1, u_2)</math>-basis by computing
:<math>\begin{align}
x_{\rm rot} = U^Tx = \begin{bmatrix} u_1^Tx \\ u_2^Tx \end{bmatrix} 
\end{align}</math>
(The subscript "rot" comes from the observation that this corresponds to
a rotation (and possibly reflection) of the original data.)
Lets take the entire training set, and compute 
<math>\textstyle x_{\rm rot}^{(i)} = U^Tx^{(i)}</math> for every <math>\textstyle i</math>.  Plotting this transformed data 
<math>\textstyle x_{\rm rot}</math>, we get: 

【初译】：

【一审】：

【二审】：


[[File:PCA-rotated.png|600px]]

This is the training set rotated into the <math>\textstyle u_1</math>,<math>\textstyle u_2</math> basis. In the general
case, <math>\textstyle U^Tx</math> will be the training set rotated into the basis 
<math>\textstyle u_1</math>,<math>\textstyle u_2</math>, ...,<math>\textstyle u_n</math>. 

One of the properties of <math>\textstyle U</math> is that it is an "orthogonal" matrix, which means
that it satisfies <math>\textstyle U^TU = UU^T = I</math>. 
So if you ever need to go from the rotated vectors <math>\textstyle x_{\rm rot}</math> back to the 
original data <math>\textstyle x</math>, you can compute 
:<math>\begin{align}
x = U x_{\rm rot}   ,
\end{align}</math>
because <math>\textstyle U x_{\rm rot} =  UU^T x = x</math>.

【初译】：

【一审】：

【二审】：

== Reducing the Data Dimension ==

【原文】：We see that the principal direction of variation of the data is the first
dimension <math>\textstyle x_{{\rm rot},1}</math> of this rotated data.  Thus, if we want to
reduce this data to one dimension, we can set 

【初译】：

【一审】：

【二审】：

:<math>\begin{align}
\tilde{x}^{(i)} = x_{{\rm rot},1}^{(i)} = u_1^Tx^{(i)} \in \Re.
\end{align}</math>
More generally, if <math>\textstyle x \in \Re^n</math> and we want to reduce it to 
a <math>\textstyle k</math> dimensional representation <math>\textstyle \tilde{x} \in \Re^k</math> (where <math>\textstyle k < n</math>), we would
take the first <math>\textstyle k</math> components of <math>\textstyle x_{\rm rot}</math>, which correspond to
the top <math>\textstyle k</math> directions of variation. 

【初译】：

【一审】：

【二审】：

Another way of explaining PCA is that <math>\textstyle x_{\rm rot}</math> is an <math>\textstyle n</math> dimensional
vector, where the first few components are likely to 
be large (e.g., in our example, we saw that <math>\textstyle x_{{\rm rot},1}^{(i)} = u_1^Tx^{(i)}</math> takes
reasonably large values for most examples <math>\textstyle i</math>), and
the later components are likely to be small (e.g., in our example, 
<math>\textstyle x_{{\rm rot},2}^{(i)} = u_2^Tx^{(i)}</math> was more likely to be small).  

【初译】：

【一审】：

【二审】：

What
PCA does it it 
drops the the later (smaller) components of <math>\textstyle x_{\rm rot}</math>, and
just approximates them with 0's.  Concretely, our definition of 
<math>\textstyle \tilde{x}</math> can also be arrived at by using an approximation to
<math>\textstyle x_{{\rm rot}}</math> where 
all but the first
<math>\textstyle k</math> components are zeros.  In other words, we have: 
:<math>\begin{align}
\tilde{x} = 
\begin{bmatrix} 
x_{{\rm rot},1} \\
\vdots \\ 
x_{{\rm rot},k} \\
0 \\ 
\vdots \\ 
0 \\ 
\end{bmatrix}
\approx 
\begin{bmatrix} 
x_{{\rm rot},1} \\
\vdots \\ 
x_{{\rm rot},k} \\
x_{{\rm rot},k+1} \\
\vdots \\ 
x_{{\rm rot},n} 
\end{bmatrix}
= x_{\rm rot} 
\end{align}</math>
In our example, this gives us the following plot of <math>\textstyle \tilde{x}</math> (using <math>\textstyle n=2, k=1</math>):

【初译】：

【一审】：

【二审】：

[[File:PCA-xtilde.png | 600px]]

However, since the final <math>\textstyle n-k</math> components of <math>\textstyle \tilde{x}</math> as defined above would
always be zero, there is no need to keep these zeros around, and so we
define <math>\textstyle \tilde{x}</math> as a <math>\textstyle k</math>-dimensional vector with just the first <math>\textstyle k</math> (non-zero) components. 

【初译】：

【一审】：

【二审】：

This also explains why we wanted to express our data in the <math>\textstyle u_1, u_2, \ldots, u_n</math> basis:
Deciding which components to keep becomes just keeping the top <math>\textstyle k</math> components.  When we
do this, we also say that we are "retaining the top <math>\textstyle k</math> PCA (or principal) components."

【初译】：

【一审】：

【二审】：

== Recovering an Approximation of the Data ==

【原文】：Now, <math>\textstyle \tilde{x} \in \Re^k</math> is a lower-dimensional, "compressed" representation
of the original <math>\textstyle x \in \Re^n</math>.  Given <math>\textstyle \tilde{x}</math>, how can we recover an approximation <math>\textstyle \hat{x}</math> to 
the original value of <math>\textstyle x</math>?  From an [[#Rotating the Data|earlier section]], we know that <math>\textstyle x = U x_{\rm rot}</math>.  Further, 
we can think of <math>\textstyle \tilde{x}</math> as an approximation to <math>\textstyle x_{\rm rot}</math>, where we have
set the last <math>\textstyle n-k</math> components to zeros.  Thus, given <math>\textstyle \tilde{x} \in \Re^k</math>, we can 
pad it out with <math>\textstyle n-k</math> zeros to get our approximation to <math>\textstyle x_{\rm rot} \in \Re^n</math>.  Finally, we pre-multiply
by <math>\textstyle U</math> to get our approximation to <math>\textstyle x</math>.  Concretely, we get 

【初译】：

【一审】：

【二审】：

:<math>\begin{align}
\hat{x}  = U \begin{bmatrix} \tilde{x}_1 \\ \vdots \\ \tilde{x}_k \\ 0 \\ \vdots \\ 0 \end{bmatrix}  
= \sum_{i=1}^k u_i \tilde{x}_i.
\end{align}</math>
The final equality above comes from the definition of <math>\textstyle U</math> [[#Example and Mathematical Background|given earlier]].
(In a practical implementation, we wouldn't actually zero pad <math>\textstyle \tilde{x}</math> and then multiply
by <math>\textstyle U</math>, since that would mean multiplying a lot of things by zeros; instead, we'd just 
multiply <math>\textstyle \tilde{x} \in \Re^k</math> with the first <math>\textstyle k</math> columns of <math>\textstyle U</math> as in the final expression above.)
Applying this to our dataset, we get the following plot for <math>\textstyle \hat{x}</math>:

【初译】：

【一审】：

【二审】：

[[File:PCA-xhat.png | 600px]]

We are thus using a 1 dimensional approximation to the original dataset. 

If you are training an autoencoder or other unsupervised feature learning algorithm,
the running time of your algorithm will depend on the dimension of the input.  If you feed <math>\textstyle \tilde{x} \in \Re^k</math>
into your learning algorithm instead of <math>\textstyle x</math>, then you'll be training on a lower-dimensional
input, and thus your algorithm might run significantly faster.  For many datasets,
the lower dimensional <math>\textstyle \tilde{x}</math> representation can be an extremely good approximation 
to the original, and using PCA this way can significantly speed up your algorithm while
introducing very little approximation error.

【初译】：

【一审】：

【二审】：

== Number of components to retain ==

【原文】：How do we set <math>\textstyle k</math>; i.e., how many PCA components should we retain?  In our
simple 2 dimensional example, it seemed natural to retain 1 out of the 2
components, but for higher dimensional data, this decision is less trivial.  If <math>\textstyle k</math> is
too large, then we won't be compressing the data much; in the limit of <math>\textstyle k=n</math>,
then we're just using the original data (but rotated into a different basis).
Conversely, if <math>\textstyle k</math> is too small, then we might be using a very bad
approximation to the data. 

【初译】：

【一审】：

【二审】：


To decide how to set <math>\textstyle k</math>, we will usually look at the '''percentage of variance retained''' 
for different values of <math>\textstyle k</math>.  Concretely, if <math>\textstyle k=n</math>, then we have
an exact approximation to the data, and we say that 100% of the variance is
retained.  I.e., all of the variation of the original data is retained.  
Conversely, if <math>\textstyle k=0</math>, then we are approximating all the data with the zero vector,
and thus 0% of the variance is retained. 

【初译】：

【一审】：

【二审】：

More generally, let <math>\textstyle \lambda_1, \lambda_2, \ldots, \lambda_n</math> be the eigenvalues 
of <math>\textstyle \Sigma</math> (sorted in decreasing order), so that <math>\textstyle \lambda_j</math> is the eigenvalue
corresponding to the eigenvector <math>\textstyle u_j</math>.  Then if we retain <math>\textstyle k</math> principal components, 
the percentage of variance retained is given by:

【初译】：

【一审】：

【二审】：

:<math>\begin{align}
\frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^n \lambda_j}.
\end{align}</math>
In our simple 2D example above, <math>\textstyle \lambda_1 = 7.29</math>, and <math>\textstyle \lambda_2 = 0.69</math>.  Thus,
by keeping only <math>\textstyle k=1</math> principal components, we retained <math>\textstyle 7.29/(7.29+0.69) = 0.913</math>,
or 91.3% of the variance.

【初译】：

【一审】：

【二审】：

A more formal definition of percentage of variance retained is beyond the scope
of these notes.  However, it is possible to show that <math>\textstyle \lambda_j =
\sum_{i=1}^m x_{{\rm rot},j}^2</math>.  Thus, if <math>\textstyle \lambda_j \approx 0</math>, that shows that
<math>\textstyle x_{{\rm rot},j}</math> is usually near 0 anyway, and we lose relatively little by
approximating it with a constant 0.  This also explains why we retain the top principal
components (corresponding to the larger values of <math>\textstyle \lambda_j</math>) instead of the bottom
ones.  The top principal components 
<math>\textstyle x_{{\rm rot},j}</math> are the ones that're more variable and that take on larger values, 
and for which we would incur a greater approximation error if we were to set them to zero. 

【初译】：

【一审】：

【二审】：

In the case of images, one common heuristic is to choose <math>\textstyle k</math> so as to retain 99% of
the variance.  In other words, we pick the smallest value of <math>\textstyle k</math> that satisfies 
:<math>\begin{align}
\frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^n \lambda_j} \geq 0.99. 
\end{align}</math>
Depending on the application, if you are willing to incur some 
additional error, values in the 90-98% range are also sometimes used.  When you
describe to others how you applied PCA, saying that you chose <math>\textstyle k</math> to retain 95% of
the variance will also be a much more easily interpretable description than saying
that you retained 120 (or whatever other number of) components.

== PCA on Images ==
【原文】：For PCA to work, usually we want each of the features <math>\textstyle x_1, x_2, \ldots, x_n</math>
to have a similar range of values to the others (and to have a mean close to
zero).  If you've used PCA on other applications before, you may therefore have
separately pre-processed each feature to have zero mean and unit variance, by
separately estimating the mean and variance of each feature <math>\textstyle x_j</math>.  However,
this isn't the pre-processing that we will apply to most types of images.  Specifically,
suppose we are training our algorithm on '''natural images''', so that <math>\textstyle x_j</math> is
the value of pixel <math>\textstyle j</math>.  By "natural images," we informally mean the type of image that
a typical animal or person might see over their lifetime.

Note: Usually we use images of outdoor scenes with grass, trees, etc., and cut out small (say 16x16) image patches randomly from these to train the algorithm.  But in practice most feature learning algorithms are extremely robust to the exact type of image  it is trained on, so most images taken with a normal camera, so long as they aren't excessively blurry or have strange artifacts, should work.  

When training on natural images, it makes little sense to estimate a separate mean and
variance for each pixel, because the statistics in one part
of the image should (theoretically) be the same as any other.  
This property of images is called '''stationarity.''' 

In detail, in order for PCA to work well, informally we require that (i) The
features have approximately zero mean, and (ii) The different features have
similar variances to each other.  With natural images, (ii) is already
satisfied even without variance normalization, and so we won't perform any 
variance normalization.  
(If you are training on audio data---say, on
spectrograms---or on text data---say, bag-of-word vectors---we will usually not perform
variance normalization either.)  
In fact, PCA is invariant to the scaling of
the data, and will return the same eigenvectors regardless of the scaling of
the input.  More formally, if you multiply each feature vector <math>\textstyle x</math> by some
positive number (thus scaling every feature in every training example by the
same number), PCA's output eigenvectors will not change.  

So, we won't use variance normalization.  The only normalization we need to
perform then is mean normalization, to ensure that the features have a mean
around zero.  Depending on the application, very often we are not interested
in how bright the overall input image is.  For example, in object recognition
tasks, the overall brightness of the image doesn't affect what objects
there are in the image.  More formally, we are not interested in the
mean intensity value of an image patch; thus, we can subtract out this value,
as a form of mean normalization.  

Concretely, if <math>\textstyle x^{(i)} \in \Re^{n}</math> are the (grayscale) intensity values of
a 16x16 image patch (<math>\textstyle n=256</math>), we might normalize the intensity of each image
<math>\textstyle x^{(i)}</math> as follows: 

<math>\mu^{(i)} := \frac{1}{n} \sum_{j=1}^n x^{(i)}_j</math>

<math>x^{(i)}_j := x^{(i)}_j - \mu^{(i)}</math>, for all <math>\textstyle j</math>
   
Note that the two steps above are done separately for each image <math>\textstyle x^{(i)}</math>,
and that <math>\textstyle \mu^{(i)}</math> here is the mean intensity of the image <math>\textstyle x^{(i)}</math>.  In particular,
this is not the same thing as estimating a mean value separately for each pixel <math>\textstyle x_j</math>.

If you are training your algorithm on images other than natural images (for example, images of handwritten characters, or images of single isolated objects centered against a white background), other types of normalization might be worth considering, and the best choice may be application dependent. But when training on natural images, using the per-image mean normalization method as given in the equations above would be a reasonable default.

== References ==
http://cs229.stanford.edu


{{PCA}}