PCA

== Introduction ==
Principal Components Analysis (PCA) is a dimensionality reduction algorithm
that can be used to significantly speed up your unsupervised feature learning
algorithm.  More importantly, understanding PCA will enable us to later
implement '''whitening''', which is an important pre-processing step for many
algorithms. 

Suppose you are training your algorithm on images.  Then the input will be
somewhat redundant, because the values of adjacent pixels in an image are
highly correlated.  Concretely, suppose we are training on 16x16 grayscale
image patches.  Then <math>\textstyle x \in \Re^{256}</math> are 256 dimensional vectors, with one
feature <math>\textstyle x_j</math> corresponding to the intensity of each pixel.  Because of the
correlation between adjacent pixels, PCA will allow us to approximate the input with
a much lower dimensional one, while incurring very little error.

== Example and Mathematical Background ==

For our running example, we will use a dataset 
<math>\textstyle \{x^{(1)}, x^{(2)}, \ldots, x^{(m)}\}</math> with 
<math>\textstyle n=2</math> dimensional inputs, so that 
<math>\textstyle x^{(i)} \in \Re^2</math>.
Suppose we want to reduce the data 
from 2 dimensions to 1.  (In practice, we might want to reduce data
from 256 to 50 dimensions, say; but using lower dimensional data in our example
allows us to visualize the algorithms better.)  Here is our dataset:

INSERT DATASET HERE!

This data has already been pre-processed so that each of the features <math>\textstyle x_1</math> and <math>\textstyle x_2</math>
have about the same mean (zero) and variance.  

For the purpose of illustration, we have also colored each of the points one of
three colors, depending on their <math>\textstyle x_1</math> value; these colors are not used by the
algorithm, and are for illustration only.

PCA will find a lower-dimensional subspace onto which to project our data.  
From visually examining the data, it appears that <math>\textstyle u_1</math> is the principal direction of 
variation of the data, and <math>\textstyle u_2</math> the secondary direction of variation:

INSERT GRAPHIC HERE

I.e., the data varies much more in the direction <math>\textstyle u_1</math> than <math>\textstyle u_2</math>. 
To more formally find the directions <math>\textstyle u_1</math> and <math>\textstyle u_2</math>, we first compute the matrix <math>\textstyle \Sigma</math>
as follows:
:<math>\begin{align}
\Sigma = \sum_{i=1}^m (x^{(i)})(x^{(i)})^T. 
\end{align}</math>
If <math>\textstyle x</math> has zero mean, then <math>\textstyle \frac{1}{m} \Sigma</math> is exactly the covariance matrix of <math>\textstyle x</math>.  

It can then be shown that <math>\textstyle u_1</math>---the principal direction of variation of the data---is 
the top (principal) eigenvector of <math>\textstyle \Sigma</math>, and <math>\textstyle u_2</math> is
the second eigenvector.

'''(Important: For a mathematical derivation/formal justification of this, see the CS229 lecture notes on PCA, linked below.)'''

You can use standard numerical linear algebra software to find these eigenvectors (see Implementation Notes).
Concretely, let us compute the eigenvectors of <math>\textstyle \Sigma</math>, and stack
the eigenvectors in columns to form the matrix <math>\textstyle U</math>:
:<math>\begin{align}
U = 
\begin{bmatrix} 
| & | & & |  \\
u_1 & u_2 & \cdots & u_n  \\
| & | & & | 
\end{bmatrix} 		
\end{align}</math>
Here, <math>\textstyle u_1</math> is the principal eigenvector (corresponding to the largest eigenvalue),
<math>\textstyle u_2</math> is the second eigenvector, and so on. 
Also, let <math>\textstyle \lambda_1, \lambda_2, \ldots, \lambda_n</math> be the corresponding eigenvalues. 

The vectors <math>\textstyle u_1</math> and <math>\textstyle u_2</math> in our example form a new basis in which we 
can represent the data.  Concretely, let <math>\textstyle x \in \Re^2</math> be some training example.  Then <math>\textstyle u_1^Tx</math>
is the length (magnitude) of the projection of <math>\textstyle x</math> onto the vector <math>\textstyle u_1</math>.  
%In otherwise, 
%<math>\textstyle u_1^Tx</math> is the amount that <math>\textstyle x</math> differs from the origin <math>\textstyle [0, 0]^T</math> in the <math>\textstyle u_1</math>
%direction.  
Similarly, <math>\textstyle u_2^Tx</math> is the magnitude of <math>\textstyle x</math> projected onto the vector <math>\textstyle u_2</math>.

== Rotating the Data ==

Thus, we can represent <math>\textstyle x</math> in the <math>\textstyle (u_1, u_2)</math>-basis by computing
:<math>\begin{align}
x_{\rm rot} = U^Tx = \begin{bmatrix} u_1^Tx \\ u_2^Tx \end{bmatrix} 
\end{align}</math>
(The subscript "rot" comes from the observation that this corresponds to
a rotation (and possibly reflection) of the original data.)
Lets take the entire training set, and compute 
<math>\textstyle x_{\rm rot}^{(i)} = U^Tx^{(i)}</math> for every <math>\textstyle i</math>.  Plotting this transformed data 
<math>\textstyle x_{\rm rot}</math>, we get: 
\begin{center} 
\includegraphics[width=0.6\maxfigwidth]{PCA-rotated.png}
\vspace*{-0.2in}
\end{center}
This is the training set rotated into the <math>\textstyle u_1</math>,<math>\textstyle u_2</math> basis. In the general
case, <math>\textstyle U^Tx</math> will be the training set rotated into the basis 
<math>\textstyle u_1</math>,<math>\textstyle u_2</math>,\ldots,<math>\textstyle u_n</math>. 

One of the properties of <math>\textstyle U</math> is that it is an orthogonal basis, and thus <math>\textstyle U^TU = UU^T = I</math>.
So if you ever need to go back from the rotated vectors <math>\textstyle x_{\rm rot}</math> back to the 
original data <math>\textstyle x</math>, you can compute 
:<math>\begin{align}
x = U x_{\rm rot}   ,
\end{align}</math>
because <math>\textstyle U x_{\rm rot} =  UU^T x = x</math>.

== Reducing the Data ==

We see that the principal direction of variation of the data is the first
dimension <math>\textstyle x_{{\rm rot},1}</math> of this rotated data.  Thus, if we want to
reduce this data to one dimension, we can set 
:<math>\begin{align}
\tilde{x}^{(i)} = x_{{\rm rot},1}^{(i)} = u_1^Tx^{(i)} \in \Re.
\end{align}</math>
More generally, if <math>\textstyle x \in \Re^n</math> and we want to reduce it to 
a <math>\textstyle k</math> dimensional representation <math>\textstyle \tilde{x} \in \Re^k</math> (where <math>\textstyle k < n</math>), we would
take the first <math>\textstyle k</math> components of <math>\textstyle x_{\rm rot}</math>, which correspond to
the top <math>\textstyle k</math> directions of variation. 

Another way of explaining PCA is that <math>\textstyle x_{\rm rot}</math> is an <math>\textstyle n</math> dimensional
vector, where the first few components are likely to 
be large (e.g., in our example, we saw that <math>\textstyle x_{{\rm rot},1}^{(i)} = u_1^Tx^{(i)}</math> takes
reasonably large values for most examples <math>\textstyle i</math>), and
the later components are likely to be small (e.g., in our example, 
<math>\textstyle x_{{\rm rot},2}^{(i)} = u_2^Tx^{(i)}</math> was more likely to be small).  What
PCA does it it 
drops the the later (smaller) components of <math>\textstyle x_{\rm rot}</math>, and
just approximates them with 0's.  Concretely, our definition of 
<math>\textstyle \tilde{x}</math> can also be arrived at by using an approximation to
<math>\textstyle x_{{\rm rot}}</math> where 
all but the first
<math>\textstyle k</math> components are zeros.  In other words, we have: 
:<math>\begin{align}
\tilde{x} = 
\begin{bmatrix} 
x_{{\rm rot},1} \\
\vdots \\ 
x_{{\rm rot},k} \\
0 \\ 
\vdots \\ 
0 \\ 
\end{bmatrix}
\approx 
\begin{bmatrix} 
x_{{\rm rot},1} \\
\vdots \\ 
x_{{\rm rot},k} \\
x_{{\rm rot},k+1} \\
\vdots \\ 
x_{{\rm rot},n} 
\end{bmatrix}
= x_{\rm rot} 
\end{align}</math>
In our example, this gives us the following plot of <math>\textstyle \tilde{x}</math> (using <math>\textstyle n=2, k=1</math>):

INSERT GRAPHIC HERE

However, since the final <math>\textstyle n-k</math> components of <math>\textstyle \tilde{x}</math> as defined above would
always be zero, there is no need to keep these zeros around, and so we
define <math>\textstyle \tilde{x}</math> as a <math>\textstyle k</math>-dimensional vector with just the first <math>\textstyle k</math> (non-zero) components. 

This also explains why we wanted to express our data in the <math>\textstyle u_1, u_2, \ldots, u_n</math> basis:
Deciding which components to keep becomes just keeping the top <math>\textstyle k</math> components.  When we
do this, we also say that we are "retaining the top <math>\textstyle k</math> PCA (or principal) components."

== Recovering an Approximation of the Data ==

Now, <math>\textstyle \tilde{x} \in \Re^k</math> is a lower-dimensional, "compressed" representation
of the original <math>\textstyle x \in \Re^n</math>.  Given <math>\textstyle \tilde{x}</math>, how can we recover an approximation <math>\textstyle \hat{x}</math> to 
the original value of <math>\textstyle x</math>?  From Equation~(\ref{eqn-unrotate}), we know that <math>\textstyle x = U x_{\rm rot}</math>.  Further, 
we can think of <math>\textstyle \tilde{x}</math> as an approximation to <math>\textstyle x_{\rm rot}</math>, where we have
set the last <math>\textstyle n-k</math> components to zeros.  Thus, given <math>\textstyle \tilde{x} \in \Re^k</math>, we can 
pad it out with <math>\textstyle n-k</math> zeros to get our approximation to <math>\textstyle x_{\rm rot} \in \Re^n</math>.  Finally, we pre-multiply
by <math>\textstyle U</math> to get our approximation to <math>\textstyle x</math>.  Concretely, we get 
:<math>\begin{align}
\hat{x}  = U \begin{bmatrix} \tilde{x}_1 \\ \vdots \\ \tilde{x}_k \\ 0 \\ \vdots \\ 0 \end{bmatrix}  
= \sum_{i=1}^k u_i \tilde{x}_i.
\end{align}</math>
The final equality above comes from the definition of <math>\textstyle U</math> given in Equation~(\ref{eqn-udefinition}).
(In a practical implementation, we wouldn't actually zero pad <math>\textstyle \tilde{x}</math> and then multiply
by <math>\textstyle U</math>, since that would mean multiplying a lot of things by zeros; instead, we'd just 
multiply <math>\textstyle \tilde{x} \in \Re^k</math> with the first <math>\textstyle k</math> columns of <math>\textstyle U</math> as in the final expression above.)
Applying this to our dataset, we get the following plot for <math>\textstyle \hat{x}</math>:
\begin{center} 
\includegraphics[width=0.6\maxfigwidth]{PCA-xhat.png}
\vspace*{-0.2in}
\end{center}
We are thus using a 1 dimensional approximation to the original dataset. 

If you are training an autoencoder or other unsupervised feature learning algorithm,
the running time of your algorithm will depend on the dimension of the input.  If you feed <math>\textstyle \tilde{x} \in \Re^k</math>
into your learning algorithm instead of <math>\textstyle x</math>, then you'll be training on a lower-dimensional
input, and thus your algorithm might run significantly faster.  For many datasets,
the lower dimensional <math>\textstyle \tilde{x}</math> representation can be an extremely good approximation 
to the original, and using PCA this way can significantly speed up your algorithm while
introducing very little approximation error.

== References ==
http://cs229.stanford.edu