PCA
From Ufldl
for
PCA
Jump to:
navigation
,
search
== Introduction == Principal Components Analysis (PCA) is a dimensionality reduction algorithm that can be used to significantly speed up your unsupervised feature learning algorithm. More importantly, understanding PCA will enable us to later implement '''whitening''', which is an important pre-processing step for many algorithms. Suppose you are training your algorithm on images. Then the input will be somewhat redundant, because the values of adjacent pixels in an image are highly correlated. Concretely, suppose we are training on 16x16 grayscale image patches. Then <math>\textstyle x \in \Re^{256}</math> are 256 dimensional vectors, with one feature <math>\textstyle x_j</math> corresponding to the intensity of each pixel. Because of the correlation between adjacent pixels, PCA will allow us to approximate the input with a much lower dimensional one, while incurring very little error. == Example and Mathematical Background == For our running example, we will use a dataset <math>\textstyle \{x^{(1)}, x^{(2)}, \ldots, x^{(m)}\}</math> with <math>\textstyle n=2</math> dimensional inputs, so that <math>\textstyle x^{(i)} \in \Re^2</math>. Suppose we want to reduce the data from 2 dimensions to 1. (In practice, we might want to reduce data from 256 to 50 dimensions, say; but using lower dimensional data in our example allows us to visualize the algorithms better.) Here is our dataset: INSERT DATASET HERE! This data has already been pre-processed so that each of the features <math>\textstyle x_1</math> and <math>\textstyle x_2</math> have about the same mean (zero) and variance. For the purpose of illustration, we have also colored each of the points one of three colors, depending on their <math>\textstyle x_1</math> value; these colors are not used by the algorithm, and are for illustration only. PCA will find a lower-dimensional subspace onto which to project our data. From visually examining the data, it appears that <math>\textstyle u_1</math> is the principal direction of variation of the data, and <math>\textstyle u_2</math> the secondary direction of variation: INSERT GRAPHIC HERE I.e., the data varies much more in the direction <math>\textstyle u_1</math> than <math>\textstyle u_2</math>. To more formally find the directions <math>\textstyle u_1</math> and <math>\textstyle u_2</math>, we first compute the matrix <math>\textstyle \Sigma</math> as follows: :<math>\begin{align} \Sigma = \sum_{i=1}^m (x^{(i)})(x^{(i)})^T. \end{align}</math> If <math>\textstyle x</math> has zero mean, then <math>\textstyle \frac{1}{m} \Sigma</math> is exactly the covariance matrix of <math>\textstyle x</math>. It can then be shown that <math>\textstyle u_1</math>---the principal direction of variation of the data---is the top (principal) eigenvector of <math>\textstyle \Sigma</math>, and <math>\textstyle u_2</math> is the second eigenvector. '''(Important: For a mathematical derivation/formal justification of this, see the CS229 lecture notes on PCA, linked below.)''' You can use standard numerical linear algebra software to find these eigenvectors (see Implementation Notes). Concretely, let us compute the eigenvectors of <math>\textstyle \Sigma</math>, and stack the eigenvectors in columns to form the matrix <math>\textstyle U</math>: :<math>\begin{align} U = \begin{bmatrix} | & | & & | \\ u_1 & u_2 & \cdots & u_n \\ | & | & & | \end{bmatrix} \end{align}</math> Here, <math>\textstyle u_1</math> is the principal eigenvector (corresponding to the largest eigenvalue), <math>\textstyle u_2</math> is the second eigenvector, and so on. Also, let <math>\textstyle \lambda_1, \lambda_2, \ldots, \lambda_n</math> be the corresponding eigenvalues. The vectors <math>\textstyle u_1</math> and <math>\textstyle u_2</math> in our example form a new basis in which we can represent the data. Concretely, let <math>\textstyle x \in \Re^2</math> be some training example. Then <math>\textstyle u_1^Tx</math> is the length (magnitude) of the projection of <math>\textstyle x</math> onto the vector <math>\textstyle u_1</math>. %In otherwise, %<math>\textstyle u_1^Tx</math> is the amount that <math>\textstyle x</math> differs from the origin <math>\textstyle [0, 0]^T</math> in the <math>\textstyle u_1</math> %direction. Similarly, <math>\textstyle u_2^Tx</math> is the magnitude of <math>\textstyle x</math> projected onto the vector <math>\textstyle u_2</math>. == Rotating the Data == Thus, we can represent <math>\textstyle x</math> in the <math>\textstyle (u_1, u_2)</math>-basis by computing :<math>\begin{align} x_{\rm rot} = U^Tx = \begin{bmatrix} u_1^Tx \\ u_2^Tx \end{bmatrix} \end{align}</math> (The subscript "rot" comes from the observation that this corresponds to a rotation (and possibly reflection) of the original data.) Lets take the entire training set, and compute <math>\textstyle x_{\rm rot}^{(i)} = U^Tx^{(i)}</math> for every <math>\textstyle i</math>. Plotting this transformed data <math>\textstyle x_{\rm rot}</math>, we get: \begin{center} \includegraphics[width=0.6\maxfigwidth]{PCA-rotated.png} \vspace*{-0.2in} \end{center} This is the training set rotated into the <math>\textstyle u_1</math>,<math>\textstyle u_2</math> basis. In the general case, <math>\textstyle U^Tx</math> will be the training set rotated into the basis <math>\textstyle u_1</math>,<math>\textstyle u_2</math>,\ldots,<math>\textstyle u_n</math>. One of the properties of <math>\textstyle U</math> is that it is an orthogonal basis, and thus <math>\textstyle U^TU = UU^T = I</math>. So if you ever need to go back from the rotated vectors <math>\textstyle x_{\rm rot}</math> back to the original data <math>\textstyle x</math>, you can compute :<math>\begin{align} x = U x_{\rm rot} , \end{align}</math> because <math>\textstyle U x_{\rm rot} = UU^T x = x</math>. == Reducing the Data == We see that the principal direction of variation of the data is the first dimension <math>\textstyle x_{{\rm rot},1}</math> of this rotated data. Thus, if we want to reduce this data to one dimension, we can set :<math>\begin{align} \tilde{x}^{(i)} = x_{{\rm rot},1}^{(i)} = u_1^Tx^{(i)} \in \Re. \end{align}</math> More generally, if <math>\textstyle x \in \Re^n</math> and we want to reduce it to a <math>\textstyle k</math> dimensional representation <math>\textstyle \tilde{x} \in \Re^k</math> (where <math>\textstyle k < n</math>), we would take the first <math>\textstyle k</math> components of <math>\textstyle x_{\rm rot}</math>, which correspond to the top <math>\textstyle k</math> directions of variation. Another way of explaining PCA is that <math>\textstyle x_{\rm rot}</math> is an <math>\textstyle n</math> dimensional vector, where the first few components are likely to be large (e.g., in our example, we saw that <math>\textstyle x_{{\rm rot},1}^{(i)} = u_1^Tx^{(i)}</math> takes reasonably large values for most examples <math>\textstyle i</math>), and the later components are likely to be small (e.g., in our example, <math>\textstyle x_{{\rm rot},2}^{(i)} = u_2^Tx^{(i)}</math> was more likely to be small). What PCA does it it drops the the later (smaller) components of <math>\textstyle x_{\rm rot}</math>, and just approximates them with 0's. Concretely, our definition of <math>\textstyle \tilde{x}</math> can also be arrived at by using an approximation to <math>\textstyle x_{{\rm rot}}</math> where all but the first <math>\textstyle k</math> components are zeros. In other words, we have: :<math>\begin{align} \tilde{x} = \begin{bmatrix} x_{{\rm rot},1} \\ \vdots \\ x_{{\rm rot},k} \\ 0 \\ \vdots \\ 0 \\ \end{bmatrix} \approx \begin{bmatrix} x_{{\rm rot},1} \\ \vdots \\ x_{{\rm rot},k} \\ x_{{\rm rot},k+1} \\ \vdots \\ x_{{\rm rot},n} \end{bmatrix} = x_{\rm rot} \end{align}</math> In our example, this gives us the following plot of <math>\textstyle \tilde{x}</math> (using <math>\textstyle n=2, k=1</math>): INSERT GRAPHIC HERE However, since the final <math>\textstyle n-k</math> components of <math>\textstyle \tilde{x}</math> as defined above would always be zero, there is no need to keep these zeros around, and so we define <math>\textstyle \tilde{x}</math> as a <math>\textstyle k</math>-dimensional vector with just the first <math>\textstyle k</math> (non-zero) components. This also explains why we wanted to express our data in the <math>\textstyle u_1, u_2, \ldots, u_n</math> basis: Deciding which components to keep becomes just keeping the top <math>\textstyle k</math> components. When we do this, we also say that we are "retaining the top <math>\textstyle k</math> PCA (or principal) components." == Recovering an Approximation of the Data == Now, <math>\textstyle \tilde{x} \in \Re^k</math> is a lower-dimensional, "compressed" representation of the original <math>\textstyle x \in \Re^n</math>. Given <math>\textstyle \tilde{x}</math>, how can we recover an approximation <math>\textstyle \hat{x}</math> to the original value of <math>\textstyle x</math>? From Equation~(\ref{eqn-unrotate}), we know that <math>\textstyle x = U x_{\rm rot}</math>. Further, we can think of <math>\textstyle \tilde{x}</math> as an approximation to <math>\textstyle x_{\rm rot}</math>, where we have set the last <math>\textstyle n-k</math> components to zeros. Thus, given <math>\textstyle \tilde{x} \in \Re^k</math>, we can pad it out with <math>\textstyle n-k</math> zeros to get our approximation to <math>\textstyle x_{\rm rot} \in \Re^n</math>. Finally, we pre-multiply by <math>\textstyle U</math> to get our approximation to <math>\textstyle x</math>. Concretely, we get :<math>\begin{align} \hat{x} = U \begin{bmatrix} \tilde{x}_1 \\ \vdots \\ \tilde{x}_k \\ 0 \\ \vdots \\ 0 \end{bmatrix} = \sum_{i=1}^k u_i \tilde{x}_i. \end{align}</math> The final equality above comes from the definition of <math>\textstyle U</math> given in Equation~(\ref{eqn-udefinition}). (In a practical implementation, we wouldn't actually zero pad <math>\textstyle \tilde{x}</math> and then multiply by <math>\textstyle U</math>, since that would mean multiplying a lot of things by zeros; instead, we'd just multiply <math>\textstyle \tilde{x} \in \Re^k</math> with the first <math>\textstyle k</math> columns of <math>\textstyle U</math> as in the final expression above.) Applying this to our dataset, we get the following plot for <math>\textstyle \hat{x}</math>: \begin{center} \includegraphics[width=0.6\maxfigwidth]{PCA-xhat.png} \vspace*{-0.2in} \end{center} We are thus using a 1 dimensional approximation to the original dataset. If you are training an autoencoder or other unsupervised feature learning algorithm, the running time of your algorithm will depend on the dimension of the input. If you feed <math>\textstyle \tilde{x} \in \Re^k</math> into your learning algorithm instead of <math>\textstyle x</math>, then you'll be training on a lower-dimensional input, and thus your algorithm might run significantly faster. For many datasets, the lower dimensional <math>\textstyle \tilde{x}</math> representation can be an extremely good approximation to the original, and using PCA this way can significantly speed up your algorithm while introducing very little approximation error. == References == http://cs229.stanford.edu
Template:Languages
(
view source
)
Template:PCA
(
view source
)
Return to
PCA
.
Views
Page
Discussion
View source
History
Personal tools
Log in
ufldl resources
UFLDL Tutorial
Recommended Readings
wiki
Main page
Recent changes
Random page
Help
Search
Toolbox
What links here
Related changes
Special pages