### Introduction

Principal Components Analysis (PCA) is a dimensionality reduction algorithm that can be used to significantly speed up your unsupervised feature learning algorithm. More importantly, understanding PCA will enable us to later implement **whitening**, which is an important pre-processing step for many algorithms.

Suppose you are training your algorithm on images. Then the input will be somewhat redundant, because the values of adjacent pixels in an image are highly correlated. Concretely, suppose we are training on 16x16 grayscale image patches. Then

### Example and Mathematical Background

For our running example, we will use a dataset

This data has already been pre-processed so that each of the features

For the purpose of illustration, we have also colored each of the points one of three colors, depending on their

PCA will find a lower-dimensional subspace onto which to project our data.

From visually examining the data, it appears that

I.e., the data varies much more in the direction

If

It can then be shown that

Note: If you are interested in seeing a more formal mathematical derivation/justification of this result, see the CS229 (Machine Learning) lecture notes on PCA (link at bottom of this page). You won’t need to do so to follow along this course, however.

You can use standard numerical linear algebra software to find these eigenvectors (see Implementation Notes). Concretely, let us compute the eigenvectors of

Here,

The vectors

Similarly,

### Rotating the Data

Thus, we can represent

(The subscript “rot” comes from the observation that this corresponds to a rotation (and possibly reflection) of the original data.) Lets take the entire training set, and compute

This is the training set rotated into the

One of the properties of

because

### Reducing the Data Dimension

We see that the principal direction of variation of the data is the first dimension

More generally, if

Another way of explaining PCA is that

In our example, this gives us the following plot of

However, since the final

This also explains why we wanted to express our data in the

### Recovering an Approximation of the Data

Now,

The final equality above comes from the definition of

We are thus using a 1 dimensional approximation to the original dataset.

If you are training an autoencoder or other unsupervised feature learning algorithm, the running time of your algorithm will depend on the dimension of the input. If you feed

### Number of components to retain

How do we set

To decide how to set

More generally, let

In our simple 2D example above,

A more formal definition of percentage of variance retained is beyond the scope of these notes. However, it is possible to show that

In the case of images, one common heuristic is to choose

Depending on the application, if you are willing to incur some additional error, values in the 90-98% range are also sometimes used. When you describe to others how you applied PCA, saying that you chose

### PCA on Images

For PCA to work, usually we want each of the features

Note: Usually we use images of outdoor scenes with grass, trees, etc., and cut out small (say 16x16) image patches randomly from these to train the algorithm. But in practice most feature learning algorithms are extremely robust to the exact type of image it is trained on, so most images taken with a normal camera, so long as they aren’t excessively blurry or have strange artifacts, should work.

When training on natural images, it makes little sense to estimate a separate mean and variance for each pixel, because the statistics in one part of the image should (theoretically) be the same as any other.

This property of images is called ”‘stationarity.”’

In detail, in order for PCA to work well, informally we require that (i) The features have approximately zero mean, and (ii) The different features have similar variances to each other. With natural images, (ii) is already satisfied even without variance normalization, and so we won’t perform any variance normalization.

(If you are training on audio data—say, on spectrograms—or on text data—say, bag-of-word vectors—we will usually not perform variance normalization either.)

In fact, PCA is invariant to the scaling of the data, and will return the same eigenvectors regardless of the scaling of the input. More formally, if you multiply each feature vector

So, we won’t use variance normalization. The only normalization we need to perform then is mean normalization, to ensure that the features have a mean around zero. Depending on the application, very often we are not interested in how bright the overall input image is. For example, in object recognition tasks, the overall brightness of the image doesn’t affect what objects there are in the image. More formally, we are not interested in the mean intensity value of an image patch; thus, we can subtract out this value, as a form of mean normalization.

Concretely, if

for all

Note that the two steps above are done separately for each image

If you are training your algorithm on images other than natural images (for example, images of handwritten characters, or images of single isolated objects centered against a white background), other types of normalization might be worth considering, and the best choice may be application dependent. But when training on natural images, using the per-image mean normalization method as given in the equations above would be a reasonable default.

### Whitening

We have used PCA to reduce the dimension of the data. There is a closely related preprocessing step called **whitening** (or, in some other literatures, **sphering**) which is needed for some algorithms. If we are training on images, the raw input is redundant, since adjacent pixel values are highly correlated. The goal of whitening is to make the input less redundant; more formally, our desiderata are that our learning algorithms sees a training input where (i) the features are less correlated with each other, and (ii) the features all have the same variance.

### 2D example

We will first describe whitening using our previous 2D example. We will then describe how this can be combined with smoothing, and finally how to combine this with PCA.

How can we make our input features uncorrelated with each other? We had already done this when computing

Repeating our previous figure, our plot for

The covariance matrix of this data is given by:

(Note: Technically, many of the statements in this section about the “covariance” will be true only if the data has zero mean. In the rest of this section, we will take this assumption as implicit in our statements. However, even if the data’s mean isn’t exactly zero, the intuitions we’re presenting here still hold true, and so this isn’t something that you should worry about.)

It is no accident that the diagonal values are

To make each of our input features have unit variance, we can simply rescale each feature

Plotting

This data now has covariance equal to the identity matrix **PCA whitened** version of the data: The different components of

**Whitening combined with dimensionality reduction.** If you want to have data that is whitened and which is lower dimensional than the original input, you can also optionally keep only the top

### ZCA Whitening

Finally, it turns out that this way of getting the data to have covariance identity

In **ZCA whitening**, we choose

Plotting

It can be shown that out of all possible choices for

When using ZCA whitening (unlike PCA whitening), we usually keep all

### Regularizaton

When implementing PCA whitening or ZCA whitening in practice, sometimes some of the eigenvalues

When

For the case of images, adding

ZCA whitening is a form of pre-processing of the data that maps it from

### Implementing PCA Whitening

In this section, we summarize the PCA, PCA whitening and ZCA whitening algorithms, and also describe how you can implement them using efficient linear algebra libraries.

First, we need to ensure that the data has (approximately) zero-mean. For natural images, we achieve this (approximately) by subtracting the mean value of each image patch.

We achieve this by computing the mean for each patch and subtracting it for each patch. In Matlab, we can do this by using

```
avg = mean(x, 1); % Compute the mean pixel intensity value separately for each patch.
x = x - repmat(avg, size(x, 1), 1);
```

Next, we need to compute

```
sigma = x * x' / size(x, 2);
```

(Check the math yourself for correctness.) Here, we assume that

Next, PCA computes the eigenvectors of `eig` function. However, because `svd` function. Concretely, if you implement

```
[U,S,V] = svd(sigma);
```

then the matrix

(Note: The `svd` function actually computes the singular vectors and singular values of a matrix, which for the special case of a symmetric positive semi-definite matrix—which is all that we’re concerned with here—is equal to its eigenvectors and eigenvalues. A full discussion of singular vectors vs. eigenvectors is beyond the scope of these notes.)

Finally, you can compute

```
xRot = U' * x; % rotated version of the data.
xTilde = U(:,1:k)' * x; % reduced dimension representation of the data,
% where k is the number of eigenvectors to keep
```

This gives your PCA representation of the data in terms of

To compute the PCA whitened data

```
xPCAwhite = diag(1./sqrt(diag(S) + epsilon)) * U' * x;
```

Since

Finally, you can also compute the ZCA whitened data

```
xZCAwhite = U * diag(1./sqrt(diag(S) + epsilon)) * U' * x;
```