# Data Preprocessing

### From Ufldl

Line 46: | Line 46: | ||

=== Reconstruction Based Models === | === Reconstruction Based Models === | ||

- | In models based on reconstruction (including Autoencoders, Sparse Coding, RBMs, k-Means), it is often preferable to set <tt>epsilon</tt> to a value such that low-pass filtering is achieved. One way to check this is to set a value for epsilon, run ZCA whitening, and thereafter visualize the data before and after whitening. If the value of epsilon is set too low, the data will look very noisy; conversely, if epsilon is set too high, you will see a "blurred" version of the original data. | + | In models based on reconstruction (including Autoencoders, Sparse Coding, RBMs, k-Means), it is often preferable to set <tt>epsilon</tt> to a value such that low-pass filtering is achieved. One way to check this is to set a value for epsilon, run ZCA whitening, and thereafter visualize the data before and after whitening. If the value of epsilon is set too low, the data will look very noisy; conversely, if epsilon is set too high, you will see a "blurred" version of the original data. |

+ | |||

+ | |||

{{Quote| | {{Quote| | ||

Line 57: | Line 59: | ||

{{quote| | {{quote| | ||

Note: When working in a classification framework, one should compute the PCA/ZCA whitening matrices based only on the training set. The following parameters used be saved for use with the test set: (a) average vector that was used to zero-mean the data, (b) whitening matrices. The test set should undergo the same preprocessing steps using these saved values. }} | Note: When working in a classification framework, one should compute the PCA/ZCA whitening matrices based only on the training set. The following parameters used be saved for use with the test set: (a) average vector that was used to zero-mean the data, (b) whitening matrices. The test set should undergo the same preprocessing steps using these saved values. }} | ||

+ | |||

== Large Images == | == Large Images == | ||

Line 64: | Line 67: | ||

== Standard Pipeline == | == Standard Pipeline == | ||

+ | |||

+ | In this section, we describe several "standard pipelines" that have worked well for some datasets: | ||

+ | |||

+ | |||

+ | === Natural Grey-scale Images === | ||

+ | |||

+ | Since grey-scale images have the stationarity property, we usually first remove the mean-component from each data example separately (remove DC). After this step, PCA/ZCA whitening is often employed with a value of <tt>epsilon</tt> set large enough to low-pass filter the data. | ||

+ | |||

+ | === Color Images === | ||

+ | |||

- | == | + | === Audio (MFCC/Spectrograms) === |

- | |||

- | |||

- | === | + | === MNIST Handwritten Digits === |

- | + | The MNIST dataset has pixel values in the range <math>[0, 255]</math>. We thus start with simple rescaling to shift the data into the range <math>[0, 1]</math>. A sparse autoencoder often works well after this simple normalization. While one could also elect to use PCA/ZCA whitening if desired, this is not often done in practice. ''Note: Since the 0 value is meaningful in MNIST, we do ''not'' perform per-example mean normalization.'' |