Data Preprocessing

From Ufldl

Jump to: navigation, search
Line 46: Line 46:
=== Reconstruction Based Models ===
=== Reconstruction Based Models ===
-
In models based on reconstruction (including Autoencoders, Sparse Coding, RBMs, k-Means), it is often preferable to set <tt>epsilon</tt> to a value such that low-pass filtering is achieved. One way to check this is to set a value for epsilon, run ZCA whitening, and thereafter visualize the data before and after whitening. If the value of epsilon is set too low, the data will look very noisy; conversely, if epsilon is set too high, you will see a "blurred" version of the original data.
+
In models based on reconstruction (including Autoencoders, Sparse Coding, RBMs, k-Means), it is often preferable to set <tt>epsilon</tt> to a value such that low-pass filtering is achieved. One way to check this is to set a value for epsilon, run ZCA whitening, and thereafter visualize the data before and after whitening. If the value of epsilon is set too low, the data will look very noisy; conversely, if epsilon is set too high, you will see a "blurred" version of the original data.  
 +
 
 +
 
{{Quote|
{{Quote|
Line 57: Line 59:
{{quote|
{{quote|
Note: When working in a classification framework, one should compute the PCA/ZCA whitening matrices based only on the training set. The following parameters used be saved for use with the test set: (a) average vector that was used to zero-mean the data, (b) whitening matrices. The test set should undergo the same preprocessing steps using these saved values.  }}
Note: When working in a classification framework, one should compute the PCA/ZCA whitening matrices based only on the training set. The following parameters used be saved for use with the test set: (a) average vector that was used to zero-mean the data, (b) whitening matrices. The test set should undergo the same preprocessing steps using these saved values.  }}
 +
== Large Images ==
== Large Images ==
Line 64: Line 67:
== Standard Pipeline ==
== Standard Pipeline ==
 +
 +
In this section, we describe several "standard pipelines" that have worked well for some datasets:
 +
 +
 +
=== Natural Grey-scale Images ===
 +
 +
Since grey-scale images have the stationarity property, we usually first remove the mean-component from each data example separately (remove DC). After this step, PCA/ZCA whitening is often employed with a value of <tt>epsilon</tt> set large enough to low-pass filter the data.
 +
 +
=== Color Images ===
 +
-
== Model Idiosyncrasies ==
+
=== Audio (MFCC/Spectrograms) ===
-
=== Sparse Autoencoder ===
 
-
==== Sigmoid Decoders ====
 
-
==== Linear Decoders ====
+
=== MNIST Handwritten Digits ===
-
=== Independent Component Analysis ===
+
The MNIST dataset has pixel values in the range <math>[0, 255]</math>. We thus start with simple rescaling to shift the data into the range <math>[0, 1]</math>. A sparse autoencoder often works well after this simple normalization. While one could also elect to use PCA/ZCA whitening if desired, this is not often done in practice. ''Note: Since the 0 value is meaningful in MNIST, we do ''not'' perform per-example mean normalization.''

Revision as of 06:48, 29 April 2011

Personal tools