Data Preprocessing

From Ufldl

Jump to: navigation, search
Line 48: Line 48:
In models based on reconstruction (including Autoencoders, Sparse Coding, RBMs, k-Means), it is often preferable to set <tt>epsilon</tt> to a value such that low-pass filtering is achieved. One way to check this is to set a value for epsilon, run ZCA whitening, and thereafter visualize the data before and after whitening. If the value of epsilon is set too low, the data will look very noisy; conversely, if epsilon is set too high, you will see a "blurred" version of the original data.  
In models based on reconstruction (including Autoencoders, Sparse Coding, RBMs, k-Means), it is often preferable to set <tt>epsilon</tt> to a value such that low-pass filtering is achieved. One way to check this is to set a value for epsilon, run ZCA whitening, and thereafter visualize the data before and after whitening. If the value of epsilon is set too low, the data will look very noisy; conversely, if epsilon is set too high, you will see a "blurred" version of the original data.  
-
 
+
In reconstruction based models, the loss function includes a term that penalizes reconstructions that are far from the original inputs. Then, if <tt>epsilon</tt> is set too ''low'', the data will contain a lot of noise which the model will need to reconstruct well. As a result, it is very important for reconstruction based models to have data that has been low-pass filtered.
{{Quote|
{{Quote|
Line 77: Line 77:
=== Color Images ===
=== Color Images ===
-
 
+
For color images, the stationarity property does not hold across color channels. Hence, we usually start by rescaling the data (making sure it is in <math>[0, 1]</math>) ad then applying PCA/ZCA with a sufficiently large <tt>epsilon</tt>. Note that it is important to perform feature mean-normalization before computing the PCA transformation.
=== Audio (MFCC/Spectrograms) ===
=== Audio (MFCC/Spectrograms) ===
-
 
+
For audio data (MFCC and Spectrograms), each dimension usually have different scales (variances). This is especially so when one includes the temporal derivatives (a common practice in audio processing). As a result, the preprocessing usually starts with simple data standardization (zero-mean, unit-variance per data dimension), followed by PCA/ZCA whitening (with an appropriate <tt>epsilon</tt>).
=== MNIST Handwritten Digits ===
=== MNIST Handwritten Digits ===
The MNIST dataset has pixel values in the range <math>[0, 255]</math>. We thus start with simple rescaling to shift the data into the range <math>[0, 1]</math>. A sparse autoencoder often works well after this simple normalization. While one could also elect to use PCA/ZCA whitening if desired, this is not often done in practice. ''Note: Since the 0 value is meaningful in MNIST, we do ''not'' perform per-example mean normalization.''
The MNIST dataset has pixel values in the range <math>[0, 255]</math>. We thus start with simple rescaling to shift the data into the range <math>[0, 1]</math>. A sparse autoencoder often works well after this simple normalization. While one could also elect to use PCA/ZCA whitening if desired, this is not often done in practice. ''Note: Since the 0 value is meaningful in MNIST, we do ''not'' perform per-example mean normalization.''

Revision as of 06:53, 29 April 2011

Personal tools