Data Preprocessing

== Overview ==

Data preprocessing plays a very important in many deep learning algorithms. In practice, many methods work best after the data has been normalized and whitened. However, the exact parameters for data preprocessing are usually not immediately apparent unless one has much experience working with the algorithms. In this page, we hope to demystify some of the preprocessing methods and also provide tips (and a "standard pipeline") for preprocessing data.

{{quote |
Tip: When approaching a dataset, the first thing to do is to look at the data itself and observe its properties. While the techniques here apply generally, you might want to opt to do certain things differently given your dataset. For example, one standard preprocessing trick is to subtract the mean of each data point from itself (also known as remove DC, local mean subtraction, subtractive normalization). While this makes sense for data such as natural images, it is less obvious for data with with a natural "zero" point such as MNIST images (where all data points use the same value of 0 to represent an empty background). 
}}


== Data Normalization ==

A standard first step to data preprocessing is data normalization. While there are a few possible approaches, this step is usually clear depending on the data. The common methods for feature normalization are:

* Simple Rescaling
* Per-example mean subtraction (a.k.a. remove DC)
* Feature Standardization (zero-mean and unit variance for each feature across the dataset)

=== Simple Rescaling ===

In simple rescaling, our goal is to rescale the data along each data dimension (possibly independently) so that the final data vectors lie in the range <math>[0, 1]</math> or  <math>[-1, 1]</math>  (depending on your dataset). This is useful for later processing as many ''default'' parameters (e.g., epsilon in PCA-whitening) treat the data as if it has been scaled to a reasonable range. 

'''Example: ''' When processing natural images, we often obtain pixel values in the range <math>[0, 255]</math>. It is a common operation to rescale these values to  <math>[0, 1]</math> by dividing the data by 255.

=== Per-example mean subtraction ===

If your data is ''stationary'' (i.e., the statistics for each data dimension follow the same distribution), then you might want to consider subtracting the mean-value for each example (computed per-example). 

'''Example:''' In images, this normalization has the property of removing the average brightness (intensity) of the data point. In many cases, we are not interested in the illumination conditions of the image, but more so in the content; removing the average pixel value per data point makes sense here.

=== Feature Standardization ===

Feature standardization refers to (independently) setting each dimension of the data to have zero-mean and unit-variance. This is the most common method for normalization and is generally used widely (e.g., when working with SVMs, feature standardization is often recommended as a preprocessing step). In practice, one achieves this by first computing the mean of each dimension (across the dataset) and subtracts this from each dimension. Next, each dimension is divided by its standard deviation. 

'''Example: ''' When working with audio data, it is common to use [http://en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs] as the data representation. However, the first component (representing the DC) of the MFCC features often overshadow the other components. Thus, one method to restore balance to the components is to standardize the values in each component independently.

== PCA/ZCA Whitening ==

How to choose epsilon? Do we need low-pass filtering?

== Large Images ==

1/f Whitening


== Standard Pipeline ==


== Model Idiosyncrasies ==

=== Sparse Autoencoder ===

==== Sigmoid Decoders ====

==== Linear Decoders ====

=== Independent Component Analysis ===