Data preprocessing plays a very important in many deep learning algorithms. In practice, many methods work best after the data has been normalized and whitened. However, the exact parameters for data preprocessing are usually not immediately apparent unless one has much experience working with the algorithms. In this page, we hope to demystify some of the preprocessing methods and also provide tips (and a "standard pipeline") for preprocessing data.
A standard first step to data preprocessing is data normalization. While there are a few possible approaches, this step is usually clear depending on the data. The common methods for feature normalization are:
- Simple Rescaling
- Per-example mean subtraction (a.k.a. remove DC)
- Feature Standardization (zero-mean and unit variance for each feature across the dataset)
In simple rescaling, our goal is to rescale the data along each data dimension (possibly independently) so that the final data vectors lie in the range [0,1] or [ − 1,1] (depending on your dataset). This is useful for later processing as many default parameters (e.g., epsilon in PCA-whitening) treat the data as if it has been scaled to a reasonable range.
Example: When processing natural images, we often obtain pixel values in the range [0,255]. It is a common operation to rescale these values to [0,1] by dividing the data by 255.
Per-example mean subtraction
If the data has the property that the
How to choose epsilon? Do we need low-pass filtering?