# Data Preprocessing

### From Ufldl

Line 15: | Line 15: | ||

* Per-example mean subtraction (a.k.a. remove DC) | * Per-example mean subtraction (a.k.a. remove DC) | ||

* Feature Standardization (zero-mean and unit variance for each feature across the dataset) | * Feature Standardization (zero-mean and unit variance for each feature across the dataset) | ||

+ | |||

=== Simple Rescaling === | === Simple Rescaling === | ||

Line 26: | Line 27: | ||

If your data is ''stationary'' (i.e., the statistics for each data dimension follow the same distribution), then you might want to consider subtracting the mean-value for each example (computed per-example). | If your data is ''stationary'' (i.e., the statistics for each data dimension follow the same distribution), then you might want to consider subtracting the mean-value for each example (computed per-example). | ||

- | '''Example:''' In images, this normalization has the property of removing the average brightness (intensity) of the data point. In many cases, we are not interested in the illumination conditions of the image, but more so in the content; removing the average pixel value per data point makes sense here. | + | '''Example:''' In images, this normalization has the property of removing the average brightness (intensity) of the data point. In many cases, we are not interested in the illumination conditions of the image, but more so in the content; removing the average pixel value per data point makes sense here. '''Note:''' While this method is generally used for images, one might want to take more care when applying this to color images. In particular, the stationarity property does not generally apply across pixels in different color channels. |

=== Feature Standardization === | === Feature Standardization === | ||

Line 33: | Line 34: | ||

'''Example: ''' When working with audio data, it is common to use [http://en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs] as the data representation. However, the first component (representing the DC) of the MFCC features often overshadow the other components. Thus, one method to restore balance to the components is to standardize the values in each component independently. | '''Example: ''' When working with audio data, it is common to use [http://en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs] as the data representation. However, the first component (representing the DC) of the MFCC features often overshadow the other components. Thus, one method to restore balance to the components is to standardize the values in each component independently. | ||

+ | |||

== PCA/ZCA Whitening == | == PCA/ZCA Whitening == | ||

- | + | After doing the simple normalizations, whitening is often the next preprocessing step employed that helps make our algorithms work better. In practice, many deep learning algorithms rely on whitening to learn good features. | |

+ | |||

+ | In performing PCA/ZCA whitening, it is pertinent to first zero-mean the features (across the dataset) to ensure that <math> \frac{1}{m} \sum_i x^{(i)} = 0 </math>. Specifically, this should be done before computing the covariance matrix. '''The only exception is when per-example mean subtraction is performed and the data is stationary across dimensions/pixels.''' | ||

+ | |||

+ | |||

+ | {{quote| | ||

+ | Note: When working in a classification framework, one should compute the PCA/ZCA whitening matrices based only on the training set. The following parameters used be saved for use with the test set: (a) average vector that was used to zero-mean the data, (b) whitening matrices. The test set should undergo the same preprocessing steps using these saved values. }} | ||

== Large Images == | == Large Images == |