Data Preprocessing

From Ufldl

Jump to: navigation, search
Line 21: Line 21:
'''Example: ''' When processing natural images, we often obtain pixel values in the range <math>[0, 255]</math>. It is a common operation to rescale these values to  <math>[0, 1]</math> by dividing the data by 255.
'''Example: ''' When processing natural images, we often obtain pixel values in the range <math>[0, 255]</math>. It is a common operation to rescale these values to  <math>[0, 1]</math> by dividing the data by 255.
-
 
=== Per-example mean subtraction ===
=== Per-example mean subtraction ===
-
If the data has the property that the  
+
If your data is ''stationary'' (i.e., the statistics for each data dimension follow the same distribution), then you might want to consider subtracting the mean-value for each example (computed per-example).
 +
 
 +
'''Example:''' In images, this normalization has the property of removing the average brightness (intensity) of the data point. In many cases, we are not interested in the illumination conditions of the image, but more so in the content; removing the average pixel value per data point makes sense here.
 +
 
 +
=== Feature Standardization ===
 +
 
 +
Feature standardization refers to (independently) setting each dimension of the data to have zero-mean and unit-variance. This is the most common method for normalization and is generally used widely (e.g., when working with SVMs, feature standardization is often recommended as a preprocessing step). In practice, one achieves this by first computing the mean of each dimension (across the dataset) and subtracts this from each dimension. Next, each dimension is divided by its standard deviation.
 +
 
 +
'''Example: ''' When working with audio data, it is common to use [http://en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs] as the data representation. However, the first component (representing the DC) of the MFCC features often overshadow the other components. Thus, one method to restore balance to the components is to standardize the values in each component independently.
== PCA/ZCA Whitening ==
== PCA/ZCA Whitening ==

Revision as of 06:18, 29 April 2011

Personal tools