# Data Preprocessing

### From Ufldl

Line 4: | Line 4: | ||

{{quote | | {{quote | | ||

- | Tip: When approaching a dataset, the first thing to do is to look at the data itself and observe its properties. While the techniques here apply generally, you might want to opt to do certain things differently given your dataset. For example, one standard preprocessing trick is to subtract the mean of each data point from itself (also known as remove DC, local mean subtraction, subtractive normalization). While this makes sense for data such as natural images, it is less obvious for data | + | Tip: When approaching a dataset, the first thing to do is to look at the data itself and observe its properties. While the techniques here apply generally, you might want to opt to do certain things differently given your dataset. For example, one standard preprocessing trick is to subtract the mean of each data point from itself (also known as remove DC, local mean subtraction, subtractive normalization). While this makes sense for data such as natural images, it is less obvious for data where stationarity does not hold. |

}} | }} | ||

Line 42: | Line 42: | ||

In performing PCA/ZCA whitening, it is pertinent to first zero-mean the features (across the dataset) to ensure that <math> \frac{1}{m} \sum_i x^{(i)} = 0 </math>. Specifically, this should be done before computing the covariance matrix. (The only exception is when per-example mean subtraction is performed and the data is stationary across dimensions/pixels.) | In performing PCA/ZCA whitening, it is pertinent to first zero-mean the features (across the dataset) to ensure that <math> \frac{1}{m} \sum_i x^{(i)} = 0 </math>. Specifically, this should be done before computing the covariance matrix. (The only exception is when per-example mean subtraction is performed and the data is stationary across dimensions/pixels.) | ||

- | Next, one needs to select the value of < | + | Next, one needs to select the value of <tt>epsilon</tt> to use when performing [[Whitening | PCA/ZCA whitening]] (recall that this was the regularization term that has an effect of ''low-pass filtering'' the data). It turns out that selecting this value can also play an important role for feature learning, we discuss two cases for selecting <tt>epsilon</tt>: |

=== Reconstruction Based Models === | === Reconstruction Based Models === | ||

- | In models based on reconstruction (including Autoencoders, Sparse Coding, RBMs, k-Means), it is often preferable to set <tt>epsilon</tt> to a value such that low-pass filtering is achieved. One way to check this is to set a value for epsilon, run ZCA whitening, and thereafter visualize the data before and after whitening. If the value of epsilon is set too low, the data will look very noisy; conversely, if epsilon is set too high, you will see a "blurred" version of the original data. | + | In models based on reconstruction (including Autoencoders, Sparse Coding, RBMs, k-Means), it is often preferable to set <tt>epsilon</tt> to a value such that low-pass filtering is achieved. One way to check this is to set a value for <tt>epsilon</tt>, run ZCA whitening, and thereafter visualize the data before and after whitening. If the value of epsilon is set too low, the data will look very noisy; conversely, if <tt>epsilon</tt> is set too high, you will see a "blurred" version of the original data. A good way to get a feel for the magnitude of <tt>epsilon</tt> to try is to plot the eigenvalues on a graph. As visible in the example graph below, you may get a "long tail" corresponding to the high frequency noise components. You will want to choose <tt>epsilon</tt> such that most of the "long tail" is filtered out, i.e. choose <tt>epsilon</tt> such that it is greater than most of the small eigenvalues corresponding to the noise. |

+ | |||

+ | [[File:ZCA_Eigenvalues_Plot.png]] | ||

In reconstruction based models, the loss function includes a term that penalizes reconstructions that are far from the original inputs. Then, if <tt>epsilon</tt> is set too ''low'', the data will contain a lot of noise which the model will need to reconstruct well. As a result, it is very important for reconstruction based models to have data that has been low-pass filtered. | In reconstruction based models, the loss function includes a term that penalizes reconstructions that are far from the original inputs. Then, if <tt>epsilon</tt> is set too ''low'', the data will contain a lot of noise which the model will need to reconstruct well. As a result, it is very important for reconstruction based models to have data that has been low-pass filtered. | ||

Line 52: | Line 54: | ||

{{Quote| | {{Quote| | ||

Tip: If your data has been scaled reasonably (e.g., to <math>[0, 1]</math>), start with <math>epsilon = 0.01</math> or <math>epsilon = 0.1</math>. | Tip: If your data has been scaled reasonably (e.g., to <math>[0, 1]</math>), start with <math>epsilon = 0.01</math> or <math>epsilon = 0.1</math>. | ||

+ | }} | ||

=== ICA-based Models (with orthogonalization) === | === ICA-based Models (with orthogonalization) === | ||

For ICA-based models with orthogonalization, it is ''very'' important for the data to be as close to white (identity covariance) as possible. This is a side-effect of using orthogonalization to decorrelate the features learned (more details in [[Independent Component Analysis | ICA]]). Hence, in this case, you will want to use an <tt>epsilon</tt> that is as small as possible (e.g., <math>epsilon = 1e-6</math>). | For ICA-based models with orthogonalization, it is ''very'' important for the data to be as close to white (identity covariance) as possible. This is a side-effect of using orthogonalization to decorrelate the features learned (more details in [[Independent Component Analysis | ICA]]). Hence, in this case, you will want to use an <tt>epsilon</tt> that is as small as possible (e.g., <math>epsilon = 1e-6</math>). | ||

+ | |||

+ | |||

+ | {{Quote| | ||

+ | Tip: In PCA whitening, one also has the option of performing dimension reduction while whitening the data. This is usually an excellent idea since it can greatly speed up the algorithms (less computation and less parameters). A simple rule of thumb to choose how many principle components to retain is to keep enough components to have 99% of the variance retained (more details at [[PCA#Number_of_components_to_retain | PCA]]) | ||

+ | }} | ||

+ | |||

{{quote| | {{quote| | ||

Note: When working in a classification framework, one should compute the PCA/ZCA whitening matrices based only on the training set. The following parameters used be saved for use with the test set: (a) average vector that was used to zero-mean the data, (b) whitening matrices. The test set should undergo the same preprocessing steps using these saved values. }} | Note: When working in a classification framework, one should compute the PCA/ZCA whitening matrices based only on the training set. The following parameters used be saved for use with the test set: (a) average vector that was used to zero-mean the data, (b) whitening matrices. The test set should undergo the same preprocessing steps using these saved values. }} | ||

- | |||

== Large Images == | == Large Images == | ||

Line 66: | Line 74: | ||

- | == Standard | + | == Standard Pipelines == |

In this section, we describe several "standard pipelines" that have worked well for some datasets: | In this section, we describe several "standard pipelines" that have worked well for some datasets: | ||

- | |||

=== Natural Grey-scale Images === | === Natural Grey-scale Images === | ||

Line 81: | Line 88: | ||

=== Audio (MFCC/Spectrograms) === | === Audio (MFCC/Spectrograms) === | ||

- | For audio data (MFCC and Spectrograms), each dimension usually have different scales (variances). This is especially so when one includes the temporal derivatives (a common practice in audio processing). As a result, the preprocessing usually starts with simple data standardization (zero-mean, unit-variance per data dimension), followed by PCA/ZCA whitening (with an appropriate <tt>epsilon</tt>). | + | For audio data (MFCC and Spectrograms), each dimension usually have different scales (variances); the first component of MFCCs, for example, is the DC component and usually has a larger magnitude than the other components. This is especially so when one includes the temporal derivatives (a common practice in audio processing). As a result, the preprocessing usually starts with simple data standardization (zero-mean, unit-variance per data dimension), followed by PCA/ZCA whitening (with an appropriate <tt>epsilon</tt>). |

=== MNIST Handwritten Digits === | === MNIST Handwritten Digits === | ||

- | The MNIST dataset has pixel values in the range <math>[0, 255]</math>. We thus start with simple rescaling to shift the data into the range <math>[0, 1]</math>. | + | The MNIST dataset has pixel values in the range <math>[0, 255]</math>. We thus start with simple rescaling to shift the data into the range <math>[0, 1]</math>. In practice, removing the mean-value per example can also help feature learning. ''Note: While one could also elect to use PCA/ZCA whitening on MNIST if desired, this is not often done in practice.'' |

+ | |||

+ | |||

+ | |||

+ | {{Languages|数据预处理|中文}} |