# Data Preprocessing

 Revision as of 01:20, 30 April 2011 (view source) (→Audio (MFCC/Spectrograms))← Older edit Revision as of 03:47, 8 May 2011 (view source)Cyfoo (Talk | contribs) (→PCA/ZCA Whitening)Newer edit → Line 42: Line 42: In performing PCA/ZCA whitening, it is pertinent to first zero-mean the features (across the dataset) to ensure that $\frac{1}{m} \sum_i x^{(i)} = 0$. Specifically, this should be done before computing the covariance matrix. (The only exception is when per-example mean subtraction is performed and the data is stationary across dimensions/pixels.) In performing PCA/ZCA whitening, it is pertinent to first zero-mean the features (across the dataset) to ensure that $\frac{1}{m} \sum_i x^{(i)} = 0$. Specifically, this should be done before computing the covariance matrix. (The only exception is when per-example mean subtraction is performed and the data is stationary across dimensions/pixels.) - Next, one needs to select the value of epsilon to use when performing [[Whitening | PCA/ZCA whitening]] (recall that this was the regularization term that has an effect of ''low-pass filtering'' the data). It turns out that selecting this value can also play an important role for feature learning, we discuss two cases for selecting epsilon: + Next, one needs to select the value of epsilon to use when performing [[Whitening | PCA/ZCA whitening]] (recall that this was the regularization term that has an effect of ''low-pass filtering'' the data). It turns out that selecting this value can also play an important role for feature learning, we discuss two cases for selecting epsilon: === Reconstruction Based Models === === Reconstruction Based Models === - In models based on reconstruction (including Autoencoders, Sparse Coding, RBMs, k-Means), it is often preferable to set epsilon to a value such that low-pass filtering is achieved. One way to check this is to set a value for epsilon, run ZCA whitening, and thereafter visualize the data before and after whitening. If the value of epsilon is set too low, the data will look very noisy; conversely, if epsilon is set too high, you will see a "blurred" version of the original data. + In models based on reconstruction (including Autoencoders, Sparse Coding, RBMs, k-Means), it is often preferable to set epsilon to a value such that low-pass filtering is achieved. One way to check this is to set a value for epsilon, run ZCA whitening, and thereafter visualize the data before and after whitening. If the value of epsilon is set too low, the data will look very noisy; conversely, if epsilon is set too high, you will see a "blurred" version of the original data. A good way to get a feel for the magnitude of epsilon to try is to plot the eigenvalues on a graph. As visible in the example graph below, you may get a "long tail" corresponding to the high frequency noise components. You will want to choose epsilon such that most of the "long tail" is filtered out, i.e. choose epsilon such that it is greater than most of the small eigenvalues corresponding to the noise. + + [[File::ZCA_Eigenvalue_Plot.png]] In reconstruction based models, the loss function includes a term that penalizes reconstructions that are far from the original inputs. Then, if epsilon is set too ''low'', the data will contain a lot of noise which the model will need to reconstruct well. As a result, it is very important for reconstruction based models to have data that has been low-pass filtered. In reconstruction based models, the loss function includes a term that penalizes reconstructions that are far from the original inputs. Then, if epsilon is set too ''low'', the data will contain a lot of noise which the model will need to reconstruct well. As a result, it is very important for reconstruction based models to have data that has been low-pass filtered. Line 53: Line 55: Tip: If your data has been scaled reasonably (e.g., to $[0, 1]$), start with $epsilon = 0.01$ or $epsilon = 0.1$. Tip: If your data has been scaled reasonably (e.g., to $[0, 1]$), start with $epsilon = 0.01$ or $epsilon = 0.1$. }} }} + === ICA-based Models (with orthogonalization) === === ICA-based Models (with orthogonalization) === Line 65: Line 68: {{quote| {{quote| Note: When working in a classification framework, one should compute the PCA/ZCA whitening matrices based only on the training set. The following parameters used be saved for use with the test set: (a) average vector that was used to zero-mean the data, (b) whitening matrices. The test set should undergo the same preprocessing steps using these saved values.  }} Note: When working in a classification framework, one should compute the PCA/ZCA whitening matrices based only on the training set. The following parameters used be saved for use with the test set: (a) average vector that was used to zero-mean the data, (b) whitening matrices. The test set should undergo the same preprocessing steps using these saved values.  }} - == Large Images == == Large Images ==