数据预处理

初译：@gausschen
一审：@咖灰茶


== Overview/概要 ==
【原文】
Data preprocessing plays a very important in many deep learning algorithms. In practice, many methods work best after the data has been normalized and whitened. However, the exact parameters for data preprocessing are usually not immediately apparent unless one has much experience working with the algorithms. In this page, we hope to demystify some of the preprocessing methods and also provide tips (and a "standard pipeline") for preprocessing data.

【初译】

数据预处理在众多深度学习算法中都起着重要作用，实际情况中，将数据做归一化和白化处理后，很多算法能够发挥最佳效果。然而除非对这些算法有丰富的使用经验，否则预处理的精确参数并非显而易见。在本页中，我们希望能够揭开预处理方法的神秘面纱，同时为预处理数据提供技巧（和标准流程）

【一审】

数据预处理在众多深度学习算法中都起着重要作用，实际情况中，将数据做归一化和白化处理后，很多算法能够发挥最佳效果。然而除非对这些算法有丰富的使用经验，否则预处理的精确参数并非显而易见。在本页中，我们希望能够揭开预处理方法的神秘面纱，同时为预处理数据提供技巧（和标准流程）

【原文】
{{quote |
Tip: When approaching a dataset, the first thing to do is to look at the data itself and observe its properties. While the techniques here apply generally, you might want to opt to do certain things differently given your dataset. For example, one standard preprocessing trick is to subtract the mean of each data point from itself (also known as remove DC, local mean subtraction, subtractive normalization). While this makes sense for data such as natural images, it is less obvious for data where stationarity does not hold. 
}}

【初译】
{{quote |
提示：获得数据后首先要做的事是查看数据并获知其特性，而后针对数据选择采取相应的处理。例如一个标准的预处理方法是减去所有数据点的均值（也被称为移除直流，局部均值消减，消减归一化），这一方法对一些数据是有效的，如自然图像，但对非平稳的数据并非如此。
}}

【一审】
{{quote |
提示：获得数据后首先要做的事是观察数据并获知其特性。本部分将介绍一些通用的技术，在实际中应该针对具体数据选择合适的预处理技术。例如一种标准的预处理方法是对每一个数据点都减去它的均值（也被称为移除直流分量，局部均值消减，消减归一化），这一方法对诸如自然图像这类数据是有效的，但对非平稳的数据则不然。
}}


== Data Normalization/数据归一化 ==
【原文】
A standard first step to data preprocessing is data normalization. While there are a few possible approaches, this step is usually clear depending on the data. The common methods for feature normalization are:

* Simple Rescaling
* Per-example mean subtraction (a.k.a. remove DC)
* Feature Standardization (zero-mean and unit variance for each feature across the dataset)

【初译】
数据预处理标准的第一步是数据归一化，由于已有一些适用的方法，根据数据的情况这一步通常是清晰地。特征归一化常用的方法包含如下几种：

* 简单重缩放
* 上例中的均值消减(也被称为移除直流)
* 特征标准化(使数据集中所有特征都具有零均值和单位方差)

【一审】
数据预处理的标准的第一步是数据归一化。已有一些常用的方法，根据数据的具体情况可以明确地确定这一步可以采用的方法。特征归一化常用的方法包含如下几种：

* 特征缩放
* 分量均值归一化(也称为移除直流分量)
* 特征标准化(使数据集中所有特征都具有零均值和单位方差)

=== Simple Rescaling/特征缩放 ===
【原文】
In simple rescaling, our goal is to rescale the data along each data dimension (possibly independently) so that the final data vectors lie in the range <math>[0, 1]</math> or  <math>[-1, 1]</math>  (depending on your dataset). This is useful for later processing as many ''default'' parameters (e.g., epsilon in PCA-whitening) treat the data as if it has been scaled to a reasonable range. 

'''Example: ''' When processing natural images, we often obtain pixel values in the range <math>[0, 255]</math>. It is a common operation to rescale these values to  <math>[0, 1]</math> by dividing the data by 255.

【初译】
简单重缩放的目的在于通过在每一维度上（可能相互独立）对数据进行的重缩放，使得最终的数据向量落在<math>[0, 1]</math>或<math>[-1, 1]</math>的区间内（根据数据情况）。这对后续的处理十分重要，因为很多''默认''参数（如主成分分析-白化中的epsilon） 都基于数据已被缩放到合理区间的假定。

'''例子''':在处理自然图像时，我们获得的像素值在<math>[0, 255]</math>的区间中，常用的处理是将数据除以255使值缩放到<math>[0, 1]</math>.

【一审】
特征缩放通过在每一个（可能相互独立）维度上对数据进行缩放，使得最终的数据向量落在<math>[0, 1]</math>或<math>[-1, 1]</math>的区间内（根据数据情况而定）。这对后续的处理十分重要，因为很多''默认''参数（如PCA白化中的epsilon） 都假定数据已被缩放到合理区间。

'''例子''':在处理自然图像时，我们获得的像素值在<math>[0, 255]</math>区间中，常用的处理是将数据除以255使其值缩放到<math>[0, 1]</math>.

=== Per-example mean subtraction/分量均值归零 ===
【原文】
If your data is ''stationary'' (i.e., the statistics for each data dimension follow the same distribution), then you might want to consider subtracting the mean-value for each example (computed per-example). 

'''Example:''' In images, this normalization has the property of removing the average brightness (intensity) of the data point. In many cases, we are not interested in the illumination conditions of the image, but more so in the content; removing the average pixel value per data point makes sense here. '''Note:''' While this method is generally used for images, one might want to take more care when applying this to color images. In particular, the stationarity property does not generally apply across pixels in different color channels.

【初译】
如果数据是''平稳''的（数据的每一维都服从相同分布）可以考虑在每个样本上减去均值(每个样本逐一计算)。

'''例子''':对于图像，归一化具有移除数据平均亮度值（密度值）的特性。很多情况下我们对亮度情况并不感兴趣，而是关注其内容，这样去减去像素均值是合理的。值得'''注意'''的是虽然该方法广泛地应用于图像，但在处理彩色图像时需要更加小心，特别是平稳性在不同色彩通道间并不存在。

【一审】
如果数据是''平稳''的（即数据每一个维度的统计量都服从相同分布），可以考虑在每个样本上减去该样本的均值(每个样本分别计算)。

'''例子''':对于图像，归一化可以移除图像的平均亮度值。很多情况下我们对光照情况不感兴趣，而是更关注图像内容，这样减去像素均值是合理的。值得'''注意'''的是虽然该方法广泛地应用于图像，但在处理彩色图像时需要小心，平稳性在不同色彩通道间并不存在（译注：即不同颜色通道具有不同的统计特性）。

=== Feature Standardization/特征标准化 ===
【原文】
Feature standardization refers to (independently) setting each dimension of the data to have zero-mean and unit-variance. This is the most common method for normalization and is generally used widely (e.g., when working with SVMs, feature standardization is often recommended as a preprocessing step). In practice, one achieves this by first computing the mean of each dimension (across the dataset) and subtracts this from each dimension. Next, each dimension is divided by its standard deviation. 

'''Example: ''' When working with audio data, it is common to use [http://en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs] as the data representation. However, the first component (representing the DC) of the MFCC features often overshadow the other components. Thus, one method to restore balance to the components is to standardize the values in each component independently.

【初译】
特征标准化指的是（独立的）使得数据的每一维都是零均值和单位方差的。这是归一化中最常用的方法（如在使用支持向量机时特征标准化常被建议为预处理的一部分）。在实际中，首先计算每一维度均值并在相应维度减除，然后每一维度上除以标准差。

'''例子''':处理音频数据时，常用[http://en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs]来表征数据。然而MFCC特征的第一组件（表示直流）常常会掩盖其他组件。因此一种重新平衡组件的方法是独立的对每一组件进行标准化。

【一审】
特征标准化指的是（独立地）使得数据的每一维具有零均值和单位方差。这是归一化中最常用的方法（通常建议在使用SVM时首先对训练数据做特征标准化预处理）。在实际应用中，特征标准化的具体做法是：首先计算训练集的样本均值，每一个样本都减去该均值，然后在样本的每一维度上除以该维度上的样本标准差。

'''例子''':处理音频数据时，常用Mel倒频系数[http://en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs]来表示数据。然而MFCC特征的第一个分量（表示直流分量）数值太大，常常会掩盖其他分量。这种情况下，为了平衡各个分量的影响，通常对特征的每个分量做标准化处理。
== PCA/ZCA Whitening ==

After doing the simple normalizations, whitening is often the next preprocessing step employed that helps make our algorithms work better. In practice, many deep learning algorithms rely on whitening to learn good features.

In performing PCA/ZCA whitening, it is pertinent to first zero-mean the features (across the dataset) to ensure that <math> \frac{1}{m} \sum_i x^{(i)} = 0 </math>. Specifically, this should be done before computing the covariance matrix. (The only exception is when per-example mean subtraction is performed and the data is stationary across dimensions/pixels.)

Next, one needs to select the value of <tt>epsilon</tt> to use when performing [[Whitening | PCA/ZCA whitening]] (recall that this was the regularization term that has an effect of ''low-pass filtering'' the data). It turns out that selecting this value can also play an important role for feature learning, we discuss two cases for selecting <tt>epsilon</tt>:

=== Reconstruction Based Models ===

In models based on reconstruction (including Autoencoders, Sparse Coding, RBMs, k-Means), it is often preferable to set <tt>epsilon</tt> to a value such that low-pass filtering is achieved. One way to check this is to set a value for <tt>epsilon</tt>, run ZCA whitening, and thereafter visualize the data before and after whitening. If the value of epsilon is set too low, the data will look very noisy; conversely, if <tt>epsilon</tt> is set too high, you will see a "blurred" version of the original data. A good way to get a feel for the magnitude of <tt>epsilon</tt> to try is to plot the eigenvalues on a graph. As visible in the example graph below, you may get a "long tail" corresponding to the high frequency noise components. You will want to choose <tt>epsilon</tt> such that most of the "long tail" is filtered out, i.e. choose <tt>epsilon</tt> such that it is greater than most of the small eigenvalues corresponding to the noise.

[[File:ZCA_Eigenvalues_Plot.png]]

In reconstruction based models, the loss function includes a term that penalizes reconstructions that are far from the original inputs. Then, if <tt>epsilon</tt> is set too ''low'', the data will contain a lot of noise which the model will need to reconstruct well. As a result, it is very important for reconstruction based models to have data that has been low-pass filtered.

{{Quote|
Tip: If your data has been scaled reasonably (e.g., to <math>[0, 1]</math>), start with <math>epsilon = 0.01</math> or <math>epsilon = 0.1</math>.
}}

=== ICA-based Models (with orthogonalization) ===

For ICA-based models with orthogonalization, it is ''very'' important for the data to be as close to white (identity covariance) as possible. This is a side-effect of using orthogonalization to decorrelate the features learned (more details in [[Independent Component Analysis | ICA]]). Hence, in this case, you will want to use an <tt>epsilon</tt> that is as small as possible (e.g., <math>epsilon = 1e-6</math>).


{{Quote|
Tip: In PCA whitening, one also has the option of performing dimension reduction while whitening the data. This is usually an excellent idea since it can greatly speed up the algorithms (less computation and less parameters). A simple rule of thumb to choose how many principle components to retain is to keep enough components to have 99% of the variance retained (more details at [[PCA#Number_of_components_to_retain | PCA]])
}}


{{quote|
Note: When working in a classification framework, one should compute the PCA/ZCA whitening matrices based only on the training set. The following parameters used be saved for use with the test set: (a) average vector that was used to zero-mean the data, (b) whitening matrices. The test set should undergo the same preprocessing steps using these saved values.  }}

== Large Images ==

For large images, PCA/ZCA based whitening methods are impractical as the covariance matrix is too large. For these cases, we defer to 1/f-whitening methods. (more details to come)


== Standard Pipelines ==

In this section, we describe several "standard pipelines" that have worked well for some datasets:

=== Natural Grey-scale Images ===

Since grey-scale images have the stationarity property, we usually first remove the mean-component from each data example separately (remove DC). After this step, PCA/ZCA whitening is often employed with a value of <tt>epsilon</tt> set large enough to low-pass filter the data.

=== Color Images ===

For color images, the stationarity property does not hold across color channels. Hence, we usually start by rescaling the data (making sure it is in <math>[0, 1]</math>) ad then applying PCA/ZCA with a sufficiently large <tt>epsilon</tt>. Note that it is important to perform feature mean-normalization before computing the PCA transformation.

=== Audio (MFCC/Spectrograms) ===

For audio data (MFCC and Spectrograms), each dimension usually have different scales (variances); the first component of MFCCs, for example, is the DC component and usually has a larger magnitude than the other components. This is especially so when one includes the temporal derivatives (a common practice in audio processing). As a result, the preprocessing usually starts with simple data standardization (zero-mean, unit-variance per data dimension), followed by PCA/ZCA whitening (with an appropriate <tt>epsilon</tt>).

=== MNIST Handwritten Digits ===

The MNIST dataset has pixel values in the range <math>[0, 255]</math>. We thus start with simple rescaling to shift the data into the range <math>[0, 1]</math>. In practice, removing the mean-value per example can also help feature learning. ''Note: While one could also elect to use PCA/ZCA whitening on MNIST if desired, this is not often done in practice.''