数据预处理

From Ufldl

Jump to: navigation, search
Line 63: Line 63:
'''例子''':在处理自然图像时,我们获得的像素值在<math>[0, 255]</math>的区间中,常用的处理是将数据除以255使值缩放到<math>[0, 1]</math>.
'''例子''':在处理自然图像时,我们获得的像素值在<math>[0, 255]</math>的区间中,常用的处理是将数据除以255使值缩放到<math>[0, 1]</math>.
-
=== Per-example mean subtraction ===
 
 +
【一审】
 +
特征缩放通过在每一个(可能相互独立)维度上对数据进行缩放,使得最终的数据向量落在<math>[0, 1]</math>或<math>[-1, 1]</math>的区间内(根据数据情况而定)。这对后续的处理十分重要,因为很多''默认''参数(如PCA白化中的epsilon) 都假定数据已被缩放到合理区间。
 +
 +
'''例子''':在处理自然图像时,我们获得的像素值在<math>[0, 255]</math>区间中,常用的处理是将数据除以255使其值缩放到<math>[0, 1]</math>.
 +
 +
=== Per-example mean subtraction/分量均值归零 ===
 +
【原文】
If your data is ''stationary'' (i.e., the statistics for each data dimension follow the same distribution), then you might want to consider subtracting the mean-value for each example (computed per-example).  
If your data is ''stationary'' (i.e., the statistics for each data dimension follow the same distribution), then you might want to consider subtracting the mean-value for each example (computed per-example).  
'''Example:''' In images, this normalization has the property of removing the average brightness (intensity) of the data point. In many cases, we are not interested in the illumination conditions of the image, but more so in the content; removing the average pixel value per data point makes sense here. '''Note:''' While this method is generally used for images, one might want to take more care when applying this to color images. In particular, the stationarity property does not generally apply across pixels in different color channels.
'''Example:''' In images, this normalization has the property of removing the average brightness (intensity) of the data point. In many cases, we are not interested in the illumination conditions of the image, but more so in the content; removing the average pixel value per data point makes sense here. '''Note:''' While this method is generally used for images, one might want to take more care when applying this to color images. In particular, the stationarity property does not generally apply across pixels in different color channels.
-
=== Feature Standardization ===
+
【初译】
 +
如果数据是''平稳''的(数据的每一维都服从相同分布)可以考虑在每个样本上减去均值(每个样本逐一计算)。
 +
'''例子''':对于图像,归一化具有移除数据平均亮度值(密度值)的特性。很多情况下我们对亮度情况并不感兴趣,而是关注其内容,这样去减去像素均值是合理的。值得'''注意'''的是虽然该方法广泛地应用于图像,但在处理彩色图像时需要更加小心,特别是平稳性在不同色彩通道间并不存在。
 +
 +
【一审】
 +
如果数据是''平稳''的(即数据每一个维度的统计量都服从相同分布),可以考虑在每个样本上减去该样本的均值(每个样本分别计算)。
 +
 +
'''例子''':对于图像,归一化可以移除图像的平均亮度值。很多情况下我们对光照情况不感兴趣,而是更关注图像内容,这样减去像素均值是合理的。值得'''注意'''的是虽然该方法广泛地应用于图像,但在处理彩色图像时需要小心,平稳性在不同色彩通道间并不存在(译注:即不同颜色通道具有不同的统计特性)。
 +
 +
=== Feature Standardization/特征标准化 ===
 +
【原文】
Feature standardization refers to (independently) setting each dimension of the data to have zero-mean and unit-variance. This is the most common method for normalization and is generally used widely (e.g., when working with SVMs, feature standardization is often recommended as a preprocessing step). In practice, one achieves this by first computing the mean of each dimension (across the dataset) and subtracts this from each dimension. Next, each dimension is divided by its standard deviation.  
Feature standardization refers to (independently) setting each dimension of the data to have zero-mean and unit-variance. This is the most common method for normalization and is generally used widely (e.g., when working with SVMs, feature standardization is often recommended as a preprocessing step). In practice, one achieves this by first computing the mean of each dimension (across the dataset) and subtracts this from each dimension. Next, each dimension is divided by its standard deviation.  
'''Example: ''' When working with audio data, it is common to use [http://en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs] as the data representation. However, the first component (representing the DC) of the MFCC features often overshadow the other components. Thus, one method to restore balance to the components is to standardize the values in each component independently.
'''Example: ''' When working with audio data, it is common to use [http://en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs] as the data representation. However, the first component (representing the DC) of the MFCC features often overshadow the other components. Thus, one method to restore balance to the components is to standardize the values in each component independently.
 +
【初译】
 +
特征标准化指的是(独立的)使得数据的每一维都是零均值和单位方差的。这是归一化中最常用的方法(如在使用支持向量机时特征标准化常被建议为预处理的一部分)。在实际中,首先计算每一维度均值并在相应维度减除,然后每一维度上除以标准差。
 +
 +
'''例子''':处理音频数据时,常用[http://en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs]来表征数据。然而MFCC特征的第一组件(表示直流)常常会掩盖其他组件。因此一种重新平衡组件的方法是独立的对每一组件进行标准化。
 +
 +
【一审】
 +
特征标准化指的是(独立地)使得数据的每一维具有零均值和单位方差。这是归一化中最常用的方法(通常建议在使用SVM时首先对训练数据做特征标准化预处理)。在实际应用中,特征标准化的具体做法是:首先计算训练集的样本均值,每一个样本都减去该均值,然后在样本的每一维度上除以该维度上的样本标准差。
 +
'''例子''':处理音频数据时,常用Mel倒频系数[http://en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs]来表示数据。然而MFCC特征的第一个分量(表示直流分量)数值太大,常常会掩盖其他分量。这种情况下,为了平衡各个分量的影响,通常对特征的每个分量做标准化处理。
== PCA/ZCA Whitening ==
== PCA/ZCA Whitening ==

Revision as of 08:18, 8 March 2013

Personal tools