数据预处理

From Ufldl

Jump to: navigation, search
Line 100: Line 100:
'''例子''':处理音频数据时,常用Mel倒频系数[http://en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs]来表示数据。然而MFCC特征的第一个分量(表示直流分量)数值太大,常常会掩盖其他分量。这种情况下,为了平衡各个分量的影响,通常对特征的每个分量做标准化处理。
'''例子''':处理音频数据时,常用Mel倒频系数[http://en.wikipedia.org/wiki/Mel-frequency_cepstrum MFCCs]来表示数据。然而MFCC特征的第一个分量(表示直流分量)数值太大,常常会掩盖其他分量。这种情况下,为了平衡各个分量的影响,通常对特征的每个分量做标准化处理。
-
== PCA/ZCA Whitening ==
+
== PCA/ZCA Whitening/PCA/ZCA白化==
-
 
+
【原文】
After doing the simple normalizations, whitening is often the next preprocessing step employed that helps make our algorithms work better. In practice, many deep learning algorithms rely on whitening to learn good features.
After doing the simple normalizations, whitening is often the next preprocessing step employed that helps make our algorithms work better. In practice, many deep learning algorithms rely on whitening to learn good features.
Line 108: Line 108:
Next, one needs to select the value of <tt>epsilon</tt> to use when performing [[Whitening | PCA/ZCA whitening]] (recall that this was the regularization term that has an effect of ''low-pass filtering'' the data). It turns out that selecting this value can also play an important role for feature learning, we discuss two cases for selecting <tt>epsilon</tt>:
Next, one needs to select the value of <tt>epsilon</tt> to use when performing [[Whitening | PCA/ZCA whitening]] (recall that this was the regularization term that has an effect of ''low-pass filtering'' the data). It turns out that selecting this value can also play an important role for feature learning, we discuss two cases for selecting <tt>epsilon</tt>:
-
=== Reconstruction Based Models ===
+
【初译】
 +
在做完简单的归一化后,白化通常会被用来作为接下来的预处理步骤来使算法性能更好。在实际中,众多深度学习算法依赖白化获得好的特征。
 +
当进行PCA/ZCA白化时首先要零均值化特征以保证<math> \frac{1}{m} \sum_i x^{(i)} = 0 </math>。特别的是这需要在计算协方差矩阵前完成(唯一例外的情况是均值消除已经完成且数据在不同维度/像素间是平稳的)。
 +
 +
接下来在PCA/ZCA白化中我们需要选择合适的<tt>epsilon</tt>(这是规则化参数,对数据有低通滤波作用)。 选取合适的值对特征学习起着很大作用,我们讨论选取epsilon的两个例子<tt>epsilon</tt>:
 +
 +
【一审】
 +
为了提高算法的性能,在做完简单的归一化之后,经常还要对特征进行白化。实际上许多深度学习算法都依赖于白化以获得好的特征。
 +
 +
在进行PCA/ZCA白化时,首先要从特征中减去样本均值,使得<math> \frac{1}{m} \sum_i x^{(i)} = 0 </math>。具体来说,这一步要在计算样本协方差之前进行(唯一例外的情况是对样本数据已经执行了分量均值归零操作,并且数据在不同维度之间是平稳的)。
 +
 +
接下来在PCA/ZCA白化中我们需要选择合适的<tt>epsilon</tt>(这是正则化参数,对数据有低通滤波作用)。 选取合适的<tt>epsilon</tt>值对特征学习起着很大作用,下面讨论两种不同场合下如何选取<tt>epsilon</tt>:
 +
 +
=== Reconstruction Based Models/基于重构的模型 ===
 +
【原文】
In models based on reconstruction (including Autoencoders, Sparse Coding, RBMs, k-Means), it is often preferable to set <tt>epsilon</tt> to a value such that low-pass filtering is achieved. One way to check this is to set a value for <tt>epsilon</tt>, run ZCA whitening, and thereafter visualize the data before and after whitening. If the value of epsilon is set too low, the data will look very noisy; conversely, if <tt>epsilon</tt> is set too high, you will see a "blurred" version of the original data. A good way to get a feel for the magnitude of <tt>epsilon</tt> to try is to plot the eigenvalues on a graph. As visible in the example graph below, you may get a "long tail" corresponding to the high frequency noise components. You will want to choose <tt>epsilon</tt> such that most of the "long tail" is filtered out, i.e. choose <tt>epsilon</tt> such that it is greater than most of the small eigenvalues corresponding to the noise.
In models based on reconstruction (including Autoencoders, Sparse Coding, RBMs, k-Means), it is often preferable to set <tt>epsilon</tt> to a value such that low-pass filtering is achieved. One way to check this is to set a value for <tt>epsilon</tt>, run ZCA whitening, and thereafter visualize the data before and after whitening. If the value of epsilon is set too low, the data will look very noisy; conversely, if <tt>epsilon</tt> is set too high, you will see a "blurred" version of the original data. A good way to get a feel for the magnitude of <tt>epsilon</tt> to try is to plot the eigenvalues on a graph. As visible in the example graph below, you may get a "long tail" corresponding to the high frequency noise components. You will want to choose <tt>epsilon</tt> such that most of the "long tail" is filtered out, i.e. choose <tt>epsilon</tt> such that it is greater than most of the small eigenvalues corresponding to the noise.
-
[[File:ZCA_Eigenvalues_Plot.png]]
+
【初译】
 +
在基于重构的模型中(包括自动编码,稀疏编码,受限Boltzman机,k-均值),<tt>epsilon</tt>通常被设定为达到低通效果的一个值。一种检验该值的方法是进行ZCA白化,再可视化白化前后的数据。如果设定的值过低数据则显得噪声较多,反之则会发现与原始数据相比显得模糊。一个获得尝试的<tt>epsilon</tt>幅度感觉的方法是在图中绘制特征值,如以下样图所示,可见一个长尾对应于高频噪声项,需要选取<tt>epsilon</tt>来过滤掉长尾,可取大于大多数噪声对应的小特征值的<tt>epsilon</tt>。
 +
【一审】
 +
在基于重构的方法中(包括:自动编码机,稀疏编码,RBM,K-Means),经常倾向于选取合适的<tt>epsilon</tt>以使得白化达到低通滤波的效果(译注:通常认为数据中的高频分量是噪声,低通滤波的作用就是尽可能抑制这些噪声,同时保留有用的信息。在PCA等方法中,假设数据的信息主要分布在方差较高的方向,方差较低的方向是噪声(即高频分量),因此后文中<tt>epsilon</tt>的选择与特征值有关)。一种检验<tt>epsilon</tt>是否合适的方法是用该值对数据进行ZCA白化,然后画出白化前后的数据。如果<tt>epsilon</tt>太小,白化后的数据就会显得比原始数据噪声大;相反,如果<tt>epsilon</tt>太大,白化后的数据就显得比原始数据模糊。一种直观感觉<tt>epsilon</tt>大小的方法是以图形方式画出特征值,如下图所示,从图中可以看到一条扁平的"长尾"(特征值很小),它对应于数据中的高频噪声部分。合适的<tt>epsilon</tt>能够在很大程度上过滤掉这条"长尾",也就是说,<tt>epsilon</tt>应大于大多数较小的特征值(这些小特征值反映的是数据中的噪声)。
 +
 +
[[File:ZCA_Eigenvalues_Plot.png]]
 +
【原文】
In reconstruction based models, the loss function includes a term that penalizes reconstructions that are far from the original inputs. Then, if <tt>epsilon</tt> is set too ''low'', the data will contain a lot of noise which the model will need to reconstruct well. As a result, it is very important for reconstruction based models to have data that has been low-pass filtered.
In reconstruction based models, the loss function includes a term that penalizes reconstructions that are far from the original inputs. Then, if <tt>epsilon</tt> is set too ''low'', the data will contain a lot of noise which the model will need to reconstruct well. As a result, it is very important for reconstruction based models to have data that has been low-pass filtered.
 +
【初译】
 +
在基于重构的模型中,损失函数中会包含一项来惩罚与原数据较大差异的重构。如果<tt>epsilon</tt>太小,数据将包含较多噪声,模型需要重构。因此在基于重构的模型中,对数据低通滤波很重要。
 +
 +
【一审】
 +
基于重构的方法的损失函数有一项是用于惩罚那些与原始数据差异较大的重构结果(译注:以自动编码机为例,要求输入数据经过编码和解码之后还能尽可能的还原输入数据)。如果<tt>epsilon</tt>太小,白化后的数据中就会包含很多噪声,而模型要拟合这些噪声,以达到很好的重构结果。因此,对于基于重构的模型来说,对原始数据进行低通滤波就显得非常重要。
 +
 +
【原文】
{{Quote|
{{Quote|
Tip: If your data has been scaled reasonably (e.g., to <math>[0, 1]</math>), start with <math>epsilon = 0.01</math> or <math>epsilon = 0.1</math>.
Tip: If your data has been scaled reasonably (e.g., to <math>[0, 1]</math>), start with <math>epsilon = 0.01</math> or <math>epsilon = 0.1</math>.
}}
}}
-
=== ICA-based Models (with orthogonalization) ===
+
【初译】
 +
{{Quote|
 +
提示:如果数据已被缩放到合理范围(如<math>[0, 1]</math>),从<math>epsilon = 0.01</math>或<math>epsilon = 0.1</math>开始调节<tt>epsilon</tt>。
 +
}}
 +
 
 +
【一审】
 +
{{Quote|
 +
如果数据已被缩放到合理范围(如<math>[0, 1]</math>),可以从<math>epsilon = 0.01</math>或<math>epsilon = 0.1</math>开始调节<tt>epsilon</tt>。
 +
}}
 +
=== ICA-based Models (with orthogonalization)/基于正交化ICA的模型 ===
 +
【原文】
For ICA-based models with orthogonalization, it is ''very'' important for the data to be as close to white (identity covariance) as possible. This is a side-effect of using orthogonalization to decorrelate the features learned (more details in [[Independent Component Analysis | ICA]]). Hence, in this case, you will want to use an <tt>epsilon</tt> that is as small as possible (e.g., <math>epsilon = 1e-6</math>).
For ICA-based models with orthogonalization, it is ''very'' important for the data to be as close to white (identity covariance) as possible. This is a side-effect of using orthogonalization to decorrelate the features learned (more details in [[Independent Component Analysis | ICA]]). Hence, in this case, you will want to use an <tt>epsilon</tt> that is as small as possible (e.g., <math>epsilon = 1e-6</math>).

Revision as of 08:32, 8 March 2013

Personal tools