数据预处理

Revision as of 10:06, 9 March 2013 (view source)

Kandeng (Talk | contribs)

← Older edit

Revision as of 17:35, 13 March 2013 (view source)

Kandeng (Talk | contribs)

Newer edit →

Line 3:

-

== ~~Overview/~~概要 ==

+

== 概要 ==

-

~~【原文】~~

+

-

Data preprocessing plays a very important in many deep learning algorithms. In practice, many methods work best after the data has been normalized and whitened. However, the exact parameters for data preprocessing are usually not immediately apparent unless one has much experience working with the algorithms. In this page, we hope to demystify some of the preprocessing methods and also provide tips (and a "standard pipeline") for preprocessing data.

+

-

+

-

~~【初译】~~

+

-

+

数据预处理在众多深度学习算法中都起着重要作用，实际情况中，将数据做归一化和白化处理后，很多算法能够发挥最佳效果。然而除非对这些算法有丰富的使用经验，否则预处理的精确参数并非显而易见。在本页中，我们希望能够揭开预处理方法的神秘面纱，同时为预处理数据提供技巧（和标准流程）

-

~~【一审】~~

-

数据预处理在众多深度学习算法中都起着重要作用，实际情况中，将数据做归一化和白化处理后，很多算法能够发挥最佳效果。然而除非对这些算法有丰富的使用经验，否则预处理的精确参数并非显而易见。在本页中，我们希望能够揭开预处理方法的神秘面纱，同时为预处理数据提供技巧（和标准流程）

-

~~【原文】~~

{{quote |

-

Tip: When approaching a dataset, the first thing to do is to look at the data itself and observe its properties. While the techniques here apply generally, you might want to opt to do certain things differently given your dataset. For example, one standard preprocessing trick is to subtract the mean of each data point from itself (also known as remove DC, local mean subtraction, subtractive normalization). While this makes sense for data such as natural images, it is less obvious for data where stationarity does not hold.

+

提示：当我们开始处理数据时，首先要做的事是观察数据并获知其特性。本部分将介绍一些通用的技术，在实际中应该针对具体数据选择合适的预处理技术。例如一种标准的预处理方法是对每一个数据点都减去它的均值（也被称为移除直流分量，局部均值消减，消减归一化），这一方法对诸如自然图像这类数据是有效的，但对非平稳的数据则不然。

}}

-

~~【初译】~~

+

== 数据归一化 ==

-

~~{{quote |~~

+

数据预处理中，标准的第一步是数据归一化。虽然这里有一系列可行的方法，但是这一步通常是根据数据的具体情况而明确选择的。特征归一化常用的方法包含如下几种：

-

提示：获得数据后首先要做的事是查看数据并获知其特性，而后针对数据选择采取相应的处理。例如一个标准的预处理方法是减去所有数据点的均值（也被称为移除直流，局部均值消减，消减归一化），这一方法对一些数据是有效的，如自然图像，但对非平稳的数据并非如此。

+

-

}}

+

-

~~【一审】~~

+

* 简单缩放

-

~~{{quote |~~

+

* 之前提到的分量均值归一化(也称为移除直流分量)

-

提示：获得数据后首先要做的事是观察数据并获知其特性。本部分将介绍一些通用的技术，在实际中应该针对具体数据选择合适的预处理技术。例如一种标准的预处理方法是对每一个数据点都减去它的均值（也被称为移除直流分量，局部均值消减，消减归一化），这一方法对诸如自然图像这类数据是有效的，但对非平稳的数据则不然。

+

-

}}

+

-

+

-

+

-

~~== Data Normalization/数据归一化 ==~~

+

-

~~【原文】~~

+

-

A standard first step to data preprocessing is data normalization. While there are a few possible approaches, this step is usually clear depending on the data. The common methods for feature normalization are:

+

-

+

-

* Simple Rescaling

+

-

* Per-example mean subtraction (a.k.a. remove DC)

+

-

* Feature Standardization (zero-mean and unit variance for each feature across the dataset)

+

-

+

-

~~【初译】~~

+

-

~~数据预处理标准的第一步是数据归一化，由于已有一些适用的方法，根据数据的情况这一步通常是清晰地。特征归一化常用的方法包含如下几种：~~

+

-

+

-

* ~~简单重缩放~~

+

-

* ~~上例中的均值消减~~(~~也被称为移除直流~~)

+

* 特征标准化(使数据集中所有特征都具有零均值和单位方差)

-

~~【一审】~~

+

=== 简单缩放 ===

-

数据预处理的标准的第一步是数据归一化。已有一些常用的方法，根据数据的具体情况可以明确地确定这一步可以采用的方法。特征归一化常用的方法包含如下几种：

+

在简单缩放中，我们的目的是通过对数据的每一个维度的值进行重新调节（这些维度可能是相互独立的），使得最终的数据向量落在<math>[0, 1]</math>或<math>[-1, 1]</math>的区间内（根据数据情况而定）。这对后续的处理十分重要，因为很多''默认''参数（如PCA-白化中的epsilon）都假定数据已被缩放到合理区间。

-

+

-

* 特征缩放

+

-

* 分量均值归一化(也称为移除直流分量)

+

-

* 特征标准化(使数据集中所有特征都具有零均值和单位方差)

+

-

+

-

=== ~~Simple Rescaling/特征缩放~~ ===

+

-

~~【原文】~~

+

-

In simple rescaling, our goal is to rescale the data along each data dimension (possibly independently) so that the final data vectors lie in the range <math>[0, 1]</math> or <math>[-1, 1]</math> (depending on your dataset). This is useful for later processing as many ''default'' parameters (e.g., epsilon in PCA-whitening) treat the data as if it has been scaled to a reasonable range.

+

-

+

-

'''Example: ''' When processing natural images, we often obtain pixel values in the range <math>[0, 255]</math>. It is a common operation to rescale these values to <math>[0, 1]</math> by dividing the data by 255.

+

-

+

-

~~【初译】~~

+

-

~~简单重缩放的目的在于通过在每一维度上（可能相互独立）对数据进行的重缩放，使得最终的数据向量落在~~<math>[0, 1]</math>或<math>[-1, 1]</math>~~的区间内（根据数据情况）。这对后续的处理十分重要，因为很多~~''默认''~~参数（如主成分分析~~-~~白化中的epsilon）都基于数据已被缩放到合理区间的假定。~~

+

-

+

-

~~'''例子''':在处理自然图像时，我们获得的像素值在<math>[0, 255]</math>的区间中，常用的处理是将数据除以255使值缩放到<math>[0, 1]</math>.~~

+

-

+

-

~~【一审】~~

+

-

特征缩放通过在每一个（可能相互独立）维度上对数据进行缩放，使得最终的数据向量落在<math>[0, 1]</math>或<math>[-1, 1]</math>的区间内（根据数据情况而定）。这对后续的处理十分重要，因为很多''默认''参数（如PCA白化中的epsilon）都假定数据已被缩放到合理区间。

+

-

'''例子''':在处理自然图像时，我们获得的像素值在<math>[0, 255]</math>~~区间中，常用的处理是将数据除以255使其值缩放到~~<math>[0, 1]</math>.

+

'''例子''':在处理自然图像时，我们获得的像素值在<math>[0, 255]</math>区间中，常用的处理是将这些像素值除以255，使它们缩放到<math>[0, 1]</math>中.

=== Per-example mean subtraction/分量均值归零 ===

From Ufldl

Revision as of 17:35, 13 March 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 3: / Line 3: @@
-== Overview/概要 ==
+== 概要 ==
-【原文】
-Data preprocessing plays a very important in many deep learning algorithms. In practice, many methods work best after the data has been normalized and whitened. However, the exact parameters for data preprocessing are usually not immediately apparent unless one has much experience working with the algorithms. In this page, we hope to demystify some of the preprocessing methods and also provide tips (and a "standard pipeline") for preprocessing data.
-【初译】
 数据预处理在众多深度学习算法中都起着重要作用，实际情况中，将数据做归一化和白化处理后，很多算法能够发挥最佳效果。然而除非对这些算法有丰富的使用经验，否则预处理的精确参数并非显而易见。在本页中，我们希望能够揭开预处理方法的神秘面纱，同时为预处理数据提供技巧（和标准流程）
-【一审】
-数据预处理在众多深度学习算法中都起着重要作用，实际情况中，将数据做归一化和白化处理后，很多算法能够发挥最佳效果。然而除非对这些算法有丰富的使用经验，否则预处理的精确参数并非显而易见。在本页中，我们希望能够揭开预处理方法的神秘面纱，同时为预处理数据提供技巧（和标准流程）
-【原文】
 {{quote |
-Tip: When approaching a dataset, the first thing to do is to look at the data itself and observe its properties. While the techniques here apply generally, you might want to opt to do certain things differently given your dataset. For example, one standard preprocessing trick is to subtract the mean of each data point from itself (also known as remove DC, local mean subtraction, subtractive normalization). While this makes sense for data such as natural images, it is less obvious for data where stationarity does not hold.
+提示：当我们开始处理数据时，首先要做的事是观察数据并获知其特性。本部分将介绍一些通用的技术，在实际中应该针对具体数据选择合适的预处理技术。例如一种标准的预处理方法是对每一个数据点都减去它的均值（也被称为移除直流分量，局部均值消减，消减归一化），这一方法对诸如自然图像这类数据是有效的，但对非平稳的数据则不然。
 }}
-【初译】
+== 数据归一化 ==
-{{quote |
+数据预处理中，标准的第一步是数据归一化。虽然这里有一系列可行的方法，但是这一步通常是根据数据的具体情况而明确选择的。特征归一化常用的方法包含如下几种：
-提示：获得数据后首先要做的事是查看数据并获知其特性，而后针对数据选择采取相应的处理。例如一个标准的预处理方法是减去所有数据点的均值（也被称为移除直流，局部均值消减，消减归一化），这一方法对一些数据是有效的，如自然图像，但对非平稳的数据并非如此。
-}}
-【一审】
+* 简单缩放
-{{quote |
+* 之前提到的分量均值归一化(也称为移除直流分量)
-提示：获得数据后首先要做的事是观察数据并获知其特性。本部分将介绍一些通用的技术，在实际中应该针对具体数据选择合适的预处理技术。例如一种标准的预处理方法是对每一个数据点都减去它的均值（也被称为移除直流分量，局部均值消减，消减归一化），这一方法对诸如自然图像这类数据是有效的，但对非平稳的数据则不然。
-}}
-== Data Normalization/数据归一化 ==
-【原文】
-A standard first step to data preprocessing is data normalization. While there are a few possible approaches, this step is usually clear depending on the data. The common methods for feature normalization are:
-* Simple Rescaling
-* Per-example mean subtraction (a.k.a. remove DC)
-* Feature Standardization (zero-mean and unit variance for each feature across the dataset)
-【初译】
-数据预处理标准的第一步是数据归一化，由于已有一些适用的方法，根据数据的情况这一步通常是清晰地。特征归一化常用的方法包含如下几种：
-* 简单重缩放
-* 上例中的均值消减(也被称为移除直流)
 * 特征标准化(使数据集中所有特征都具有零均值和单位方差)
-【一审】
+=== 简单缩放 ===
-数据预处理的标准的第一步是数据归一化。已有一些常用的方法，根据数据的具体情况可以明确地确定这一步可以采用的方法。特征归一化常用的方法包含如下几种：
+在简单缩放中，我们的目的是通过对数据的每一个维度的值进行重新调节（这些维度可能是相互独立的），使得最终的数据向量落在<math>[0, 1]</math>或<math>[-1, 1]</math>的区间内（根据数据情况而定）。这对后续的处理十分重要，因为很多''默认''参数（如PCA-白化中的epsilon）都假定数据已被缩放到合理区间。
-* 特征缩放
-* 分量均值归一化(也称为移除直流分量)
-* 特征标准化(使数据集中所有特征都具有零均值和单位方差)
-=== Simple Rescaling/特征缩放 ===
-【原文】
-In simple rescaling, our goal is to rescale the data along each data dimension (possibly independently) so that the final data vectors lie in the range <math>[0, 1]</math> or  <math>[-1, 1]</math>  (depending on your dataset). This is useful for later processing as many ''default'' parameters (e.g., epsilon in PCA-whitening) treat the data as if it has been scaled to a reasonable range.
-'''Example: ''' When processing natural images, we often obtain pixel values in the range <math>[0, 255]</math>. It is a common operation to rescale these values to  <math>[0, 1]</math> by dividing the data by 255.
-【初译】
-简单重缩放的目的在于通过在每一维度上（可能相互独立）对数据进行的重缩放，使得最终的数据向量落在<math>[0, 1]</math>或<math>[-1, 1]</math>的区间内（根据数据情况）。这对后续的处理十分重要，因为很多''默认''参数（如主成分分析-白化中的epsilon） 都基于数据已被缩放到合理区间的假定。
-'''例子''':在处理自然图像时，我们获得的像素值在<math>[0, 255]</math>的区间中，常用的处理是将数据除以255使值缩放到<math>[0, 1]</math>.
-【一审】
-特征缩放通过在每一个（可能相互独立）维度上对数据进行缩放，使得最终的数据向量落在<math>[0, 1]</math>或<math>[-1, 1]</math>的区间内（根据数据情况而定）。这对后续的处理十分重要，因为很多''默认''参数（如PCA白化中的epsilon） 都假定数据已被缩放到合理区间。
-'''例子''':在处理自然图像时，我们获得的像素值在<math>[0, 255]</math>区间中，常用的处理是将数据除以255使其值缩放到<math>[0, 1]</math>.
+'''例子''':在处理自然图像时，我们获得的像素值在<math>[0, 255]</math>区间中，常用的处理是将这些像素值除以255，使它们缩放到<math>[0, 1]</math>中.
 === Per-example mean subtraction/分量均值归零 ===