稀疏编码

Revision as of 03:21, 8 March 2013 (view source)

Kandeng (Talk | contribs)

← Older edit

Latest revision as of 08:32, 8 April 2013 (view source)

Kandeng (Talk | contribs)

Line 1:

-

~~'''Sparse Coding'''~~

+

== 稀疏编码 ==

-

~~'''稀疏编码'''~~

+

稀疏编码算法是一种无监督学习方法，它用来寻找一组“超完备”基向量来更高效地表示样本数据。稀疏编码算法的目的就是找到一组基向量 <math>\mathbf{\phi}_i</math> ，使得我们能将输入向量 <math>\mathbf{x}</math> 表示为这些基向量的线性组合：

-

~~初译：@寅莹~~

-

~~一审：@大黄蜂的思索~~

-

Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. The aim of sparse coding is to find a set of basis vectors <math>\mathbf{\phi}_i</math> such that we can represent an input vector <math>\mathbf{x}</math> as a linear combination of these basis vectors:

-

【初译】针对超完备基的学习集，稀疏编码是一类有效表示数据的非监督方法。稀疏编码的目的是找到一组基向量 <math>\mathbf{\phi}_i</math> 集合，以致能将输入向量 <math>\mathbf{x}</math> 表示为这组基向量的线性组合。

-

【一审】稀疏编码算法是一种无监督学习方法，它用来寻找一组“超完备”基向量来更高效地表示样本数据。稀疏编码算法的目的就是找到一组基向量 <math>\mathbf{\phi}_i</math> ，使得我们能将输入向量 <math>\mathbf{x}</math> 表示为这些基向量的线性组合。

:<math>\begin{align}

\mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i}

\end{align}</math>

-

While techniques such as Principal Component Analysis (PCA) allow us to learn a complete set of basis vectors efficiently, we wish to learn an '''over-complete''' set of basis vectors to represent input vectors <math>\mathbf{x}\in\mathbb{R}^n</math> (i.e. such that <math>k > n</math>). The advantage of having an over-complete basis is that our basis vectors are better able to capture structures and patterns inherent in the input data. However, with an over-complete basis, the coefficients <math>a_i</math> are no longer uniquely determined by the input vector <math>\mathbf{x}</math>. Therefore, in sparse coding, we introduce the additional criterion of '''sparsity''' to resolve the degeneracy introduced by over-completeness.

-

【初译】因主成分分析技术允许有效地学习一组完备基向量集合，希望能够学习一组过完备的基向量集合，来表示输入向量 <math>\mathbf{x}\in\mathbb{R}^n</math> （即,如<math>k > n</math>）。过完备基的优点是该基向量能更好地捕捉输入数据内在的结构和模式。然而，对于过完备基，其系数 <math>a_i</math> 不在由输入向量 <math>\mathbf{x}</math>唯一确定。因此，在稀疏编中引入稀疏的附加标准，解决因引入过完备基所导致的退化问题。

-

【一审】虽然形如主成分分析技术（PCA）能使我们方便地找到一组“完备”基向量，但是这里我们想要做的是找到一组“超完备”基向量来表示输入向量 <math>\mathbf{x}\in\mathbb{R}^n</math> （也就是说，<math>k > n</math>）。超完备基的好处是它们能更有效地找出输入数据的结构与模式。然而，对于超完备基来说，系数 <math>a_i</math> 不再由输入向量 <math>\mathbf{x}</math>唯一确定。因此，在稀疏编码算法中，我们另加了一个评判标准“稀疏性”来解决因超完备而导致的退化（degeneracy）问题。

-

Here, we define sparsity as having few non-zero components or having few components not close to zero. The requirement that our coefficients <math>a_i</math> be sparse means that given a input vector, we would like as few of our coefficients to be far from zero as possible. The choice of sparsity as a desired characteristic of our representation of the input data can be motivated by the observation that most sensory data such as natural images may be described as the superposition of a small number of atomic elements such as surfaces or edges. Other justifications such as comparisons to the properties of the primary visual cortex have also been advanced.

-

【初译】此处，定义的稀疏性为具有很少的非零分量或很少不接近零的分量。针对系数 <math>a_i</math> 的稀疏处理是对于输入向量，要求得到尽可能少的不为零的系数。作为输入表示数据的期望特征，即稀疏性的选择可由多数传感器数据描述受到启发，如自然图像可被描述为少量的元成分（表面或边缘）的叠加。其他理由，如和主要视觉皮层的属性相比已取得进展。

-

~~【一审】这里，我们把“稀疏性”定义为：只有很少的几个非零元素或只有很少的几个远大于零的元素。要求系数~~ <math>a_i</math> 是稀疏的意思就是说：对于一组输入向量，我们只想有尽可能少的几个系数远大于零。要使输入数据的表示方式符合我们所期望的这个样子，稀疏性的选择可以从对感观数据的观察得到启示，比如，自然图像有时可以通过少量的基本元素如图像表层或边缘的叠加来表示。其它的选择标准如初级视觉皮层神经元特性的比较，这在之前的课程中也有提及。

+

虽然形如主成分分析技术（PCA）能使我们方便地找到一组“完备”基向量，但是这里我们想要做的是找到一组'''“超完备”'''基向量来表示输入向量 <math>\mathbf{x}\in\mathbb{R}^n</math> （也就是说，<math>k > n</math>）。超完备基的好处是它们能更有效地找出隐含在输入数据内部的结构与模式。然而，对于超完备基来说，系数 <math>a_i</math> 不再由输入向量 <math>\mathbf{x}</math> 唯一确定。因此，在稀疏编码算法中，我们另加了一个评判标准'''“稀疏性”'''来解决因超完备而导致的退化（degeneracy）问题。

-

~~We define the sparse coding cost function on a set of <math>m</math> input vectors as~~

-

~~【初译】在一组m个输入向量上定义稀疏编码的代价函数为：~~

+

这里，我们把“稀疏性”定义为：只有很少的几个非零元素或只有很少的几个远大于零的元素。要求系数 <math>a_i</math> 是稀疏的意思就是说：对于一组输入向量，我们只想有尽可能少的几个系数远大于零。选择使用具有稀疏性的分量来表示我们的输入数据是有原因的，因为绝大多数的感官数据，比如自然图像，可以被表示成少量基本元素的叠加，在图像中这些基本元素可以是面或者线。同时，比如与初级视觉皮层的类比过程也因此得到了提升。

-

~~【一审】我们把有m个输入向量的稀疏编码代价函数定义为：~~

+

我们把有 <math>m</math> 个输入向量的稀疏编码代价函数定义为：

:<math>\begin{align}

Line 38:

Line 20:

\end{align}</math>

-

where <math>S(.)</math> is a sparsity cost function which penalizes <math>a_i</math> for being far from zero. We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of <math>\mathbf{x}</math> and the second term as a sparsity penalty which forces our representation of <math>\mathbf{x}</math> to be sparse. The constant <math>\lambda</math> is a scaling constant to determine the relative importance of these two contributions.

-

~~【初译】此处，~~ <math>S(.)</math> ~~是一个稀疏代价函数，它的惩罚系数~~ <math>a_i</math> ~~不为零。第一项解释为稀疏编码的目标项，它作为重建项，约束算法给~~ <math>\mathbf{x}</math> ~~提供一个好的表示；第二项作为稀疏惩罚性，它约束对~~ <math>\mathbf{x}</math> ~~的表示是稀疏的；常量~~ <math>\lambda</math> ~~是一个尺度常量，决定两项贡献的相对重要性。~~

+

此处 <math>S(.)</math> 是一个稀疏代价函数，由它来对远大于零的 <math>a_i</math> 进行“惩罚”。我们可以把稀疏编码目标函式的第一项解释为一个重构项，这一项迫使稀疏编码算法能为输入向量 <math>\mathbf{x}</math> 提供一个高拟合度的线性表达式，而公式第二项即“稀疏惩罚”项，它使 <math>\mathbf{x}</math> 的表达式变得“稀疏”。常量 <math>\lambda</math> 是一个变换量，由它来控制这两项式子的相对重要性。

-

【一审】此处 <math>S(.)</math> 是一个稀疏代价函数，由它来对远大于零的 <math>a_i</math> 进行“惩罚”。我们可以把稀疏编码目标函式的第一项解释为稀疏编码的重构项，这项公式迫使稀疏编码算法能为输入向量 <math>\mathbf{x}</math> 提供一个高拟合度的线性表达式，而公式第二项即“稀疏惩罚”项，它使 <math>\mathbf{x}</math> 的表达式变得“稀疏”。常量 <math>\lambda</math> 是一个变换量，由它来控制这两项式子的相对重要性。

-

~~Although the most direct measure of sparsity is the~~ "<math>L_0</math>" ~~norm~~ (<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>)~~, it is non-differentiable and difficult to optimize in general. In practice, common choices for the sparsity cost~~ <math>S(.)</math> ~~are the~~ <math>L_1</math> ~~penalty~~ <math>S(a_i)=\left|a_i\right|_1 </math> ~~and the log penalty~~ <math>S(a_i)=\log(1+a_i^2)</math>.

+

虽然“稀疏性”的最直接测度标准是 "<math>L_0</math>" 范式(<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>)，但这是不可微的，而且通常很难进行优化。在实际中，稀疏代价函数 <math>S(.)</math> 的普遍选择是<math>L_1</math> 范式代价函数 <math>S(a_i)=\left|a_i\right|_1 </math> 及对数代价函数 <math>S(a_i)=\log(1+a_i^2)</math> 。

-

【初译】尽管最直接的稀疏性测度是 "<math>L_0</math>" 范数(<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>) ，该范数通常不可微且难以优化。在实际中，稀疏代价函数 <math>S(.)</math> 的选择是<math>L_1</math> 征罚 <math>S(a_i)=\left|a_i\right|_1 </math> ， <math>S(a_i)=\log(1+a_i^2)</math>.

-

【一审】虽然“稀疏性”的最直接测度标准是 "<math>L_0</math>" 范式(<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>)，但这是不可微的，而且通常很难进行优化，而实际中，稀疏代价函数 <math>S(.)</math> 的普遍选择是<math>L_1</math> 范式代价函数 <math>S(a_i)=\left|a_i\right|_1 </math> 及对数代价函数 <math>S(a_i)=\log(1+a_i^2)</math>。

+

此外，很有可能因为减小 <math>a_i</math> 或增加 <math>\mathbf{\phi}_i</math> 至很大的常量，使得稀疏惩罚变得非常小。为防止此类事件发生，我们将限制 <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> 要小于某常量 <math>C</math> 。包含了限制条件的稀疏编码代价函数的完整形式如下：

-

+

-

In addition, it is also possible to make the sparsity penalty arbitrarily small by scaling down <math>a_i</math> and scaling <math>\mathbf{\phi}_i</math> up by some large constant. To prevent this from happening, we will constrain <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> to be less than some constant <math>C</math>. The full sparse coding cost function including our constraint on <math>\mathbf{\phi}</math> is

+

-

+

-

【初译】此外，常可通过缩小 <math>a_i</math> ，放大<math>\mathbf{\phi}_i</math> ，通过使用一些大的常量，任意地选择稀疏惩罚项。为了阻止这一可能的发生，增加约束，使 <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> 小于一些常量 <math>C</math>。包括 <math>\mathbf{\phi}</math> 约束完整的稀疏编码代价函数定义如下：

+

-

+

-

~~【一审】此外，稀疏代价函数也可以通过减小~~ <math>a_i</math> 或增加<math>\mathbf{\phi}_i</math> ~~至很大的常量来使其自身变得极其地小。为防止此类事件发生，我们将~~ <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> ~~限制为比某常量~~ <math>C</math>~~更小的数。包含了限制条件的稀疏编码代价函数的完整形式如下：~~

+

:<math>\begin{array}{rc}

Line 63:

Line 36:

\end{array}</math>

-

~~== Probabilistic Interpretation [Based on Olshausen and Field 1996] ==概率解释[该理论基于1996年Olshausen 与 Field的模型]~~

-

So far, we have considered sparse coding in the context of finding a sparse, over-complete set of basis vectors to span our input space. Alternatively, we may also approach sparse coding from a probabilistic perspective as a generative model.

-

~~Consider the problem of modelling natural images as the linear superposition of <math>k</math> independent source features <math>\mathbf{\phi}_i</math> with some additive noise <math>\nu</math>:~~

+

== 概率解释 [基于1996年Olshausen与Field的理论] ==

-

【初译】到目前为止，在寻找一个稀疏的上下文中，考虑了稀疏编码，输入空间上的完备基向量。相应地，我们也可以从概率角度来处理稀疏编码，将它看一个生成模型。

+

到目前为止，我们所考虑的稀疏编码，是为了寻找到一个稀疏的、超完备基向量集，来覆盖我们的输入数据空间。现在换一种方式，我们可以从概率的角度出发，将稀疏编码算法当作一种“生成模型”。

-

~~把自然图像的建模问题看作是 <math>k</math> 个维独立的原特征 <math>\mathbf{\phi}_i</math> 和附加噪声 <math>\nu</math>的一个线性叠加~~

+

-

【一审】到目前为止，我们已通过确定稀疏系数及超完备基向量的方法论述了稀疏编码算法。不过，我们或许可从概率的角度为稀疏编码算法找到一种“生成模型”。

+

我们将自然图像建模问题看成是一种线性叠加，叠加元素包括 <math>k</math> 个独立的源特征 <math>\mathbf{\phi}_i</math> 以及加性噪声 <math>\nu</math> ：

-

~~举个自然图像建模的例子，这种建模方式通过~~ <math>k</math> ~~个独立的特征向量~~ <math>\mathbf{\phi}_i</math> ~~进行线性叠加，这里的具有一些噪音~~ <math>\nu</math>~~，如下式：~~

+

:<math>\begin{align}

Line 80:

Line 49:

\end{align}</math>

-

Our goal is to find a set of basis feature vectors <math>\mathbf{\phi}</math> such that the distribution of images <math>P(\mathbf{x}\mid\mathbf{\phi})</math> is as close as possible to the empirical distribution of our input data <math>P^*(\mathbf{x})</math>. One method of doing so is to minimize the KL divergence between <math>P^*(\mathbf{x})</math> and <math>P(\mathbf{x}\mid\mathbf{\phi})</math> where the KL divergence is defined as:

-

~~【初译】我们的目标是寻找一组基特征向量~~ <math>\mathbf{\phi}</math> ~~，以致于图像的分布~~ <math>P(\mathbf{x}\mid\mathbf{\phi})</math> ~~尽可能的近似输入数据~~ <math>P^*(\mathbf{x})</math>~~的经验分布。一类方法是最小化KL~~ <math>P^*(\mathbf{x})</math> ~~和 <math>P^*(\mathbf{x})</math> 之间的散度，这里 KL 散度定义如下:~~

+

我们的目标是找到一组特征基向量 <math>\mathbf{\phi}</math> ，它使得图像的分布函数 <math>P(\mathbf{x}\mid\mathbf{\phi})</math> 尽可能地近似于输入数据的经验分布函数 <math>P^*(\mathbf{x})</math> 。一种实现方式是，最小化 <math>P^*(\mathbf{x})</math> 与 <math>P(\mathbf{x}\mid\mathbf{\phi})</math> 之间的 KL 散度，此 KL 散度表示如下：

-

+

-

~~【一审】我们的目标是找到一组特征向量 <math>\mathbf{\phi}</math> ，因此，图像的分布函数~~ <math>P(\mathbf{x}\mid\mathbf{\phi})</math> 就可以尽可能地近似于输入数据的经验分布函数 <math>P^*(\mathbf{x})</math>。这么做的一种方法是，最小化 <math>P^*(\mathbf{x})</math> 与 <math>P^*(\mathbf{x})</math> 之间的KL离差，KL离差表示如下：

+

:<math>\begin{align}

Line 90:

Line 56:

\end{align}</math>

-

Since the empirical distribution <math>P^*(\mathbf{x})</math> is constant across our choice of <math>\mathbf{\phi}</math>, this is equivalent to maximizing the log-likelihood of <math>P(\mathbf{x}\mid\mathbf{\phi})</math>.

-

~~Assuming <math>\nu</math> is Gaussian white noise with variance <math>\sigma^2</math>, we have that~~

-

~~【初译】因为通过我们对~~ <math>\mathbf{\phi}</math>~~的选择，经验分布~~ <math>P^*(\mathbf{x})</math> ~~是不变量，这相当于最大化对数似然 <math>P(\mathbf{x}\mid\mathbf{\phi})</math>。~~

+

因为无论我们如何选择 <math>\mathbf{\phi}</math> ，经验分布函数 <math>P^*(\mathbf{x})</math> 都是常量，也就是说我们只需要最大化对数似然函数 <math>P(\mathbf{x}\mid\mathbf{\phi})</math> 。

-

~~假设 <math>\nu</math> 是方差为 <math>\sigma^2</math>高斯白噪声，有下式成立。~~

+

假设 <math>\nu</math> 是具有方差 <math>\sigma^2</math> 的高斯白噪音，则有下式：

-

+

-

~~【一审】因为经验分布函数 <math>P^*(\mathbf{x})</math> 对于所有的 <math>\mathbf{\phi}</math>其结果是常量，这就等于说要最大化对数似然函数~~ <math>P(\mathbf{x}\mid\mathbf{\phi})</math>。

+

-

假设 <math>\nu</math> 是具有方差 <math>\sigma^2</math>的高斯白噪音，则有下式：

+

:<math>\begin{align}

Line 103:

Line 64:

\end{align}</math>

-

In order to determine the distribution <math>P(\mathbf{x}\mid\mathbf{\phi})</math>, we also need to specify the prior distribution <math>P(\mathbf{a})</math>. Assuming the independence of our source features, we can factorize our prior probability as

-

~~【初译】为决定~~ <math>P(\mathbf{x}\mid\mathbf{\phi})</math>~~的分布，需要描述先验分布~~<math>P(\mathbf{a})</math>~~。假设原特征分布独立，可分解先验分布如下式：~~

+

为了确定分布 <math>P(\mathbf{x}\mid\mathbf{\phi})</math> ，我们需要指定先验分布 <math>P(\mathbf{a})</math> 。假定我们的特征变量是独立的，我们就可以将先验概率分解为：

-

【一审】为了确定 <math>P(\mathbf{x}\mid\mathbf{\phi})</math>的分布，我们同时要确定先验分布<math>P(\mathbf{a})</math>，假定我们的特征变量是独立的，我们就可得到先验概率为：

+

:<math>\begin{align}

Line 112:

Line 71:

\end{align}</math>

-

At this point, we would like to incorporate our sparsity assumption -- the assumption that any single image is likely to be the product of relatively few source features. Therefore, we would like the probability distribution of <math>a_i</math> to be peaked at zero and have high kurtosis. A convenient parameterization of the prior distribution is

+

此时，我们将“稀疏”假设加入进来——假设任何一幅图像都是由相对较少的一些源特征组合起来的。因此，我们希望 <math>a_i</math> 的概率分布在零值附近是凸起的，而且峰值很高。一个方便的参数化先验分布就是：

+

:<math>\begin{align}

P(a_i) = \frac{1}{Z}\exp(-\beta S(a_i))

\end{align}</math>

-

~~Where <math>S(a_i)</math> is a function determining the shape of the prior distribution.~~

-

~~【初译】在这一点，我们将纳入稀疏假设---假设任一个图像都可能是相对少数源特征的乘积。因此，我们希望~~ <math>a_i</math> ~~的分布在零点得到峰值，且有较高的峰度。先验分布的一个合适的参数化是~~

+

这里 <math>S(a_i)</math> 是决定先验分布的形状的函数。

+

当定义了 <math>P(\mathbf{x} \mid \mathbf{a} , \mathbf{\phi})</math> 和 <math> P(\mathbf{a})</math> 后，我们就可以写出在由 <math>\mathbf{\phi}</math> 定义的模型之下的数据 <math>\mathbf{x}</math> 的概率分布：

:<math>\begin{align}

-

P(~~a_i~~) = \~~frac~~{1}{Z}\~~exp(-~~\~~beta S~~(~~a_i)~~)

+

P(\mathbf{x} \mid \mathbf{\phi}) = \int P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) P(\mathbf{a}) d\mathbf{a}

\end{align}</math>

-

~~这里 <math>S(a_i)</math> 是一个函数，决定先验分布的形状。~~

+

那么，我们的问题就简化为寻找：

-

+

-

【一审】至此，我们将“稀疏”的设想加入进来――任一图像都可能是由很少一部分相关特征生成的。因此，我们希望 <math>a_i</math> 的概率分布在零值附近是凸起的，而且形态很陡峭。一个简易的参数化先验分布就是：

+

:<math>\begin{align}

-

~~P(a_i)~~ = \~~frac~~{1}{Z}\~~exp~~(~~-\beta S~~(~~a_i~~))

+

\mathbf{\phi}^*=\text{argmax}_{\mathbf{\phi}} < \log(P(\mathbf{x} \mid \mathbf{\phi})) >

\end{align}</math>

-

~~这里函数~~ <math>~~S(a_i)~~</math> ~~决定了先验分布的形状。~~

+

这里 <math><.></math> 表示的是输入数据的期望值。

-

~~Having defined~~ <math>P(\mathbf{x} \mid ~~\mathbf{a},~~ \mathbf{\phi})</math> ~~and~~ <math> P(\mathbf{a})</math>~~, we can write the probability of the data~~ <math>\mathbf{x}</math> ~~under the model defined by~~ <math>\mathbf{\phi}</math> as

+

不幸的是，通过对 <math>\mathbf{a}</math> 的积分计算 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 通常是难以实现的。虽然如此，我们注意到如果 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 的分布（对于相应的 <math>\mathbf{a}</math> ）足够陡峭的话，我们就可以用 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 的最大值来估算以上积分。估算方法如下：

:<math>\begin{align}

-

P(\mathbf{x} \~~mid~~ \mathbf{\phi}~~) =~~ \~~int~~ P(\mathbf{x} \mid ~~\mathbf{a},~~ \mathbf{\phi}) ~~P(\mathbf{a}~~) ~~d\mathbf{a}~~

+

\mathbf{\phi}^{*'}=\text{argmax}_{\mathbf{\phi}} < \max_{\mathbf{a}} \log(P(\mathbf{x} \mid \mathbf{\phi})) >

\end{align}</math>

-

~~and our problem reduces to finding~~

-

:<math>~~\begin{align}~~

+

跟之前一样，我们可以通过减小 <math>a_i</math> 或增大 <math>\mathbf{\phi}</math> 来增加概率的估算值（因为 <math>P(a_i)</math> 在零值附近陡升）。因此我们要对特征向量 <math>\mathbf{\phi}</math> 加一个限制以防止这种情况发生。

-

\mathbf{\phi~~}^*=\text{argmax}_{\mathbf{\phi}~~} < ~~\log(~~P(~~\mathbf{x} \mid~~ \mathbf{\phi~~})) >~~

+

-

~~\end{align~~}</math>

+

-

~~Where <math><.></math> denotes expectation over our input data.~~

+

最后，我们可以定义一种线性生成模型的能量函数，从而将原先的代价函数重新表述为：

-

~~【初译】定义了~~ <math>P(\mathbf{x} ~~\mid~~ \mathbf{a}, \mathbf{\phi})~~</math> 和 <math>~~ P(\mathbf{a}~~)</math>，在由 <math>~~\mathbf{\phi}~~</math> 定义的模型下，我们可以与出数据 <math>~~\mathbf{x}~~</math> 的概率为：~~

+

:<math>\begin{array}{rl}

-

+

E\left( \mathbf{x} , \mathbf{a} \mid \mathbf{\phi} \right) & := -\log \left( P(\mathbf{x}\mid \mathbf{\phi},\mathbf{a}\right)P(\mathbf{a})) \\

-

~~:<math>~~\~~begin{align}~~

+

&= \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i)

-

P(\mathbf{x} \~~mid~~ \~~mathbf{~~\~~phi~~}~~) =~~ \~~int P(~~\mathbf{x} \~~mid \mathbf~~{a}, \mathbf{\phi}~~) P(~~\~~mathbf~~{a}~~) d~~\~~mathbf~~{a}

+

\end{array}</math>

-

\end{~~align~~}</math>

+

-

~~我们的问题演化为发现~~

-

~~:<math>\begin{align}~~

-

~~\mathbf{\phi}^*=\text{argmax}_{\mathbf{\phi}} < \log(P(\mathbf{x} \mid \mathbf{\phi})) >~~

-

~~\end{align}</math>~~

-

其中 <math>~~<.>~~</math> ~~表示期望我们的输入数据。~~

+

其中 <math>\lambda = 2\sigma^2\beta</math> ，并且关系不大的常量已被隐藏起来。因为最大化对数似然函数等同于最小化能量函数，我们就可以将原先的优化问题重新表述为：

-

【一审】当定义了 <math>P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi})</math> 和 <math> P(\mathbf{a})</math>后，我们就可以通过由 <math>\mathbf{\phi}</math> 生成的模型将数据集 <math>\mathbf{x}</math> 的概率确定为：

-

:<math>\begin{align}

-

P(\mathbf{x} \~~mid~~ \mathbf{\phi}) = \~~int P(~~\mathbf{x} \~~mid \mathbf~~{a}, \mathbf{\phi}~~) P(~~\~~mathbf~~{a}~~) d~~\~~mathbf~~{a}

+

\mathbf{\phi}^{*},\mathbf{a}^{*}=\text{argmin}_{\mathbf{\phi},\mathbf{a}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i)

\end{align}</math>

-

~~那么，我们的问题就简缩为：~~

-

~~:<math>\begin{align}~~

-

~~\mathbf{\phi}^*=\text{argmax}_{\mathbf{\phi}} < \log(P(\mathbf{x} \mid \mathbf{\phi})) >~~

-

~~\end{align}</math>~~

-

这里 <math><.></math> ~~指的是输入数据的期望值。~~

+

使用概率理论来分析，我们可以发现，选择 <math>L_1</math> 惩罚和 <math>\log(1+a_i^2)</math> 惩罚作为函数 <math>S(.)</math> ，分别对应于使用了拉普拉斯概率 <math>P(a_i) \propto \exp\left(-\beta|a_i|\right)</math> 和柯西先验概率 <math>P(a_i) \propto \frac{\beta}{1+a_i^2}</math> 。

+

-

Unfortunately, the integral over <math>\mathbf{a}</math> to obtain <math>P(\mathbf{x} \mid \mathbf{\phi})</math> is generally intractable. We note though that if the distribution of <math>P(\mathbf{x} \mid \mathbf{\phi})</math> is sufficiently peaked (w.r.t. <math>\mathbf{a}</math>), we can approximate its integral with the maximum value of <math>P(\mathbf{x} \mid \mathbf{\phi})</math> and obtain a approximate solution

+

== 学习算法 ==

-

~~【初译】不幸的是，基于所有~~ <math>~~\mathbf{a}</math> 获得 <math>P(~~\mathbf{x} ~~\mid \mathbf{\phi})~~</math> ~~难以解决。注意到如果~~ <math>~~P(\mathbf{x} \mid \mathbf{\phi})~~</math> ~~能够充分地达到峰值（w.r.t.~~ <math>~~\mathbf{a}</math>），我能近似其整数，通过最大化 <math>P(\mathbf{x} \mid~~ \mathbf{\phi})</math> ~~而得到一个近似解~~

+

使用稀疏编码算法学习基向量集的方法，是由两个独立的优化过程组合起来的。第一个是逐个使用训练样本 <math>\mathbf{x}</math> 来优化系数 <math>a_i</math> ，第二个是一次性处理多个样本对基向量 <math>\mathbf{\phi}</math> 进行优化。

-

【一审】不幸的是，通过对 <math>\mathbf{a}</math> 的积分计算 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 通常是难以实现的。虽然如此，我们注意到如果 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 的分布（对于相应的<math>\mathbf{a}</math>）足够陡峭的话，我们就可以用 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 的最大值来估算以上积分。估算方法如下：

-

:<math>~~\begin~~{~~align}~~

+

如果使用 <math>L_1</math> 范式作为稀疏惩罚函数，对 <math>a^{(j)}_i</math> 的学习过程就简化为求解由 <math>L_1</math> 范式正则化的最小二乘法问题，这个问题函数在域 <math>a^{(j)}_i</math> 内为凸，已经有很多技术方法来解决这个问题（诸如CVX之类的凸优化软件可以用来解决L1正则化的最小二乘法问题）。如果 <math>S(.)</math> 是可微的，比如是对数惩罚函数，则可以采用基于梯度算法的方法，如共轭梯度法。

-

~~\mathbf{\phi~~}^{*'}~~=\text{argmax}_{~~\mathbf{\phi}} < ~~\max_{\mathbf{a}} \log(P(\mathbf{x} \mid~~ \mathbf{\phi~~})) >~~

+

-

~~\end{align~~}</math>

+

用 <math>L_2</math> 范式约束来学习基向量，同样可以简化为一个带有二次约束的最小二乘问题，其问题函数在域 <math>\mathbf{\phi}</math> 内也为凸。标准的凸优化软件（如CVX）或其它迭代方法就可以用来求解 <math>\mathbf{\phi}</math>，虽然已经有了更有效的方法，比如求解拉格朗日对偶函数（Lagrange dual）。

+

根据前面的的描述，稀疏编码是有一个明显的局限性的，这就是即使已经学习得到一组基向量，如果为了对新的数据样本进行“编码”，我们必须再次执行优化过程来得到所需的系数。这个显著的“实时”消耗意味着，即使是在测试中,实现稀疏编码也需要高昂的计算成本，尤其是与典型的前馈结构算法相比。

+

==中英文对照==

+

:稀疏编码 Sparse Coding

+

:无监督学习 unsupervised method

+

:超完备基 over-complete bases

+

:主成分分析 PCA

+

:稀疏性 sparsity

+

:退化 degeneracy

+

:代价函数 cost function

+

:重构项 reconstruction term

+

:稀疏惩罚项 sparsity penalty

+

:范式 norm

+

:生成模型 generative model

+

:线性叠加 linear superposition

+

:加性噪声 additive noise

+

:特征基向量 basis feature vectors

+

:经验分布函数 the empirical distribution

+

:KL 散度 KL divergence

+

:对数似然函数 the log-likelihood

+

:高斯白噪音 Gaussian white noise

+

:先验分布 the prior distribution

+

:先验概率 prior probability

+

:源特征 source features

+

:能量函数 the energy function

+

:正则化 regularized

+

:最小二乘法 least squares

+

:凸优化软件convex optimization software

+

:共轭梯度法 conjugate gradient methods

+

:二次约束 quadratic constraints

+

:拉格朗日对偶函数 the Lagrange dual

+

:前馈结构算法 feedforward architectures

+

==中文译者==

+

柳翠寅（liucuiyin@163.com），林锋（xlfg@yeah.net），王方（fangkey@gmail.com）

+

From Ufldl

Latest revision as of 08:32, 8 April 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 1: / Line 1: @@
-'''Sparse Coding'''
+== 稀疏编码 ==
-'''稀疏编码'''
+稀疏编码算法是一种无监督学习方法，它用来寻找一组“超完备”基向量来更高效地表示样本数据。稀疏编码算法的目的就是找到一组基向量 <math>\mathbf{\phi}_i</math> ，使得我们能将输入向量 <math>\mathbf{x}</math> 表示为这些基向量的线性组合：
-初译：@寅莹
-一审：@大黄蜂的思索
-Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. The aim of sparse coding is to find a set of basis vectors <math>\mathbf{\phi}_i</math> such that we can represent an input vector <math>\mathbf{x}</math> as a linear combination of these basis vectors:
-【初译】针对超完备基的学习集，稀疏编码是一类有效表示数据的非监督方法。稀疏编码的目的是找到一组基向量 <math>\mathbf{\phi}_i</math> 集合，以致能将输入向量 <math>\mathbf{x}</math> 表示为这组基向量的线性组合。
-【一审】稀疏编码算法是一种无监督学习方法，它用来寻找一组“超完备”基向量来更高效地表示样本数据。稀疏编码算法的目的就是找到一组基向量 <math>\mathbf{\phi}_i</math> ，使得我们能将输入向量 <math>\mathbf{x}</math> 表示为这些基向量的线性组合。
 :<math>\begin{align}
 \mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i}
 \end{align}</math>
-While techniques such as Principal Component Analysis (PCA) allow us to learn a complete set of basis vectors efficiently, we wish to learn an '''over-complete''' set of basis vectors to represent input vectors <math>\mathbf{x}\in\mathbb{R}^n</math> (i.e. such that <math>k > n</math>). The advantage of having an over-complete basis is that our basis vectors are better able to capture structures and patterns inherent in the input data. However, with an over-complete basis, the coefficients <math>a_i</math> are no longer uniquely determined by the input vector <math>\mathbf{x}</math>. Therefore, in sparse coding, we introduce the additional criterion of '''sparsity''' to resolve the degeneracy introduced by over-completeness.
-【初译】因主成分分析技术允许有效地学习一组完备基向量集合，希望能够学习一组过完备的基向量集合，来表示输入向量 <math>\mathbf{x}\in\mathbb{R}^n</math> （即,如<math>k > n</math>）。过完备基的优点是该基向量能更好地捕捉输入数据内在的结构和模式。然而，对于过完备基，其系数 <math>a_i</math> 不在由输入向量  <math>\mathbf{x}</math>唯一确定。因此，在稀疏编中引入稀疏的附加标准，解决因引入过完备基所导致的退化问题。
-【一审】虽然形如主成分分析技术（PCA）能使我们方便地找到一组“完备”基向量，但是这里我们想要做的是找到一组“超完备”基向量来表示输入向量 <math>\mathbf{x}\in\mathbb{R}^n</math> （也就是说，<math>k > n</math>）。超完备基的好处是它们能更有效地找出输入数据的结构与模式。然而，对于超完备基来说，系数 <math>a_i</math> 不再由输入向量 <math>\mathbf{x}</math>唯一确定。因此，在稀疏编码算法中，我们另加了一个评判标准“稀疏性”来解决因超完备而导致的退化（degeneracy）问题。
-Here, we define sparsity as having few non-zero components or having few components not close to zero. The requirement that our coefficients <math>a_i</math> be sparse means that given a input vector, we would like as few of our coefficients to be far from zero as possible. The choice of sparsity as a desired characteristic of our representation of the input data can be motivated by the observation that most sensory data such as natural images may be described as the superposition of a small number of atomic elements such as surfaces or edges. Other justifications such as comparisons to the properties of the primary visual cortex have also been advanced.
-【初译】此处，定义的稀疏性为具有很少的非零分量或很少不接近零的分量。针对系数 <math>a_i</math> 的稀疏处理是对于输入向量，要求得到尽可能少的不为零的系数。作为输入表示数据的期望特征，即稀疏性的选择可由多数传感器数据描述受到启发，如自然图像可被描述为少量的元成分（表面或边缘）的叠加。其他理由，如和主要视觉皮层的属性相比已取得进展。
-【一审】这里，我们把“稀疏性”定义为：只有很少的几个非零元素或只有很少的几个远大于零的元素。要求系数 <math>a_i</math> 是稀疏的意思就是说：对于一组输入向量，我们只想有尽可能少的几个系数远大于零。要使输入数据的表示方式符合我们所期望的这个样子，稀疏性的选择可以从对感观数据的观察得到启示，比如，自然图像有时可以通过少量的基本元素如图像表层或边缘的叠加来表示。其它的选择标准如初级视觉皮层神经元特性的比较，这在之前的课程中也有提及。
+虽然形如主成分分析技术（PCA）能使我们方便地找到一组“完备”基向量，但是这里我们想要做的是找到一组'''“超完备”'''基向量来表示输入向量 <math>\mathbf{x}\in\mathbb{R}^n</math> （也就是说，<math>k > n</math>）。超完备基的好处是它们能更有效地找出隐含在输入数据内部的结构与模式。然而，对于超完备基来说，系数 <math>a_i</math> 不再由输入向量 <math>\mathbf{x}</math> 唯一确定。因此，在稀疏编码算法中，我们另加了一个评判标准'''“稀疏性”'''来解决因超完备而导致的退化（degeneracy）问题。
-We define the sparse coding cost function on a set of <math>m</math> input vectors as
-【初译】在一组m个输入向量上定义稀疏编码的代价函数为：
+这里，我们把“稀疏性”定义为：只有很少的几个非零元素或只有很少的几个远大于零的元素。要求系数 <math>a_i</math> 是稀疏的意思就是说：对于一组输入向量，我们只想有尽可能少的几个系数远大于零。选择使用具有稀疏性的分量来表示我们的输入数据是有原因的，因为绝大多数的感官数据，比如自然图像，可以被表示成少量基本元素的叠加，在图像中这些基本元素可以是面或者线。同时，比如与初级视觉皮层的类比过程也因此得到了提升。
-【一审】我们把有m个输入向量的稀疏编码代价函数定义为：
+我们把有 <math>m</math> 个输入向量的稀疏编码代价函数定义为：
 :<math>\begin{align}
@@ Line 38: / Line 20: @@
 \end{align}</math>
-where <math>S(.)</math> is a sparsity cost function which penalizes <math>a_i</math> for being far from zero. We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of <math>\mathbf{x}</math> and the second term as a sparsity penalty which forces our representation of <math>\mathbf{x}</math> to be sparse. The constant <math>\lambda</math> is a scaling constant to determine the relative importance of these two contributions.
-【初译】此处， <math>S(.)</math> 是一个稀疏代价函数，它的惩罚系数 <math>a_i</math> 不为零。第一项解释为稀疏编码的目标项，它作为重建项，约束算法给 <math>\mathbf{x}</math> 提供一个好的表示；第二项作为稀疏惩罚性，它约束对 <math>\mathbf{x}</math> 的表示是稀疏的；常量 <math>\lambda</math> 是一个尺度常量，决定两项贡献的相对重要性。
+此处 <math>S(.)</math> 是一个稀疏代价函数，由它来对远大于零的 <math>a_i</math> 进行“惩罚”。我们可以把稀疏编码目标函式的第一项解释为一个重构项，这一项迫使稀疏编码算法能为输入向量 <math>\mathbf{x}</math> 提供一个高拟合度的线性表达式，而公式第二项即“稀疏惩罚”项，它使 <math>\mathbf{x}</math> 的表达式变得“稀疏”。常量 <math>\lambda</math> 是一个变换量，由它来控制这两项式子的相对重要性。
-【一审】此处 <math>S(.)</math> 是一个稀疏代价函数，由它来对远大于零的 <math>a_i</math> 进行“惩罚”。我们可以把稀疏编码目标函式的第一项解释为稀疏编码的重构项，这项公式迫使稀疏编码算法能为输入向量 <math>\mathbf{x}</math> 提供一个高拟合度的线性表达式，而公式第二项即“稀疏惩罚”项，它使 <math>\mathbf{x}</math> 的表达式变得“稀疏”。常量 <math>\lambda</math> 是一个变换量，由它来控制这两项式子的相对重要性。
-Although the most direct measure of sparsity is the "<math>L_0</math>" norm (<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>), it is non-differentiable and difficult to optimize in general. In practice, common choices for the sparsity cost <math>S(.)</math> are the <math>L_1</math> penalty <math>S(a_i)=\left|a_i\right|_1 </math> and the log penalty <math>S(a_i)=\log(1+a_i^2)</math>.
+虽然“稀疏性”的最直接测度标准是 "<math>L_0</math>" 范式(<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>)，但这是不可微的，而且通常很难进行优化。在实际中，稀疏代价函数 <math>S(.)</math> 的普遍选择是<math>L_1</math> 范式代价函数 <math>S(a_i)=\left|a_i\right|_1 </math> 及对数代价函数 <math>S(a_i)=\log(1+a_i^2)</math> 。
-【初译】尽管最直接的稀疏性测度是 "<math>L_0</math>" 范数(<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>) ，该范数通常不可微且难以优化。在实际中，稀疏代价函数 <math>S(.)</math> 的选择是<math>L_1</math> 征罚 <math>S(a_i)=\left|a_i\right|_1 </math> ， <math>S(a_i)=\log(1+a_i^2)</math>.
-【一审】虽然“稀疏性”的最直接测度标准是 "<math>L_0</math>" 范式(<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>)， 但这是不可微的，而且通常很难进行优化，而实际中，稀疏代价函数 <math>S(.)</math> 的普遍选择是<math>L_1</math> 范式代价函数 <math>S(a_i)=\left|a_i\right|_1 </math> 及对数代价函数 <math>S(a_i)=\log(1+a_i^2)</math>。
+此外，很有可能因为减小 <math>a_i</math> 或增加 <math>\mathbf{\phi}_i</math> 至很大的常量，使得稀疏惩罚变得非常小。为防止此类事件发生，我们将限制 <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> 要小于某常量 <math>C</math> 。包含了限制条件的稀疏编码代价函数的完整形式如下：
-In addition, it is also possible to make the sparsity penalty arbitrarily small by scaling down <math>a_i</math> and scaling <math>\mathbf{\phi}_i</math> up by some large constant. To prevent this from happening, we will constrain <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> to be less than some constant <math>C</math>. The full sparse coding cost function including our constraint on <math>\mathbf{\phi}</math> is
-【初译】此外，常可通过缩小 <math>a_i</math> ，放大<math>\mathbf{\phi}_i</math> ，通过使用一些大的常量，任意地选择稀疏惩罚项。为了阻止这一可能的发生，增加约束，使 <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> 小于一些常量 <math>C</math>。包括 <math>\mathbf{\phi}</math> 约束完整的稀疏编码代价函数定义如下：
-【一审】此外，稀疏代价函数也可以通过减小 <math>a_i</math> 或增加<math>\mathbf{\phi}_i</math> 至很大的常量来使其自身变得极其地小。为防止此类事件发生，我们将 <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> 限制为比某常量 <math>C</math>更小的数。包含了限制条件的稀疏编码代价函数的完整形式如下：
 :<math>\begin{array}{rc}
@@ Line 63: / Line 36: @@
 \end{array}</math>
-== Probabilistic Interpretation [Based on Olshausen and Field 1996] ==概率解释[该理论基于1996年Olshausen 与 Field的模型]
-So far, we have considered sparse coding in the context of finding a sparse, over-complete set of basis vectors to span our input space. Alternatively, we may also approach sparse coding from a probabilistic perspective as a generative model.
-Consider the problem of modelling natural images as the linear superposition of <math>k</math> independent source features <math>\mathbf{\phi}_i</math> with some additive noise <math>\nu</math>:
+== 概率解释 [基于1996年Olshausen与Field的理论] ==
-【初译】到目前为止，在寻找一个稀疏的上下文中，考虑了稀疏编码，输入空间上的完备基向量。相应地，我们也可以从概率角度来处理稀疏编码，将它看一个生成模型。
+到目前为止，我们所考虑的稀疏编码，是为了寻找到一个稀疏的、超完备基向量集，来覆盖我们的输入数据空间。现在换一种方式，我们可以从概率的角度出发，将稀疏编码算法当作一种“生成模型”。
-把自然图像的建模问题看作是 <math>k</math> 个维独立的原特征 <math>\mathbf{\phi}_i</math> 和附加噪声 <math>\nu</math>的一个线性叠加
-【一审】到目前为止，我们已通过确定稀疏系数及超完备基向量的方法论述了稀疏编码算法。不过，我们或许可从概率的角度为稀疏编码算法找到一种“生成模型”。
+我们将自然图像建模问题看成是一种线性叠加，叠加元素包括 <math>k</math> 个独立的源特征 <math>\mathbf{\phi}_i</math> 以及加性噪声 <math>\nu</math> ：
-举个自然图像建模的例子，这种建模方式通过 <math>k</math> 个独立的特征向量 <math>\mathbf{\phi}_i</math> 进行线性叠加，这里的 具有一些噪音 <math>\nu</math>，如下式：
 :<math>\begin{align}
@@ Line 80: / Line 49: @@
 \end{align}</math>
-Our goal is to find a set of basis feature vectors <math>\mathbf{\phi}</math> such that the distribution of images <math>P(\mathbf{x}\mid\mathbf{\phi})</math> is as close as possible to the empirical distribution of our input data <math>P^*(\mathbf{x})</math>. One method of doing so is to minimize the KL divergence between <math>P^*(\mathbf{x})</math> and <math>P(\mathbf{x}\mid\mathbf{\phi})</math> where the KL divergence is defined as:
-【初译】我们的目标是寻找一组基特征向量 <math>\mathbf{\phi}</math> ，以致于图像的分布 <math>P(\mathbf{x}\mid\mathbf{\phi})</math> 尽可能的近似输入数据 <math>P^*(\mathbf{x})</math>的经验分布。一类方法是最小化KL <math>P^*(\mathbf{x})</math> 和 <math>P^*(\mathbf{x})</math> 之间的散度，这里 KL 散度定义如下:
+我们的目标是找到一组特征基向量 <math>\mathbf{\phi}</math> ，它使得图像的分布函数 <math>P(\mathbf{x}\mid\mathbf{\phi})</math> 尽可能地近似于输入数据的经验分布函数 <math>P^*(\mathbf{x})</math> 。一种实现方式是，最小化 <math>P^*(\mathbf{x})</math> 与 <math>P(\mathbf{x}\mid\mathbf{\phi})</math> 之间的 KL 散度，此 KL 散度表示如下：
-【一审】我们的目标是找到一组特征向量 <math>\mathbf{\phi}</math> ，因此，图像的分布函数 <math>P(\mathbf{x}\mid\mathbf{\phi})</math> 就可以尽可能地近似于输入数据的经验分布函数 <math>P^*(\mathbf{x})</math>。这么做的一种方法是，最小化 <math>P^*(\mathbf{x})</math> 与 <math>P^*(\mathbf{x})</math> 之间的KL离差，KL离差表示如下：
 :<math>\begin{align}
@@ Line 90: / Line 56: @@
 \end{align}</math>
-Since the empirical distribution <math>P^*(\mathbf{x})</math> is constant across our choice of <math>\mathbf{\phi}</math>, this is equivalent to maximizing the log-likelihood of <math>P(\mathbf{x}\mid\mathbf{\phi})</math>.
-Assuming <math>\nu</math> is Gaussian white noise with variance <math>\sigma^2</math>, we have that
-【初译】因为通过我们对 <math>\mathbf{\phi}</math>的选择，经验分布 <math>P^*(\mathbf{x})</math> 是不变量，这相当于最大化对数似然 <math>P(\mathbf{x}\mid\mathbf{\phi})</math>。
+因为无论我们如何选择 <math>\mathbf{\phi}</math> ，经验分布函数 <math>P^*(\mathbf{x})</math> 都是常量，也就是说我们只需要最大化对数似然函数 <math>P(\mathbf{x}\mid\mathbf{\phi})</math> 。
-假设 <math>\nu</math> 是方差为 <math>\sigma^2</math>高斯白噪声，有下式成立。
+假设 <math>\nu</math> 是具有方差 <math>\sigma^2</math> 的高斯白噪音，则有下式：
-【一审】因为经验分布函数 <math>P^*(\mathbf{x})</math> 对于所有的 <math>\mathbf{\phi}</math>其结果是常量，这就等于说要最大化对数似然函数 <math>P(\mathbf{x}\mid\mathbf{\phi})</math>。
-假设 <math>\nu</math> 是具有方差 <math>\sigma^2</math>的高斯白噪音，则有下式：
 :<math>\begin{align}
@@ Line 103: / Line 64: @@
 \end{align}</math>
-In order to determine the distribution <math>P(\mathbf{x}\mid\mathbf{\phi})</math>, we also need to specify the prior distribution <math>P(\mathbf{a})</math>. Assuming the independence of our source features, we can factorize our prior probability as
-【初译】为决定 <math>P(\mathbf{x}\mid\mathbf{\phi})</math>的分布，需要描述先验分布<math>P(\mathbf{a})</math>。假设原特征分布独立，可分解先验分布如下式：
+为了确定分布 <math>P(\mathbf{x}\mid\mathbf{\phi})</math> ，我们需要指定先验分布 <math>P(\mathbf{a})</math> 。假定我们的特征变量是独立的，我们就可以将先验概率分解为：
-【一审】为了确定 <math>P(\mathbf{x}\mid\mathbf{\phi})</math>的分布，我们同时要确定先验分布<math>P(\mathbf{a})</math>，假定我们的特征变量是独立的，我们就可得到先验概率为：
 :<math>\begin{align}
@@ Line 112: / Line 71: @@
 \end{align}</math>
-At this point, we would like to incorporate our sparsity assumption -- the assumption that any single image is likely to be the product of relatively few source features. Therefore, we would like the probability distribution of <math>a_i</math> to be peaked at zero and have high kurtosis. A convenient parameterization of the prior distribution is
+此时，我们将“稀疏”假设加入进来——假设任何一幅图像都是由相对较少的一些源特征组合起来的。因此，我们希望 <math>a_i</math> 的概率分布在零值附近是凸起的，而且峰值很高。一个方便的参数化先验分布就是：
 :<math>\begin{align}
 P(a_i) = \frac{1}{Z}\exp(-\beta S(a_i))
 \end{align}</math>
-Where <math>S(a_i)</math> is a function determining the shape of the prior distribution.
-【初译】在这一点，我们将纳入稀疏假设---假设任一个图像都可能是相对少数源特征的乘积。因此，我们希望 <math>a_i</math> 的分布在零点得到峰值，且有较高的峰度。先验分布的一个合适的参数化是
+这里 <math>S(a_i)</math> 是决定先验分布的形状的函数。
+当定义了 <math>P(\mathbf{x} \mid \mathbf{a} , \mathbf{\phi})</math> 和 <math> P(\mathbf{a})</math> 后，我们就可以写出在由 <math>\mathbf{\phi}</math> 定义的模型之下的数据 <math>\mathbf{x}</math> 的概率分布：
 :<math>\begin{align}
-P(a_i) = \frac{1}{Z}\exp(-\beta S(a_i))
+P(\mathbf{x} \mid \mathbf{\phi}) = \int P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) P(\mathbf{a}) d\mathbf{a}
 \end{align}</math>
-这里 <math>S(a_i)</math> 是一个函数，决定先验分布的形状。
+那么，我们的问题就简化为寻找：
-【一审】至此，我们将“稀疏”的设想加入进来――任一图像都可能是由很少一部分相关特征生成的。因此，我们希望 <math>a_i</math> 的概率分布在零值附近是凸起的，而且形态很陡峭。一个简易的参数化先验分布就是：
 :<math>\begin{align}
-P(a_i) = \frac{1}{Z}\exp(-\beta S(a_i))
+\mathbf{\phi}^*=\text{argmax}_{\mathbf{\phi}} < \log(P(\mathbf{x} \mid \mathbf{\phi})) >
 \end{align}</math>
-这里函数 <math>S(a_i)</math> 决定了先验分布的形状。
+这里 <math><.></math> 表示的是输入数据的期望值。
-Having defined <math>P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi})</math> and <math> P(\mathbf{a})</math>, we can write the probability of the data <math>\mathbf{x}</math> under the model defined by <math>\mathbf{\phi}</math> as
+不幸的是，通过对 <math>\mathbf{a}</math> 的积分计算 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 通常是难以实现的。虽然如此，我们注意到如果 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 的分布（对于相应的 <math>\mathbf{a}</math> ）足够陡峭的话，我们就可以用 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 的最大值来估算以上积分。估算方法如下：
 :<math>\begin{align}
-P(\mathbf{x} \mid \mathbf{\phi}) = \int P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) P(\mathbf{a}) d\mathbf{a}
+\mathbf{\phi}^{*'}=\text{argmax}_{\mathbf{\phi}} < \max_{\mathbf{a}} \log(P(\mathbf{x} \mid \mathbf{\phi})) >
 \end{align}</math>
-and our problem reduces to finding
-:<math>\begin{align}
+跟之前一样，我们可以通过减小 <math>a_i</math> 或增大 <math>\mathbf{\phi}</math> 来增加概率的估算值（因为 <math>P(a_i)</math> 在零值附近陡升）。因此我们要对特征向量 <math>\mathbf{\phi}</math> 加一个限制以防止这种情况发生。
-\mathbf{\phi}^*=\text{argmax}_{\mathbf{\phi}} < \log(P(\mathbf{x} \mid \mathbf{\phi})) >
-\end{align}</math>
-Where <math><.></math> denotes expectation over our input data.
+最后，我们可以定义一种线性生成模型的能量函数，从而将原先的代价函数重新表述为：
-【初译】定义了 <math>P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi})</math> 和 <math> P(\mathbf{a})</math>，在由 <math>\mathbf{\phi}</math> 定义的模型下，我们可以与出数据 <math>\mathbf{x}</math> 的概率为：
+:<math>\begin{array}{rl}
+E\left( \mathbf{x} , \mathbf{a} \mid \mathbf{\phi} \right) & := -\log \left( P(\mathbf{x}\mid \mathbf{\phi},\mathbf{a}\right)P(\mathbf{a})) \\
-:<math>\begin{align}
+ &= \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i)
-P(\mathbf{x} \mid \mathbf{\phi}) = \int P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) P(\mathbf{a}) d\mathbf{a}
+\end{array}</math>
-\end{align}</math>
-我们的问题演化为发现
-:<math>\begin{align}
-\mathbf{\phi}^*=\text{argmax}_{\mathbf{\phi}} < \log(P(\mathbf{x} \mid \mathbf{\phi})) >
-\end{align}</math>
-其中 <math><.></math> 表示期望我们的输入数据。
+其中 <math>\lambda = 2\sigma^2\beta</math> ，并且关系不大的常量已被隐藏起来。因为最大化对数似然函数等同于最小化能量函数，我们就可以将原先的优化问题重新表述为：
-【一审】当定义了 <math>P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi})</math> 和 <math> P(\mathbf{a})</math>后，我们就可以通过由 <math>\mathbf{\phi}</math> 生成的模型将数据集 <math>\mathbf{x}</math> 的概率确定为：
 :<math>\begin{align}
-P(\mathbf{x} \mid \mathbf{\phi}) = \int P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) P(\mathbf{a}) d\mathbf{a}
+\mathbf{\phi}^{*},\mathbf{a}^{*}=\text{argmin}_{\mathbf{\phi},\mathbf{a}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i)
 \end{align}</math>
-那么，我们的问题就简缩为：
-:<math>\begin{align}
-\mathbf{\phi}^*=\text{argmax}_{\mathbf{\phi}} < \log(P(\mathbf{x} \mid \mathbf{\phi})) >
-\end{align}</math>
-这里 <math><.></math> 指的是输入数据的期望值。
+使用概率理论来分析，我们可以发现，选择 <math>L_1</math> 惩罚和 <math>\log(1+a_i^2)</math> 惩罚作为函数 <math>S(.)</math> ，分别对应于使用了拉普拉斯概率 <math>P(a_i) \propto \exp\left(-\beta|a_i|\right)</math> 和柯西先验概率 <math>P(a_i) \propto \frac{\beta}{1+a_i^2}</math> 。
-Unfortunately, the integral over <math>\mathbf{a}</math> to obtain <math>P(\mathbf{x} \mid \mathbf{\phi})</math> is generally intractable. We note though that if the distribution of <math>P(\mathbf{x} \mid \mathbf{\phi})</math> is sufficiently peaked (w.r.t. <math>\mathbf{a}</math>), we can approximate its integral with the maximum value of  <math>P(\mathbf{x} \mid \mathbf{\phi})</math> and obtain a approximate solution
+== 学习算法 ==
-【初译】不幸的是，基于所有 <math>\mathbf{a}</math>  获得 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 难以解决。注意到如果 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 能够充分地达到峰值（w.r.t. <math>\mathbf{a}</math>），我能近似其整数，通过最大化  <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 而得到一个近似解
+使用稀疏编码算法学习基向量集的方法，是由两个独立的优化过程组合起来的。第一个是逐个使用训练样本 <math>\mathbf{x}</math> 来优化系数 <math>a_i</math> ，第二个是一次性处理多个样本对基向量 <math>\mathbf{\phi}</math> 进行优化。
-【一审】不幸的是，通过对 <math>\mathbf{a}</math> 的积分计算 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 通常是难以实现的。虽然如此，我们注意到如果 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 的分布（对于相应的<math>\mathbf{a}</math>）足够陡峭的话，我们就可以用  <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 的最大值来估算以上积分。估算方法如下：
-:<math>\begin{align}
+如果使用 <math>L_1</math> 范式作为稀疏惩罚函数，对 <math>a^{(j)}_i</math> 的学习过程就简化为求解 由 <math>L_1</math> 范式正则化的最小二乘法问题，这个问题函数在域 <math>a^{(j)}_i</math> 内为凸，已经有很多技术方法来解决这个问题（诸如CVX之类的凸优化软件可以用来解决L1正则化的最小二乘法问题）。如果 <math>S(.)</math> 是可微的，比如是对数惩罚函数，则可以采用基于梯度算法的方法，如共轭梯度法。
-\mathbf{\phi}^{*'}=\text{argmax}_{\mathbf{\phi}} < \max_{\mathbf{a}} \log(P(\mathbf{x} \mid \mathbf{\phi})) >
-\end{align}</math>
+用 <math>L_2</math> 范式约束来学习基向量，同样可以简化为一个带有二次约束的最小二乘问题，其问题函数在域 <math>\mathbf{\phi}</math> 内也为凸。标准的凸优化软件（如CVX）或其它迭代方法就可以用来求解 <math>\mathbf{\phi}</math>，虽然已经有了更有效的方法，比如求解拉格朗日对偶函数（Lagrange dual）。
+根据前面的的描述，稀疏编码是有一个明显的局限性的，这就是即使已经学习得到一组基向量，如果为了对新的数据样本进行“编码”，我们必须再次执行优化过程来得到所需的系数。这个显著的“实时”消耗意味着，即使是在测试中,实现稀疏编码也需要高昂的计算成本，尤其是与典型的前馈结构算法相比。
+==中英文对照==
+:稀疏编码 Sparse Coding
+:无监督学习 unsupervised method
+:超完备基 over-complete bases
+:主成分分析 PCA
+:稀疏性 sparsity
+:退化 degeneracy
+:代价函数 cost function
+:重构项 reconstruction term
+:稀疏惩罚项 sparsity penalty
+:范式 norm
+:生成模型 generative model
+:线性叠加 linear superposition
+:加性噪声 additive noise
+:特征基向量 basis feature vectors
+:经验分布函数 the empirical distribution
+:KL 散度 KL divergence
+:对数似然函数 the log-likelihood
+:高斯白噪音 Gaussian white noise
+:先验分布 the prior distribution
+:先验概率 prior probability
+:源特征 source features
+:能量函数 the energy function
+:正则化 regularized
+:最小二乘法 least squares
+:凸优化软件convex optimization software
+:共轭梯度法 conjugate gradient methods
+:二次约束 quadratic constraints
+:拉格朗日对偶函数 the Lagrange dual
+:前馈结构算法 feedforward architectures
+==中文译者==
+柳翠寅（liucuiyin@163.com），林锋（xlfg@yeah.net），王方（fangkey@gmail.com）
+{{Languages|Sparse_Coding|English}}