稀疏编码

'''Sparse Coding'''

'''稀疏编码'''

初译：@寅莹

一审：@大黄蜂的思索

Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. The aim of sparse coding is to find a set of basis vectors <math>\mathbf{\phi}_i</math> such that we can represent an input vector <math>\mathbf{x}</math> as a linear combination of these basis vectors: 

【初译】针对超完备基的学习集，稀疏编码是一类有效表示数据的非监督方法。稀疏编码的目的是找到一组基向量 <math>\mathbf{\phi}_i</math> 集合，以致能将输入向量 <math>\mathbf{x}</math> 表示为这组基向量的线性组合。

【一审】稀疏编码算法是一种无监督学习方法，它用来寻找一组“超完备”基向量来更高效地表示样本数据。稀疏编码算法的目的就是找到一组基向量 <math>\mathbf{\phi}_i</math> ，使得我们能将输入向量 <math>\mathbf{x}</math> 表示为这些基向量的线性组合。
:<math>\begin{align}
\mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} 
\end{align}</math>
While techniques such as Principal Component Analysis (PCA) allow us to learn a complete set of basis vectors efficiently, we wish to learn an '''over-complete''' set of basis vectors to represent input vectors <math>\mathbf{x}\in\mathbb{R}^n</math> (i.e. such that <math>k > n</math>). The advantage of having an over-complete basis is that our basis vectors are better able to capture structures and patterns inherent in the input data. However, with an over-complete basis, the coefficients <math>a_i</math> are no longer uniquely determined by the input vector <math>\mathbf{x}</math>. Therefore, in sparse coding, we introduce the additional criterion of '''sparsity''' to resolve the degeneracy introduced by over-completeness. 

【初译】因主成分分析技术允许有效地学习一组完备基向量集合，希望能够学习一组过完备的基向量集合，来表示输入向量 <math>\mathbf{x}\in\mathbb{R}^n</math> （即,如<math>k > n</math>）。过完备基的优点是该基向量能更好地捕捉输入数据内在的结构和模式。然而，对于过完备基，其系数 <math>a_i</math> 不在由输入向量  <math>\mathbf{x}</math>唯一确定。因此，在稀疏编中引入稀疏的附加标准，解决因引入过完备基所导致的退化问题。

【一审】虽然形如主成分分析技术（PCA）能使我们方便地找到一组“完备”基向量，但是这里我们想要做的是找到一组“超完备”基向量来表示输入向量 <math>\mathbf{x}\in\mathbb{R}^n</math> （也就是说，<math>k > n</math>）。超完备基的好处是它们能更有效地找出输入数据的结构与模式。然而，对于超完备基来说，系数 <math>a_i</math> 不再由输入向量 <math>\mathbf{x}</math>唯一确定。因此，在稀疏编码算法中，我们另加了一个评判标准“稀疏性”来解决因超完备而导致的退化（degeneracy）问题。

Here, we define sparsity as having few non-zero components or having few components not close to zero. The requirement that our coefficients <math>a_i</math> be sparse means that given a input vector, we would like as few of our coefficients to be far from zero as possible. The choice of sparsity as a desired characteristic of our representation of the input data can be motivated by the observation that most sensory data such as natural images may be described as the superposition of a small number of atomic elements such as surfaces or edges. Other justifications such as comparisons to the properties of the primary visual cortex have also been advanced. 

【初译】此处，定义的稀疏性为具有很少的非零分量或很少不接近零的分量。针对系数 <math>a_i</math> 的稀疏处理是对于输入向量，要求得到尽可能少的不为零的系数。作为输入表示数据的期望特征，即稀疏性的选择可由多数传感器数据描述受到启发，如自然图像可被描述为少量的元成分（表面或边缘）的叠加。其他理由，如和主要视觉皮层的属性相比已取得进展。

【一审】这里，我们把“稀疏性”定义为：只有很少的几个非零元素或只有很少的几个远大于零的元素。要求系数 <math>a_i</math> 是稀疏的意思就是说：对于一组输入向量，我们只想有尽可能少的几个系数远大于零。要使输入数据的表示方式符合我们所期望的这个样子，稀疏性的选择可以从对感观数据的观察得到启示，比如，自然图像有时可以通过少量的基本元素如图像表层或边缘的叠加来表示。其它的选择标准如初级视觉皮层神经元特性的比较，这在之前的课程中也有提及。

We define the sparse coding cost function on a set of <math>m</math> input vectors as

【初译】在一组m个输入向量上定义稀疏编码的代价函数为：

【一审】我们把有m个输入向量的稀疏编码代价函数定义为：


:<math>\begin{align}
\text{minimize}_{a^{(j)}_i,\mathbf{\phi}_{i}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i)
\end{align}</math>

where <math>S(.)</math> is a sparsity cost function which penalizes <math>a_i</math> for being far from zero. We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of <math>\mathbf{x}</math> and the second term as a sparsity penalty which forces our representation of <math>\mathbf{x}</math> to be sparse. The constant <math>\lambda</math> is a scaling constant to determine the relative importance of these two contributions. 

【初译】此处， <math>S(.)</math> 是一个稀疏代价函数，它的惩罚系数 <math>a_i</math> 不为零。第一项解释为稀疏编码的目标项，它作为重建项，约束算法给 <math>\mathbf{x}</math> 提供一个好的表示；第二项作为稀疏惩罚性，它约束对 <math>\mathbf{x}</math> 的表示是稀疏的；常量 <math>\lambda</math> 是一个尺度常量，决定两项贡献的相对重要性。

【一审】此处 <math>S(.)</math> 是一个稀疏代价函数，由它来对远大于零的 <math>a_i</math> 进行“惩罚”。我们可以把稀疏编码目标函式的第一项解释为稀疏编码的重构项，这项公式迫使稀疏编码算法能为输入向量 <math>\mathbf{x}</math> 提供一个高拟合度的线性表达式，而公式第二项即“稀疏惩罚”项，它使 <math>\mathbf{x}</math> 的表达式变得“稀疏”。常量 <math>\lambda</math> 是一个变换量，由它来控制这两项式子的相对重要性。 

Although the most direct measure of sparsity is the "<math>L_0</math>" norm (<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>), it is non-differentiable and difficult to optimize in general. In practice, common choices for the sparsity cost <math>S(.)</math> are the <math>L_1</math> penalty <math>S(a_i)=\left|a_i\right|_1 </math> and the log penalty <math>S(a_i)=\log(1+a_i^2)</math>.

【初译】尽管最直接的稀疏性测度是 "<math>L_0</math>" 范数(<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>) ，该范数通常不可微且难以优化。在实际中，稀疏代价函数 <math>S(.)</math> 的选择是<math>L_1</math> 征罚 <math>S(a_i)=\left|a_i\right|_1 </math> ， <math>S(a_i)=\log(1+a_i^2)</math>.

【一审】虽然“稀疏性”的最直接测度标准是 "<math>L_0</math>" 范式(<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>)， 但这是不可微的，而且通常很难进行优化，而实际中，稀疏代价函数 <math>S(.)</math> 的普遍选择是<math>L_1</math> 范式代价函数 <math>S(a_i)=\left|a_i\right|_1 </math> 及对数代价函数 <math>S(a_i)=\log(1+a_i^2)</math>。

In addition, it is also possible to make the sparsity penalty arbitrarily small by scaling down <math>a_i</math> and scaling <math>\mathbf{\phi}_i</math> up by some large constant. To prevent this from happening, we will constrain <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> to be less than some constant <math>C</math>. The full sparse coding cost function including our constraint on <math>\mathbf{\phi}</math> is

【初译】此外，常可通过缩小 <math>a_i</math> ，放大<math>\mathbf{\phi}_i</math> ，通过使用一些大的常量，任意地选择稀疏惩罚项。为了阻止这一可能的发生，增加约束，使 <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> 小于一些常量 <math>C</math>。包括 <math>\mathbf{\phi}</math> 约束完整的稀疏编码代价函数定义如下：

【一审】此外，稀疏代价函数也可以通过减小 <math>a_i</math> 或增加<math>\mathbf{\phi}_i</math> 至很大的常量来使其自身变得极其地小。为防止此类事件发生，我们将 <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> 限制为比某常量 <math>C</math>更小的数。包含了限制条件的稀疏编码代价函数的完整形式如下：

:<math>\begin{array}{rc}
\text{minimize}_{a^{(j)}_i,\mathbf{\phi}_{i}} & \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) 
\\
\text{subject to}  &  \left|\left|\mathbf{\phi}_i\right|\right|^2 \leq C, \forall i = 1,...,k 
\\
\end{array}</math>

== Probabilistic Interpretation [Based on Olshausen and Field 1996] ==概率解释[该理论基于1996年Olshausen 与 Field的模型]

So far, we have considered sparse coding in the context of finding a sparse, over-complete set of basis vectors to span our input space. Alternatively, we may also approach sparse coding from a probabilistic perspective as a generative model. 
Consider the problem of modelling natural images as the linear superposition of <math>k</math> independent source features <math>\mathbf{\phi}_i</math> with some additive noise <math>\nu</math>:


【初译】到目前为止，在寻找一个稀疏的上下文中，考虑了稀疏编码，输入空间上的完备基向量。相应地，我们也可以从概率角度来处理稀疏编码，将它看一个生成模型。
把自然图像的建模问题看作是 <math>k</math> 个维独立的原特征 <math>\mathbf{\phi}_i</math> 和附加噪声 <math>\nu</math>的一个线性叠加


【一审】到目前为止，我们已通过确定稀疏系数及超完备基向量的方法论述了稀疏编码算法。不过，我们或许可从概率的角度为稀疏编码算法找到一种“生成模型”。
举个自然图像建模的例子，这种建模方式通过 <math>k</math> 个独立的特征向量 <math>\mathbf{\phi}_i</math> 进行线性叠加，这里的 具有一些噪音 <math>\nu</math>，如下式：

:<math>\begin{align}
\mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} + \nu(\mathbf{x})
\end{align}</math>

Our goal is to find a set of basis feature vectors <math>\mathbf{\phi}</math> such that the distribution of images <math>P(\mathbf{x}\mid\mathbf{\phi})</math> is as close as possible to the empirical distribution of our input data <math>P^*(\mathbf{x})</math>. One method of doing so is to minimize the KL divergence between <math>P^*(\mathbf{x})</math> and <math>P(\mathbf{x}\mid\mathbf{\phi})</math> where the KL divergence is defined as:

【初译】我们的目标是寻找一组基特征向量 <math>\mathbf{\phi}</math> ，以致于图像的分布 <math>P(\mathbf{x}\mid\mathbf{\phi})</math> 尽可能的近似输入数据 <math>P^*(\mathbf{x})</math>的经验分布。一类方法是最小化KL <math>P^*(\mathbf{x})</math> 和 <math>P^*(\mathbf{x})</math> 之间的散度，这里 KL 散度定义如下: 

【一审】我们的目标是找到一组特征向量 <math>\mathbf{\phi}</math> ，因此，图像的分布函数 <math>P(\mathbf{x}\mid\mathbf{\phi})</math> 就可以尽可能地近似于输入数据的经验分布函数 <math>P^*(\mathbf{x})</math>。这么做的一种方法是，最小化 <math>P^*(\mathbf{x})</math> 与 <math>P^*(\mathbf{x})</math> 之间的KL离差，KL离差表示如下：

:<math>\begin{align}
D(P^*(\mathbf{x})||P(\mathbf{x}\mid\mathbf{\phi})) = \int P^*(\mathbf{x}) \log \left(\frac{P^*(\mathbf{x})}{P(\mathbf{x}\mid\mathbf{\phi})}\right)d\mathbf{x}
\end{align}</math> 

Since the empirical distribution <math>P^*(\mathbf{x})</math> is constant across our choice of <math>\mathbf{\phi}</math>, this is equivalent to maximizing the log-likelihood of <math>P(\mathbf{x}\mid\mathbf{\phi})</math>.
Assuming <math>\nu</math> is Gaussian white noise with variance <math>\sigma^2</math>, we have that 

【初译】因为通过我们对 <math>\mathbf{\phi}</math>的选择，经验分布 <math>P^*(\mathbf{x})</math> 是不变量，这相当于最大化对数似然 <math>P(\mathbf{x}\mid\mathbf{\phi})</math>。
假设 <math>\nu</math> 是方差为 <math>\sigma^2</math>高斯白噪声，有下式成立。

【一审】因为经验分布函数 <math>P^*(\mathbf{x})</math> 对于所有的 <math>\mathbf{\phi}</math>其结果是常量，这就等于说要最大化对数似然函数 <math>P(\mathbf{x}\mid\mathbf{\phi})</math>。
假设 <math>\nu</math> 是具有方差 <math>\sigma^2</math>的高斯白噪音，则有下式：

:<math>\begin{align}
P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac{1}{Z} \exp\left(- \frac{(\mathbf{x}-\sum^{k}_{i=1} a_i \mathbf{\phi}_{i})^2}{2\sigma^2}\right)
\end{align}</math>

In order to determine the distribution <math>P(\mathbf{x}\mid\mathbf{\phi})</math>, we also need to specify the prior distribution <math>P(\mathbf{a})</math>. Assuming the independence of our source features, we can factorize our prior probability as 

【初译】为决定 <math>P(\mathbf{x}\mid\mathbf{\phi})</math>的分布，需要描述先验分布<math>P(\mathbf{a})</math>。假设原特征分布独立，可分解先验分布如下式：
【一审】为了确定 <math>P(\mathbf{x}\mid\mathbf{\phi})</math>的分布，我们同时要确定先验分布<math>P(\mathbf{a})</math>，假定我们的特征变量是独立的，我们就可得到先验概率为：

:<math>\begin{align}
P(\mathbf{a}) = \prod_{i=1}^{k} P(a_i)
\end{align}</math>

At this point, we would like to incorporate our sparsity assumption -- the assumption that any single image is likely to be the product of relatively few source features. Therefore, we would like the probability distribution of <math>a_i</math> to be peaked at zero and have high kurtosis. A convenient parameterization of the prior distribution is 

:<math>\begin{align}
P(a_i) = \frac{1}{Z}\exp(-\beta S(a_i))
\end{align}</math>

Where <math>S(a_i)</math> is a function determining the shape of the prior distribution.

【初译】在这一点，我们将纳入稀疏假设---假设任一个图像都可能是相对少数源特征的乘积。因此，我们希望 <math>a_i</math> 的分布在零点得到峰值，且有较高的峰度。先验分布的一个合适的参数化是
 
:<math>\begin{align}
P(a_i) = \frac{1}{Z}\exp(-\beta S(a_i))
\end{align}</math>


这里 <math>S(a_i)</math> 是一个函数，决定先验分布的形状。

【一审】至此，我们将“稀疏”的设想加入进来――任一图像都可能是由很少一部分相关特征生成的。因此，我们希望 <math>a_i</math> 的概率分布在零值附近是凸起的，而且形态很陡峭。一个简易的参数化先验分布就是：
 
:<math>\begin{align}
P(a_i) = \frac{1}{Z}\exp(-\beta S(a_i))
\end{align}</math>


这里函数 <math>S(a_i)</math> 决定了先验分布的形状。

Having defined <math>P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi})</math> and <math> P(\mathbf{a})</math>, we can write the probability of the data <math>\mathbf{x}</math> under the model defined by <math>\mathbf{\phi}</math> as 

:<math>\begin{align}
P(\mathbf{x} \mid \mathbf{\phi}) = \int P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) P(\mathbf{a}) d\mathbf{a}
\end{align}</math>

and our problem reduces to finding

:<math>\begin{align}
\mathbf{\phi}^*=\text{argmax}_{\mathbf{\phi}} < \log(P(\mathbf{x} \mid \mathbf{\phi})) >
\end{align}</math>

Where <math><.></math> denotes expectation over our input data. 

【初译】定义了 <math>P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi})</math> 和 <math> P(\mathbf{a})</math>，在由 <math>\mathbf{\phi}</math> 定义的模型下，我们可以与出数据 <math>\mathbf{x}</math> 的概率为：
 
:<math>\begin{align}
P(\mathbf{x} \mid \mathbf{\phi}) = \int P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) P(\mathbf{a}) d\mathbf{a}
\end{align}</math>

我们的问题演化为发现
 
:<math>\begin{align}
\mathbf{\phi}^*=\text{argmax}_{\mathbf{\phi}} < \log(P(\mathbf{x} \mid \mathbf{\phi})) >
\end{align}</math>

其中 <math><.></math> 表示期望我们的输入数据。

【一审】当定义了 <math>P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi})</math> 和 <math> P(\mathbf{a})</math>后，我们就可以通过由 <math>\mathbf{\phi}</math> 生成的模型将数据集 <math>\mathbf{x}</math> 的概率确定为：
 
:<math>\begin{align}
P(\mathbf{x} \mid \mathbf{\phi}) = \int P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) P(\mathbf{a}) d\mathbf{a}
\end{align}</math>

那么，我们的问题就简缩为：
 
:<math>\begin{align}
\mathbf{\phi}^*=\text{argmax}_{\mathbf{\phi}} < \log(P(\mathbf{x} \mid \mathbf{\phi})) >
\end{align}</math>

这里 <math><.></math> 指的是输入数据的期望值。

Unfortunately, the integral over <math>\mathbf{a}</math> to obtain <math>P(\mathbf{x} \mid \mathbf{\phi})</math> is generally intractable. We note though that if the distribution of <math>P(\mathbf{x} \mid \mathbf{\phi})</math> is sufficiently peaked (w.r.t. <math>\mathbf{a}</math>), we can approximate its integral with the maximum value of  <math>P(\mathbf{x} \mid \mathbf{\phi})</math> and obtain a approximate solution 

【初译】不幸的是，基于所有 <math>\mathbf{a}</math>  获得 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 难以解决。注意到如果 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 能够充分地达到峰值（w.r.t. <math>\mathbf{a}</math>），我能近似其整数，通过最大化  <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 而得到一个近似解

【一审】不幸的是，通过对 <math>\mathbf{a}</math> 的积分计算 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 通常是难以实现的。虽然如此，我们注意到如果 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 的分布（对于相应的<math>\mathbf{a}</math>）足够陡峭的话，我们就可以用  <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 的最大值来估算以上积分。估算方法如下：

:<math>\begin{align}
\mathbf{\phi}^{*'}=\text{argmax}_{\mathbf{\phi}} < \max_{\mathbf{a}} \log(P(\mathbf{x} \mid \mathbf{\phi})) >
\end{align}</math>

As before, we may increase the estimated probability by scaling down <math>a_i</math> and scaling up <math>\mathbf{\phi}</math> (since <math>P(a_i)</math> peaks about zero) , we therefore impose a norm constraint on our features <math>\mathbf{\phi}</math> to prevent this.

【初译】如前所述，增加估计的概率通过缩小 <math>a_i</math> 和放大 <math>\mathbf{\phi}</math> (因为<math>P(a_i)</math> 在零点得到峰值)，因此，在特征 <math>\mathbf{\phi}</math> 上增加了一个范数约束避免这种可能。

【一审】跟之前一样，我们就可以通过减小 <math>a_i</math> 或增大 <math>\mathbf{\phi}</math> 来增加概率的估算值（因为<math>P(a_i)</math> 在零值附近陡升）。因此我们要对特征向量 <math>\mathbf{\phi}</math> 加一个限制以防止这种情况发生。

Finally, we can recover our original cost function by defining the energy function of this linear generative model

【初译】最后，通过定义线性生成模型的能量函数覆盖最初的代价函数。

【一审】最后，我们就可以通过定义一种线性生成模型的能量函数来将原先的代价函数重新表述为：

:<math>\begin{array}{rl}
E\left( \mathbf{x} , \mathbf{a} \mid \mathbf{\phi} \right) & := -\log \left( P(\mathbf{x}\mid \mathbf{\phi},\mathbf{a}\right)P(\mathbf{a})) \\
 &= \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) 
\end{array}</math>

where <math>\lambda = 2\sigma^2\beta</math> and irrelevant constants have been hidden. Since maximizing the log-likelihood is equivalent to minimizing the energy function, we recover the original optimization problem:

【初译】此处， <math>\lambda = 2\sigma^2\beta</math> 且不相关的常量已被隐藏。因为最大化对数似然等价于最小化能量函数，覆盖最初的的优化问题：

【一审】此处， <math>\lambda = 2\sigma^2\beta</math> ，其中关系不大的常量已被隐藏起来。因为最大化对数似然函数等同于最小化能量函数，我们将原先的优化问题重新表述为：

:<math>\begin{align}
\mathbf{\phi}^{*},\mathbf{a}^{*}=\text{argmin}_{\mathbf{\phi},\mathbf{a}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) 
\end{align}</math>