稀疏编码

Revision as of 07:01, 8 March 2013 (view source)

Revision as of 07:07, 8 March 2013 (view source)

Line 8:

Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. The aim of sparse coding is to find a set of basis vectors <math>\mathbf{\phi}_i</math> such that we can represent an input vector <math>\mathbf{x}</math> as a linear combination of these basis vectors:

+

【初译】

针对超完备基的学习集，稀疏编码是一类有效表示数据的非监督方法。稀疏编码的目的是找到一组基向量 <math>\mathbf{\phi}_i</math> 集合，以致能将输入向量 <math>\mathbf{x}</math> 表示为这组基向量的线性组合。

+

【一审】

Line 20:

Line 22:

\mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i}

\end{align}</math>

+

【原文】

While techniques such as Principal Component Analysis (PCA) allow us to learn a complete set of basis vectors efficiently, we wish to learn an '''over-complete''' set of basis vectors to represent input vectors <math>\mathbf{x}\in\mathbb{R}^n</math> (i.e. such that <math>k > n</math>). The advantage of having an over-complete basis is that our basis vectors are better able to capture structures and patterns inherent in the input data. However, with an over-complete basis, the coefficients <math>a_i</math> are no longer uniquely determined by the input vector <math>\mathbf{x}</math>. Therefore, in sparse coding, we introduce the additional criterion of '''sparsity''' to resolve the degeneracy introduced by over-completeness.

+

【初译】

因主成分分析技术允许有效地学习一组完备基向量集合，希望能够学习一组过完备的基向量集合，来表示输入向量 <math>\mathbf{x}\in\mathbb{R}^n</math> （即,如<math>k > n</math>）。过完备基的优点是该基向量能更好地捕捉输入数据内在的结构和模式。然而，对于过完备基，其系数 <math>a_i</math> 不在由输入向量 <math>\mathbf{x}</math>唯一确定。因此，在稀疏编中引入稀疏的附加标准，解决因引入过完备基所导致的退化问题。

+

【一审】

虽然形如主成分分析技术（PCA）能使我们方便地找到一组“完备”基向量，但是这里我们想要做的是找到一组“超完备”基向量来表示输入向量 <math>\mathbf{x}\in\mathbb{R}^n</math> （也就是说，<math>k > n</math>）。超完备基的好处是它们能更有效地找出输入数据的结构与模式。然而，对于超完备基来说，系数 <math>a_i</math> 不再由输入向量 <math>\mathbf{x}</math>唯一确定。因此，在稀疏编码算法中，我们另加了一个评判标准“稀疏性”来解决因超完备而导致的退化（degeneracy）问题。

+

【原文】

Here, we define sparsity as having few non-zero components or having few components not close to zero. The requirement that our coefficients <math>a_i</math> be sparse means that given a input vector, we would like as few of our coefficients to be far from zero as possible. The choice of sparsity as a desired characteristic of our representation of the input data can be motivated by the observation that most sensory data such as natural images may be described as the superposition of a small number of atomic elements such as surfaces or edges. Other justifications such as comparisons to the properties of the primary visual cortex have also been advanced.

+

【初译】

此处，定义的稀疏性为具有很少的非零分量或很少不接近零的分量。针对系数 <math>a_i</math> 的稀疏处理是对于输入向量，要求得到尽可能少的不为零的系数。作为输入表示数据的期望特征，即稀疏性的选择可由多数传感器数据描述受到启发，如自然图像可被描述为少量的元成分（表面或边缘）的叠加。其他理由，如和主要视觉皮层的属性相比已取得进展。

+

【一审】

这里，我们把“稀疏性”定义为：只有很少的几个非零元素或只有很少的几个远大于零的元素。要求系数 <math>a_i</math> 是稀疏的意思就是说：对于一组输入向量，我们只想有尽可能少的几个系数远大于零。要使输入数据的表示方式符合我们所期望的这个样子，稀疏性的选择可以从对感观数据的观察得到启示，比如，自然图像有时可以通过少量的基本元素如图像表层或边缘的叠加来表示。其它的选择标准如初级视觉皮层神经元特性的比较，这在之前的课程中也有提及。

+

【原文】

We define the sparse coding cost function on a set of <math>m</math> input vectors as

+

【初译】

在一组m个输入向量上定义稀疏编码的代价函数为：

+

【一审】

Line 60:

Line 71:

\text{minimize}_{a^{(j)}_i,\mathbf{\phi}_{i}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i)

\end{align}</math>

+

【原文】

where <math>S(.)</math> is a sparsity cost function which penalizes <math>a_i</math> for being far from zero. We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of <math>\mathbf{x}</math> and the second term as a sparsity penalty which forces our representation of <math>\mathbf{x}</math> to be sparse. The constant <math>\lambda</math> is a scaling constant to determine the relative importance of these two contributions.

+

【初译】

此处， <math>S(.)</math> 是一个稀疏代价函数，它的惩罚系数 <math>a_i</math> 不为零。第一项解释为稀疏编码的目标项，它作为重建项，约束算法给 <math>\mathbf{x}</math> 提供一个好的表示；第二项作为稀疏惩罚性，它约束对 <math>\mathbf{x}</math> 的表示是稀疏的；常量 <math>\lambda</math> 是一个尺度常量，决定两项贡献的相对重要性。

+

【一审】

此处 <math>S(.)</math> 是一个稀疏代价函数，由它来对远大于零的 <math>a_i</math> 进行“惩罚”。我们可以把稀疏编码目标函式的第一项解释为稀疏编码的重构项，这项公式迫使稀疏编码算法能为输入向量 <math>\mathbf{x}</math> 提供一个高拟合度的线性表达式，而公式第二项即“稀疏惩罚”项，它使 <math>\mathbf{x}</math> 的表达式变得“稀疏”。常量 <math>\lambda</math> 是一个变换量，由它来控制这两项式子的相对重要性。

+

【原文】

Although the most direct measure of sparsity is the "<math>L_0</math>" norm (<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>), it is non-differentiable and difficult to optimize in general. In practice, common choices for the sparsity cost <math>S(.)</math> are the <math>L_1</math> penalty <math>S(a_i)=\left|a_i\right|_1 </math> and the log penalty <math>S(a_i)=\log(1+a_i^2)</math>.

+

【初译】

尽管最直接的稀疏性测度是 "<math>L_0</math>" 范数(<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>) ，该范数通常不可微且难以优化。在实际中，稀疏代价函数 <math>S(.)</math> 的选择是<math>L_1</math> 征罚 <math>S(a_i)=\left|a_i\right|_1 </math> ， <math>S(a_i)=\log(1+a_i^2)</math>.

+

【一审】

虽然“稀疏性”的最直接测度标准是 "<math>L_0</math>" 范式(<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>)，但这是不可微的，而且通常很难进行优化，而实际中，稀疏代价函数 <math>S(.)</math> 的普遍选择是<math>L_1</math> 范式代价函数 <math>S(a_i)=\left|a_i\right|_1 </math> 及对数代价函数 <math>S(a_i)=\log(1+a_i^2)</math>。

+

【原文】

In addition, it is also possible to make the sparsity penalty arbitrarily small by scaling down <math>a_i</math> and scaling <math>\mathbf{\phi}_i</math> up by some large constant. To prevent this from happening, we will constrain <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> to be less than some constant <math>C</math>. The full sparse coding cost function including our constraint on <math>\mathbf{\phi}</math> is

+

【初译】

此外，常可通过缩小 <math>a_i</math> ，放大<math>\mathbf{\phi}_i</math> ，通过使用一些大的常量，任意地选择稀疏惩罚项。为了阻止这一可能的发生，增加约束，使 <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> 小于一些常量 <math>C</math>。包括 <math>\mathbf{\phi}</math> 约束完整的稀疏编码代价函数定义如下：

+

【一审】

Line 104:

Line 124:

\end{array}</math>

-

~~== Probabilistic Interpretation [Based on Olshausen and Field 1996] ==概率解释[一审：该理论基于1996年Olshausen 与 Field的模型]~~

【原文】

+

==Probabilistic Interpretation [Based on Olshausen and Field 1996]==概率解释[一审：该理论基于1996年Olshausen 与 Field的模型]

+

So far, we have considered sparse coding in the context of finding a sparse, over-complete set of basis vectors to span our input space. Alternatively, we may also approach sparse coding from a probabilistic perspective as a generative model.

Consider the problem of modelling natural images as the linear superposition of <math>k</math> independent source features <math>\mathbf{\phi}_i</math> with some additive noise <math>\nu</math>:

+

Line 116:

Line 139:

到目前为止，在寻找一个稀疏的上下文中，考虑了稀疏编码，输入空间上的完备基向量。相应地，我们也可以从概率角度来处理稀疏编码，将它看一个生成模型。

把自然图像的建模问题看作是 <math>k</math> 个维独立的原特征 <math>\mathbf{\phi}_i</math> 和附加噪声 <math>\nu</math>的一个线性叠加

+

Line 126:

Line 150:

\mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} + \nu(\mathbf{x})

\end{align}</math>

+

【原文】

Our goal is to find a set of basis feature vectors <math>\mathbf{\phi}</math> such that the distribution of images <math>P(\mathbf{x}\mid\mathbf{\phi})</math> is as close as possible to the empirical distribution of our input data <math>P^*(\mathbf{x})</math>. One method of doing so is to minimize the KL divergence between <math>P^*(\mathbf{x})</math> and <math>P(\mathbf{x}\mid\mathbf{\phi})</math> where the KL divergence is defined as:

+

【初译】

我们的目标是寻找一组基特征向量 <math>\mathbf{\phi}</math> ，以致于图像的分布 <math>P(\mathbf{x}\mid\mathbf{\phi})</math> 尽可能的近似输入数据 <math>P^*(\mathbf{x})</math>的经验分布。一类方法是最小化KL <math>P^*(\mathbf{x})</math> 和 <math>P^*(\mathbf{x})</math> 之间的散度，这里 KL 散度定义如下:

+

【一审】

Line 142:

Line 169:

D(P^*(\mathbf{x})||P(\mathbf{x}\mid\mathbf{\phi})) = \int P^*(\mathbf{x}) \log \left(\frac{P^*(\mathbf{x})}{P(\mathbf{x}\mid\mathbf{\phi})}\right)d\mathbf{x}

\end{align}</math>

+

【原文】

Line 147:

Line 175:

Since the empirical distribution <math>P^*(\mathbf{x})</math> is constant across our choice of <math>\mathbf{\phi}</math>, this is equivalent to maximizing the log-likelihood of <math>P(\mathbf{x}\mid\mathbf{\phi})</math>.

Assuming <math>\nu</math> is Gaussian white noise with variance <math>\sigma^2</math>, we have that

+

【初译】

Line 152:

Line 181:

因为通过我们对 <math>\mathbf{\phi}</math>的选择，经验分布 <math>P^*(\mathbf{x})</math> 是不变量，这相当于最大化对数似然 <math>P(\mathbf{x}\mid\mathbf{\phi})</math>。

假设 <math>\nu</math> 是方差为 <math>\sigma^2</math>高斯白噪声，有下式成立。

+

【一审】

Line 161:

Line 191:

P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac{1}{Z} \exp\left(- \frac{(\mathbf{x}-\sum^{k}_{i=1} a_i \mathbf{\phi}_{i})^2}{2\sigma^2}\right)

\end{align}</math>

+

【原文】

In order to determine the distribution <math>P(\mathbf{x}\mid\mathbf{\phi})</math>, we also need to specify the prior distribution <math>P(\mathbf{a})</math>. Assuming the independence of our source features, we can factorize our prior probability as

+

【初译】

为决定 <math>P(\mathbf{x}\mid\mathbf{\phi})</math>的分布，需要描述先验分布<math>P(\mathbf{a})</math>。假设原特征分布独立，可分解先验分布如下式：

+

【一审】

Line 177:

Line 210:

P(\mathbf{a}) = \prod_{i=1}^{k} P(a_i)

\end{align}</math>

+

【原文】

Line 187:

Line 221:

Where <math>S(a_i)</math> is a function determining the shape of the prior distribution.

+

【初译】

Line 197:

Line 232:

这里 <math>S(a_i)</math> 是一个函数，决定先验分布的形状。

+

【一审】

Line 207:

Line 243:

这里函数 <math>S(a_i)</math> 决定了先验分布的形状。

+

【原文】

Line 223:

Line 260:

Where <math><.></math> denotes expectation over our input data.

+

【初译】

Line 239:

Line 277:

其中 <math><.></math> 表示期望我们的输入数据。

+

【一审】

Line 255:

Line 294:

这里 <math><.></math> 指的是输入数据的期望值。

+

【原文】

Unfortunately, the integral over <math>\mathbf{a}</math> to obtain <math>P(\mathbf{x} \mid \mathbf{\phi})</math> is generally intractable. We note though that if the distribution of <math>P(\mathbf{x} \mid \mathbf{\phi})</math> is sufficiently peaked (w.r.t. <math>\mathbf{a}</math>), we can approximate its integral with the maximum value of <math>P(\mathbf{x} \mid \mathbf{\phi})</math> and obtain a approximate solution

+

【初译】

不幸的是，基于所有 <math>\mathbf{a}</math> 获得 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 难以解决。注意到如果 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 能够充分地达到峰值（w.r.t. <math>\mathbf{a}</math>），我能近似其整数，通过最大化 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 而得到一个近似解

+

【一审】

Line 271:

Line 313:

\mathbf{\phi}^{*'}=\text{argmax}_{\mathbf{\phi}} < \max_{\mathbf{a}} \log(P(\mathbf{x} \mid \mathbf{\phi})) >

\end{align}</math>

+

【原文】

As before, we may increase the estimated probability by scaling down <math>a_i</math> and scaling up <math>\mathbf{\phi}</math> (since <math>P(a_i)</math> peaks about zero) , we therefore impose a norm constraint on our features <math>\mathbf{\phi}</math> to prevent this.

+

【初译】

如前所述，增加估计的概率通过缩小 <math>a_i</math> 和放大 <math>\mathbf{\phi}</math> (因为<math>P(a_i)</math> 在零点得到峰值)，因此，在特征 <math>\mathbf{\phi}</math> 上增加了一个范数约束避免这种可能。

+

【一审】

跟之前一样，我们就可以通过减小 <math>a_i</math> 或增大 <math>\mathbf{\phi}</math> 来增加概率的估算值（因为<math>P(a_i)</math> 在零值附近陡升）。因此我们要对特征向量 <math>\mathbf{\phi}</math> 加一个限制以防止这种情况发生。

+

【原文】

Finally, we can recover our original cost function by defining the energy function of this linear generative model

+

【初译】

最后，通过定义线性生成模型的能量函数覆盖最初的代价函数。

+

【一审】

Line 300:

Line 348:

&= \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i)

\end{array}</math>

+

【原文】

where <math>\lambda = 2\sigma^2\beta</math> and irrelevant constants have been hidden. Since maximizing the log-likelihood is equivalent to minimizing the energy function, we recover the original optimization problem:

+

【初译】

此处， <math>\lambda = 2\sigma^2\beta</math> 且不相关的常量已被隐藏。因为最大化对数似然等价于最小化能量函数，覆盖最初的的优化问题：

+

【一审】

Line 316:

Line 367:

\mathbf{\phi}^{*},\mathbf{a}^{*}=\text{argmin}_{\mathbf{\phi},\mathbf{a}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i)

\end{align}</math>

+

【原文】

Using a probabilistic approach, it can also be seen that the choices of the <math>L_1</math> penalty <math>\left|a_i\right|_1 </math> and the log penalty <math>\log(1+a_i^2)</math> for <math>S(.)</math> correspond to the use of the Laplacian <math>P(a_i) \propto \exp\left(-\beta|a_i|\right)</math> and the Cauchy prior <math>P(a_i) \propto \frac{\beta}{1+a_i^2}</math> respectively.

+

【初译】

使用概率方法，可看作是对 <math>L_1</math> 惩罚项 <math>\left|a_i\right|_1 </math> 和对数惩罚项 <math>\log(1+a_i^2)</math> 的选择，对于 <math>S(.)</math> 对应着Laplacian <math>P(a_i) \propto \exp\left(-\beta|a_i|\right)</math> 和 Cauchy先验 <math>P(a_i) \propto \frac{\beta}{1+a_i^2}</math> 的使用。

+

【一审】

Line 330:

Line 384:

== Learning ==

+

学习（一审：算法学习）

+

【原文】

Learning a set of basis vectors <math>\mathbf{\phi}</math> using sparse coding consists of performing two separate optimizations, the first being an optimization over coefficients <math>a_i</math> for each training example <math>\mathbf{x}</math> and the second an optimization over basis vectors <math>\mathbf{\phi}</math> across many training examples at once.

+

【初译】

学习一组集基向量 <math>\mathbf{\phi}</math> 使用稀疏编码组成两种独立的优化，第一类对于每个训练样本基于系数 <math>a_i</math> 的优化，第二类是对多个训练样本，基于向量 <math>\mathbf{\phi}</math> 的优化。

+

【一审】

使用稀疏编码算法学习基向量集的方法由两项各自独立的优化方法来实现，其一是逐个套用样本 <math>\mathbf{x}</math> 对系数 <math>a_i</math> 进行优化，其二是一次性处理多个样本对基向量 <math>\mathbf{\phi}</math> 进行优化。

+

【原文】

Assuming an <math>L_1</math> sparsity penalty, learning <math>a^{(j)}_i</math> reduces to solving a <math>L_1</math> regularized least squares problem which is convex in <math>a^{(j)}_i</math> for which several techniques have been developed (convex optimization software such as CVX can also be used to perform L1 regularized least squares). Assuming a differentiable <math>S(.)</math> such as the log penalty, gradient-based methods such as conjugate gradient methods can also be used.

+

【初译】

假设 <math>L_1</math> 稀疏惩罚，学习 <math>a^{(j)}_i</math> 减化为求解一个 <math>L_1</math> 的正规化最小二乘问题，即在 <math>a^{(j)}_i</math> 上的凸集问题，针对此已有若干技术 (凸优化软件，如CVX 也被用于解决L1正规化最小二乘)。假定一个可微 <math>S(.)</math> ，如对数惩罚，基于梯度的方法，如共轭梯度方法用来解决优化问题。

+

【一审】

以 <math>L_1</math> 范式稀疏惩罚（代价函数）为例，对 <math>a^{(j)}_i</math> 的学习过程就简化为求解 <math>L_1</math> 范式的正则化最小二乘法问题，这个问题函式在域 <math>a^{(j)}_i</math> 内为凸，已经有很多技术方法可以求得这个凸集（诸如CVX之类的凸优化软件可以用来解决L1范式的正则化最小二乘法问题）。如果可微函数 <math>S(.)</math> 是一个类似对数惩罚函数，则可以采用基于梯度算法的方法如共轭梯度法。

+

【原文】

Learning a set of basis vectors with a <math>L_2</math> norm constraint also reduces to a least squares problem with quadratic constraints which is convex in <math>\mathbf{\phi}</math>. Standard convex optimization software (e.g. CVX) or other iterative methods can be used to solve for <math>\mathbf{\phi}</math> although significantly more efficient methods such as solving the Lagrange dual have also been developed.

+

【初译】

<math>L_2</math> 范数约束下学习一组基向量，常简化为有二次约束的最小二乘问题，也是基向量 <math>\mathbf{\phi}</math>上的凸问题。虽然已有更有效的方法，如解决拉格朗日对偶方法,标准凸优化软件（例如 CVX）或其它迭代方法都可用来求解基向量 <math>\mathbf{\phi}</math>。

+

【一审】

用 <math>L_2</math> 范式学习基向量同样可以简化为求解二次约束最小二乘问题，其问题函式在域 <math>\mathbf{\phi}</math>内也为凸。标准凸优化软件（如CVX）或其它迭代方法就可以用来求解 <math>\mathbf{\phi}</math>，虽然已经有了求解拉格朗日对偶函数这样的更具效率的方法。

+

【原文】

As described above, a significant limitation of sparse coding is that even after a set of basis vectors have been learnt, in order to "encode" a new data example, optimization must be performed to obtain the required coefficients. This significant "runtime" cost means that sparse coding is computationally expensive to implement even at test time especially compared to typical feedforward architectures.

+

【初译】

如前所述，稀疏编码有效的限制是：即使在学习一组基向量后，“编码”一个新的数据样本，要获得需要的系数，必须执行优化。这个有效的“运行”代价指的是稀疏编码计算成本高昂，特别当与典型的前馈结构构相比，即使是测试计算代价仍很高。

+

【一审】

综上所述，即使已经学习得到一组基向量，稀疏编码算法还是极有局限性的。为了对一组新样本进行“编码”，必须执行优化才可以得到所需要的系数。即使只是在测试的时候，特别是相比于“前馈”模式，这种巨大的“运行时”成本意味着稀疏编码是极耗运算性能的。

From Ufldl

Revision as of 07:07, 8 March 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 8: / Line 8: @@
 Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. The aim of sparse coding is to find a set of basis vectors <math>\mathbf{\phi}_i</math> such that we can represent an input vector <math>\mathbf{x}</math> as a linear combination of these basis vectors:
 【初译】
 针对超完备基的学习集，稀疏编码是一类有效表示数据的非监督方法。稀疏编码的目的是找到一组基向量 <math>\mathbf{\phi}_i</math> 集合，以致能将输入向量 <math>\mathbf{x}</math> 表示为这组基向量的线性组合。
 【一审】
@@ Line 20: / Line 22: @@
 \mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i}
 \end{align}</math>
 【原文】
 While techniques such as Principal Component Analysis (PCA) allow us to learn a complete set of basis vectors efficiently, we wish to learn an '''over-complete''' set of basis vectors to represent input vectors <math>\mathbf{x}\in\mathbb{R}^n</math> (i.e. such that <math>k > n</math>). The advantage of having an over-complete basis is that our basis vectors are better able to capture structures and patterns inherent in the input data. However, with an over-complete basis, the coefficients <math>a_i</math> are no longer uniquely determined by the input vector <math>\mathbf{x}</math>. Therefore, in sparse coding, we introduce the additional criterion of '''sparsity''' to resolve the degeneracy introduced by over-completeness.
 【初译】
 因主成分分析技术允许有效地学习一组完备基向量集合，希望能够学习一组过完备的基向量集合，来表示输入向量 <math>\mathbf{x}\in\mathbb{R}^n</math> （即,如<math>k > n</math>）。过完备基的优点是该基向量能更好地捕捉输入数据内在的结构和模式。然而，对于过完备基，其系数 <math>a_i</math> 不在由输入向量  <math>\mathbf{x}</math>唯一确定。因此，在稀疏编中引入稀疏的附加标准，解决因引入过完备基所导致的退化问题。
 【一审】
 虽然形如主成分分析技术（PCA）能使我们方便地找到一组“完备”基向量，但是这里我们想要做的是找到一组“超完备”基向量来表示输入向量 <math>\mathbf{x}\in\mathbb{R}^n</math> （也就是说，<math>k > n</math>）。超完备基的好处是它们能更有效地找出输入数据的结构与模式。然而，对于超完备基来说，系数 <math>a_i</math> 不再由输入向量 <math>\mathbf{x}</math>唯一确定。因此，在稀疏编码算法中，我们另加了一个评判标准“稀疏性”来解决因超完备而导致的退化（degeneracy）问题。
 【原文】
 Here, we define sparsity as having few non-zero components or having few components not close to zero. The requirement that our coefficients <math>a_i</math> be sparse means that given a input vector, we would like as few of our coefficients to be far from zero as possible. The choice of sparsity as a desired characteristic of our representation of the input data can be motivated by the observation that most sensory data such as natural images may be described as the superposition of a small number of atomic elements such as surfaces or edges. Other justifications such as comparisons to the properties of the primary visual cortex have also been advanced.
 【初译】
 此处，定义的稀疏性为具有很少的非零分量或很少不接近零的分量。针对系数 <math>a_i</math> 的稀疏处理是对于输入向量，要求得到尽可能少的不为零的系数。作为输入表示数据的期望特征，即稀疏性的选择可由多数传感器数据描述受到启发，如自然图像可被描述为少量的元成分（表面或边缘）的叠加。其他理由，如和主要视觉皮层的属性相比已取得进展。
 【一审】
 这里，我们把“稀疏性”定义为：只有很少的几个非零元素或只有很少的几个远大于零的元素。要求系数 <math>a_i</math> 是稀疏的意思就是说：对于一组输入向量，我们只想有尽可能少的几个系数远大于零。要使输入数据的表示方式符合我们所期望的这个样子，稀疏性的选择可以从对感观数据的观察得到启示，比如，自然图像有时可以通过少量的基本元素如图像表层或边缘的叠加来表示。其它的选择标准如初级视觉皮层神经元特性的比较，这在之前的课程中也有提及。
 【原文】
 We define the sparse coding cost function on a set of <math>m</math> input vectors as
 【初译】
 在一组m个输入向量上定义稀疏编码的代价函数为：
 【一审】
@@ Line 60: / Line 71: @@
 \text{minimize}_{a^{(j)}_i,\mathbf{\phi}_{i}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i)
 \end{align}</math>
 【原文】
 where <math>S(.)</math> is a sparsity cost function which penalizes <math>a_i</math> for being far from zero. We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of <math>\mathbf{x}</math> and the second term as a sparsity penalty which forces our representation of <math>\mathbf{x}</math> to be sparse. The constant <math>\lambda</math> is a scaling constant to determine the relative importance of these two contributions.
 【初译】
 此处， <math>S(.)</math> 是一个稀疏代价函数，它的惩罚系数 <math>a_i</math> 不为零。第一项解释为稀疏编码的目标项，它作为重建项，约束算法给 <math>\mathbf{x}</math> 提供一个好的表示；第二项作为稀疏惩罚性，它约束对 <math>\mathbf{x}</math> 的表示是稀疏的；常量 <math>\lambda</math> 是一个尺度常量，决定两项贡献的相对重要性。
 【一审】
 此处 <math>S(.)</math> 是一个稀疏代价函数，由它来对远大于零的 <math>a_i</math> 进行“惩罚”。我们可以把稀疏编码目标函式的第一项解释为稀疏编码的重构项，这项公式迫使稀疏编码算法能为输入向量 <math>\mathbf{x}</math> 提供一个高拟合度的线性表达式，而公式第二项即“稀疏惩罚”项，它使 <math>\mathbf{x}</math> 的表达式变得“稀疏”。常量 <math>\lambda</math> 是一个变换量，由它来控制这两项式子的相对重要性。
 【原文】
 Although the most direct measure of sparsity is the "<math>L_0</math>" norm (<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>), it is non-differentiable and difficult to optimize in general. In practice, common choices for the sparsity cost <math>S(.)</math> are the <math>L_1</math> penalty <math>S(a_i)=\left|a_i\right|_1 </math> and the log penalty <math>S(a_i)=\log(1+a_i^2)</math>.
 【初译】
 尽管最直接的稀疏性测度是 "<math>L_0</math>" 范数(<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>) ，该范数通常不可微且难以优化。在实际中，稀疏代价函数 <math>S(.)</math> 的选择是<math>L_1</math> 征罚 <math>S(a_i)=\left|a_i\right|_1 </math> ， <math>S(a_i)=\log(1+a_i^2)</math>.
 【一审】
 虽然“稀疏性”的最直接测度标准是 "<math>L_0</math>" 范式(<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>)， 但这是不可微的，而且通常很难进行优化，而实际中，稀疏代价函数 <math>S(.)</math> 的普遍选择是<math>L_1</math> 范式代价函数 <math>S(a_i)=\left|a_i\right|_1 </math> 及对数代价函数 <math>S(a_i)=\log(1+a_i^2)</math>。
 【原文】
 In addition, it is also possible to make the sparsity penalty arbitrarily small by scaling down <math>a_i</math> and scaling <math>\mathbf{\phi}_i</math> up by some large constant. To prevent this from happening, we will constrain <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> to be less than some constant <math>C</math>. The full sparse coding cost function including our constraint on <math>\mathbf{\phi}</math> is
 【初译】
 此外，常可通过缩小 <math>a_i</math> ，放大<math>\mathbf{\phi}_i</math> ，通过使用一些大的常量，任意地选择稀疏惩罚项。为了阻止这一可能的发生，增加约束，使 <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> 小于一些常量 <math>C</math>。包括 <math>\mathbf{\phi}</math> 约束完整的稀疏编码代价函数定义如下：
 【一审】
@@ Line 104: / Line 124: @@
 \end{array}</math>
-== Probabilistic Interpretation [Based on Olshausen and Field 1996] ==概率解释[一审：该理论基于1996年Olshausen 与 Field的模型]
 【原文】
+==Probabilistic Interpretation [Based on Olshausen and Field 1996]==概率解释[一审：该理论基于1996年Olshausen 与 Field的模型]
 So far, we have considered sparse coding in the context of finding a sparse, over-complete set of basis vectors to span our input space. Alternatively, we may also approach sparse coding from a probabilistic perspective as a generative model.
 Consider the problem of modelling natural images as the linear superposition of <math>k</math> independent source features <math>\mathbf{\phi}_i</math> with some additive noise <math>\nu</math>:
@@ Line 116: / Line 139: @@
 到目前为止，在寻找一个稀疏的上下文中，考虑了稀疏编码，输入空间上的完备基向量。相应地，我们也可以从概率角度来处理稀疏编码，将它看一个生成模型。
 把自然图像的建模问题看作是 <math>k</math> 个维独立的原特征 <math>\mathbf{\phi}_i</math> 和附加噪声 <math>\nu</math>的一个线性叠加
@@ Line 126: / Line 150: @@
 \mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} + \nu(\mathbf{x})
 \end{align}</math>
 【原文】
 Our goal is to find a set of basis feature vectors <math>\mathbf{\phi}</math> such that the distribution of images <math>P(\mathbf{x}\mid\mathbf{\phi})</math> is as close as possible to the empirical distribution of our input data <math>P^*(\mathbf{x})</math>. One method of doing so is to minimize the KL divergence between <math>P^*(\mathbf{x})</math> and <math>P(\mathbf{x}\mid\mathbf{\phi})</math> where the KL divergence is defined as:
 【初译】
 我们的目标是寻找一组基特征向量 <math>\mathbf{\phi}</math> ，以致于图像的分布 <math>P(\mathbf{x}\mid\mathbf{\phi})</math> 尽可能的近似输入数据 <math>P^*(\mathbf{x})</math>的经验分布。一类方法是最小化KL <math>P^*(\mathbf{x})</math> 和 <math>P^*(\mathbf{x})</math> 之间的散度，这里 KL 散度定义如下:
 【一审】
@@ Line 142: / Line 169: @@
 D(P^*(\mathbf{x})||P(\mathbf{x}\mid\mathbf{\phi})) = \int P^*(\mathbf{x}) \log \left(\frac{P^*(\mathbf{x})}{P(\mathbf{x}\mid\mathbf{\phi})}\right)d\mathbf{x}
 \end{align}</math>
 【原文】
@@ Line 147: / Line 175: @@
 Since the empirical distribution <math>P^*(\mathbf{x})</math> is constant across our choice of <math>\mathbf{\phi}</math>, this is equivalent to maximizing the log-likelihood of <math>P(\mathbf{x}\mid\mathbf{\phi})</math>.
 Assuming <math>\nu</math> is Gaussian white noise with variance <math>\sigma^2</math>, we have that
 【初译】
@@ Line 152: / Line 181: @@
 因为通过我们对 <math>\mathbf{\phi}</math>的选择，经验分布 <math>P^*(\mathbf{x})</math> 是不变量，这相当于最大化对数似然 <math>P(\mathbf{x}\mid\mathbf{\phi})</math>。
 假设 <math>\nu</math> 是方差为 <math>\sigma^2</math>高斯白噪声，有下式成立。
 【一审】
@@ Line 161: / Line 191: @@
 P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac{1}{Z} \exp\left(- \frac{(\mathbf{x}-\sum^{k}_{i=1} a_i \mathbf{\phi}_{i})^2}{2\sigma^2}\right)
 \end{align}</math>
 【原文】
 In order to determine the distribution <math>P(\mathbf{x}\mid\mathbf{\phi})</math>, we also need to specify the prior distribution <math>P(\mathbf{a})</math>. Assuming the independence of our source features, we can factorize our prior probability as
 【初译】
 为决定 <math>P(\mathbf{x}\mid\mathbf{\phi})</math>的分布，需要描述先验分布<math>P(\mathbf{a})</math>。假设原特征分布独立，可分解先验分布如下式：
 【一审】
@@ Line 177: / Line 210: @@
 P(\mathbf{a}) = \prod_{i=1}^{k} P(a_i)
 \end{align}</math>
 【原文】
@@ Line 187: / Line 221: @@
 Where <math>S(a_i)</math> is a function determining the shape of the prior distribution.
 【初译】
@@ Line 197: / Line 232: @@
 这里 <math>S(a_i)</math> 是一个函数，决定先验分布的形状。
 【一审】
@@ Line 207: / Line 243: @@
 这里函数 <math>S(a_i)</math> 决定了先验分布的形状。
 【原文】
@@ Line 223: / Line 260: @@
 Where <math><.></math> denotes expectation over our input data.
 【初译】
@@ Line 239: / Line 277: @@
 其中 <math><.></math> 表示期望我们的输入数据。
 【一审】
@@ Line 255: / Line 294: @@
 这里 <math><.></math> 指的是输入数据的期望值。
 【原文】
 Unfortunately, the integral over <math>\mathbf{a}</math> to obtain <math>P(\mathbf{x} \mid \mathbf{\phi})</math> is generally intractable. We note though that if the distribution of <math>P(\mathbf{x} \mid \mathbf{\phi})</math> is sufficiently peaked (w.r.t. <math>\mathbf{a}</math>), we can approximate its integral with the maximum value of  <math>P(\mathbf{x} \mid \mathbf{\phi})</math> and obtain a approximate solution
 【初译】
 不幸的是，基于所有 <math>\mathbf{a}</math>  获得 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 难以解决。注意到如果 <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 能够充分地达到峰值（w.r.t. <math>\mathbf{a}</math>），我能近似其整数，通过最大化  <math>P(\mathbf{x} \mid \mathbf{\phi})</math> 而得到一个近似解
 【一审】
@@ Line 271: / Line 313: @@
 \mathbf{\phi}^{*'}=\text{argmax}_{\mathbf{\phi}} < \max_{\mathbf{a}} \log(P(\mathbf{x} \mid \mathbf{\phi})) >
 \end{align}</math>
 【原文】
 As before, we may increase the estimated probability by scaling down <math>a_i</math> and scaling up <math>\mathbf{\phi}</math> (since <math>P(a_i)</math> peaks about zero) , we therefore impose a norm constraint on our features <math>\mathbf{\phi}</math> to prevent this.
 【初译】
 如前所述，增加估计的概率通过缩小 <math>a_i</math> 和放大 <math>\mathbf{\phi}</math> (因为<math>P(a_i)</math> 在零点得到峰值)，因此，在特征 <math>\mathbf{\phi}</math> 上增加了一个范数约束避免这种可能。
 【一审】
 跟之前一样，我们就可以通过减小 <math>a_i</math> 或增大 <math>\mathbf{\phi}</math> 来增加概率的估算值（因为<math>P(a_i)</math> 在零值附近陡升）。因此我们要对特征向量 <math>\mathbf{\phi}</math> 加一个限制以防止这种情况发生。
 【原文】
 Finally, we can recover our original cost function by defining the energy function of this linear generative model
 【初译】
 最后，通过定义线性生成模型的能量函数覆盖最初的代价函数。
 【一审】
@@ Line 300: / Line 348: @@
   &= \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i)
 \end{array}</math>
 【原文】
 where <math>\lambda = 2\sigma^2\beta</math> and irrelevant constants have been hidden. Since maximizing the log-likelihood is equivalent to minimizing the energy function, we recover the original optimization problem:
 【初译】
 此处， <math>\lambda = 2\sigma^2\beta</math> 且不相关的常量已被隐藏。因为最大化对数似然等价于最小化能量函数，覆盖最初的的优化问题：
 【一审】
@@ Line 316: / Line 367: @@
 \mathbf{\phi}^{*},\mathbf{a}^{*}=\text{argmin}_{\mathbf{\phi},\mathbf{a}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i)
 \end{align}</math>
 【原文】
 Using a probabilistic approach, it can also be seen that the choices of the <math>L_1</math> penalty <math>\left|a_i\right|_1 </math> and the log penalty <math>\log(1+a_i^2)</math> for <math>S(.)</math> correspond to the use of the Laplacian <math>P(a_i) \propto \exp\left(-\beta|a_i|\right)</math> and the Cauchy prior <math>P(a_i) \propto \frac{\beta}{1+a_i^2}</math> respectively.
 【初译】
 使用概率方法，可看作是对 <math>L_1</math> 惩罚项 <math>\left|a_i\right|_1 </math> 和对数惩罚项 <math>\log(1+a_i^2)</math> 的选择，对于 <math>S(.)</math> 对应着Laplacian <math>P(a_i) \propto \exp\left(-\beta|a_i|\right)</math> 和 Cauchy先验 <math>P(a_i) \propto \frac{\beta}{1+a_i^2}</math> 的使用。
 【一审】
@@ Line 330: / Line 384: @@
 == Learning ==
 学习 （一审：算法学习）
 【原文】
 Learning a set of basis vectors <math>\mathbf{\phi}</math> using sparse coding consists of performing two separate optimizations, the first being an optimization over coefficients <math>a_i</math> for each training example <math>\mathbf{x}</math> and the second an optimization over basis vectors <math>\mathbf{\phi}</math> across many training examples at once.
 【初译】
 学习一组集基向量 <math>\mathbf{\phi}</math> 使用稀疏编码组成两种独立的优化， 第一类对于每个训练样本基于系数 <math>a_i</math> 的优化，第二类是对多个训练样本，基于向量 <math>\mathbf{\phi}</math> 的优化。
 【一审】
 使用稀疏编码算法学习基向量集的方法由两项各自独立的优化方法来实现，其一是逐个套用样本 <math>\mathbf{x}</math> 对系数 <math>a_i</math> 进行优化，其二是一次性处理多个样本对基向量 <math>\mathbf{\phi}</math> 进行优化。
 【原文】
 Assuming an <math>L_1</math> sparsity penalty, learning <math>a^{(j)}_i</math> reduces to solving a <math>L_1</math> regularized least squares problem which is convex in <math>a^{(j)}_i</math> for which several techniques have been developed (convex optimization software such as CVX can also be used to perform L1 regularized least squares). Assuming a differentiable <math>S(.)</math> such as the log penalty, gradient-based methods such as conjugate gradient methods can also be used.
 【初译】
 假设 <math>L_1</math> 稀疏惩罚，学习 <math>a^{(j)}_i</math> 减化为求解 一个 <math>L_1</math> 的正规化最小二乘问题，即在 <math>a^{(j)}_i</math> 上的凸集问题，针对此已有若干技术 (凸优化软件，如CVX 也被用于解决L1正规化最小二乘)。假定一个可微 <math>S(.)</math> ，如对数惩罚，基于梯度的方法，如共轭梯度方法用来解决优化问题。
 【一审】
 以 <math>L_1</math> 范式稀疏惩罚（代价函数）为例，对 <math>a^{(j)}_i</math> 的学习过程就简化为求解 <math>L_1</math> 范式的正则化最小二乘法问题，这个问题函式在域 <math>a^{(j)}_i</math> 内为凸，已经有很多技术方法可以求得这个凸集（诸如CVX之类的凸优化软件可以用来解决L1范式的正则化最小二乘法问题）。如果可微函数 <math>S(.)</math> 是一个类似对数惩罚函数，则可以采用基于梯度算法的方法如共轭梯度法。
 【原文】
 Learning a set of basis vectors with a <math>L_2</math> norm constraint also reduces to a least squares problem with quadratic constraints which is convex in <math>\mathbf{\phi}</math>. Standard convex optimization software (e.g. CVX) or other iterative methods can be used to solve for <math>\mathbf{\phi}</math> although significantly more efficient methods such as solving the Lagrange dual have also been developed.
 【初译】
 <math>L_2</math> 范数约束下学习一组基向量，常简化为有二次约束的最小二乘问题，也是基向量 <math>\mathbf{\phi}</math>上的凸问题。虽然已有更有效的方法，如解决拉格朗日对偶方法,标准凸优化软件（例如 CVX）或其它迭代方法都可用来求解基向量 <math>\mathbf{\phi}</math>。
 【一审】
 用 <math>L_2</math> 范式学习基向量同样可以简化为求解二次约束最小二乘问题，其问题函式在域 <math>\mathbf{\phi}</math>内也为凸。标准凸优化软件（如CVX）或其它迭代方法就可以用来求解 <math>\mathbf{\phi}</math>，虽然已经有了求解拉格朗日对偶函数这样的更具效率的方法。
 【原文】
 As described above, a significant limitation of sparse coding is that even after a set of basis vectors have been learnt, in order to "encode" a new data example, optimization must be performed to obtain the required coefficients. This significant "runtime" cost means that sparse coding is computationally expensive to implement even at test time especially compared to typical feedforward architectures.
 【初译】
 如前所述，稀疏编码有效的限制是：即使在学习一组基向量后，“编码”一个新的数据样本，要获得需要的系数，必须执行优化。这个有效的“运行”代价指的是稀疏编码计算成本高昂，特别当与典型的前馈结构构相比，即使是测试计算代价仍很高。
 【一审】
 综上所述，即使已经学习得到一组基向量，稀疏编码算法还是极有局限性的。为了对一组新样本进行“编码”，必须执行优化才可以得到所需要的系数。即使只是在测试的时候，特别是相比于“前馈”模式，这种巨大的“运行时”成本意味着稀疏编码是极耗运算性能的。