稀疏编码自编码表达

[原文]
[原文]
== Sparse coding[稀疏编码] ==

In the sparse autoencoder, we tried to learn a set of weights <math>W</math> (and associated biases <math>b</math>) that would give us sparse features <math>\sigma(Wx + b)</math> useful in reconstructing the input <math>x</math>. 

[初译]

稀疏编码

在稀疏自编码中，为了用稀疏特征<math>\sigma(Wx + b)</math>重新表示输入数据<math>x</math>需要学习权重系数<math>W</math>（以及对应的偏移量<math>b</math>）。

[一审]

稀疏编码

在稀疏自编码中，为了用稀疏特征<math>\sigma(Wx + b)</math>重新表示输入数据<math>x</math>需要学习权重系数<math>W</math>（以及对应的偏移量<math>b</math>）。

[[File:STL_SparseAE.png | 240px]]

[原文]

Sparse coding can be seen as a modification of the sparse autoencoder method in which we try to learn the set of features for some data "directly". Together with an associated basis  for transforming the learned features from the feature space to the data space, we can then reconstruct the data from the learned features.

[初译]

稀疏编码可以看作是稀疏自编码方法的一个变形，该方法试图直接学习数据的特征集。利用偏移量将待学习特征集从特征空间转化到数据空间，实现了待学习特征集数据的重新表示。

[一审]

稀疏编码可以看作是稀疏自编码方法的一个变形，该方法试图直接学习数据的特征集。利用连带基将学习特征集从特征空间转化到数据空间，就可以从学到的特征中重建数据。

[原文]

Formally, in sparse coding, we have some data <math>x</math> we would like to learn features on. In particular, we would like to learn <math>s</math>, a set of sparse features useful for representing the data, and <math>A</math>, a basis for transforming the features from the feature space to the data space. Our objective function is hence:

[初译]

在稀疏编码中，通常有很多数据<math>x</math>供我们进行特征学习。例如：<math>s</math>是一个用于表示数据的稀疏特征集，<math>A</math>是特征集从特征空间转换到数据空间的基。因此，为了计算：<math>s</math>和：<math>A</math>构建如下目标函数：

[一审]

在稀疏编码中，对于从数据<math>x</math>中进行特征学习的情况。例如学习一个用于表示数据的稀疏特征集<math>s</math>,和一个将特征从特征空间转换到数据空间的基<math>A</math>,我们可以构建如下目标函数：

[原文]

:<math>
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1
</math>

(If you are unfamiliar with the notation, <math>\lVert x \rVert_k</math> refers to the L<math>k</math> norm of the <math>x</math> which is equal to <math>\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}</math>. The L2 norm is the familiar Euclidean norm, while the L1 norm is the sum of absolute values of the elements of the vector)

[初译]

（ <math>\lVert x \rVert_k</math>等价于<math>\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}</math>是<math>x</math>的L<math>k</math>范数。L2范数即大家熟知的欧几里得范数，L1范数是向量元素的绝对值之和）

[一审]

（ <math>\lVert x \rVert_k</math>等价于<math>\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}</math>是<math>x</math>的L<math>k</math>范数。L2范数即大家熟知的欧几里得范数，L1范数是向量元素的绝对值之和）

[原文]

The first term is the error in reconstructing the data from the features using the basis, and the second term is a sparsity penalty term to encourage the learned features to be sparse. 

[初译]

上式前半部分为重建误差，后半部分为稀疏性惩罚项（sparsity penalty term）用于保证向量集的稀疏性。

[一审]

上式前半部分为重建误差，后半部分为稀疏性惩罚项（sparsity penalty term）用于保证向量集的稀疏性。

[原文]

However, the objective function as it stands is not properly constrained - it is possible to reduce the sparsity cost (the second term) by scaling <math>A</math> by some constant and scaling <math>s</math> by the inverse of the same constant, without changing the error. Hence, we include the additional constraint that that for every column <math>A_j</math> of <math>A</math>, 
<math>A_j^TA_j \le 1</math>. Our problem is thus:

[初译]

但是，目标函数在不改变重建误差的前提下，可以通过常数倍缩放<math>A</math>同时以该常数倒数等比例缩放<math>s</math>降低稀疏代价（目标函数第二项）。因此，需要为<math>A</math>中每项<math>A_j</math> 增加额外约束<math>A_j^TA_j \le 1</math>。问题变为：

[一审]

但是，目标函数在不改变重建误差的前提下，可以通过常数倍扩大<math>A</math>同时以该常数倒数等比例缩放<math>s</math>降低稀疏代价（目标函数第二项）。因此，需要为<math>A</math>中每项<math>A_j</math> 增加额外约束<math>A_j^TA_j \le 1</math>。问题变为：

[原文]

:<math>
\begin{array}{rcl}
     {\rm minimize} & \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 \\
     {\rm s.t.}     &    A_j^TA_j \le 1 \; \forall j \\
\end{array} 
</math>

[原文]

Unfortunately, the objective function is non-convex, and hence impossible to optimize well using gradient-based methods. However, given <math>A</math>, the problem of finding <math>s</math> that minimizes <math>J(A, s)</math> is convex. Similarly, given <math>s</math>, the problem of finding <math>A</math> that minimizes <math>J(A, s)</math> is also convex. This suggests that we might try alternately optimizing for <math>A</math> for a fixed <math>s</math>, and then optimizing for <math>s</math> given a fixed <math>A</math>. It turns out that this works quite well in practice.

[初译]

遗憾的是，因为目标函数并不是一个凸函数，所以不能用梯度方法解决这个优化问题。但是，在给定<math>A</math>通过最小化<math>J(A, s)</math>求解<math>s</math>的问题是凸函数。同理，给定<math>s</math>通过最小化<math>J(A, s)</math>求解<math>A</math>的问题也是凸函数。这表明，可以通过交替固定<math>s</math>和<math>A</math>分别求解<math>A</math>和<math>s</math>。实践表明，这一策略取得的效果非常好。

[一审]

遗憾的是，因为目标函数并不是一个凸函数，所以不能用梯度方法解决这个优化问题。但是，在给定<math>A</math>通过最小化<math>J(A, s)</math>求解<math>s</math>的问题是凸问题。同理，给定<math>s</math>通过最小化<math>J(A, s)</math>求解<math>A</math>的问题也是凸问题。这表明，可以通过交替固定<math>s</math>和<math>A</math>分别求解<math>A</math>和<math>s</math>。实践表明，这一策略取得的效果非常好。

[原文]

However, the form of our problem presents another difficulty - the constraint that <math>A_j^TA_j \le 1 \; \forall j</math> cannot be enforced using simple gradient-based methods. Hence, in practice, this constraint is weakened to a "weight decay" term designed to keep the entries of <math>A</math> small. This gives us a new objective function:

[初译]

但是，问题的表示形式带来了另一个难题：其约束条件<math>A_j^TA_j \le 1 \; \forall j</math> 不能简单的应用到梯度方法中。因此在实际问题中，此约束条件被弱化为“权重衰变”（"weight decay"）项用于保证<math>A</math>的每一项值够小。至此得到一个新的目标函数：

[一审]

但是，问题的表示形式带来了另一个难题：其约束条件<math>A_j^TA_j \le 1 \; \forall j</math> 不能在简单的梯度方法中实施。因此在实际问题中，此约束条件被弱化为“权重衰变”（"weight decay"）项以保证<math>A</math>的每一项值够小。至此得到一个新的目标函数：

[原文]

:<math>
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \lVert s \rVert_1 + \gamma \lVert A \rVert_2^2
</math>

(note that the third term, <math>\lVert A \rVert_2^2</math> is simply the sum of squares of the entries of A, or <math>\sum_r{\sum_c{A_{rc}^2}}</math>)

[初译]

（注意上式中第三项， <math>\lVert A \rVert_2^2</math>等价于<math>\sum_r{\sum_c{A_{rc}^2}}</math>，是A各项的平方和）

[一审]

（注意上式中第三项， <math>\lVert A \rVert_2^2</math>等价于<math>\sum_r{\sum_c{A_{rc}^2}}</math>，是A各项的平方和）

[原文]

This objective function presents one last problem - the L1 norm is not differentiable at 0, and hence poses a problem for gradient-based methods. While the problem can be solved using other non-gradient descent-based methods, we will "smooth out" the L1 norm using an approximation which will allow us to use gradient descent. To "smooth out" the L1 norm, we use <math>\sqrt{x + \epsilon}</math> in place of <math>\left| x \right|</math>, where <math>\epsilon</math> is a "smoothing parameter" which can also be interpreted as a sort of "sparsity parameter" (to see this, observe that when <math>\epsilon</math> is large compared to <math>x</math>, the <math>x + \epsilon</math> is dominated by <math>\epsilon</math>, and taking the square root yields approximately <math>\sqrt{\epsilon}</math>). This "smoothing" will come in handy later when considering topographic sparse coding below. 

Our final objective function is hence:

[初译]

这一目标函数带来了最后一个问题，即L1范数在0点不可微影响了梯度方法的应用。尽管可以通过其他非梯度下降方法避开这一问题，但是本文通过使用近似值“平滑”L1范数的方法解决此难题。使用 <math>\sqrt{x + \epsilon}</math> 代替 <math>\left| x \right|</math>对L1范数进行平滑，其中<math>\epsilon</math>是“平滑参数”（"smoothing parameter"）或者“稀疏参数”（"sparsity parameter"）（如果<math>\epsilon</math>远大于<math>x</math>，则<math>x + \epsilon</math> 的值由<math>\epsilon</math>主导，其平方根近似于<math>\sqrt{\epsilon}</math>）。考虑拓扑稀疏编码时，“平滑”会派上用场。

因此，最终的目标函数是：

[一审]

这一目标函数表现出最后一个问题，即L1范数在0点不可微影响了梯度方法的应用。尽管可以通过其他非梯度下降方法避开这一问题，但是本文通过使用近似值“平滑”L1范数的方法解决此难题。使用 <math>\sqrt{x + \epsilon}</math> 代替 <math>\left| x \right|</math>对L1范数进行平滑，其中<math>\epsilon</math>是“平滑参数”（"smoothing parameter"）或者“稀疏参数”（"sparsity parameter"）（如果<math>\epsilon</math>远大于<math>x</math>，则<math>x + \epsilon</math> 的值由<math>\epsilon</math>主导，其平方根近似于<math>\sqrt{\epsilon}</math>）。后文可见考虑拓扑稀疏编码时，“平滑”会派上用场。

因此，最终的目标函数是：

[原文]

:<math>
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sqrt{s^2 + \epsilon} + \gamma \lVert A \rVert_2^2
</math>

(where <math>\sqrt{s^2 + \epsilon}</math> is shorthand for <math>\sum_k{\sqrt{s_k^2 + \epsilon}}</math>)

[初译]

（ <math>\sqrt{s^2 + \epsilon}</math>是 <math>\sum_k{\sqrt{s_k^2 + \epsilon}}</math>的简写）

[一审]

（其中 <math>\sqrt{s^2 + \epsilon}</math>是 <math>\sum_k{\sqrt{s_k^2 + \epsilon}}</math>的简写）

[原文]

This objective function can then be optimized iteratively, using the following procedure:
<ol>
<li>Initialize <math>A</math> randomly
<li>Repeat until convergence
  <ol>
    <li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step
    <li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step 
  </ol>
</ol>

[初译]

该目标函数可以通过以下过程迭代优化：
<ol>
<li>随机初始化<math>A</math>
<li>重复以下步骤直至收敛：
  <ol>
    <li>根据上一步给定的<math>A</math>，求解能够最小化<math>J(A, s)</math>的<math>s</math>  
    <li>根据上一步得到的<math>s</math>，求解能够最小化<math>J(A, s)</math>的<math>A</math>  
  </ol>
</ol>

[一审]

该目标函数可以通过以下过程迭代优化：
<ol>
<li>随机初始化<math>A</math>
<li>重复以下步骤直至收敛：
  <ol>
    <li>根据上一步给定的<math>A</math>，求解能够最小化<math>J(A, s)</math>的<math>s</math>  
    <li>根据上一步得到的<math>s</math>，求解能够最小化<math>J(A, s)</math>的<math>A</math>  
  </ol>
</ol>


[原文]

Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods.

[初译]

观察修改后的目标函数<math>J(A, s)</math>，给定<math>s</math>的条件下，目标函数可以简化为<math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> （因为<math>s</math>的L1范式不是<math>A</math>的函数，所以可以忽略）。简化后的目标函数是一个关于<math>A</math>的简单二次项式，因此存在容易推导分析的解决方案。矩阵演算是一个快速获得该解决方案的方法（在可用链接部分列出了跟矩阵演算有关的很多页面）。遗憾的是，在给定<math>A</math>的条件下目标函数不具备这样完美的分析解决方案，因此在求解最小化步骤中只能用梯度下降或其他类似的最优化方法。

[一审]

观察修改后的目标函数<math>J(A, s)</math>，给定<math>s</math>的条件下，目标函数可以简化为<math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> （因为<math>s</math>的L1范式不是<math>A</math>的函数，所以可以忽略）。简化后的目标函数是一个关于<math>A</math>的简单二次项式，因此容易计算<math>A</math>的可导解析解。矩阵演算是快速求解的一个方法（在可用链接部分列出了跟矩阵演算有关的很多页面）。遗憾的是，在给定<math>A</math>的条件下目标函数不具备这样完美解析解，因此在最小化目标函数步骤中只能用梯度下降或其他类似的最优化方法。

[原文]

In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[ Sparse Coding: Autoencoder Interpretation#Sparse coding in practice | sparse coding in practice]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using the backpropagation idea | using the backpropagation intuition]] can be helpful.

[初译]

理论上，通过上述迭代方法求解目标函数的最优化问题最终得到的特征集（基向量<math>A</math>）与通过稀疏编码学习得到的特征集类似。但是实际上，为了更好的获得算法收敛性需要使用一些小技巧，后面的稀疏编码实践章节会详细介绍这些技巧。用梯度下降方法求解目标函数略有技巧，此外使用矩阵演算以及反向传播机制也都有助于最优化问题的解决。

[一审]

理论上，通过上述迭代方法求解目标函数的最优化问题最终得到的特征集（<math>A</math>的基向量）与通过稀疏自编码学习得到的特征集是相似的。但是实际上，为了更好的获得算法收敛性需要使用一些小技巧，后面的稀疏编码实践章节会详细介绍这些技巧。用梯度下降方法求解目标函数略有技巧，此外使用矩阵演算以及反向传播机制也都有助于最优化问题的解决。

[原文]

== Topographic sparse coding[拓扑稀疏编码] ==

With sparse coding, we can learn a set of features useful for representing the data. However, drawing inspiration from the brain, we would like to learn a set of features that are "orderly" in some manner. For instance, consider visual features. As suggested earlier, the V1 cortex of the brain contains neurons which detect edges at particular orientations. However, these neurons are also organized into hypercolumns in which adjacent neurons detect edges at similar orientations. One neuron could detect a horizontal edge, its neighbors edges oriented slightly off the horizontal, and moving further along the hypercolumn, the neurons detect edges oriented further off the horizontal. 

[初译]

拓扑稀疏编码

通过稀疏编码，能够得到很多用于表示数据的特征。而且根据从大脑结构中获得的灵感，能够学习到一系列以某种方式“有序”组合在一起的特征。以视觉特征为例。如前面所提到的，V1区的大脑皮层神经元能够在特定方向进行边缘检测。同时，相邻的神经元能够在相同的方向进行边缘检测，因此它们被组织成超柱（hypercolumns）。单个神经元仅可以检测水平边缘，其相邻边缘稍微偏离检测到的水平边缘，但是由相邻神经元组成的超柱（hypercolumns）检测的边缘会极大的偏离单个神经元检测到的水平值。

[一审]

通过稀疏编码，能够得到一组用于表示数据的特征。而且根据从大脑结构中获得的灵感，我们希望学习到一组以某种方式“有序”组合在一起的特征。以视觉特征为例。如前面所提到的，大脑皮层V1区神经元能够在特定方向进行边缘检测。同时，这些神经元（在生理上）被组织成超柱（hypercolumns），使得相邻神经元完成相似方向的边缘检测。一个神经元可以检测水平边缘，其相邻边缘稍微偏离水平，并沿着超柱移动，那些神经元就可以检测方向与水平方向相差甚远的边缘了。

[原文]

Inspired by this example, we would like to learn features which are similarly "topographically ordered". What does this imply for our learned features? Intuitively, if "adjacent" features are "similar", we would expect that if one feature is activated, its neighbors will also be activated to a lesser extent. 

[初译]

根据该例子的启发，我们对“拓扑有序”的特征感兴趣。这意味着我们要学习的特征是什么呢?直观的讲，如果“相邻”的特征是“相似”的，就意味着如果某个特征被激活，那么与之相邻的特征在较小的程度上被激活。

[一审]

受该例子的启发，我们希望学到的特征具有这样“拓扑有序”的性质。这对于我们要学习的特征意味着什么呢?直观的讲，如果“相邻”的特征是“相似”的，就意味着如果某个特征被激活，那么与之相邻的特征在较小的程度上也会被激活。

[原文]

Concretely, suppose we (arbitrarily) organized our features into a square matrix. We would then like adjacent features in the matrix to be similar. The way this is accomplished is to group these adjacent features together in the smoothed L1 penalty, so that instead of say <math>\sqrt{s_{1,1}^2 + \epsilon}</math>, we use say <math>\sqrt{s_{1,1}^2 + s_{1,2}^2 + s_{1,3}^2 + s_{2,1}^2 + s_{2,2}^2 + s_{3,2}^2 + s_{3,1}^2 + s_{3,2}^2 + s_{3,3}^2 + \epsilon}</math> instead, if we group in 3x3 regions. The grouping is usually overlapping, so that the 3x3 region starting at the 1st row and 1st column is one group, the 3x3 region starting at the 1st row and 2nd column is another group, and so on. Further, the grouping is also usually done wrapping around, as if the matrix were a torus, so that every feature is counted an equal number of times.

Hence, in place of the smoothed L1 penalty, we use the sum of smoothed L1 penalties over all the groups, so our new objective function is:

[初译]

具体而言，假设我们（随意地）将特征组织成一个方阵。矩阵中相邻特征是相似的。实现这一点的方法是将相邻特征按经过平滑的L1范数惩罚值进行分组，如果在3x3的区域内分组，则用 <math>\sqrt{s_{1,1}^2 + s_{1,2}^2 + s_{1,3}^2 + s_{2,1}^2 + s_{2,2}^2 + s_{3,2}^2 + s_{3,1}^2 + s_{3,2}^2 + s_{3,3}^2 + \epsilon}</math> 代替 <math>\sqrt{s_{1,1}^2 + \epsilon}</math>。其分组通常是重合的，因此从第1行第1列开始的3x3区域是一个分组，从第1行第2列开始的区域是另一个分组，以此类推。另外，因为矩阵是环形的，分组也通常是环绕进行，所以每个特征的计数是相等的。

用所有分组中经过平滑的L1惩罚值之和代替经过平滑的L1惩罚值，得到新的目标函数如下：

[一审]

具体而言，假设我们（随意地）将特征组织成一个方阵。我们就希望矩阵中相邻特征是相似的。实现这一点的方法是将相邻特征按经过平滑的L1范数惩罚值进行分组，如果在3x3的区域内分组，则用 <math>\sqrt{s_{1,1}^2 + s_{1,2}^2 + s_{1,3}^2 + s_{2,1}^2 + s_{2,2}^2 + s_{3,2}^2 + s_{3,1}^2 + s_{3,2}^2 + s_{3,3}^2 + \epsilon}</math> 代替<math>\sqrt{s_{1,1}^2 + \epsilon}</math> 。其分组通常是重合的，因此从第1行第1列开始的3x3区域是一个分组，从第1行第2列开始的区域是另一个分组，以此类推。另外，把矩阵当作围成的环形一样，分组也通常是环绕进行，所以每个特征的计数是相等的。

用所有分组中经过平滑的L1惩罚值之和代替经过平滑的L1惩罚值，得到新的目标函数如下：

[原文]

:<math>
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum_{\text{all groups } g}{\sqrt{ \left( \sum_{\text{all } s \in g}{s^2} \right) + \epsilon} } + \gamma \lVert A \rVert_2^2
</math>

[原文]

In practice, the "grouping" can be accomplished using a "grouping matrix" <math>V</math>, such that the <math>r</math>th row of <math>V</math> indicates which features are grouped in the <math>r</math>th group, so <math>V_{r, c} = 1</math> if group <math>r</math> contains feature <math>c</math>. Thinking of the grouping as being achieved by a grouping matrix makes the computation of the gradients more intuitive. Using this grouping matrix, the objective function can be rewritten as:

[初译]

实际上，“分组”可以通过“分组矩阵”<math>V</math>完成，例如矩阵<math>V</math>的第<math>r</math>列表示特征被分到第<math>r</math>组中，即如果第<math>r</math>组包含特征<math>c</math>则<math>V_{r, c} = 1</math>。通过分组矩阵实现分组使得梯度的计算更加直观。使用此分组矩阵，目标函数被重写为：

[一审]

实际上，“分组”可以通过“分组矩阵”<math>V</math>完成，例如矩阵<math>V</math>的第<math>r</math>列记录哪些特征被分到第<math>r</math>组中，即如果第<math>r</math>组包含特征<math>c</math>则<math>V_{r, c} = 1</math>。通过分组矩阵实现分组使得梯度的计算更加直观。使用此分组矩阵，目标函数被重写为：

[原文]

:<math>
J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum{ \sqrt{Vss^T + \epsilon} } + \gamma \lVert A \rVert_2^2
</math>

(where <math>\sum{ \sqrt{Vss^T + \epsilon} }</math> is <math>\sum_r{ \sum_c { D_{r, c} } } </math> if we let <math>D = \sqrt{Vss^T + \epsilon}</math>)

[初译]

(令<math>D = \sqrt{Vss^T + \epsilon}</math>，<math>\sum{ \sqrt{Vss^T + \epsilon} }</math> 等价于 <math>\sum_r{ \sum_c { D_{r, c} } } </math>)

[一审]

(令<math>D = \sqrt{Vss^T + \epsilon}</math>，<math>\sum{ \sqrt{Vss^T + \epsilon} }</math> 等价于 <math>\sum_r{ \sum_c { D_{r, c} } } </math>)

[原文]

This objective function can be optimized using the iterated method described in the earlier section. Topographic sparse coding will learn features similar to those learned by sparse coding, except that the features will now be "ordered" in some way.

[初译]

该目标函数能够使用之前部分提到的迭代方法进行求解。拓扑稀疏编码得到的特征与稀疏编码得到的类似，但是拓扑稀疏编码得到的特征是以某种方式“有序”的。

[一审]

该目标函数能够使用之前部分提到的迭代方法进行求解。拓扑稀疏编码得到的特征与稀疏编码得到的类似，但是拓扑稀疏编码得到的特征是以某种方式“有序”的。

[原文]

== Sparse coding in practice[稀疏编码实践] ==

As suggested in the earlier sections, while the theory behind sparse coding is quite simple, writing a good implementation that actually works and converges reasonably quickly to good optima requires a bit of finesse.

[初译]

如上所述，稀疏编码背后的理论十分简单，但是要实现一个好的稀疏编码算法需要一定的技巧。一个好的算法不仅能实际工作并且能够合理快速地收敛到最优值。

[一审]

如上所述，稀疏编码背后的理论十分简单，但是要实现一个好的稀疏编码算法需要一定的技巧。一个好的算法不仅能实际工作并且能够合理快速地收敛到最优值。

[原文]

Recall the simple iterative algorithm proposed earlier:
<ol>
<li>Initialize <math>A</math> randomly
<li>Repeat until convergence
  <ol>
    <li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step
    <li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step 
  </ol>
</ol>

[初译]

调用之前提到的简单迭代算法：
<ol>
<li>随机初始化<math>A</math>
<li>重复以下步骤直至收敛：
  <ol>
    <li>根据上一步给定的<math>A</math>，求解能够最小化<math>J(A, s)</math>的<math>s</math>  
    <li>根据上一步得到的<math>s</math>，求解能够最小化<math>J(A, s)</math>的<math>A</math>  
  </ol>
</ol>

[一审]

调用之前提到的简单迭代算法：
<ol>
<li>随机初始化<math>A</math>
<li>重复以下步骤直至收敛：
  <ol>
    <li>根据上一步给定的<math>A</math>，求解能够最小化<math>J(A, s)</math>的<math>s</math>  
    <li>根据上一步得到的<math>s</math>，求解能够最小化<math>J(A, s)</math>的<math>A</math>  
  </ol>
</ol>

[原文]

It turns out that running this algorithm out of the box will not produce very good results, if any results are produced at all. There are two main tricks to achieve faster and better convergence: 
<ol>
<li>Batching examples into "mini-batches"
<li>Good initialization of <math>s</math>
</ol>

[初译]

事实证明，即使有结果产生该算法的开箱即用也不会产生很好的效果。以下是两个快速收敛技巧：
<ol>
<li>将样本处理为“迷你块”
<li>良好的<math>s</math>初始值
</ol>

[一审]

事实证明，即使有结果产生该算法的开箱即用也不会产生很好的效果。以下是两个快速收敛技巧：
<ol>
<li>将样本处理为“迷你块”
<li>良好的<math>s</math>初始值
</ol>

[原文]

=== Batching examples into mini-batches [将样本处理为“迷你块”]===

If you try running the simple iterative algorithm on a large dataset of say 10 000 patches at one go, you will find that each iteration takes a long time, and the algorithm may hence take a long time to converge. To increase the rate of convergence, you can instead run the algorithm on mini-batches instead. To do this, instead of running the algorithm on all 10 000 patches, in each iteration, select a mini-batch - a (different) random subset of say 2000 patches from the 10 000 patches - and run the algorithm on that mini-batch for the iteration instead. This accomplishes two things - firstly, it speeds up each iteration, since now each iteration is operating on 2000 rather than 10 000 patches; secondly, and more importantly, it increases the rate of convergence [[(TODO]]: explain why).

[初译]

如果在一个大规模数据集（例如，10 000 patches）上运行上面描述的简单迭代算法，每次迭代都会耗费大量时间，因此算法收敛速度极慢。为了提高收敛速度，可以选择在迷你块上运行该算法。每次迭代中，选择一个迷你块代替10 000 patches 的大数据集合，然后在迷你块上运行算法。此处的迷你块是从10 000 patches中随机选取2000 patches。这样做有两点好处，一是加速了每次迭代，因为此时算法是在2000 patches而非10 000 patches上运行；更重要的是，加速了收敛速度。

[一审]

如果在一个大规模数据集（例如，10 000 patches）上运行上面描述的简单迭代算法，每次迭代都会耗费大量时间，因此算法收敛速度极慢。为了提高收敛速度，可以选择在迷你块上运行该算法。每次迭代中，选择一个迷你块代替10 000 patches 的大数据集合，然后在迷你块上运行算法。此处的迷你块是从10 000 patches中随机选取2000 patches。这样做有两点好处，一是加速了每次迭代，因为此时算法是在2000 patches而非10 000 patches上运行；更重要的是，加速了收敛速度。

[原文]

=== Good initialization of <math>s</math>[良好的<math>s</math>初始值] ===

Another important trick in obtaining faster and better convergence is good initialization of the feature matrix <math>s</math> before using gradient descent (or other methods) to optimize for the objective function for <math>s</math> given <math>A</math>. In practice, initializing <math>s</math> randomly at each iteration can result in poor convergence unless a good optima is found for <math>s</math> before moving on to optimize for <math>A</math>. A better way to initialize <math>s</math> is the following:
<ol>
<li>Set <math>s \leftarrow W^Tx</math> (where <math>x</math> is the matrix of patches in the mini-batch)
<li>For each feature in <math>s</math> (i.e. each column of <math>s</math>), divide the feature by the norm of the corresponding basis vector in <math>A</math>. That is, if <math>s_{r, c}</math> is the <math>r</math>th feature for the <math>c</math>th example, and <math>A_c</math> is the <math>c</math>th basis vector in <math>A</math>, then set <math>s_{r, c} \leftarrow \frac{ s_{r, c} } { \lVert A_c \rVert }.</math>
</ol>

[初译]

在给定<math>A</math>的条件下，根据目标函数使用梯度下降（或其他方法）求解<math>s</math>之前找到良好的特征矩阵<math>s</math>的初始值是另一个快速高效收敛的重要技巧。实际上，每次迭代过程<math>s</math>的随机初始化导致收敛性较差，除非在求解<math>A</math>的最优值前已得到<math>s</math>的最优解。下面给出一个初始化<math>s</math>的较好方法：
<ol>
<li>令<math>s \leftarrow W^Tx</math> (<math>x</math> 是迷你块中patches的矩阵表示)
<li>对<math>s</math>做归一化处理：<math>s</math>中的每个特征（<math>s</math>的每一列）除以其在<math>A</math>中对应的偏移量。换句话说，如果 <math>s_{r, c}</math>表示<math>c</math>样本的第<math>r</math>个特征，<math>A_c</math>表示<math>A</math>中第<math>c</math>个偏移量，则令<math>s_{r, c} \leftarrow \frac{ s_{r, c} } { \lVert A_c \rVert }.</math>
</ol>

[一审]

在给定<math>A</math>的条件下，根据目标函数使用梯度下降（或其他方法）求解<math>s</math>之前找到良好的特征矩阵<math>s</math>的初始值是另一个快速高效收敛的重要技巧。实际上，每次迭代过程<math>s</math>的随机初始化导致收敛性较差，除非在优化<math>A</math>的最优值前已得到<math>s</math>的最优解。下面给出一个初始化<math>s</math>的较好方法：
<ol>
<li>令<math>s \leftarrow W^Tx</math> (<math>x</math> 是迷你块中patches的矩阵表示)
<li>对<math>s</math>做归一化处理：<math>s</math>中的每个特征（<math>s</math>的每一列）除以其在<math>A</math>中对应的基向量。即，如果 <math>s_{r, c}</math>表示<math>c</math>样本的第<math>r</math>个特征，<math>A_c</math>表示<math>A</math>中第<math>c</math>个基向量，则令<math>s_{r, c} \leftarrow \frac{ s_{r, c} } { \lVert A_c \rVert }.</math>
</ol>

[原文]

Very roughly and informally speaking, this initialization helps because the first step is an attempt to find a good <math>s</math> such that <math>Ws \approx x</math>, and the second step "normalizes" <math>s</math> in an attempt to keep the sparsity penalty small. It turns out that initializing <math>s</math> using only one but not both steps results in poor performance in practice. ([[TODO]]: a better explanation for why this initialization helps?)

[初译]

无疑，这样的初始化有助于算法的改进。因为上述的第一步是求解满足<math>Ws \approx x</math>的<math>s</math> 的最优解，第二步的规范化处理是为了保持较小的稀疏惩罚值。实际运行证明，只用上述步骤的某一步代替这两步对<math>s</math>做初始化处理严重影响算法性能。

[一审]

无疑，这样的初始化有助于算法的改进。因为上述的第一步希望找到满足<math>Ws \approx x</math>的<math>s</math> 的最优解，第二步的规范化处理是为了保持较小的稀疏惩罚值。实际运行证明，只用上述步骤的某一步代替这两步对<math>s</math> 做初始化处理严重影响算法性能。

[原文]

=== The practical algorithm[可运行算法] ===

With the above two tricks, the algorithm for sparse coding then becomes:
<ol>
<li>Initialize <math>A</math> randomly
<li>Repeat until convergence
  <ol>
    <li>Select a random mini-batch of 2000 patches
    <li>Initialize <math>s</math> as described above
    <li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step
    <li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step 
  </ol>
</ol>

With this method, you should be able to reach a good local optima relatively quickly.

[初译]

考虑到以上两点，稀疏编码算法修改如下：
<ol>
<li>随机初始化 <math>A</math> 
<li>重复以下步骤直至收敛：
  <ol>
    <li>随机选取一个2000 patches大小的迷你块
    <li>如上所述初始化<math>s</math> 
    <li>根据上一步给定的<math>A</math>，求解能够最小化<math>J(A, s)</math>的<math>s</math>  
    <li>根据上一步得到的<math>s</math>，求解能够最小化<math>J(A, s)</math>的<math>A</math> 
  </ol>
</ol>

通过上述方法，可以相对快速的得到局部最优解。

[一审]

考虑到以上两点，稀疏编码算法修改如下：
<ol>
<li>随机初始化 <math>A</math> 
<li>重复以下步骤直至收敛：
  <ol>
    <li>随机选取一个2000 patches大小的迷你块
    <li>如上所述初始化<math>s</math> 
    <li>根据上一步给定的<math>A</math>，求解能够最小化<math>J(A, s)</math>的<math>s</math>  
    <li>根据上一步得到的<math>s</math>，求解能够最小化<math>J(A, s)</math>的<math>A</math> 
  </ol>
</ol>

通过上述方法，可以相对快速的得到局部最优解。