稀疏编码自编码表达

From Ufldl

Jump to: navigation, search

@@ Line 1: / Line 1: @@
+[原文]
 [原文]
 == Sparse coding ==
@@ Line 6: / Line 7: @@
 [初译]
+稀疏编码
+在稀疏自编码中，为了用稀疏特征<math>\sigma(Wx + b)</math>重新表示输入数据<math>x</math>需要学习权重系数<math>W</math>（以及对应的偏移量<math>b</math>）。
 [一审]
-[原文]
+稀疏编码
+在稀疏自编码中，为了用稀疏特征<math>\sigma(Wx + b)</math>重新表示输入数据<math>x</math>需要学习权重系数<math>W</math>（以及对应的偏移量<math>b</math>）。
 [[File:STL_SparseAE.png | 240px]]
@@ Line 19: / Line 25: @@
 [初译]
+稀疏编码可以看作是稀疏自编码方法的一个变形，该方法试图直接学习数据的特征集。利用偏移量将待学习特征集从特征空间转化到数据空间，实现了待学习特征集数据的重新表示。
 [一审]
+稀疏编码可以看作是稀疏自编码方法的一个变形，该方法试图直接学习数据的特征集。利用连带基将学习特征集从特征空间转化到数据空间，就可以从学到的特征中重建数据。
 [原文]
@@ Line 28: / Line 37: @@
 [初译]
+在稀疏编码中，通常有很多数据<math>x</math>供我们进行特征学习。例如：<math>s</math>是一个用于表示数据的稀疏特征集，<math>A</math>是特征集从特征空间转换到数据空间的基。因此，为了计算s和A构建如下目标函数：
 [一审]
+在稀疏编码中，对于从数据<math>x</math>中进行特征学习的情况。例如学习一个用于表示数据的稀疏特征集math>s</math>,和一个将特征从特征空间转换到数据空间的基<math>A</math>,我们可以构建如下目标函数：
 [原文]
@@ Line 41: / Line 53: @@
 [初译]
+（ <math>\lVert x \rVert_k</math>等价于<math>\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}</math>是<math>x</math>的L<math>k</math>范数。L2范数即大家熟知的欧几里得范数，L1范数是向量元素的绝对值之和）
 [一审]
-[原文]
+（ <math>\lVert x \rVert_k</math>等价于<math>\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}</math>是<math>x</math>的L<math>k</math>范数。L2范数即大家熟知的欧几里得范数，L1范数是向量元素的绝对值之和）
+[原文]
 The first term is the error in reconstructing the data from the features using the basis, and the second term is a sparsity penalty term to encourage the learned features to be sparse.
@@ Line 51: / Line 65: @@
 [初译]
+上式前半部分为重建误差，后半部分为稀疏性惩罚项（sparsity penalty term）用于保证向量集的稀疏性。
 [一审]
-[原文]
+上式前半部分为重建误差，后半部分为稀疏性惩罚项（sparsity penalty term）用于保证向量集的稀疏性。
+[原文]
 However, the objective function as it stands is not properly constrained - it is possible to reduce the sparsity cost (the second term) by scaling <math>A</math> by some constant and scaling <math>s</math> by the inverse of the same constant, without changing the error. Hence, we include the additional constraint that that for every column <math>A_j</math> of <math>A</math>,
@@ Line 62: / Line 78: @@
 [初译]
+但是，目标函数在不改变重建误差的前提下，可以通过常数倍缩放<math>A</math>同时以该常数倒数等比例缩放<math>s</math>降低稀疏代价（目标函数第二项）。因此，需要为<math>A</math>中每项<math>A_j</math> 增加额外约束<math>A_j^TA_j \le 1</math>。问题变为：
 [一审]
+但是，目标函数在不改变重建误差的前提下，可以通过常数倍扩大<math>A</math>同时以该常数倒数等比例缩放<math>s</math>降低稀疏代价（目标函数第二项）。因此，需要为<math>A</math>中每项<math>A_j</math> 增加额外约束<math>A_j^TA_j \le 1</math>。问题变为：
 [原文]
 :<math>
@@ Line 82: / Line 100: @@
 [初译]
+遗憾的是，因为目标函数并不是一个凸函数，所以不能用梯度方法解决这个优化问题。但是，在给定A通过最小化<math>J(A, s)</math>求解s的问题是凸函数。同理，给定<math>s</math>通过最小化<math>J(A, s)</math>求解A的问题也是凸函数。这表明，可以通过交替固定<math>s</math>和<math>A</math>分别求解<math>A</math>和<math>s</math>。实践表明，这一策略取得的效果非常好。
 [一审]
+遗憾的是，因为目标函数并不是一个凸函数，所以不能用梯度方法解决这个优化问题。但是，在给定A通过最小化<math>J(A, s)</math>求解s的问题是凸问题。同理，给定<math>s</math>通过最小化<math>J(A, s)</math>求解A的问题也是凸问题。这表明，可以通过交替固定<math>s</math>和<math>A</math>分别求解<math>A</math>和<math>s</math>。实践表明，这一策略取得的效果非常好。
 [原文]
 However, the form of our problem presents another difficulty - the constraint that <math>A_j^TA_j \le 1 \; \forall j</math> cannot be enforced using simple gradient-based methods. Hence, in practice, this constraint is weakened to a "weight decay" term designed to keep the entries of <math>A</math> small. This gives us a new objective function:
 [初译]
+但是，问题的表示形式带来了另一个难题：其约束条件<math>A_j^TA_j \le 1 \; \forall j</math> 不能简单的应用到梯度方法中。因此在实际问题中，此约束条件被弱化为“权重衰变”（"weight decay"）项用于保证<math>A</math>的每一项值够小。至此得到一个新的目标函数：
 [一审]
+但是，问题的表示形式带来了另一个难题：其约束条件<math>A_j^TA_j \le 1 \; \forall j</math> 不能在简单的梯度方法中实施。因此在实际问题中，此约束条件被弱化为“权重衰变”（"weight decay"）项以保证<math>A</math>的每一项值够小。至此得到一个新的目标函数：
 [原文]
 :<math>
@@ Line 108: / Line 128: @@
 [初译]
+（注意上式中第三项， <math>\lVert A \rVert_2^2</math>等价于<math>\sum_r{\sum_c{A_{rc}^2}}</math>，是A各项的平方和）
 [一审]
+（注意上式中第三项， <math>\lVert A \rVert_2^2</math>等价于<math>\sum_r{\sum_c{A_{rc}^2}}</math>，是A各项的平方和）
 [原文]
 This objective function presents one last problem - the L1 norm is not differentiable at 0, and hence poses a problem for gradient-based methods. While the problem can be solved using other non-gradient descent-based methods, we will "smooth out" the L1 norm using an approximation which will allow us to use gradient descent. To "smooth out" the L1 norm, we use <math>\sqrt{x + \epsilon}</math> in place of <math>\left| x \right|</math>, where <math>\epsilon</math> is a "smoothing parameter" which can also be interpreted as a sort of "sparsity parameter" (to see this, observe that when <math>\epsilon</math> is large compared to <math>x</math>, the <math>x + \epsilon</math> is dominated by <math>\epsilon</math>, and taking the square root yields approximately <math>\sqrt{\epsilon}</math>). This "smoothing" will come in handy later when considering topographic sparse coding below.
@@ Line 120: / Line 142: @@
 [初译]
+这一目标函数带来了最后一个问题，即L1范数在0点不可微影响了梯度方法的应用。尽管可以通过其他非梯度下降方法避开这一问题，但是本文通过使用近似值“平滑”L1范数的方法解决此难题。使用 <math>\sqrt{x + \epsilon}</math> 代替 <math>\left| x \right|</math>对L1范数进行平滑，其中<math>\epsilon</math>是“平滑参数”（"smoothing parameter"）或者“稀疏参数”（"sparsity parameter"）（如果<math>\epsilon</math>远大于<math>x</math>，则<math>x + \epsilon</math> 的值由<math>\epsilon</math>主导，其平方根近似于<math>\sqrt{\epsilon}</math>）。考虑拓扑稀疏编码时，“平滑”会派上用场。
+因此，最终的目标函数是：
 [一审]
-[原文]
+这一目标函数表现出最后一个问题，即L1范数在0点不可微影响了梯度方法的应用。尽管可以通过其他非梯度下降方法避开这一问题，但是本文通过使用近似值“平滑”L1范数的方法解决此难题。使用 <math>\sqrt{x + \epsilon}</math> 代替 <math>\left| x \right|</math>对L1范数进行平滑，其中<math>\epsilon</math>是“平滑参数”（"smoothing parameter"）或者“稀疏参数”（"sparsity parameter"）（如果<math>\epsilon</math>远大于<math>x</math>，则<math>x + \epsilon</math> 的值由<math>\epsilon</math>主导，其平方根近似于<math>\sqrt{\epsilon}</math>）。后文可见考虑拓扑稀疏编码时，“平滑”会派上用场。
+因此，最终的目标函数是：
+[原文]
 :<math>
@@ Line 134: / Line 162: @@
 [初译]
+（ <math>\sqrt{s^2 + \epsilon}</math>是 <math>\sum_k{\sqrt{s_k^2 + \epsilon}}</math>的简写）
 [一审]
+（其中 <math>\sqrt{s^2 + \epsilon}</math>是 <math>\sum_k{\sqrt{s_k^2 + \epsilon}}</math>的简写）
 [原文]
 This objective function can then be optimized iteratively, using the following procedure:
@@ Line 152: / Line 182: @@
 [初译]
+该目标函数可以通过以下过程迭代优化：
+<ol>
+<li>随机初始化<math>A</math>
+<li>重复以下步骤直至收敛：
+  <ol>
+    <li>根据上一步给定的<math>A</math>，求解能够最小化<math>J(A, s)</math>的<math>s</math>
+    <li>根据上一步得到的<math>s</math>，求解能够最小化<math>J(A, s)</math>的<math>A</math>
+  </ol>
+</ol>
 [一审]
-[原文]
+该目标函数可以通过以下过程迭代优化：
+<ol>
+<li>随机初始化<math>A</math>
+<li>重复以下步骤直至收敛：
+  <ol>
+    <li>根据上一步给定的<math>A</math>，求解能够最小化<math>J(A, s)</math>的<math>s</math>
+    <li>根据上一步得到的<math>s</math>，求解能够最小化<math>J(A, s)</math>的<math>A</math>
+  </ol>
+</ol>
+[原文]
 Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods.
@@ Line 166: / Line 215: @@
 [原文]
 In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[ Sparse Coding: Autoencoder Interpretation#Sparse coding in practice | sparse coding in practice]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using the backpropagation idea | using the backpropagation intuition]] can be helpful.
@@ Line 172: / Line 220: @@
 [初译]
+理论上，通过上述迭代方法求解目标函数的最优化问题最终得到的特征集（基向量<math>A</math>）与通过稀疏编码学习得到的特征集类似。但是实际上，为了更好的获得算法收敛性需要使用一些小技巧，后面的稀疏编码实践章节会详细介绍这些技巧。用梯度下降方法求解目标函数略有技巧，此外使用矩阵演算以及反向传播机制也都有助于最优化问题的解决。
 [一审]
+理论上，通过上述迭代方法求解目标函数的最优化问题最终得到的特征集（<math>A</math>的基向量）与通过稀疏自编码学习得到的特征集是相似的。但是实际上，为了更好的获得算法收敛性需要使用一些小技巧，后面的稀疏编码实践章节会详细介绍这些技巧。用梯度下降方法求解目标函数略有技巧，此外使用矩阵演算以及反向传播机制也都有助于最优化问题的解决。
 [原文]

稀疏编码自编码表达

From Ufldl

Revision as of 08:05, 8 March 2013

Views

Personal tools

ufldl resources

wiki

Search

Toolbox