稀疏编码自编码表达
From Ufldl
(→Topographic sparse coding) |
(→Sparse coding) |
||
Line 1: | Line 1: | ||
+ | [原文] | ||
[原文] | [原文] | ||
== Sparse coding == | == Sparse coding == | ||
Line 6: | Line 7: | ||
[初译] | [初译] | ||
+ | 稀疏编码 | ||
+ | |||
+ | 在稀疏自编码中,为了用稀疏特征<math>\sigma(Wx + b)</math>重新表示输入数据<math>x</math>需要学习权重系数<math>W</math>(以及对应的偏移量<math>b</math>)。 | ||
[一审] | [一审] | ||
- | + | 稀疏编码 | |
+ | |||
+ | 在稀疏自编码中,为了用稀疏特征<math>\sigma(Wx + b)</math>重新表示输入数据<math>x</math>需要学习权重系数<math>W</math>(以及对应的偏移量<math>b</math>)。 | ||
[[File:STL_SparseAE.png | 240px]] | [[File:STL_SparseAE.png | 240px]] | ||
Line 19: | Line 25: | ||
[初译] | [初译] | ||
+ | 稀疏编码可以看作是稀疏自编码方法的一个变形,该方法试图直接学习数据的特征集。利用偏移量将待学习特征集从特征空间转化到数据空间,实现了待学习特征集数据的重新表示。 | ||
[一审] | [一审] | ||
+ | |||
+ | 稀疏编码可以看作是稀疏自编码方法的一个变形,该方法试图直接学习数据的特征集。利用连带基将学习特征集从特征空间转化到数据空间,就可以从学到的特征中重建数据。 | ||
[原文] | [原文] | ||
Line 28: | Line 37: | ||
[初译] | [初译] | ||
+ | 在稀疏编码中,通常有很多数据<math>x</math>供我们进行特征学习。例如:<math>s</math>是一个用于表示数据的稀疏特征集,<math>A</math>是特征集从特征空间转换到数据空间的基。因此,为了计算s和A构建如下目标函数: | ||
[一审] | [一审] | ||
+ | |||
+ | 在稀疏编码中,对于从数据<math>x</math>中进行特征学习的情况。例如学习一个用于表示数据的稀疏特征集math>s</math>,和一个将特征从特征空间转换到数据空间的基<math>A</math>,我们可以构建如下目标函数: | ||
[原文] | [原文] | ||
Line 41: | Line 53: | ||
[初译] | [初译] | ||
+ | ( <math>\lVert x \rVert_k</math>等价于<math>\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}</math>是<math>x</math>的L<math>k</math>范数。L2范数即大家熟知的欧几里得范数,L1范数是向量元素的绝对值之和) | ||
[一审] | [一审] | ||
- | + | ( <math>\lVert x \rVert_k</math>等价于<math>\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}</math>是<math>x</math>的L<math>k</math>范数。L2范数即大家熟知的欧几里得范数,L1范数是向量元素的绝对值之和) | |
+ | [原文] | ||
The first term is the error in reconstructing the data from the features using the basis, and the second term is a sparsity penalty term to encourage the learned features to be sparse. | The first term is the error in reconstructing the data from the features using the basis, and the second term is a sparsity penalty term to encourage the learned features to be sparse. | ||
Line 51: | Line 65: | ||
[初译] | [初译] | ||
+ | 上式前半部分为重建误差,后半部分为稀疏性惩罚项(sparsity penalty term)用于保证向量集的稀疏性。 | ||
[一审] | [一审] | ||
- | + | 上式前半部分为重建误差,后半部分为稀疏性惩罚项(sparsity penalty term)用于保证向量集的稀疏性。 | |
+ | [原文] | ||
However, the objective function as it stands is not properly constrained - it is possible to reduce the sparsity cost (the second term) by scaling <math>A</math> by some constant and scaling <math>s</math> by the inverse of the same constant, without changing the error. Hence, we include the additional constraint that that for every column <math>A_j</math> of <math>A</math>, | However, the objective function as it stands is not properly constrained - it is possible to reduce the sparsity cost (the second term) by scaling <math>A</math> by some constant and scaling <math>s</math> by the inverse of the same constant, without changing the error. Hence, we include the additional constraint that that for every column <math>A_j</math> of <math>A</math>, | ||
Line 62: | Line 78: | ||
[初译] | [初译] | ||
+ | 但是,目标函数在不改变重建误差的前提下,可以通过常数倍缩放<math>A</math>同时以该常数倒数等比例缩放<math>s</math>降低稀疏代价(目标函数第二项)。因此,需要为<math>A</math>中每项<math>A_j</math> 增加额外约束<math>A_j^TA_j \le 1</math>。问题变为: | ||
[一审] | [一审] | ||
+ | |||
+ | 但是,目标函数在不改变重建误差的前提下,可以通过常数倍扩大<math>A</math>同时以该常数倒数等比例缩放<math>s</math>降低稀疏代价(目标函数第二项)。因此,需要为<math>A</math>中每项<math>A_j</math> 增加额外约束<math>A_j^TA_j \le 1</math>。问题变为: | ||
[原文] | [原文] | ||
- | |||
:<math> | :<math> | ||
Line 82: | Line 100: | ||
[初译] | [初译] | ||
+ | 遗憾的是,因为目标函数并不是一个凸函数,所以不能用梯度方法解决这个优化问题。但是,在给定A通过最小化<math>J(A, s)</math>求解s的问题是凸函数。同理,给定<math>s</math>通过最小化<math>J(A, s)</math>求解A的问题也是凸函数。这表明,可以通过交替固定<math>s</math>和<math>A</math>分别求解<math>A</math>和<math>s</math>。实践表明,这一策略取得的效果非常好。 | ||
[一审] | [一审] | ||
+ | |||
+ | 遗憾的是,因为目标函数并不是一个凸函数,所以不能用梯度方法解决这个优化问题。但是,在给定A通过最小化<math>J(A, s)</math>求解s的问题是凸问题。同理,给定<math>s</math>通过最小化<math>J(A, s)</math>求解A的问题也是凸问题。这表明,可以通过交替固定<math>s</math>和<math>A</math>分别求解<math>A</math>和<math>s</math>。实践表明,这一策略取得的效果非常好。 | ||
[原文] | [原文] | ||
- | |||
- | |||
However, the form of our problem presents another difficulty - the constraint that <math>A_j^TA_j \le 1 \; \forall j</math> cannot be enforced using simple gradient-based methods. Hence, in practice, this constraint is weakened to a "weight decay" term designed to keep the entries of <math>A</math> small. This gives us a new objective function: | However, the form of our problem presents another difficulty - the constraint that <math>A_j^TA_j \le 1 \; \forall j</math> cannot be enforced using simple gradient-based methods. Hence, in practice, this constraint is weakened to a "weight decay" term designed to keep the entries of <math>A</math> small. This gives us a new objective function: | ||
- | |||
[初译] | [初译] | ||
+ | 但是,问题的表示形式带来了另一个难题:其约束条件<math>A_j^TA_j \le 1 \; \forall j</math> 不能简单的应用到梯度方法中。因此在实际问题中,此约束条件被弱化为“权重衰变”("weight decay")项用于保证<math>A</math>的每一项值够小。至此得到一个新的目标函数: | ||
[一审] | [一审] | ||
+ | |||
+ | 但是,问题的表示形式带来了另一个难题:其约束条件<math>A_j^TA_j \le 1 \; \forall j</math> 不能在简单的梯度方法中实施。因此在实际问题中,此约束条件被弱化为“权重衰变”("weight decay")项以保证<math>A</math>的每一项值够小。至此得到一个新的目标函数: | ||
[原文] | [原文] | ||
- | |||
:<math> | :<math> | ||
Line 108: | Line 128: | ||
[初译] | [初译] | ||
+ | (注意上式中第三项, <math>\lVert A \rVert_2^2</math>等价于<math>\sum_r{\sum_c{A_{rc}^2}}</math>,是A各项的平方和) | ||
[一审] | [一审] | ||
+ | |||
+ | (注意上式中第三项, <math>\lVert A \rVert_2^2</math>等价于<math>\sum_r{\sum_c{A_{rc}^2}}</math>,是A各项的平方和) | ||
[原文] | [原文] | ||
- | |||
This objective function presents one last problem - the L1 norm is not differentiable at 0, and hence poses a problem for gradient-based methods. While the problem can be solved using other non-gradient descent-based methods, we will "smooth out" the L1 norm using an approximation which will allow us to use gradient descent. To "smooth out" the L1 norm, we use <math>\sqrt{x + \epsilon}</math> in place of <math>\left| x \right|</math>, where <math>\epsilon</math> is a "smoothing parameter" which can also be interpreted as a sort of "sparsity parameter" (to see this, observe that when <math>\epsilon</math> is large compared to <math>x</math>, the <math>x + \epsilon</math> is dominated by <math>\epsilon</math>, and taking the square root yields approximately <math>\sqrt{\epsilon}</math>). This "smoothing" will come in handy later when considering topographic sparse coding below. | This objective function presents one last problem - the L1 norm is not differentiable at 0, and hence poses a problem for gradient-based methods. While the problem can be solved using other non-gradient descent-based methods, we will "smooth out" the L1 norm using an approximation which will allow us to use gradient descent. To "smooth out" the L1 norm, we use <math>\sqrt{x + \epsilon}</math> in place of <math>\left| x \right|</math>, where <math>\epsilon</math> is a "smoothing parameter" which can also be interpreted as a sort of "sparsity parameter" (to see this, observe that when <math>\epsilon</math> is large compared to <math>x</math>, the <math>x + \epsilon</math> is dominated by <math>\epsilon</math>, and taking the square root yields approximately <math>\sqrt{\epsilon}</math>). This "smoothing" will come in handy later when considering topographic sparse coding below. | ||
Line 120: | Line 142: | ||
[初译] | [初译] | ||
+ | 这一目标函数带来了最后一个问题,即L1范数在0点不可微影响了梯度方法的应用。尽管可以通过其他非梯度下降方法避开这一问题,但是本文通过使用近似值“平滑”L1范数的方法解决此难题。使用 <math>\sqrt{x + \epsilon}</math> 代替 <math>\left| x \right|</math>对L1范数进行平滑,其中<math>\epsilon</math>是“平滑参数”("smoothing parameter")或者“稀疏参数”("sparsity parameter")(如果<math>\epsilon</math>远大于<math>x</math>,则<math>x + \epsilon</math> 的值由<math>\epsilon</math>主导,其平方根近似于<math>\sqrt{\epsilon}</math>)。考虑拓扑稀疏编码时,“平滑”会派上用场。 | ||
+ | |||
+ | 因此,最终的目标函数是: | ||
[一审] | [一审] | ||
- | + | 这一目标函数表现出最后一个问题,即L1范数在0点不可微影响了梯度方法的应用。尽管可以通过其他非梯度下降方法避开这一问题,但是本文通过使用近似值“平滑”L1范数的方法解决此难题。使用 <math>\sqrt{x + \epsilon}</math> 代替 <math>\left| x \right|</math>对L1范数进行平滑,其中<math>\epsilon</math>是“平滑参数”("smoothing parameter")或者“稀疏参数”("sparsity parameter")(如果<math>\epsilon</math>远大于<math>x</math>,则<math>x + \epsilon</math> 的值由<math>\epsilon</math>主导,其平方根近似于<math>\sqrt{\epsilon}</math>)。后文可见考虑拓扑稀疏编码时,“平滑”会派上用场。 | |
+ | 因此,最终的目标函数是: | ||
+ | |||
+ | [原文] | ||
:<math> | :<math> | ||
Line 134: | Line 162: | ||
[初译] | [初译] | ||
+ | ( <math>\sqrt{s^2 + \epsilon}</math>是 <math>\sum_k{\sqrt{s_k^2 + \epsilon}}</math>的简写) | ||
[一审] | [一审] | ||
+ | |||
+ | (其中 <math>\sqrt{s^2 + \epsilon}</math>是 <math>\sum_k{\sqrt{s_k^2 + \epsilon}}</math>的简写) | ||
[原文] | [原文] | ||
- | |||
This objective function can then be optimized iteratively, using the following procedure: | This objective function can then be optimized iteratively, using the following procedure: | ||
Line 152: | Line 182: | ||
[初译] | [初译] | ||
+ | 该目标函数可以通过以下过程迭代优化: | ||
+ | <ol> | ||
+ | <li>随机初始化<math>A</math> | ||
+ | <li>重复以下步骤直至收敛: | ||
+ | <ol> | ||
+ | <li>根据上一步给定的<math>A</math>,求解能够最小化<math>J(A, s)</math>的<math>s</math> | ||
+ | <li>根据上一步得到的<math>s</math>,求解能够最小化<math>J(A, s)</math>的<math>A</math> | ||
+ | </ol> | ||
+ | </ol> | ||
[一审] | [一审] | ||
- | + | 该目标函数可以通过以下过程迭代优化: | |
+ | <ol> | ||
+ | <li>随机初始化<math>A</math> | ||
+ | <li>重复以下步骤直至收敛: | ||
+ | <ol> | ||
+ | <li>根据上一步给定的<math>A</math>,求解能够最小化<math>J(A, s)</math>的<math>s</math> | ||
+ | <li>根据上一步得到的<math>s</math>,求解能够最小化<math>J(A, s)</math>的<math>A</math> | ||
+ | </ol> | ||
+ | </ol> | ||
+ | |||
+ | [原文] | ||
Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods. | Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods. | ||
Line 166: | Line 215: | ||
[原文] | [原文] | ||
- | |||
In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[ Sparse Coding: Autoencoder Interpretation#Sparse coding in practice | sparse coding in practice]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using the backpropagation idea | using the backpropagation intuition]] can be helpful. | In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[ Sparse Coding: Autoencoder Interpretation#Sparse coding in practice | sparse coding in practice]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using the backpropagation idea | using the backpropagation intuition]] can be helpful. | ||
Line 172: | Line 220: | ||
[初译] | [初译] | ||
+ | 理论上,通过上述迭代方法求解目标函数的最优化问题最终得到的特征集(基向量<math>A</math>)与通过稀疏编码学习得到的特征集类似。但是实际上,为了更好的获得算法收敛性需要使用一些小技巧,后面的稀疏编码实践章节会详细介绍这些技巧。用梯度下降方法求解目标函数略有技巧,此外使用矩阵演算以及反向传播机制也都有助于最优化问题的解决。 | ||
[一审] | [一审] | ||
+ | |||
+ | 理论上,通过上述迭代方法求解目标函数的最优化问题最终得到的特征集(<math>A</math>的基向量)与通过稀疏自编码学习得到的特征集是相似的。但是实际上,为了更好的获得算法收敛性需要使用一些小技巧,后面的稀疏编码实践章节会详细介绍这些技巧。用梯度下降方法求解目标函数略有技巧,此外使用矩阵演算以及反向传播机制也都有助于最优化问题的解决。 | ||
[原文] | [原文] |