# 稀疏编码自编码表达

 Revision as of 07:25, 8 March 2013 (view source)Kandeng (Talk | contribs)← Older edit Latest revision as of 06:22, 14 May 2014 (view source)Kandeng (Talk | contribs) (→稀疏编码) Line 1: Line 1: - [原文] + == 稀疏编码 == - == Sparse coding == + - In the sparse autoencoder, we tried to learn a set of weights $W$ (and associated biases $b$) that would give us sparse features $\sigma(Wx + b)$ useful in reconstructing the input $x$. - [初译] + 在稀疏自编码算法中，我们试着学习得到一组权重参数 $W$（以及相应的截距 $b$），通过这些参数可以使我们得到稀疏特征向量 $\sigma(Wx + b)$ ，这些特征向量对于重构输入样本非常有用。 - + - + - [一审] + - + - [原文] + [[File:STL_SparseAE.png | 240px]] [[File:STL_SparseAE.png | 240px]] - [原文] - Sparse coding can be seen as a modification of the sparse autoencoder method in which we try to learn the set of features for some data "directly". Together with an associated basis  for transforming the learned features from the feature space to the data space, we can then reconstruct the data from the learned features. + 稀疏编码可以看作是稀疏自编码方法的一个变形，该方法试图直接学习数据的特征集。利用与此特征集相应的基向量，将学习得到的特征集从特征空间转换到样本数据空间，这样我们就可以用学习得到的特征集重构样本数据。 - [初译] - + 确切地说，在稀疏编码算法中，有样本数据 $x$ 供我们进行特征学习。特别是，学习一个用于表示样本数据的稀疏特征集 $s$, 和一个将特征集从特征空间转换到样本数据空间的基向量 $A$, 我们可以构建如下目标函数： - [一审] + - + - [原文] + - + - Formally, in sparse coding, we have some data $x$ we would like to learn features on. In particular, we would like to learn $s$, a set of sparse features useful for representing the data, and $A$, a basis for transforming the features from the feature space to the data space. Our objective function is hence: + - + - [初译] + - + - + - [一审] + - + - [原文] + :$:[itex] Line 37: Line 16:$ [/itex] - (If you are unfamiliar with the notation, $\lVert x \rVert_k$ refers to the L$k$ norm of the $x$ which is equal to $\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}$. The L2 norm is the familiar Euclidean norm, while the L1 norm is the sum of absolute values of the elements of the vector) + （$\lVert x \rVert_k$是x的Lk范数，等价于 $\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}$。L2 范数即大家熟知的欧几里得范数，L1 范数是向量元素的绝对值之和） - + - [初译] + - + - + - [一审] + - + - [原文] + - + - + - The first term is the error in reconstructing the data from the features using the basis, and the second term is a sparsity penalty term to encourage the learned features to be sparse. + - + - [初译] + - + - + - [一审] + - + - [原文] + - + - + - However, the objective function as it stands is not properly constrained - it is possible to reduce the sparsity cost (the second term) by scaling $A$ by some constant and scaling $s$ by the inverse of the same constant, without changing the error. Hence, we include the additional constraint that that for every column $A_j$ of $A$, + - $A_j^TA_j \le 1$. Our problem is thus: + - + - [初译] + - [一审] + 上式前第一部分是利用基向量将特征集重构为样本数据所产生的误差，第二部分为稀疏性惩罚项（sparsity penalty term），用于保证特征集的稀疏性。 - [原文] + 但是，如目标函数所示，它的约束性并不强――按常数比例缩放$A$的同时再按这个常数的倒数缩放 $s$，结果不会改变误差大小，却会减少稀疏代价（表达式第二项）的值。因此，需要为 $A$ 中每项 $A_j$ 增加额外约束 $A_j^TA_j \le 1$。问题变为： :$:[itex] Line 75: Line 31:$ [/itex] - [原文] - Unfortunately, the objective function is non-convex, and hence impossible to optimize well using gradient-based methods. However, given $A$, the problem of finding $s$ that minimizes $J(A, s)$ is convex. Similarly, given $s$, the problem of finding $A$ that minimizes $J(A, s)$ is also convex. This suggests that we might try alternately optimizing for $A$ for a fixed $s$, and then optimizing for $s$ given a fixed $A$. It turns out that this works quite well in practice. + 遗憾的是，因为目标函数并不是一个凸函数，所以不能用梯度方法解决这个优化问题。但是，在给定 $A$ 的情况下，最小化 $J(A, s)$ 求解 $s$ 是凸的。同理，给定 $s$ 最小化 $J(A, s)$ 求解 $A$ 也是凸的。这表明，可以通过交替固定 $s和 A 分别求解 [itex]A$和$s$。实践表明，这一策略取得的效果非常好。 - [初译] + 但是，以上表达式带来了另一个难题：不能用简单的梯度方法来实现约束条件 $A_j^TA_j \le 1 \; \forall j$。因此在实际问题中，此约束条件还不足以成为“权重衰变”（"weight decay"）项以保证 A 的每一项值够小。这样我们就得到一个新的目标函数： - + - + - [一审] + - + - [原文] + - + - + - + - However, the form of our problem presents another difficulty - the constraint that $A_j^TA_j \le 1 \; \forall j$ cannot be enforced using simple gradient-based methods. Hence, in practice, this constraint is weakened to a "weight decay" term designed to keep the entries of $A$ small. This gives us a new objective function: + - + - + - [初译] + - + - + - [一审] + - + - [原文] + - + :$:[itex] Line 104: Line 41:$ [/itex] - (note that the third term, $\lVert A \rVert_2^2$ is simply the sum of squares of the entries of A, or $\sum_r{\sum_c{A_{rc}^2}}$) + （注意上式中第三项， $\lVert A \rVert_2^2$等价于$\sum_r{\sum_c{A_{rc}^2}}$，是A各项的平方和） - [初译] + 这一目标函数带来了最后一个问题，即 L1 范数在 0 点处不可微影响了梯度方法的应用。尽管可以通过其他非梯度下降方法避开这一问题，但是本文通过使用近似值“平滑” L1 范数的方法解决此难题。使用 $\sqrt{x^2 + \epsilon}$ 代替 $\left| x \right|$, 对 L1 范数进行平滑，其中 $\epsilon$ 是“平滑参数”（"smoothing parameter"）或者“稀疏参数”（"sparsity parameter"） （如果 $\epsilon$远大于$x$, 则 $x + \epsilon$ 的值由 $\epsilon$ 主导，其平方根近似于$\epsilon$）。在下文提及拓扑稀疏编码时，“平滑”会派上用场。 - [一审] - - [原文] - - - This objective function presents one last problem - the L1 norm is not differentiable at 0, and hence poses a problem for gradient-based methods. While the problem can be solved using other non-gradient descent-based methods, we will "smooth out" the L1 norm using an approximation which will allow us to use gradient descent. To "smooth out" the L1 norm, we use $\sqrt{x + \epsilon}$ in place of $\left| x \right|$, where $\epsilon$ is a "smoothing parameter" which can also be interpreted as a sort of "sparsity parameter" (to see this, observe that when $\epsilon$ is large compared to $x$, the $x + \epsilon$ is dominated by $\epsilon$, and taking the square root yields approximately $\sqrt{\epsilon}$). This "smoothing" will come in handy later when considering topographic sparse coding below. - - Our final objective function is hence: - - [初译] - - - [一审] - - [原文] + 因此，最终的目标函数是： :$:[itex] Line 130: Line 53:$ [/itex] - (where $\sqrt{s^2 + \epsilon}$ is shorthand for $\sum_k{\sqrt{s_k^2 + \epsilon}}$) + （ $\sqrt{s^2 + \epsilon}$ 是 $\sum_k{\sqrt{s_k^2 + \epsilon}}$ 的简写） - [初译] + 该目标函数可以通过以下过程迭代优化： - [一审] - - [原文] - - - This objective function can then be optimized iteratively, using the following procedure:
-
1. Initialize $A$ randomly +
2. 随机初始化$A$ -
3. Repeat until convergence +
4. 重复以下步骤直至收敛：
-
1. Find the $s$ that minimizes $J(A, s)$ for the $A$ found in the previous step +
2. 根据上一步给定的$A$，求解能够最小化$J(A, s)$的$s$ -
3. Solve for the $A$ that minimizes $J(A, s)$ for the $s$ found in the previous step +
4. 根据上一步得到的$s$，，求解能够最小化$J(A, s)$的$A$
-
+
- [初译] + 观察修改后的目标函数 $J(A, s)$，给定 $s$ 的条件下，目标函数可以简化为 $J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2$（因为 $s$ 的 L1 范式不是 $A$ 的函数，所以可以忽略）。简化后的目标函数是一个关于 $A$ 的简单二次项式，因此对 $A$ 求导是很容易的。这种求导的一种快捷方法是矩阵微积分（[[Useful Links | 相关链接]]部分列出了跟矩阵演算有关的内容）。遗憾的是，在给定 $A$ 的条件下，目标函数却不具备这样的求导方法，因此目标函数的最小化步骤只能用梯度下降或其他类似的最优化方法。 - [一审] - [原文] + 理论上，通过上述迭代方法求解目标函数的最优化问题最终得到的特征集（A 的基向量）与通过稀疏自编码学习得到的特征集是差不多的。但是实际上，为了获得更好的算法收敛性需要使用一些小技巧，后面的[[ Sparse Coding: Autoencoder Interpretation#Sparse coding in practice | 稀疏编码实践]] 稀疏编码实践章节会详细介绍这些技巧。用梯度下降方法求解目标函数也略需技巧，另外使用矩阵演算或[[Deriving gradients using the backpropagation idea | 反向传播算法]]则有助于解决此类问题。 + == 拓扑稀疏编码 == - Observe that with our modified objective function, the objective function $J(A, s)$ given $s$, that is $J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2$ (the L1 term in $s$ can be omitted since it is not a function of $A$) is simply a quadratic term in $A$, and hence has an easily derivable analytic solution in $A$. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given $A$ does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods. - [初译] + 通过稀疏编码，我们能够得到一组用于表示样本数据的特征集。不过，让我们来找些灵感，我们希望学习得到一组有某种“秩序”的特征集。举个例子，视觉特征，如前面所提到的，大脑皮层 V1 区神经元能够按特定的方向对边缘进行检测，同时，这些神经元（在生理上）被组织成超柱（hypercolumns），在超柱中，相邻神经元以相似的方向对边缘进行检测，一个神经元检测水平边缘，其相邻神经元检测到的边缘就稍微偏离水平方向，沿着超柱，神经元就可以检测到与水平方向相差更大的边缘了。 - [一审] + 受该例子的启发，我们希望学习到的特征也具有这样“拓扑秩序”的性质。这对于我们要学习的特征意味着什么呢?直观的讲，如果“相邻”的特征是“相似”的，就意味着如果某个特征被激活，那么与之相邻的特征也将随之被激活。 - [原文] - - - In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of $A$) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[ Sparse Coding: Autoencoder Interpretation#Sparse coding in practice | sparse coding in practice]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using the backpropagation idea | using the backpropagation intuition]] can be helpful. - - [初译] - - - [一审] - - [原文] - - == Topographic sparse coding == - - With sparse coding, we can learn a set of features useful for representing the data. However, drawing inspiration from the brain, we would like to learn a set of features that are "orderly" in some manner. For instance, consider visual features. As suggested earlier, the V1 cortex of the brain contains neurons which detect edges at particular orientations. However, these neurons are also organized into hypercolumns in which adjacent neurons detect edges at similar orientations. One neuron could detect a horizontal edge, its neighbors edges oriented slightly off the horizontal, and moving further along the hypercolumn, the neurons detect edges oriented further off the horizontal. - - [初译] - - - [一审] - - [原文] - - - Inspired by this example, we would like to learn features which are similarly "topographically ordered". What does this imply for our learned features? Intuitively, if "adjacent" features are "similar", we would expect that if one feature is activated, its neighbors will also be activated to a lesser extent. - - [初译] - - - [一审] - - [原文] - - - Concretely, suppose we (arbitrarily) organized our features into a square matrix. We would then like adjacent features in the matrix to be similar. The way this is accomplished is to group these adjacent features together in the smoothed L1 penalty, so that instead of say $\sqrt{s_{1,1}^2 + \epsilon}$, we use say $\sqrt{s_{1,1}^2 + s_{1,2}^2 + s_{1,3}^2 + s_{2,1}^2 + s_{2,2}^2 + s_{3,2}^2 + s_{3,1}^2 + s_{3,2}^2 + s_{3,3}^2 + \epsilon}$ instead, if we group in 3x3 regions. The grouping is usually overlapping, so that the 3x3 region starting at the 1st row and 1st column is one group, the 3x3 region starting at the 1st row and 2nd column is another group, and so on. Further, the grouping is also usually done wrapping around, as if the matrix were a torus, so that every feature is counted an equal number of times. - - Hence, in place of the smoothed L1 penalty, we use the sum of smoothed L1 penalties over all the groups, so our new objective function is: - - [初译] - - - [一审] - - [原文] + 具体而言，假设我们（随意地）将特征组织成一个方阵。我们就希望矩阵中相邻的特征是相似的。实现这一点的方法是将相邻特征按经过平滑的L1范式惩罚进行分组，如果按 3x3 方阵分组，则用 $\sqrt{s_{1,1}^2 + s_{1,2}^2 + s_{1,3}^2 + s_{2,1}^2 + s_{2,2}^2 + s_{3,2}^2 + s_{3,1}^2 + s_{3,2}^2 + s_{3,3}^2 + \epsilon}$ 代替 $\sqrt{s_{1,1}^2 + \epsilon}$, 其分组通常是重合的，因此从第 1 行第 1 列开始的 3x3 区域是一个分组，从第 1 行第 2 列开始的 3x3 区域是另一个分组，以此类推。最终，这样的分组会形成环绕，就好像这个矩阵是个环形曲面，所以每个特征都以同样的次数进行了分组。 + 于是，将经过平滑的所有分组的 L1 惩罚值之和代替经过平滑的 L1 惩罚值，得到新的目标函数如下： :$:[itex] Line 215: Line 88:$ [/itex] - [初译] - - - [一审] - - [原文] - - - In practice, the "grouping" can be accomplished using a "grouping matrix" $V$, such that the $r$th row of $V$ indicates which features are grouped in the $r$th group, so $V_{r, c} = 1$ if group $r$ contains feature $c$. Thinking of the grouping as being achieved by a grouping matrix makes the computation of the gradients more intuitive. Using this grouping matrix, the objective function can be rewritten as: - - [初译] - - - [一审] - - [原文] + 实际上，“分组”可以通过“分组矩阵”$V$ 完成，于是矩阵 $V$ 的第 $r$ 行标识了哪些特征被分到第 $r$ 组中，即如果第 $r$ 组包含特征 $c$ 则 $V_{r, c} = 1$。通过分组矩阵实现分组使得梯度的计算更加直观，使用此分组矩阵，目标函数被重写为： :$:[itex] Line 237: Line 95:$ [/itex] - (where $\sum{ \sqrt{Vss^T + \epsilon} }$ is $\sum_r{ \sum_c { D_{r, c} } }$ if we let $D = \sqrt{Vss^T + \epsilon}$) + (令 $D = \sqrt{Vss^T + \epsilon}$，$\sum{ \sqrt{Vss^T + \epsilon} }$ 等价于 $\sum_r{ \sum_c { D_{r, c} } }$) - [初译] + 该目标函数能够使用之前部分提到的迭代方法进行求解。拓扑稀疏编码得到的特征与稀疏编码得到的类似，只是拓扑稀疏编码得到的特征是以某种方式有“秩序”排列的。 - [一审] - [原文] + == 稀疏编码实践 == - This objective function can be optimized using the iterated method described in the earlier section. Topographic sparse coding will learn features similar to those learned by sparse coding, except that the features will now be "ordered" in some way. + 如上所述，虽然稀疏编码背后的理论十分简单，但是要写出准确无误的实现代码并能快速又恰到好处地收敛到最优值，则需要一定的技巧。 - [初译] + 回顾一下之前提到的简单迭代算法： - [一审] - - [原文] - - == Sparse coding in practice == - - As suggested in the earlier sections, while the theory behind sparse coding is quite simple, writing a good implementation that actually works and converges reasonably quickly to good optima requires a bit of finesse. - - [初译] - - - [一审] - - [原文] - - - Recall the simple iterative algorithm proposed earlier:
-
1. Initialize $A$ randomly +
2. 随机初始化$A$ -
3. Repeat until convergence +
4. 重复以下步骤直至收敛到最优值：
-
1. Find the $s$ that minimizes $J(A, s)$ for the $A$ found in the previous step +
2. 根据上一步给定的$A$，求解能够最小化$J(A, s)$的$s$ -
3. Solve for the $A$ that minimizes $J(A, s)$ for the $s$ found in the previous step +
4. 根据上一步得到的$s$，求解能够最小化$J(A, s)$的$A$
- [初译] - - - [一审] - - [原文] + 这样信手拈来地执行这个算法，结果并不会令人满意，即使确实得到了某些结果。以下是两种更快更优化的收敛技巧： - It turns out that running this algorithm out of the box will not produce very good results, if any results are produced at all. There are two main tricks to achieve faster and better convergence:
-
1. Batching examples into "mini-batches" +
2. 将样本分批为“迷你块” -
3. Good initialization of $s$ +
4. 良好的$s$初始值
- [初译] + === 将样本分批为“迷你块” === - [一审] - [原文] + 如果你一次性在大规模数据集（比如，有10000 个patch）上执行简单的迭代算法，你会发现每次迭代都要花很长时间，也因此这算法要花好长时间才能达到收敛结果。为了提高收敛速度，可以选择在迷你块上运行该算法。每次迭代的时候，不是在所有的 10000 个 patchs 上执行该算法，而是使用迷你块，即从 10000 个 patch 中随机选出 2000 个 patch，再在这个迷你块上执行这个算法。这样就可以做到一石二鸟――第一，提高了每次迭代的速度，因为现在每次迭代只在 2000 个 patch 上执行而不是 10000个；第二，也是更重要的，它提高了收敛的速度（原因见[[TODO]]）。 - === Batching examples into mini-batches === - If you try running the simple iterative algorithm on a large dataset of say 10 000 patches at one go, you will find that each iteration takes a long time, and the algorithm may hence take a long time to converge. To increase the rate of convergence, you can instead run the algorithm on mini-batches instead. To do this, instead of running the algorithm on all 10 000 patches, in each iteration, select a mini-batch - a (different) random subset of say 2000 patches from the 10 000 patches - and run the algorithm on that mini-batch for the iteration instead. This accomplishes two things - firstly, it speeds up each iteration, since now each iteration is operating on 2000 rather than 10 000 patches; secondly, and more importantly, it increases the rate of convergence [[(TODO]]: explain why). + === 良好的$s$初始值 === - [初译] + 另一个能获得更快速更优化收敛的重要技巧是：在给定 $A$ 的条件下，根据目标函数使用梯度下降（或其他方法）求解 $s$ 之前找到良好的特征矩阵 $s$ 的初始值。实际上，除非在优化 $A$ 的最优值前已找到一个最佳矩阵 $s$，不然每次迭代过程中随机初始化 $s$ 值会导致很差的收敛效果。下面给出一个初始化 $s$ 的较好方法： +
+
1. 令$s \leftarrow W^Tx$ ($x$ 是迷你块中patches的矩阵表示) +
2. $s$中的每个特征（$s$的每一列），除以其在$A$中对应基向量的范数。即，如果$s_{r, c}$表示第$c$个样本的第$r$个特征，则$A_c$表示$A$中的第$c$个基向量，则令 + $s_{r, c} \leftarrow \frac{ s_{r, c} } { \lVert A_c \rVert }.$ +
- [一审] - [原文] + 无疑，这样的初始化有助于算法的改进，因为上述的第一步希望找到满足 $Ws \approx x$ 的矩阵 $s$；第二步对 $s$ 作规范化处理是为了保持较小的稀疏惩罚值。这也表明，只采用上述步骤的某一步而不是两步对 $s$ 做初始化处理将严重影响算法性能。（[[TODO]]: 此链接将会对为什么这样的初始化能改进算法作出更详细的解释） - === Good initialization of $s$[良好的s初始值] === - Another important trick in obtaining faster and better convergence is good initialization of the feature matrix $s$ before using gradient descent (or other methods) to optimize for the objective function for $s$ given $A$. In practice, initializing $s$ randomly at each iteration can result in poor convergence unless a good optima is found for $s$ before moving on to optimize for $A$. A better way to initialize $s$ is the following: + === 可运行算法 === -
+ -
1. Set $s \leftarrow W^Tx$ (where $x$ is the matrix of patches in the mini-batch) + -
2. For each feature in $s$ (i.e. each column of $s$), divide the feature by the norm of the corresponding basis vector in $A$. That is, if $s_{r, c}$ is the $r$th feature for the $c$th example, and $A_c$ is the $c$th basis vector in $A$, then set $s_{r, c} \leftarrow \frac{ s_{r, c} } { \lVert A_c \rVert }.$ + -
+ - [初译] + 有了以上两种技巧，稀疏编码算法修改如下： - 在给定$A$的条件下，根据目标函数使用梯度下降（或其他方法）求解$s$之前找到良好的特征矩阵$s$的初始值是另一个快速高效收敛的重要技巧。实际上，每次迭代过程$s$的随机初始化导致收敛性较差，除非在求解$A$的最优值前已得到$s$的最优解。下面给出一个初始化$s$的较好方法：
-
1. 令$s \leftarrow W^Tx$ (x 是迷你块中patches的矩阵表示) +
2. 随机初始化$A$ -
3. s做归一化处理：$s$中的每个特征（s的每一列）除以其在$A$中对应的偏移量。换句话说，如果 $s_{r, c}$表示$c$样本的第r个特征，$A_c$表示$A$中第$c$个偏移量，则令$s_{r, c} \leftarrow \frac{ s_{r, c} } { \lVert A_c \rVert }. + 4. 重复以下步骤直至收敛 + + 1. 随机选取一个有2000个patches的迷你块 + 2. 如上所述，初始化[itex]s$ +
3. 根据上一步给定的$A$，求解能够最小化$J(A, s)$的$s$ +
4. 根据上一步得到的$s$，求解能够最小化$J(A, s)$的$A$ +
- [一审] + 通过上述方法，可以相对快速的得到局部最优解。 - 在给定$A$的条件下，根据目标函数使用梯度下降（或其他方法）求解$s$之前找到良好的特征矩阵$s$的初始值是另一个快速高效收敛的重要技巧。实际上，每次迭代过程$s$的随机初始化导致收敛性较差，除非在优化$A$的最优值前已得到$s$的最优解。下面给出一个初始化$s$的较好方法： -
-
1. 令$s \leftarrow W^Tx$ ($x$ 是迷你块中patches的矩阵表示) -
2. 对$s$做归一化处理：$s$中的每个特征（$s$的每一列）除以其在$A$中对应的基向量。即，如果 $s_{r, c}$表示$c$样本的第$r$个特征，$A_c$表示$A$中第$c$个基向量，则令$s_{r, c} \leftarrow \frac{ s_{r, c} } { \lVert A_c \rVert }.$ -
- [原文] - Very roughly and informally speaking, this initialization helps because the first step is an attempt to find a good $s$ such that $Ws \approx x$, and the second step "normalizes" $s$ in an attempt to keep the sparsity penalty small. It turns out that initializing $s$ using only one but not both steps results in poor performance in practice. ([[TODO]]: a better explanation for why this initialization helps?) + ==中英文对照== - [初译] + :稀疏编码                sparse coding + :自编码                    autoencoder + :目标函数                objective function + :稀疏代价                sparsity cost + :反向传播                backpropagation + :基于梯度的            gradient-based + :非凸的                    non-convex + :权重衰变                weight decay + :拓扑稀疏编码        topographic sparse coding + :拓扑秩序                topographically ordered + :平滑的一范数惩罚 smoothed L1 penalty + :迷你块                    mini-batches + :收敛速度                the rate of convergence + :梯度下降                gradient descent + :局部最优解            local optima - 无疑，这样的初始化有助于算法的改进。因为上述的第一步是求解满足$Ws \approx x$的$s$ 的最优解，第二步的规范化处理是为了保持较小的稀疏惩罚值。实际运行证明，只用上述步骤的某一步代替这两步对$s$做初始化处理严重影响算法性能。 - [一审] - 无疑，这样的初始化有助于算法的改进。因为上述的第一步希望找到满足$Ws \approx x$的$s$ 的最优解，第二步的规范化处理是为了保持较小的稀疏惩罚值。实际运行证明，只用上述步骤的某一步代替这两步对$s$ 做初始化处理严重影响算法性能。 + ==中文译者== - [原文] + 许超（xuchaowill@gmail.com）， 张睿卿（zrqjennifer@gmail.com）, 林锋（xlfg@yeah.net） - === The practical algorithm[可运行算法] === - With the above two tricks, the algorithm for sparse coding then becomes: + {{Sparse_Autoencoder}} -
+ -
1. Initialize $A$ randomly + -
2. Repeat until convergence + -
+ -
1. Select a random mini-batch of 2000 patches + -
2. Initialize $s$ as described above + -
3. Find the $s$ that minimizes $J(A, s)$ for the $A$ found in the previous step + -
4. Solve for the $A$ that minimizes $J(A, s)$ for the $s$ found in the previous step + -
+ -
+ - With this method, you should be able to reach a good local optima relatively quickly. - [初译] + {{Languages|Sparse_Coding:_Autoencoder_Interpretation|English}} - + - 考虑到以上两点，稀疏编码算法修改如下： + -
+ -
1. 随机初始化 $A$ + -
2. 重复以下步骤直至收敛： + -
+ -
1. 随机选取一个2000 patches大小的迷你块 + -
2. 如上所述初始化$s$ + -
3. 根据上一步给定的$A$，求解能够最小化$J(A, s)$的$s$ + -
4. 根据上一步得到的$s$，求解能够最小化$J(A, s)$的$A$ + -
+ -
+ - + - 通过上述方法，可以相对快速的得到局部最优解。 + - + - [一审] + - + - 考虑到以上两点，稀疏编码算法修改如下： + -
+ -
1. 随机初始化 $A$ + -
2. 重复以下步骤直至收敛： + -
+ -
1. 随机选取一个2000 patches大小的迷你块 + -
2. 如上所述初始化$s$ + -
3. 根据上一步给定的$A$，求解能够最小化$J(A, s)$的$s$ + -
4. 根据上一步得到的$s$，求解能够最小化$J(A, s)$的$A$ + -
+ -
+ - + - 通过上述方法，可以相对快速的得到局部最优解。 +