# 稀疏编码自编码表达

 Revision as of 07:01, 8 March 2013 (view source)Kandeng (Talk | contribs)← Older edit Revision as of 07:08, 8 March 2013 (view source)Kandeng (Talk | contribs) Newer edit → Line 1: Line 1: + [原文] == Sparse coding == == Sparse coding == In the sparse autoencoder, we tried to learn a set of weights $W$ (and associated biases $b$) that would give us sparse features $\sigma(Wx + b)$ useful in reconstructing the input $x$. In the sparse autoencoder, we tried to learn a set of weights $W$ (and associated biases $b$) that would give us sparse features $\sigma(Wx + b)$ useful in reconstructing the input $x$. + + [初译] + + + [一审] + + [原文] [[File:STL_SparseAE.png | 240px]] [[File:STL_SparseAE.png | 240px]] + + [原文] Sparse coding can be seen as a modification of the sparse autoencoder method in which we try to learn the set of features for some data "directly". Together with an associated basis  for transforming the learned features from the feature space to the data space, we can then reconstruct the data from the learned features. Sparse coding can be seen as a modification of the sparse autoencoder method in which we try to learn the set of features for some data "directly". Together with an associated basis  for transforming the learned features from the feature space to the data space, we can then reconstruct the data from the learned features. + + [初译] + + + [一审] + + [原文] Formally, in sparse coding, we have some data $x$ we would like to learn features on. In particular, we would like to learn $s$, a set of sparse features useful for representing the data, and $A$, a basis for transforming the features from the feature space to the data space. Our objective function is hence: Formally, in sparse coding, we have some data $x$ we would like to learn features on. In particular, we would like to learn $s$, a set of sparse features useful for representing the data, and $A$, a basis for transforming the features from the feature space to the data space. Our objective function is hence: + + [初译] + + + [一审] + + [原文] :$:[itex] Line 14: Line 38: (If you are unfamiliar with the notation, [itex]\lVert x \rVert_k$ refers to the L$k$ norm of the $x$ which is equal to $\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}$. The L2 norm is the familiar Euclidean norm, while the L1 norm is the sum of absolute values of the elements of the vector) (If you are unfamiliar with the notation, $\lVert x \rVert_k$ refers to the L$k$ norm of the $x$ which is equal to $\left( \sum{ \left| x_i^k \right| } \right) ^{\frac{1}{k}}$. The L2 norm is the familiar Euclidean norm, while the L1 norm is the sum of absolute values of the elements of the vector) + + [初译] + + + [一审] + + [原文] + The first term is the error in reconstructing the data from the features using the basis, and the second term is a sparsity penalty term to encourage the learned features to be sparse. The first term is the error in reconstructing the data from the features using the basis, and the second term is a sparsity penalty term to encourage the learned features to be sparse. + + [初译] + + + [一审] + + [原文] + However, the objective function as it stands is not properly constrained - it is possible to reduce the sparsity cost (the second term) by scaling $A$ by some constant and scaling $s$ by the inverse of the same constant, without changing the error. Hence, we include the additional constraint that that for every column $A_j$ of $A$, However, the objective function as it stands is not properly constrained - it is possible to reduce the sparsity cost (the second term) by scaling $A$ by some constant and scaling $s$ by the inverse of the same constant, without changing the error. Hence, we include the additional constraint that that for every column $A_j$ of $A$, $A_j^TA_j \le 1$. Our problem is thus: $A_j^TA_j \le 1$. Our problem is thus: + + [初译] + + + [一审] + + [原文] + :$:[itex] Line 26: Line 74: \end{array} \end{array}$ [/itex] + + [原文] Unfortunately, the objective function is non-convex, and hence impossible to optimize well using gradient-based methods. However, given $A$, the problem of finding $s$ that minimizes $J(A, s)$ is convex. Similarly, given $s$, the problem of finding $A$ that minimizes $J(A, s)$ is also convex. This suggests that we might try alternately optimizing for $A$ for a fixed $s$, and then optimizing for $s$ given a fixed $A$. It turns out that this works quite well in practice. Unfortunately, the objective function is non-convex, and hence impossible to optimize well using gradient-based methods. However, given $A$, the problem of finding $s$ that minimizes $J(A, s)$ is convex. Similarly, given $s$, the problem of finding $A$ that minimizes $J(A, s)$ is also convex. This suggests that we might try alternately optimizing for $A$ for a fixed $s$, and then optimizing for $s$ given a fixed $A$. It turns out that this works quite well in practice. + + + [初译] + + + [一审] + + [原文] + + However, the form of our problem presents another difficulty - the constraint that $A_j^TA_j \le 1 \; \forall j$ cannot be enforced using simple gradient-based methods. Hence, in practice, this constraint is weakened to a "weight decay" term designed to keep the entries of $A$ small. This gives us a new objective function: However, the form of our problem presents another difficulty - the constraint that $A_j^TA_j \le 1 \; \forall j$ cannot be enforced using simple gradient-based methods. Hence, in practice, this constraint is weakened to a "weight decay" term designed to keep the entries of $A$ small. This gives us a new objective function: + + + [初译] + + + [一审] + + [原文] + :$:[itex] Line 36: Line 105: (note that the third term, [itex]\lVert A \rVert_2^2$ is simply the sum of squares of the entries of A, or $\sum_r{\sum_c{A_{rc}^2}}$) (note that the third term, $\lVert A \rVert_2^2$ is simply the sum of squares of the entries of A, or $\sum_r{\sum_c{A_{rc}^2}}$) + + [初译] + + + [一审] + + [原文] + This objective function presents one last problem - the L1 norm is not differentiable at 0, and hence poses a problem for gradient-based methods. While the problem can be solved using other non-gradient descent-based methods, we will "smooth out" the L1 norm using an approximation which will allow us to use gradient descent. To "smooth out" the L1 norm, we use $\sqrt{x + \epsilon}$ in place of $\left| x \right|$, where $\epsilon$ is a "smoothing parameter" which can also be interpreted as a sort of "sparsity parameter" (to see this, observe that when $\epsilon$ is large compared to $x$, the $x + \epsilon$ is dominated by $\epsilon$, and taking the square root yields approximately $\sqrt{\epsilon}$). This "smoothing" will come in handy later when considering topographic sparse coding below. This objective function presents one last problem - the L1 norm is not differentiable at 0, and hence poses a problem for gradient-based methods. While the problem can be solved using other non-gradient descent-based methods, we will "smooth out" the L1 norm using an approximation which will allow us to use gradient descent. To "smooth out" the L1 norm, we use $\sqrt{x + \epsilon}$ in place of $\left| x \right|$, where $\epsilon$ is a "smoothing parameter" which can also be interpreted as a sort of "sparsity parameter" (to see this, observe that when $\epsilon$ is large compared to $x$, the $x + \epsilon$ is dominated by $\epsilon$, and taking the square root yields approximately $\sqrt{\epsilon}$). This "smoothing" will come in handy later when considering topographic sparse coding below. Our final objective function is hence: Our final objective function is hence: + + [初译] + + + [一审] + + [原文] + :$:[itex] Line 46: Line 131: (where [itex]\sqrt{s^2 + \epsilon}$ is shorthand for $\sum_k{\sqrt{s_k^2 + \epsilon}}$) (where $\sqrt{s^2 + \epsilon}$ is shorthand for $\sum_k{\sqrt{s_k^2 + \epsilon}}$) + + [初译] + + + [一审] + + [原文] + This objective function can then be optimized iteratively, using the following procedure: This objective function can then be optimized iteratively, using the following procedure: Line 56: Line 149: + + [初译] + + + [一审] + + [原文] + Observe that with our modified objective function, the objective function $J(A, s)$ given $s$, that is $J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2$ (the L1 term in $s$ can be omitted since it is not a function of $A$) is simply a quadratic term in $A$, and hence has an easily derivable analytic solution in $A$. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given $A$ does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods. Observe that with our modified objective function, the objective function $J(A, s)$ given $s$, that is $J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2$ (the L1 term in $s$ can be omitted since it is not a function of $A$) is simply a quadratic term in $A$, and hence has an easily derivable analytic solution in $A$. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given $A$ does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods. + + [初译] + + + [一审] + + [原文] + In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of $A$) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[ Sparse Coding: Autoencoder Interpretation#Sparse coding in practice | sparse coding in practice]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using the backpropagation idea | using the backpropagation intuition]] can be helpful. In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of $A$) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[ Sparse Coding: Autoencoder Interpretation#Sparse coding in practice | sparse coding in practice]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using the backpropagation idea | using the backpropagation intuition]] can be helpful. + + [初译] + + + [一审] + + [原文] + == Topographic sparse coding == == Topographic sparse coding == With sparse coding, we can learn a set of features useful for representing the data. However, drawing inspiration from the brain, we would like to learn a set of features that are "orderly" in some manner. For instance, consider visual features. As suggested earlier, the V1 cortex of the brain contains neurons which detect edges at particular orientations. However, these neurons are also organized into hypercolumns in which adjacent neurons detect edges at similar orientations. One neuron could detect a horizontal edge, its neighbors edges oriented slightly off the horizontal, and moving further along the hypercolumn, the neurons detect edges oriented further off the horizontal. With sparse coding, we can learn a set of features useful for representing the data. However, drawing inspiration from the brain, we would like to learn a set of features that are "orderly" in some manner. For instance, consider visual features. As suggested earlier, the V1 cortex of the brain contains neurons which detect edges at particular orientations. However, these neurons are also organized into hypercolumns in which adjacent neurons detect edges at similar orientations. One neuron could detect a horizontal edge, its neighbors edges oriented slightly off the horizontal, and moving further along the hypercolumn, the neurons detect edges oriented further off the horizontal. + + [初译] + + + [一审] + + [原文] + Inspired by this example, we would like to learn features which are similarly "topographically ordered". What does this imply for our learned features? Intuitively, if "adjacent" features are "similar", we would expect that if one feature is activated, its neighbors will also be activated to a lesser extent. Inspired by this example, we would like to learn features which are similarly "topographically ordered". What does this imply for our learned features? Intuitively, if "adjacent" features are "similar", we would expect that if one feature is activated, its neighbors will also be activated to a lesser extent. + + [初译] + + + [一审] + + [原文] + Concretely, suppose we (arbitrarily) organized our features into a square matrix. We would then like adjacent features in the matrix to be similar. The way this is accomplished is to group these adjacent features together in the smoothed L1 penalty, so that instead of say $\sqrt{s_{1,1}^2 + \epsilon}$, we use say $\sqrt{s_{1,1}^2 + s_{1,2}^2 + s_{1,3}^2 + s_{2,1}^2 + s_{2,2}^2 + s_{3,2}^2 + s_{3,1}^2 + s_{3,2}^2 + s_{3,3}^2 + \epsilon}$ instead, if we group in 3x3 regions. The grouping is usually overlapping, so that the 3x3 region starting at the 1st row and 1st column is one group, the 3x3 region starting at the 1st row and 2nd column is another group, and so on. Further, the grouping is also usually done wrapping around, as if the matrix were a torus, so that every feature is counted an equal number of times. Concretely, suppose we (arbitrarily) organized our features into a square matrix. We would then like adjacent features in the matrix to be similar. The way this is accomplished is to group these adjacent features together in the smoothed L1 penalty, so that instead of say $\sqrt{s_{1,1}^2 + \epsilon}$, we use say $\sqrt{s_{1,1}^2 + s_{1,2}^2 + s_{1,3}^2 + s_{2,1}^2 + s_{2,2}^2 + s_{3,2}^2 + s_{3,1}^2 + s_{3,2}^2 + s_{3,3}^2 + \epsilon}$ instead, if we group in 3x3 regions. The grouping is usually overlapping, so that the 3x3 region starting at the 1st row and 1st column is one group, the 3x3 region starting at the 1st row and 2nd column is another group, and so on. Further, the grouping is also usually done wrapping around, as if the matrix were a torus, so that every feature is counted an equal number of times. Hence, in place of the smoothed L1 penalty, we use the sum of smoothed L1 penalties over all the groups, so our new objective function is: Hence, in place of the smoothed L1 penalty, we use the sum of smoothed L1 penalties over all the groups, so our new objective function is: + + [初译] + + + [一审] + + [原文] + :$:[itex] J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum_{\text{all groups } g}{\sqrt{ \left( \sum_{\text{all } s \in g}{s^2} \right) + \epsilon} } + \gamma \lVert A \rVert_2^2 J(A, s) = \lVert As - x \rVert_2^2 + \lambda \sum_{\text{all groups } g}{\sqrt{ \left( \sum_{\text{all } s \in g}{s^2} \right) + \epsilon} } + \gamma \lVert A \rVert_2^2$ [/itex] + + [初译] + + + [一审] + + [原文] + In practice, the "grouping" can be accomplished using a "grouping matrix" $V$, such that the $r$th row of $V$ indicates which features are grouped in the $r$th group, so $V_{r, c} = 1$ if group $r$ contains feature $c$. Thinking of the grouping as being achieved by a grouping matrix makes the computation of the gradients more intuitive. Using this grouping matrix, the objective function can be rewritten as: In practice, the "grouping" can be accomplished using a "grouping matrix" $V$, such that the $r$th row of $V$ indicates which features are grouped in the $r$th group, so $V_{r, c} = 1$ if group $r$ contains feature $c$. Thinking of the grouping as being achieved by a grouping matrix makes the computation of the gradients more intuitive. Using this grouping matrix, the objective function can be rewritten as: + + [初译] + + + [一审] + + [原文] + :$:[itex] Line 82: Line 239: (where [itex]\sum{ \sqrt{Vss^T + \epsilon} }$ is $\sum_r{ \sum_c { D_{r, c} } }$ if we let $D = \sqrt{Vss^T + \epsilon}$) (where $\sum{ \sqrt{Vss^T + \epsilon} }$ is $\sum_r{ \sum_c { D_{r, c} } }$ if we let $D = \sqrt{Vss^T + \epsilon}$) + + [初译] + + + [一审] + + [原文] + This objective function can be optimized using the iterated method described in the earlier section. Topographic sparse coding will learn features similar to those learned by sparse coding, except that the features will now be "ordered" in some way. This objective function can be optimized using the iterated method described in the earlier section. Topographic sparse coding will learn features similar to those learned by sparse coding, except that the features will now be "ordered" in some way. + + [初译] + + + [一审] + + [原文] + == Sparse coding in practice == == Sparse coding in practice == As suggested in the earlier sections, while the theory behind sparse coding is quite simple, writing a good implementation that actually works and converges reasonably quickly to good optima requires a bit of finesse. As suggested in the earlier sections, while the theory behind sparse coding is quite simple, writing a good implementation that actually works and converges reasonably quickly to good optima requires a bit of finesse. + + [初译] + + + [一审] + + [原文] + Recall the simple iterative algorithm proposed earlier: Recall the simple iterative algorithm proposed earlier: Line 98: Line 279: + + [初译] + + + [一审] + + [原文] + It turns out that running this algorithm out of the box will not produce very good results, if any results are produced at all. There are two main tricks to achieve faster and better convergence: It turns out that running this algorithm out of the box will not produce very good results, if any results are produced at all. There are two main tricks to achieve faster and better convergence: Line 104: Line 293:
• Good initialization of $s$
• Good initialization of $s$ + + [初译] + + + [一审] + + [原文] + === Batching examples into mini-batches === === Batching examples into mini-batches === If you try running the simple iterative algorithm on a large dataset of say 10 000 patches at one go, you will find that each iteration takes a long time, and the algorithm may hence take a long time to converge. To increase the rate of convergence, you can instead run the algorithm on mini-batches instead. To do this, instead of running the algorithm on all 10 000 patches, in each iteration, select a mini-batch - a (different) random subset of say 2000 patches from the 10 000 patches - and run the algorithm on that mini-batch for the iteration instead. This accomplishes two things - firstly, it speeds up each iteration, since now each iteration is operating on 2000 rather than 10 000 patches; secondly, and more importantly, it increases the rate of convergence [[(TODO]]: explain why). If you try running the simple iterative algorithm on a large dataset of say 10 000 patches at one go, you will find that each iteration takes a long time, and the algorithm may hence take a long time to converge. To increase the rate of convergence, you can instead run the algorithm on mini-batches instead. To do this, instead of running the algorithm on all 10 000 patches, in each iteration, select a mini-batch - a (different) random subset of say 2000 patches from the 10 000 patches - and run the algorithm on that mini-batch for the iteration instead. This accomplishes two things - firstly, it speeds up each iteration, since now each iteration is operating on 2000 rather than 10 000 patches; secondly, and more importantly, it increases the rate of convergence [[(TODO]]: explain why). + + [初译] + + + [一审] + + [原文] + === Good initialization of $s$ === === Good initialization of $s$ === Line 116: Line 321:
• For each feature in $s$ (i.e. each column of $s$), divide the feature by the norm of the corresponding basis vector in $A$. That is, if $s_{r, c}$ is the $r$th feature for the $c$th example, and $A_c$ is the $c$th basis vector in $A$, then set $s_{r, c} \leftarrow \frac{ s_{r, c} } { \lVert A_c \rVert }.$
• For each feature in $s$ (i.e. each column of $s$), divide the feature by the norm of the corresponding basis vector in $A$. That is, if $s_{r, c}$ is the $r$th feature for the $c$th example, and $A_c$ is the $c$th basis vector in $A$, then set $s_{r, c} \leftarrow \frac{ s_{r, c} } { \lVert A_c \rVert }.$ + + [初译] + + + [一审] + + [原文] + Very roughly and informally speaking, this initialization helps because the first step is an attempt to find a good $s$ such that $Ws \approx x$, and the second step "normalizes" $s$ in an attempt to keep the sparsity penalty small. It turns out that initializing $s$ using only one but not both steps results in poor performance in practice. ([[TODO]]: a better explanation for why this initialization helps?) Very roughly and informally speaking, this initialization helps because the first step is an attempt to find a good $s$ such that $Ws \approx x$, and the second step "normalizes" $s$ in an attempt to keep the sparsity penalty small. It turns out that initializing $s$ using only one but not both steps results in poor performance in practice. ([[TODO]]: a better explanation for why this initialization helps?) + + [初译] + + + [一审] + + [原文] === The practical algorithm === === The practical algorithm === Line 134: Line 354: With this method, you should be able to reach a good local optima relatively quickly. With this method, you should be able to reach a good local optima relatively quickly. + + [初译] + + 考虑到以上两点，稀疏编码算法修改如下： +
+
1. 随机初始化 $A$ +
2. 重复以下步骤直至收敛： +
+
1. 随机选取一个2000 patches大小的迷你块 +
2. 如上所述初始化$s$ +
3. 根据上一步给定的$A$，求解能够最小化$J(A, s)$的$s$ +
4. 根据上一步得到的$s$，求解能够最小化$J(A, s)$的$A$ +
+
+ + 通过上述方法，可以相对快速的得到局部最优解。 + + [一审] + + 考虑到以上两点，稀疏编码算法修改如下： +
+
1. 随机初始化 $A$ +
2. 重复以下步骤直至收敛： +
+
1. 随机选取一个2000 patches大小的迷你块 +
2. 如上所述初始化$s$ +
3. 根据上一步给定的$A$，求解能够最小化$J(A, s)$的$s$ +
4. 根据上一步得到的$s$，求解能够最小化$J(A, s)$的$A$ +
+
+ + 通过上述方法，可以相对快速的得到局部最优解。