稀疏编码自编码表达

 Revision as of 07:11, 8 March 2013 (view source)Kandeng (Talk | contribs)← Older edit Revision as of 07:21, 8 March 2013 (view source)Kandeng (Talk | contribs) (→Good initialization of s)Newer edit → Line 314: Line 314: - === Good initialization of $s$ === + === Good initialization of $s$[良好的s初始值] === Another important trick in obtaining faster and better convergence is good initialization of the feature matrix $s$ before using gradient descent (or other methods) to optimize for the objective function for $s$ given $A$. In practice, initializing $s$ randomly at each iteration can result in poor convergence unless a good optima is found for $s$ before moving on to optimize for $A$. A better way to initialize $s$ is the following: Another important trick in obtaining faster and better convergence is good initialization of the feature matrix $s$ before using gradient descent (or other methods) to optimize for the objective function for $s$ given $A$. In practice, initializing $s$ randomly at each iteration can result in poor convergence unless a good optima is found for $s$ before moving on to optimize for $A$. A better way to initialize $s$ is the following: Line 324: Line 324: [初译] [初译] + 在给定$A$的条件下，根据目标函数使用梯度下降（或其他方法）求解$s$之前找到良好的特征矩阵$s$的初始值是另一个快速高效收敛的重要技巧。实际上，每次迭代过程$s$的随机初始化导致收敛性较差，除非在求解$A$的最优值前已得到$s$的最优解。下面给出一个初始化s的较好方法： +
+
1. 令$s \leftarrow W^Tx$ ($x$ 是迷你块中patches的矩阵表示) +
2. 对s做归一化处理：$s$中的每个特征（$s$的每一列）除以其在$A$中对应的偏移量。换句话说，如果 $s_{r, c}$表示$c$样本的第$r$个特征，$A_c$表示$A$中第$c$个偏移量，则令$s_{r, c} \leftarrow \frac{ s_{r, c} } { \lVert A_c \rVert }.$ +
[一审] [一审] - [原文] + 在给定$A$的条件下，根据目标函数使用梯度下降（或其他方法）求解$s$之前找到良好的特征矩阵$s$的初始值是另一个快速高效收敛的重要技巧。实际上，每次迭代过程$s$的随机初始化导致收敛性较差，除非在优化$A$的最优值前已得到$s$的最优解。下面给出一个初始化s的较好方法： +
+
1. 令$s \leftarrow W^Tx$ ($x$ 是迷你块中patches的矩阵表示) +
2. 对s做归一化处理：$s$中的每个特征（$s$的每一列）除以其在$A$中对应的基向量。即，如果 $s_{r, c}$表示$c$样本的第$r$个特征，$A_c$表示$A$中第$c$个基向量，则令$s_{r, c} \leftarrow \frac{ s_{r, c} } { \lVert A_c \rVert }.$ +
+ [原文] Very roughly and informally speaking, this initialization helps because the first step is an attempt to find a good $s$ such that $Ws \approx x$, and the second step "normalizes" $s$ in an attempt to keep the sparsity penalty small. It turns out that initializing $s$ using only one but not both steps results in poor performance in practice. ([[TODO]]: a better explanation for why this initialization helps?) Very roughly and informally speaking, this initialization helps because the first step is an attempt to find a good $s$ such that $Ws \approx x$, and the second step "normalizes" $s$ in an attempt to keep the sparsity penalty small. It turns out that initializing $s$ using only one but not both steps results in poor performance in practice. ([[TODO]]: a better explanation for why this initialization helps?)