Sparse Coding: Autoencoder Interpretation

From Ufldl

Jump to: navigation, search

@@ Line 59: / Line 59: @@
 Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods.
-In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[Sparse Coding: Autoencoder Interpretation#Sparse coding in practice]] | sparse coding in practice]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using backpropagation | using the backpropagation intuition]] can be helpful.
+In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[ Sparse Coding: Autoencoder Interpretation#Sparse coding in practice | sparse coding in practice]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using backpropagation | using the backpropagation intuition]] can be helpful.
 == Topographic sparse coding ==
@@ Line 107: / Line 107: @@
 === Batching examples into mini-batches ===
-If you try running the simple iterative algorithm on a large dataset of say 10 000 patches at one go, you will find that each iteration takes a long time, and the algorithm may hence take a long time to converge. To increase the rate of convergence, you can instead run the algorithm on mini-batches instead. To do this, instead of running the algorithm on all 10 000 patches, in each iteration, select a mini-batch - a (different) random subset of 2000 patches from the 10 000 patches - and run the algorithm on that mini-batch for the iteration instead. This accomplishes two things - firstly, it speeds up each iteration, since now each iteration is operating on 2000 rather than 10 000 patches; secondly, and more importantly, it increases the rate of convergence ([[(TODO]]: explain why).
+If you try running the simple iterative algorithm on a large dataset of say 10 000 patches at one go, you will find that each iteration takes a long time, and the algorithm may hence take a long time to converge. To increase the rate of convergence, you can instead run the algorithm on mini-batches instead. To do this, instead of running the algorithm on all 10 000 patches, in each iteration, select a mini-batch - a (different) random subset of 2000 patches from the 10 000 patches - and run the algorithm on that mini-batch for the iteration instead. This accomplishes two things - firstly, it speeds up each iteration, since now each iteration is operating on 2000 rather than 10 000 patches; secondly, and more importantly, it increases the rate of convergence [[(TODO]]: explain why).
 === Good initialization of <math>s</math> ===
-Another important trick in obtaining faster and better convergence is good initialization of the feature matrix <math>s</math> before using gradient descent (or other methods) to optimize for the objective function for <math>s</math> given <math>A</math>. In practice, initializing <math>s</math> randomly at each iteration can result in poor convergence unless a good optima is found for <math>s</math> before moving on to optimize for <math>A</math>. A better way to initialize <math>s</math> is to set <math>s \leftarrow W^Tx</math>, where <math>x</math> is the patches in the mini-batch, and then normalize each column of <math>s</math> so that all the feature vectors have norm 1.
+Another important trick in obtaining faster and better convergence is good initialization of the feature matrix <math>s</math> before using gradient descent (or other methods) to optimize for the objective function for <math>s</math> given <math>A</math>. In practice, initializing <math>s</math> randomly at each iteration can result in poor convergence unless a good optima is found for <math>s</math> before moving on to optimize for <math>A</math>. A better way to initialize <math>s</math> is the following:
+<ol>
+<li>Set <math>s \leftarrow W^Tx</math> (where <math>x</math> is the matrix of patches in the mini-batch)
+<li>For each feature in <math>s</math> (i.e. each column of <math>s</math>), divide the feature by the norm of the corresponding basis in <math>A</math>. That is, if <math>s_{r, c}</math> is the <math>r</math>th feature for the <math>c</math>th example, and <math>A_c</math> is the <math>c</math>th basis vector in <math>A</math>, then set <math>s_{r, c} \leftarrow \frac{ s_{r, c} } { \lVert A_c \rVert }.</math>
+</ol>
+Very roughly and informally speaking, this initialization helps because the first step is an attempt to find a good <math>s</math> such that <math>Ws \approx x</math>, and the second step "normalizes" <math>s</math> in an attempt to keep the sparsity penalty small. It turns out that initializing <math>s</math> using only one but not both steps results in poor performance in practice. ([[TODO]]: a better explanation for why this initialization helps?)
 === The practical algorithm ===
@@ Line 121: / Line 127: @@
    <ol>
      <li>Select a random mini-batch of 2000 patches
-     <li>Initialize <math>s</math> to <math>W^Tx</math>, and normalize each feature (each column of <math>s</math>) to unit norm
+     <li>Initialize <math>s</math> as described above
      <li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step
      <li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step

Sparse Coding: Autoencoder Interpretation

From Ufldl

Revision as of 06:50, 29 May 2011

Views

Personal tools

ufldl resources

wiki

Search

Toolbox