Sparse Coding: Autoencoder Interpretation

From Ufldl

Jump to: navigation, search
Line 52: Line 52:
<li>Repeat until convergence
<li>Repeat until convergence
   <ol>
   <ol>
-
     <li>Find <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step
+
     <li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step
-
     <li>Find <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step  
+
     <li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step  
   </ol>
   </ol>
</ol>
</ol>
Line 59: Line 59:
Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods.
Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods.
-
In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in [[Exercise:Sparse Coding | the sparse coding exercise]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using backpropagation | using the backpropagation intuition)]] can be helpful.
+
In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[Sparse Coding: Autoencoder Interpretation#Sparse coding in practice]] | sparse coding in practice]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using backpropagation | using the backpropagation intuition]] can be helpful.
== Topographic sparse coding ==
== Topographic sparse coding ==
Line 84: Line 84:
This objective function can be optimized using the iterated method described in the earlier section. Topographic sparse coding will learn features similar to those learned by sparse coding, except that the features will now be "ordered" in some way.
This objective function can be optimized using the iterated method described in the earlier section. Topographic sparse coding will learn features similar to those learned by sparse coding, except that the features will now be "ordered" in some way.
 +
 +
== Sparse coding in practice ==
 +
 +
As suggested in the earlier sections, while the theory behind sparse coding is quite simple, writing a good implementation that actually works and converges reasonably quickly to good optima requires a bit of finesse.
 +
 +
Recall the simple iterative algorithm proposed earlier:
 +
<ol>
 +
<li>Initialize <math>A</math> randomly
 +
<li>Repeat until convergence
 +
  <ol>
 +
    <li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step
 +
    <li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step
 +
  </ol>
 +
</ol>
 +
 +
It turns out that running this algorithm out of the box will not produce very good results, if any results are produced at all. There are two main tricks to achieve faster and better convergence:
 +
<ol>
 +
<li>Batching examples into "mini-batches"
 +
<li>Good initialization of <math>s</math>
 +
</ol>
 +
 +
=== Batching examples into mini-batches ===
 +
 +
If you try running the simple iterative algorithm on a large dataset of say 10 000 patches at one go, you will find that each iteration takes a long time, and the algorithm may hence take a long time to converge. To increase the rate of convergence, you can instead run the algorithm on mini-batches instead. To do this, instead of running the algorithm on all 10 000 patches, in each iteration, select a mini-batch - a (different) random subset of 2000 patches from the 10 000 patches - and run the algorithm on that mini-batch for the iteration instead. This accomplishes two things - firstly, it speeds up each iteration, since now each iteration is operating on 2000 rather than 10 000 patches; secondly, and more importantly, it increases the rate of convergence ([[(TODO]]: explain why).
 +
 +
=== Good initialization of <math>s</math> ===
 +
 +
Another important trick in obtaining faster and better convergence is good initialization of the feature matrix <math>s</math> before using gradient descent (or other methods) to optimize for the objective function for <math>s</math> given <math>A</math>. In practice, initializing <math>s</math> randomly at each iteration can result in poor convergence unless a good optima is found for <math>s</math> before moving on to optimize for <math>A</math>. A better way to initialize <math>s</math> is to set <math>s \leftarrow W^Tx</math>, where <math>x</math> is the patches in the mini-batch, and then normalize each column of <math>s</math> so that all the feature vectors have norm 1.
 +
 +
=== The practical algorithm ===
 +
 +
With the above two tricks, the algorithm for sparse coding then becomes:
 +
<ol>
 +
<li>Initialize <math>A</math> randomly
 +
<li>Repeat until convergence
 +
  <ol>
 +
    <li>Select a random mini-batch of 2000 patches
 +
    <li>Initialize <math>s</math> to <math>W^Tx</math>, and normalize each feature (each column of <math>s</math>) to unit norm
 +
    <li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step
 +
    <li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step
 +
  </ol>
 +
</ol>
 +
 +
With this method, you should be able to reach a good local optima relatively quickly.

Revision as of 06:34, 29 May 2011

Personal tools