Sparse Coding: Autoencoder Interpretation
From Ufldl
Line 59: | Line 59: | ||
Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods. | Observe that with our modified objective function, the objective function <math>J(A, s)</math> given <math>s</math>, that is <math>J(A; s) = \lVert As - x \rVert_2^2 + \gamma \lVert A \rVert_2^2</math> (the L1 term in <math>s</math> can be omitted since it is not a function of <math>A</math>) is simply a quadratic term in <math>A</math>, and hence has an easily derivable analytic solution in <math>A</math>. A quick way to derive this solution would be to use matrix calculus - some pages about matrix calculus can be found in the [[Useful Links | useful links]] section. Unfortunately, the objective function given <math>A</math> does not have a similarly nice analytic solution, so that minimization step will have to be carried out using gradient descent or similar optimization methods. | ||
- | In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[Sparse Coding: Autoencoder Interpretation#Sparse coding in practice | + | In theory, optimizing for this objective function using the iterative method as above should (eventually) yield features (the basis vectors of <math>A</math>) similar to those learned using the sparse autoencoder. However, in practice, there are quite a few tricks required for better convergence of the algorithm, and these tricks are described in greater detail in the later section on [[ Sparse Coding: Autoencoder Interpretation#Sparse coding in practice | sparse coding in practice]]. Deriving the gradients for the objective function may be slightly tricky as well, and using matrix calculus or [[Deriving gradients using backpropagation | using the backpropagation intuition]] can be helpful. |
== Topographic sparse coding == | == Topographic sparse coding == | ||
Line 107: | Line 107: | ||
=== Batching examples into mini-batches === | === Batching examples into mini-batches === | ||
- | If you try running the simple iterative algorithm on a large dataset of say 10 000 patches at one go, you will find that each iteration takes a long time, and the algorithm may hence take a long time to converge. To increase the rate of convergence, you can instead run the algorithm on mini-batches instead. To do this, instead of running the algorithm on all 10 000 patches, in each iteration, select a mini-batch - a (different) random subset of 2000 patches from the 10 000 patches - and run the algorithm on that mini-batch for the iteration instead. This accomplishes two things - firstly, it speeds up each iteration, since now each iteration is operating on 2000 rather than 10 000 patches; secondly, and more importantly, it increases the rate of convergence | + | If you try running the simple iterative algorithm on a large dataset of say 10 000 patches at one go, you will find that each iteration takes a long time, and the algorithm may hence take a long time to converge. To increase the rate of convergence, you can instead run the algorithm on mini-batches instead. To do this, instead of running the algorithm on all 10 000 patches, in each iteration, select a mini-batch - a (different) random subset of 2000 patches from the 10 000 patches - and run the algorithm on that mini-batch for the iteration instead. This accomplishes two things - firstly, it speeds up each iteration, since now each iteration is operating on 2000 rather than 10 000 patches; secondly, and more importantly, it increases the rate of convergence [[(TODO]]: explain why). |
=== Good initialization of <math>s</math> === | === Good initialization of <math>s</math> === | ||
- | Another important trick in obtaining faster and better convergence is good initialization of the feature matrix <math>s</math> before using gradient descent (or other methods) to optimize for the objective function for <math>s</math> given <math>A</math>. In practice, initializing <math>s</math> randomly at each iteration can result in poor convergence unless a good optima is found for <math>s</math> before moving on to optimize for <math>A</math>. A better way to initialize <math>s</math> is | + | Another important trick in obtaining faster and better convergence is good initialization of the feature matrix <math>s</math> before using gradient descent (or other methods) to optimize for the objective function for <math>s</math> given <math>A</math>. In practice, initializing <math>s</math> randomly at each iteration can result in poor convergence unless a good optima is found for <math>s</math> before moving on to optimize for <math>A</math>. A better way to initialize <math>s</math> is the following: |
+ | <ol> | ||
+ | <li>Set <math>s \leftarrow W^Tx</math> (where <math>x</math> is the matrix of patches in the mini-batch) | ||
+ | <li>For each feature in <math>s</math> (i.e. each column of <math>s</math>), divide the feature by the norm of the corresponding basis in <math>A</math>. That is, if <math>s_{r, c}</math> is the <math>r</math>th feature for the <math>c</math>th example, and <math>A_c</math> is the <math>c</math>th basis vector in <math>A</math>, then set <math>s_{r, c} \leftarrow \frac{ s_{r, c} } { \lVert A_c \rVert }.</math> | ||
+ | </ol> | ||
+ | |||
+ | Very roughly and informally speaking, this initialization helps because the first step is an attempt to find a good <math>s</math> such that <math>Ws \approx x</math>, and the second step "normalizes" <math>s</math> in an attempt to keep the sparsity penalty small. It turns out that initializing <math>s</math> using only one but not both steps results in poor performance in practice. ([[TODO]]: a better explanation for why this initialization helps?) | ||
=== The practical algorithm === | === The practical algorithm === | ||
Line 121: | Line 127: | ||
<ol> | <ol> | ||
<li>Select a random mini-batch of 2000 patches | <li>Select a random mini-batch of 2000 patches | ||
- | <li>Initialize <math>s</math> | + | <li>Initialize <math>s</math> as described above |
<li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step | <li>Find the <math>s</math> that minimizes <math>J(A, s)</math> for the <math>A</math> found in the previous step | ||
<li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step | <li>Solve for the <math>A</math> that minimizes <math>J(A, s)</math> for the <math>s</math> found in the previous step |