Sparse Coding

Revision as of 11:29, 21 March 2011 (view source)

Zhenghao (Talk | contribs)

← Older edit

Revision as of 13:30, 21 March 2011 (view source)

Zhenghao (Talk | contribs)

Newer edit →

Line 1:

-

== ~~Background and Motivation~~ ==

+

== Sparse Coding ==

Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. The aim of sparse coding is to find a set of basis vectors <math>\mathbf{\phi}_i</math> such that we can represent an input vector <math>\mathbf{x}</math> as a linear combination of these basis vectors:

Line 16:

\end{align}</math>

-

where <math>S(.)</math> is a sparsity cost function which penalizes <math>a_i</math> for being far from zero. We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of <math>\mathbf{x}</math> and the second term as a sparsity penalty which forces our representation of <math>\mathbf{x}</math> to be sparse. The constant <math>\lambda</math> is a scaling constant to determine the relative importance of these two contributions. In practice, common choices for the sparsity cost <math>S(.)</math> are the <math>L_1</math> penalty <math>\left|a_i\right|_1 </math> and the log penalty <math>\log(1+a_i^2)</math>.

+

where <math>S(.)</math> is a sparsity cost function which penalizes <math>a_i</math> for being far from zero. We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of <math>\mathbf{x}</math> and the second term as a sparsity penalty which forces our representation of <math>\mathbf{x}</math> to be sparse. The constant <math>\lambda</math> is a scaling constant to determine the relative importance of these two contributions.

+

Although the most direct measure of sparsity is the "<math>L_0</math>" norm (<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>), it is non-differentiable and difficult to optimize in general. In practice, common choices for the sparsity cost <math>S(.)</math> are the <math>L_1</math> penalty <math>\left|a_i\right|_1 </math> and the log penalty <math>\log(1+a_i^2)</math>.

In addition, it is also possible to make the sparsity penalty arbitrarily small by scaling down <math>a_i</math> and scaling <math>\mathbf{\phi}_i</math> up by some large constant. To prevent this from happening, we will constrain <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> to be less than some constant <math>C</math>. The full sparse coding cost function including our constraint on <math>\mathbf{\phi}</math> is

Line 34:

Line 36:

\mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} + \nu(\mathbf{x})

\end{align}</math>

-

~~We thus wish~~ to find a set of basis feature vectors <math>\mathbf{\phi}</math> such that the distribution of images <math>P(\mathbf{x}\mid\mathbf{\phi})</math> is as close as possible to the empirical distribution of our input data <math>P^*(\mathbf{x})</math>.

+

Our goal is to find a set of basis feature vectors <math>\mathbf{\phi}</math> such that the distribution of images <math>P(\mathbf{x}\mid\mathbf{\phi})</math> is as close as possible to the empirical distribution of our input data <math>P^*(\mathbf{x})</math>. One method of doing so is to minimize the KL divergence between <math>P^*(\mathbf{x})</math> and <math>P(\mathbf{x}\mid\mathbf{\phi})</math> where the KL divergence is defined as:

+

:<math>\begin{align}

+

D(P^*(\mathbf{x})||P(\mathbf{x}\mid\mathbf{\phi})) = \int P^*(\mathbf{x}) \log \left(\frac{P^*(\mathbf{x})}{P(\mathbf{x}\mid\mathbf{\phi})}\right)d\mathbf{x}

+

\end{align}</math>

+

Since the empirical distribution <math>P^*(\mathbf{x})</math> is constant across our choice of <math>\mathbf{\phi}</math>, this is equivalent to maximizing the log-likelihood of <math>P(\mathbf{x}\mid\mathbf{\phi})</math>.

+

Assuming <math>\nu</math> is Gaussian white noise with variance <math>\sigma^2</math>, we have that

:<math>\begin{align}

-

P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac{1}{Z} \exp(- \frac{(\mathbf{x}-\sum^{k}_{i=1} a_i \mathbf{\phi}_{i})^2}{2\sigma^2})

+

P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac{1}{Z} \exp\left(- \frac{(\mathbf{x}-\sum^{k}_{i=1} a_i \mathbf{\phi}_{i})^2}{2\sigma^2}\right)

\end{align}</math>

In order to determine the distribution <math>P(\mathbf{x}\mid\mathbf{\phi})</math>, we also need to specify the prior distribution <math>P(\mathbf{a})</math>. Assuming the independence of our source features, we can factorize our prior probability as

+

:<math>\begin{align}

-

P(~~\mathbf{x} \mid~~ \mathbf{a~~}, \mathbf{\phi~~}) = \~~frac~~{1}~~{Z} \exp(- \frac{(\mathbf{x}-\sum~~^{k}~~_{i=1}~~ a_i ~~\mathbf{\phi}_{i})^2}{2\sigma^2}~~)

+

P(\mathbf{a}) = \prod_{i=1}^{k} P(a_i)

\end{align}</math>

+

At this point, we would like to incorporate our sparsity assumption -- the assumption that any single image is likely to be the product of relatively few source features. Therefore, we would like the probability distribution of <math>a_i</math> to be peaked at zero and have high kurtosis. A convenient parameterization of the prior distribution is

+

:<math>\begin{align}

+

P(a_i) = \frac{1}{Z}\exp(-\beta S(a_i))

+

\end{align}</math>

+

Where <math>S(a_i)</math> is a function determining the shape of the prior distribution.

+

Having defined <math>P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi})</math> and <math> P(\mathbf{a})</math>, we can write the probability of the data <math>\mathbf{x}</math> under the model defined by <math>\mathbf{\phi}</math> as

+

:<math>\begin{align}

+

P(\mathbf{x} \mid \mathbf{\phi}) = \int P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) P(\mathbf{a}) d\mathbf{a}

+

\end{align}</math>

+

and our problem reduces to finding

+

:<math>\begin{align}

+

\mathbf{\phi}^*=\text{argmax}_{\mathbf{\phi}} < \log(P(\mathbf{x} \mid \mathbf{\phi})) >

+

\end{align}</math>

+

Where <math><.></math> denotes expectation over our input data.

+

Unfortunately, the integral over <math>\mathbf{a}</math> to obtain <math>P(\mathbf{x} \mid \mathbf{\phi})</math> is generally intractable. We note though that if the distribution of <math>P(\mathbf{x} \mid \mathbf{\phi})</math> is sufficiently peaked (w.r.t. <math>\mathbf{a}</math>), we can approximate its integral with the maximum value of <math>P(\mathbf{x} \mid \mathbf{\phi})</math> and obtain a approximate solution

+

:<math>\begin{align}

+

\mathbf{\phi}^{*'}=\text{argmax}_{\mathbf{\phi}} < \max_{\mathbf{a}} \log(P(\mathbf{x} \mid \mathbf{\phi})) >

+

\end{align}</math>

+

As before, we may increase the estimated probability by scaling down <math>a_i</math> and scaling up <math>\mathbf{\phi}</math> (since <math>P(a_i)</math> peaks about zero) , we therefore impose a norm constraint on our features <math>\mathbf{\phi}</math> to prevent this.

+

Finally, we can recover our original cost function by defining the energy function of this linear generative model

+

:<math>\begin{array}{rl}

+

E\left( \mathbf{x} , \mathbf{a} \mid \mathbf{\phi} \right) & := -\log \left( P(\mathbf{x}\mid \mathbf{\phi},\mathbf{a}\right)P(\mathbf{a})) \\

+

&= \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i)

+

\end{array}</math>

+

where <math>\lambda = 2*\sigma^2*\beta</math> and irrelevant constants have been hidden. Since maximizing the log-likelihood is equivalent to minimizing the energy function, we recover the original optimization problem:

+

:<math>\begin{align}

+

\mathbf{\phi}^{*},\mathbf{a}^{*}=\text{argmin}_{\mathbf{\phi},\mathbf{a}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i)

+

\end{align}</math>

+

Using a probabilistic approach, it can also be seen that the choices of the <math>L_1</math> penalty <math>\left|a_i\right|_1 </math> and the log penalty <math>\log(1+a_i^2)</math> for <math>S(.)</math> correspond to the use of the Laplacian (<math>P(a_i) \propto \exp\left(-\beta|a_i|\right)</math>) and the Cauchy prior (<math>P(a_i) \propto \frac{\beta}{1+a_i^2}</math>) respectively.

+

== Learning ==

+

Learning a set of basis vectors <math>\mathbf{\phi}</math> using sparse coding consists of performing two separate optimizations, the first being an optimization over coefficients <math>a_i</math> for each training example <math>\mathbf{x}</math> and the second an optimization over basis vectors <math>\mathbf{\phi}</math> across many training examples at once.

+

Assuming an <math>L_1</math> sparsity penalty, learning <math>a^{(j)}_i</math> reduces to solving a <math>L_1</math> regularized least squares problem which is convex in <math>a^{(j)}_i</math> for which several techniques have been developed (convex optimization software such as CVX can also be used to perform L1 regularized least squares). Assuming a differentiable <math>S(.)</math> such as the log penalty, gradient-based methods such as conjugate gradient methods can also be used.

+

Learning a set of basis vectors with a <math>L_2</math> norm constraint also reduces to a least squares problem with quadratic constraints which is convex in <math>\mathbf{\phi}</math>. Standard convex optimization software (e.g. CVX) or other iterative methods can be used to solve for <math>\mathbf{\phi}</math> although significantly more efficient methods such as solving the Lagrange dual have also been developed.

+

As described above, a significant limitation of sparse coding is that even after a set of basis vectors have been learnt, in order to "encode" a new data example, optimization must be performed to obtain the required coefficients. This significant "runtime" cost means that sparse coding is computationally expensive to implement even at test time especially compared to typical feedforward architectures.

Sparse Coding

From Ufldl

Revision as of 13:30, 21 March 2011

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 1: / Line 1: @@
-== Background and Motivation ==
+== Sparse Coding ==
 Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. The aim of sparse coding is to find a set of basis vectors <math>\mathbf{\phi}_i</math> such that we can represent an input vector <math>\mathbf{x}</math> as a linear combination of these basis vectors:
@@ Line 16: / Line 16: @@
 \end{align}</math>
-where <math>S(.)</math> is a sparsity cost function which penalizes <math>a_i</math> for being far from zero. We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of <math>\mathbf{x}</math> and the second term as a sparsity penalty which forces our representation of <math>\mathbf{x}</math> to be sparse. The constant <math>\lambda</math> is a scaling constant to determine the relative importance of these two contributions. In practice, common choices for the sparsity cost <math>S(.)</math> are the <math>L_1</math> penalty <math>\left|a_i\right|_1 </math> and the log penalty <math>\log(1+a_i^2)</math>.
+where <math>S(.)</math> is a sparsity cost function which penalizes <math>a_i</math> for being far from zero. We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of <math>\mathbf{x}</math> and the second term as a sparsity penalty which forces our representation of <math>\mathbf{x}</math> to be sparse. The constant <math>\lambda</math> is a scaling constant to determine the relative importance of these two contributions.
+Although the most direct measure of sparsity is the "<math>L_0</math>" norm (<math>S(a_i) = \mathbf{1}(|a_i|>0)</math>), it is non-differentiable and difficult to optimize in general. In practice, common choices for the sparsity cost <math>S(.)</math> are the <math>L_1</math> penalty <math>\left|a_i\right|_1 </math> and the log penalty <math>\log(1+a_i^2)</math>.
 In addition, it is also possible to make the sparsity penalty arbitrarily small by scaling down <math>a_i</math> and scaling <math>\mathbf{\phi}_i</math> up by some large constant. To prevent this from happening, we will constrain <math>\left|\left|\mathbf{\phi}\right|\right|^2</math> to be less than some constant <math>C</math>. The full sparse coding cost function including our constraint on <math>\mathbf{\phi}</math> is
@@ Line 34: / Line 36: @@
 \mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} + \nu(\mathbf{x})
 \end{align}</math>
-We thus wish to find a set of basis feature vectors <math>\mathbf{\phi}</math> such that the distribution of images <math>P(\mathbf{x}\mid\mathbf{\phi})</math> is as close as possible to the empirical distribution of our input data <math>P^*(\mathbf{x})</math>.
+Our goal is to find a set of basis feature vectors <math>\mathbf{\phi}</math> such that the distribution of images <math>P(\mathbf{x}\mid\mathbf{\phi})</math> is as close as possible to the empirical distribution of our input data <math>P^*(\mathbf{x})</math>. One method of doing so is to minimize the KL divergence between <math>P^*(\mathbf{x})</math> and <math>P(\mathbf{x}\mid\mathbf{\phi})</math> where the KL divergence is defined as:
+:<math>\begin{align}
+D(P^*(\mathbf{x})||P(\mathbf{x}\mid\mathbf{\phi})) = \int P^*(\mathbf{x}) \log \left(\frac{P^*(\mathbf{x})}{P(\mathbf{x}\mid\mathbf{\phi})}\right)d\mathbf{x}
+\end{align}</math>
+Since the empirical distribution <math>P^*(\mathbf{x})</math> is constant across our choice of <math>\mathbf{\phi}</math>, this is equivalent to maximizing the log-likelihood of <math>P(\mathbf{x}\mid\mathbf{\phi})</math>.
 Assuming <math>\nu</math> is Gaussian white noise with variance <math>\sigma^2</math>, we have that
 :<math>\begin{align}
-P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac{1}{Z} \exp(- \frac{(\mathbf{x}-\sum^{k}_{i=1} a_i \mathbf{\phi}_{i})^2}{2\sigma^2})
+P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac{1}{Z} \exp\left(- \frac{(\mathbf{x}-\sum^{k}_{i=1} a_i \mathbf{\phi}_{i})^2}{2\sigma^2}\right)
 \end{align}</math>
 In order to determine the distribution <math>P(\mathbf{x}\mid\mathbf{\phi})</math>, we also need to specify the prior distribution <math>P(\mathbf{a})</math>. Assuming the independence of our source features, we can factorize our prior probability as
 :<math>\begin{align}
-P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac{1}{Z} \exp(- \frac{(\mathbf{x}-\sum^{k}_{i=1} a_i \mathbf{\phi}_{i})^2}{2\sigma^2})
+P(\mathbf{a}) = \prod_{i=1}^{k} P(a_i)
 \end{align}</math>
+At this point, we would like to incorporate our sparsity assumption -- the assumption that any single image is likely to be the product of relatively few source features. Therefore, we would like the probability distribution of <math>a_i</math> to be peaked at zero and have high kurtosis. A convenient parameterization of the prior distribution is
+:<math>\begin{align}
+P(a_i) = \frac{1}{Z}\exp(-\beta S(a_i))
+\end{align}</math>
+Where <math>S(a_i)</math> is a function determining the shape of the prior distribution.
+Having defined <math>P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi})</math> and <math> P(\mathbf{a})</math>, we can write the probability of the data <math>\mathbf{x}</math> under the model defined by <math>\mathbf{\phi}</math> as
+:<math>\begin{align}
+P(\mathbf{x} \mid \mathbf{\phi}) = \int P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) P(\mathbf{a}) d\mathbf{a}
+\end{align}</math>
+and our problem reduces to finding
+:<math>\begin{align}
+\mathbf{\phi}^*=\text{argmax}_{\mathbf{\phi}} < \log(P(\mathbf{x} \mid \mathbf{\phi})) >
+\end{align}</math>
+Where <math><.></math> denotes expectation over our input data.
+Unfortunately, the integral over <math>\mathbf{a}</math> to obtain <math>P(\mathbf{x} \mid \mathbf{\phi})</math> is generally intractable. We note though that if the distribution of <math>P(\mathbf{x} \mid \mathbf{\phi})</math> is sufficiently peaked (w.r.t. <math>\mathbf{a}</math>), we can approximate its integral with the maximum value of  <math>P(\mathbf{x} \mid \mathbf{\phi})</math> and obtain a approximate solution
+:<math>\begin{align}
+\mathbf{\phi}^{*'}=\text{argmax}_{\mathbf{\phi}} < \max_{\mathbf{a}} \log(P(\mathbf{x} \mid \mathbf{\phi})) >
+\end{align}</math>
+As before, we may increase the estimated probability by scaling down <math>a_i</math> and scaling up <math>\mathbf{\phi}</math> (since <math>P(a_i)</math> peaks about zero) , we therefore impose a norm constraint on our features <math>\mathbf{\phi}</math> to prevent this.
+Finally, we can recover our original cost function by defining the energy function of this linear generative model
+:<math>\begin{array}{rl}
+E\left( \mathbf{x} , \mathbf{a} \mid \mathbf{\phi} \right) & := -\log \left( P(\mathbf{x}\mid \mathbf{\phi},\mathbf{a}\right)P(\mathbf{a})) \\
+ &= \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i)
+\end{array}</math>
+where <math>\lambda = 2*\sigma^2*\beta</math> and irrelevant constants have been hidden. Since maximizing the log-likelihood is equivalent to minimizing the energy function, we recover the original optimization problem:
+:<math>\begin{align}
+\mathbf{\phi}^{*},\mathbf{a}^{*}=\text{argmin}_{\mathbf{\phi},\mathbf{a}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i)
+\end{align}</math>
+Using a probabilistic approach, it can also be seen that the choices of the <math>L_1</math> penalty <math>\left|a_i\right|_1 </math> and the log penalty <math>\log(1+a_i^2)</math> for <math>S(.)</math> correspond to the use of the Laplacian (<math>P(a_i) \propto \exp\left(-\beta|a_i|\right)</math>) and the Cauchy prior (<math>P(a_i) \propto \frac{\beta}{1+a_i^2}</math>) respectively.
+== Learning ==
+Learning a set of basis vectors <math>\mathbf{\phi}</math> using sparse coding consists of performing two separate optimizations, the first being an optimization over coefficients <math>a_i</math> for each training example <math>\mathbf{x}</math> and the second an optimization over basis vectors <math>\mathbf{\phi}</math> across many training examples at once.
+Assuming an <math>L_1</math> sparsity penalty, learning <math>a^{(j)}_i</math> reduces to solving a <math>L_1</math> regularized least squares problem which is convex in <math>a^{(j)}_i</math> for which several techniques have been developed (convex optimization software such as CVX can also be used to perform L1 regularized least squares). Assuming a differentiable <math>S(.)</math> such as the log penalty, gradient-based methods such as conjugate gradient methods can also be used.
+Learning a set of basis vectors with a <math>L_2</math> norm constraint also reduces to a least squares problem with quadratic constraints which is convex in <math>\mathbf{\phi}</math>. Standard convex optimization software (e.g. CVX) or other iterative methods can be used to solve for <math>\mathbf{\phi}</math> although significantly more efficient methods such as solving the Lagrange dual have also been developed.
+As described above, a significant limitation of sparse coding is that even after a set of basis vectors have been learnt, in order to "encode" a new data example, optimization must be performed to obtain the required coefficients. This significant "runtime" cost means that sparse coding is computationally expensive to implement even at test time especially compared to typical feedforward architectures.