# Sparse Coding

 Revision as of 11:20, 21 March 2011 (view source)Zhenghao (Talk | contribs)← Older edit Latest revision as of 04:28, 8 April 2013 (view source)Kandeng (Talk | contribs) Line 1: Line 1: - == Background and Motivation == + == Sparse Coding == Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. The aim of sparse coding is to find a set of basis vectors $\mathbf{\phi}_i$ such that we can represent an input vector $\mathbf{x}$ as a linear combination of these basis vectors: Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. The aim of sparse coding is to find a set of basis vectors $\mathbf{\phi}_i$ such that we can represent an input vector $\mathbf{x}$ as a linear combination of these basis vectors: Line 16: Line 16: \end{align}[/itex] \end{align}[/itex] - where $S(.)$ is a sparsity cost function which penalizes $a_i$ for being far from zero. We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of $\mathbf{x}$ and the second term as a sparsity penalty which forces our representation of $\mathbf{x}$ to be sparse. The constant $\lambda$ is a scaling constant to determine the relative importance of these two contributions. In practice, common choices for the sparsity cost $S(.)$ are the $L_1$ penalty $\left|a_i\right|_1$ and the log penalty $\log(1+a_i^2)$. + where $S(.)$ is a sparsity cost function which penalizes $a_i$ for being far from zero. We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of $\mathbf{x}$ and the second term as a sparsity penalty which forces our representation of $\mathbf{x}$ to be sparse. The constant $\lambda$ is a scaling constant to determine the relative importance of these two contributions. + + Although the most direct measure of sparsity is the "$L_0$" norm ($S(a_i) = \mathbf{1}(|a_i|>0)$), it is non-differentiable and difficult to optimize in general. In practice, common choices for the sparsity cost $S(.)$ are the $L_1$ penalty $S(a_i)=\left|a_i\right|_1$ and the log penalty $S(a_i)=\log(1+a_i^2)$. In addition, it is also possible to make the sparsity penalty arbitrarily small by scaling down $a_i$ and scaling $\mathbf{\phi}_i$ up by some large constant. To prevent this from happening, we will constrain $\left|\left|\mathbf{\phi}\right|\right|^2$ to be less than some constant $C$. The full sparse coding cost function including our constraint on $\mathbf{\phi}$ is In addition, it is also possible to make the sparsity penalty arbitrarily small by scaling down $a_i$ and scaling $\mathbf{\phi}_i$ up by some large constant. To prevent this from happening, we will constrain $\left|\left|\mathbf{\phi}\right|\right|^2$ to be less than some constant $C$. The full sparse coding cost function including our constraint on $\mathbf{\phi}$ is Line 27: Line 29: \end{array}[/itex] \end{array}[/itex] - == Probabilistic Interpretation == + == Probabilistic Interpretation [Based on Olshausen and Field 1996] == So far, we have considered sparse coding in the context of finding a sparse, over-complete set of basis vectors to span our input space. Alternatively, we may also approach sparse coding from a probabilistic perspective as a generative model. So far, we have considered sparse coding in the context of finding a sparse, over-complete set of basis vectors to span our input space. Alternatively, we may also approach sparse coding from a probabilistic perspective as a generative model. - Consider the problem of modelling natural images as the linear superposition of $k$ independent causal features $\mathbf{\phi}_i$ with some additive noise $\nu$: + Consider the problem of modelling natural images as the linear superposition of $k$ independent source features $\mathbf{\phi}_i$ with some additive noise $\nu$: :\begin{align} :[itex]\begin{align} \mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} + \nu(\mathbf{x}) \mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} + \nu(\mathbf{x}) \end{align} \end{align}[/itex] - We thus wish to find a set of basis feature vectors $\mathbf{\phi}$ such that the distribution of images $P(\mathbf{x}\mid\mathbf{\phi})$ is as close as possible to the empirical distribution of our input data $P^*(\mathbf{x})$. + Our goal is to find a set of basis feature vectors $\mathbf{\phi}$ such that the distribution of images $P(\mathbf{x}\mid\mathbf{\phi})$ is as close as possible to the empirical distribution of our input data $P^*(\mathbf{x})$. One method of doing so is to minimize the KL divergence between $P^*(\mathbf{x})$ and $P(\mathbf{x}\mid\mathbf{\phi})$ where the KL divergence is defined as: + + :\begin{align} + D(P^*(\mathbf{x})||P(\mathbf{x}\mid\mathbf{\phi})) = \int P^*(\mathbf{x}) \log \left(\frac{P^*(\mathbf{x})}{P(\mathbf{x}\mid\mathbf{\phi})}\right)d\mathbf{x} + \end{align} + + Since the empirical distribution $P^*(\mathbf{x})$ is constant across our choice of $\mathbf{\phi}$, this is equivalent to maximizing the log-likelihood of $P(\mathbf{x}\mid\mathbf{\phi})$. + Assuming $\nu$ is Gaussian white noise with variance $\sigma^2$, we have that Assuming $\nu$ is Gaussian white noise with variance $\sigma^2$, we have that :\begin{align} :[itex]\begin{align} - P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac{1}{Z} \exp(- \frac{\mathbf{x}-\sum^{k}_{i=1} a_i \mathbf{\phi}_{i} }{2\sigma^2}) + P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac{1}{Z} \exp\left(- \frac{(\mathbf{x}-\sum^{k}_{i=1} a_i \mathbf{\phi}_{i})^2}{2\sigma^2}\right) \end{align} \end{align}[/itex] + In order to determine the distribution $P(\mathbf{x}\mid\mathbf{\phi})$, we also need to specify the prior distribution $P(\mathbf{a})$. Assuming the independence of our source features, we can factorize our prior probability as + + :\begin{align} + P(\mathbf{a}) = \prod_{i=1}^{k} P(a_i) + \end{align} + + At this point, we would like to incorporate our sparsity assumption -- the assumption that any single image is likely to be the product of relatively few source features. Therefore, we would like the probability distribution of $a_i$ to be peaked at zero and have high kurtosis. A convenient parameterization of the prior distribution is + + :\begin{align} + P(a_i) = \frac{1}{Z}\exp(-\beta S(a_i)) + \end{align} + + Where $S(a_i)$ is a function determining the shape of the prior distribution. + + Having defined $P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi})$ and $P(\mathbf{a})$, we can write the probability of the data $\mathbf{x}$ under the model defined by $\mathbf{\phi}$ as + + :\begin{align} + P(\mathbf{x} \mid \mathbf{\phi}) = \int P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) P(\mathbf{a}) d\mathbf{a} + \end{align} + + and our problem reduces to finding + + :\begin{align} + \mathbf{\phi}^*=\text{argmax}_{\mathbf{\phi}} < \log(P(\mathbf{x} \mid \mathbf{\phi})) > + \end{align} + + Where $<.>$ denotes expectation over our input data. + + Unfortunately, the integral over $\mathbf{a}$ to obtain $P(\mathbf{x} \mid \mathbf{\phi})$ is generally intractable. We note though that if the distribution of $P(\mathbf{x} \mid \mathbf{\phi})$ is sufficiently peaked (w.r.t. $\mathbf{a}$), we can approximate its integral with the maximum value of  $P(\mathbf{x} \mid \mathbf{\phi})$ and obtain a approximate solution + :\begin{align} + \mathbf{\phi}^{*'}=\text{argmax}_{\mathbf{\phi}} < \max_{\mathbf{a}} \log(P(\mathbf{x} \mid \mathbf{\phi})) > + \end{align} + + As before, we may increase the estimated probability by scaling down $a_i$ and scaling up $\mathbf{\phi}$ (since $P(a_i)$ peaks about zero) , we therefore impose a norm constraint on our features $\mathbf{\phi}$ to prevent this. + + Finally, we can recover our original cost function by defining the energy function of this linear generative model + :$\begin{array}{rl} + E\left( \mathbf{x} , \mathbf{a} \mid \mathbf{\phi} \right) & := -\log \left( P(\mathbf{x}\mid \mathbf{\phi},\mathbf{a}\right)P(\mathbf{a})) \\ + &= \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) + \end{array}$ + where $\lambda = 2\sigma^2\beta$ and irrelevant constants have been hidden. Since maximizing the log-likelihood is equivalent to minimizing the energy function, we recover the original optimization problem: + :\begin{align} + \mathbf{\phi}^{*},\mathbf{a}^{*}=\text{argmin}_{\mathbf{\phi},\mathbf{a}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) + \end{align} + + Using a probabilistic approach, it can also be seen that the choices of the $L_1$ penalty $\left|a_i\right|_1$ and the log penalty $\log(1+a_i^2)$ for $S(.)$ correspond to the use of the Laplacian $P(a_i) \propto \exp\left(-\beta|a_i|\right)$ and the Cauchy prior $P(a_i) \propto \frac{\beta}{1+a_i^2}$ respectively. + + == Learning == + Learning a set of basis vectors $\mathbf{\phi}$ using sparse coding consists of performing two separate optimizations, the first being an optimization over coefficients $a_i$ for each training example $\mathbf{x}$ and the second an optimization over basis vectors $\mathbf{\phi}$ across many training examples at once. + + Assuming an $L_1$ sparsity penalty, learning $a^{(j)}_i$ reduces to solving a $L_1$ regularized least squares problem which is convex in $a^{(j)}_i$ for which several techniques have been developed (convex optimization software such as CVX can also be used to perform L1 regularized least squares). Assuming a differentiable $S(.)$ such as the log penalty, gradient-based methods such as conjugate gradient methods can also be used. + + Learning a set of basis vectors with a $L_2$ norm constraint also reduces to a least squares problem with quadratic constraints which is convex in $\mathbf{\phi}$. Standard convex optimization software (e.g. CVX) or other iterative methods can be used to solve for $\mathbf{\phi}$ although significantly more efficient methods such as solving the Lagrange dual have also been developed. + + As described above, a significant limitation of sparse coding is that even after a set of basis vectors have been learnt, in order to "encode" a new data example, optimization must be performed to obtain the required coefficients. This significant "runtime" cost means that sparse coding is computationally expensive to implement even at test time especially compared to typical feedforward architectures. + - In order to define the distribution $P(\mathbf{x}\mid\mathbf{\phi})$, we must first specify a prior distribution over the amplitudes $a_i$ + {{Languages|稀疏编码|中文}}