http://ufldl.stanford.edu/wiki/index.php?title=Special:Contributions/Arjun&feed=atom&limit=50&target=Arjun&year=&month= Ufldl - User contributions [en] 2019-05-26T07:06:34Z From Ufldl MediaWiki 1.16.2 http://ufldl.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity Autoencoders and Sparsity 2011-09-21T22:53:25Z <p>Arjun: minor rephrase</p> <hr /> <div>So far, we have described the application of neural networks to supervised learning, in which we have labeled<br /> training examples. Now suppose we have only a set of unlabeled training examples &lt;math&gt;\textstyle \{x^{(1)}, x^{(2)}, x^{(3)}, \ldots\}&lt;/math&gt;,<br /> where &lt;math&gt;\textstyle x^{(i)} \in \Re^{n}&lt;/math&gt;. An<br /> '''autoencoder''' neural network is an unsupervised learning algorithm that applies backpropagation,<br /> setting the target values to be equal to the inputs. I.e., it uses &lt;math&gt;\textstyle y^{(i)} = x^{(i)}&lt;/math&gt;.<br /> <br /> Here is an autoencoder:<br /> <br /> [[Image:Autoencoder636.png|400px|center]]<br /> <br /> The autoencoder tries to learn a function &lt;math&gt;\textstyle h_{W,b}(x) \approx x&lt;/math&gt;. In other<br /> words, it is trying to learn an approximation to the identity function, so as<br /> to output &lt;math&gt;\textstyle \hat{x}&lt;/math&gt; that is similar to &lt;math&gt;\textstyle x&lt;/math&gt;. The identity function seems a<br /> particularly trivial function to be trying to learn; but by placing constraints<br /> on the network, such as by limiting the number of hidden units, we can discover<br /> interesting structure about the data. As a concrete example, suppose the<br /> inputs &lt;math&gt;\textstyle x&lt;/math&gt; are the pixel intensity values from a &lt;math&gt;\textstyle 10 \times 10&lt;/math&gt; image (100<br /> pixels) so &lt;math&gt;\textstyle n=100&lt;/math&gt;, and there are &lt;math&gt;\textstyle s_2=50&lt;/math&gt; hidden units in layer &lt;math&gt;\textstyle L_2&lt;/math&gt;. Note that<br /> we also have &lt;math&gt;\textstyle y \in \Re^{100}&lt;/math&gt;. Since there are only 50 hidden units, the<br /> network is forced to learn a ''compressed'' representation of the input.<br /> I.e., given only the vector of hidden unit activations &lt;math&gt;\textstyle a^{(2)} \in \Re^{50}&lt;/math&gt;,<br /> it must try to '''reconstruct''' the 100-pixel input &lt;math&gt;\textstyle x&lt;/math&gt;. If the input were completely<br /> random---say, each &lt;math&gt;\textstyle x_i&lt;/math&gt; comes from an IID Gaussian independent of the other<br /> features---then this compression task would be very difficult. But if there is<br /> structure in the data, for example, if some of the input features are correlated,<br /> then this algorithm will be able to discover some of those correlations. In fact,<br /> this simple autoencoder often ends up learning a low-dimensional representation very similar<br /> to PCAs.<br /> <br /> Our argument above relied on the number of hidden units &lt;math&gt;\textstyle s_2&lt;/math&gt; being small. But<br /> even when the number of hidden units is large (perhaps even greater than the<br /> number of input pixels), we can still discover interesting structure, by<br /> imposing other constraints on the network. In particular, if we impose a<br /> '''sparsity''' constraint on the hidden units, then the autoencoder will still<br /> discover interesting structure in the data, even if the number of hidden units<br /> is large.<br /> <br /> Informally, we will think of a neuron as being &quot;active&quot; (or as &quot;firing&quot;) if<br /> its output value is close to 1, or as being &quot;inactive&quot; if its output value is<br /> close to 0. We would like to constrain the neurons to be inactive most of the<br /> time. This discussion assumes a sigmoid activation function. If you are<br /> using a tanh activation function, then we think of a neuron as being inactive<br /> when it outputs values close to -1.<br /> <br /> Recall that &lt;math&gt;\textstyle a^{(2)}_j&lt;/math&gt; denotes the activation of hidden unit &lt;math&gt;\textstyle j&lt;/math&gt; in the<br /> autoencoder. However, this notation doesn't make explicit what was the input &lt;math&gt;\textstyle x&lt;/math&gt;<br /> that led to that activation. Thus, we will write &lt;math&gt;\textstyle a^{(2)}_j(x)&lt;/math&gt; to denote the activation<br /> of this hidden unit when the network is given a specific input &lt;math&gt;\textstyle x&lt;/math&gt;. Further, let<br /> :&lt;math&gt;\begin{align}<br /> \hat\rho_j = \frac{1}{m} \sum_{i=1}^m \left[ a^{(2)}_j(x^{(i)}) \right]<br /> \end{align}&lt;/math&gt;<br /> be the average activation of hidden unit &lt;math&gt;\textstyle j&lt;/math&gt; (averaged over the training set).<br /> We would like to (approximately) enforce the constraint<br /> :&lt;math&gt;\begin{align}<br /> \hat\rho_j = \rho,<br /> \end{align}&lt;/math&gt;<br /> where &lt;math&gt;\textstyle \rho&lt;/math&gt; is a '''sparsity parameter''', typically a small value close to zero<br /> (say &lt;math&gt;\textstyle \rho = 0.05&lt;/math&gt;). In other words, we would like the average activation<br /> of each hidden neuron &lt;math&gt;\textstyle j&lt;/math&gt; to be close to 0.05 (say). To satisfy this<br /> constraint, the hidden unit's activations must mostly be near 0.<br /> <br /> <br /> To achieve this, we will add an extra penalty term to our optimization objective that<br /> penalizes &lt;math&gt;\textstyle \hat\rho_j&lt;/math&gt; deviating significantly from &lt;math&gt;\textstyle \rho&lt;/math&gt;. Many choices of the penalty<br /> term will give reasonable results. We will choose the following:<br /> :&lt;math&gt;\begin{align}<br /> \sum_{j=1}^{s_2} \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j}.<br /> \end{align}&lt;/math&gt;<br /> Here, &lt;math&gt;\textstyle s_2&lt;/math&gt; is the number of neurons in the hidden layer, and the index &lt;math&gt;\textstyle j&lt;/math&gt; is summing<br /> over the hidden units in our network. If you are<br /> familiar with the concept of KL divergence, this penalty term is based on<br /> it, and can also be written<br /> :&lt;math&gt;\begin{align}<br /> \sum_{j=1}^{s_2} {\rm KL}(\rho || \hat\rho_j),<br /> \end{align}&lt;/math&gt;<br /> where &lt;math&gt;\textstyle {\rm KL}(\rho || \hat\rho_j)<br /> = \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j}&lt;/math&gt;<br /> is the Kullback-Leibler (KL) divergence between<br /> a Bernoulli random variable with mean &lt;math&gt;\textstyle \rho&lt;/math&gt; and a Bernoulli random variable with mean &lt;math&gt;\textstyle \hat\rho_j&lt;/math&gt;.<br /> KL-divergence is a standard function for measuring how different two different<br /> distributions are. (If you've not seen KL-divergence before, don't worry about<br /> it; everything you need to know about it is contained in these notes.)<br /> <br /> This penalty function has the property that &lt;math&gt;\textstyle {\rm KL}(\rho || \hat\rho_j) = 0&lt;/math&gt; if &lt;math&gt;\textstyle \hat\rho_j = \rho&lt;/math&gt;,<br /> and otherwise it increases monotonically as &lt;math&gt;\textstyle \hat\rho_j&lt;/math&gt; diverges from &lt;math&gt;\textstyle \rho&lt;/math&gt;. For example, in the<br /> figure below, we have set &lt;math&gt;\textstyle \rho = 0.2&lt;/math&gt;, and plotted<br /> &lt;math&gt;\textstyle {\rm KL}(\rho || \hat\rho_j)&lt;/math&gt; for a range of values of &lt;math&gt;\textstyle \hat\rho_j&lt;/math&gt;:<br /> <br /> [[Image:KLPenaltyExample.png|400px|center]]<br /> <br /> We see that the KL-divergence reaches its minimum of 0 at<br /> &lt;math&gt;\textstyle \hat\rho_j = \rho&lt;/math&gt;, and blows up (it actually approaches &lt;math&gt;\textstyle \infty&lt;/math&gt;) as &lt;math&gt;\textstyle \hat\rho_j&lt;/math&gt;<br /> approaches 0 or 1. Thus, minimizing<br /> this penalty term has the effect of causing &lt;math&gt;\textstyle \hat\rho_j&lt;/math&gt; to be close to &lt;math&gt;\textstyle \rho&lt;/math&gt;.<br /> <br /> Our overall cost function is now<br /> :&lt;math&gt;\begin{align}<br /> J_{\rm sparse}(W,b) = J(W,b) + \beta \sum_{j=1}^{s_2} {\rm KL}(\rho || \hat\rho_j),<br /> \end{align}&lt;/math&gt;<br /> where &lt;math&gt;\textstyle J(W,b)&lt;/math&gt; is as defined previously, and &lt;math&gt;\textstyle \beta&lt;/math&gt; controls the weight of<br /> the sparsity penalty term. The term &lt;math&gt;\textstyle \hat\rho_j&lt;/math&gt; (implicitly) depends on &lt;math&gt;\textstyle W,b&lt;/math&gt; also,<br /> because it is the average activation of hidden unit &lt;math&gt;\textstyle j&lt;/math&gt;, and the activation of a hidden<br /> unit depends on the parameters &lt;math&gt;\textstyle W,b&lt;/math&gt;.<br /> <br /> To incorporate the KL-divergence term into your derivative calculation, there is a simple-to-implement<br /> trick involving only a small change to your code. Specifically, where previously for<br /> the second layer (&lt;math&gt;\textstyle l=2&lt;/math&gt;), during backpropagation you would have computed<br /> :&lt;math&gt;\begin{align}<br /> \delta^{(2)}_i = \left( \sum_{j=1}^{s_{2}} W^{(2)}_{ji} \delta^{(3)}_j \right) f'(z^{(2)}_i),<br /> \end{align}&lt;/math&gt;<br /> now instead compute<br /> :&lt;math&gt;\begin{align}<br /> \delta^{(2)}_i =<br /> \left( \left( \sum_{j=1}^{s_{2}} W^{(2)}_{ji} \delta^{(3)}_j \right)<br /> + \beta \left( - \frac{\rho}{\hat\rho_i} + \frac{1-\rho}{1-\hat\rho_i} \right) \right) f'(z^{(2)}_i) .<br /> \end{align}&lt;/math&gt;<br /> One subtlety is that you'll need to know &lt;math&gt;\textstyle \hat\rho_i&lt;/math&gt; to compute this term. Thus, you'll need<br /> to compute a forward pass on all the training examples first to compute the average<br /> activations on the training set, before computing backpropagation on any example. If your<br /> training set is small enough to fit comfortably in computer memory (this will be the case for the programming<br /> assignment), you can compute forward passes on all your examples and keep the resulting activations<br /> in memory and compute the &lt;math&gt;\textstyle \hat\rho_i&lt;/math&gt;s. Then you can use your precomputed activations to<br /> perform backpropagation on all your examples. If your data is too large to fit in memory, you<br /> may have to scan through your examples computing a forward pass on each to accumulate (sum up) the<br /> activations and compute &lt;math&gt;\textstyle \hat\rho_i&lt;/math&gt; (discarding the result of each forward pass after you<br /> have taken its activations &lt;math&gt;\textstyle a^{(2)}_i&lt;/math&gt; into account for computing &lt;math&gt;\textstyle \hat\rho_i&lt;/math&gt;). Then after<br /> having computed &lt;math&gt;\textstyle \hat\rho_i&lt;/math&gt;, you'd have to redo the forward pass for each example so that you<br /> can do backpropagation on that example. In this latter case, you would end up computing a forward<br /> pass twice on each example in your training set, making it computationally less efficient.<br /> <br /> <br /> The full derivation showing that the algorithm above results in gradient descent is beyond the scope<br /> of these notes. But if you implement the autoencoder using backpropagation modified this way,<br /> you will be performing gradient descent exactly on the objective<br /> &lt;math&gt;\textstyle J_{\rm sparse}(W,b)&lt;/math&gt;. Using the derivative checking method, you will be able to verify<br /> this for yourself as well.<br /> <br /> <br /> {{Sparse_Autoencoder}}</div> Arjun