http://ufldl.stanford.edu/wiki/index.php?title=Autoencoders_and_Sparsity&feed=atom&action=historyAutoencoders and Sparsity - Revision history2024-03-29T15:25:23ZRevision history for this page on the wikiMediaWiki 1.16.2http://ufldl.stanford.edu/wiki/index.php?title=Autoencoders_and_Sparsity&diff=2267&oldid=prevKandeng at 12:43, 7 April 20132013-04-07T12:43:48Z<p></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 12:43, 7 April 2013</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 138:</td>
<td colspan="2" class="diff-lineno">Line 138:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>{{Sparse_Autoencoder}}</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>{{Sparse_Autoencoder}}</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>{{Languages|自编码算法与稀疏性|中文}}</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>{{Languages|自编码算法与稀疏性|中文}}</div></td></tr>
</table>Kandenghttp://ufldl.stanford.edu/wiki/index.php?title=Autoencoders_and_Sparsity&diff=2266&oldid=prevKandeng at 12:43, 7 April 20132013-04-07T12:43:27Z<p></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 12:43, 7 April 2013</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 138:</td>
<td colspan="2" class="diff-lineno">Line 138:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>{{Sparse_Autoencoder}}</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>{{Sparse_Autoencoder}}</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">{{Languages|自编码算法与稀疏性|中文}}</ins></div></td></tr>
</table>Kandenghttp://ufldl.stanford.edu/wiki/index.php?title=Autoencoders_and_Sparsity&diff=1225&oldid=prevArjun: minor rephrase2011-09-21T22:53:25Z<p>minor rephrase</p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 22:53, 21 September 2011</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 1:</td>
<td colspan="2" class="diff-lineno">Line 1:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>So far, we have described the application of neural networks to supervised learning, in which we have labeled</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>So far, we have described the application of neural networks to supervised learning, in which we have labeled</div></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>training examples. Now suppose we have only unlabeled training examples <del class="diffchange diffchange-inline">set </del><math>\textstyle \{x^{(1)}, x^{(2)}, x^{(3)}, \ldots\}</math>,</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>training examples. Now suppose we have only <ins class="diffchange diffchange-inline">a set of </ins>unlabeled training examples <math>\textstyle \{x^{(1)}, x^{(2)}, x^{(3)}, \ldots\}</math>,</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>where <math>\textstyle x^{(i)} \in \Re^{n}</math>. An</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>where <math>\textstyle x^{(i)} \in \Re^{n}</math>. An</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>'''autoencoder''' neural network is an unsupervised learning algorithm that applies backpropagation,</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>'''autoencoder''' neural network is an unsupervised learning algorithm that applies backpropagation,</div></td></tr>
</table>Arjunhttp://ufldl.stanford.edu/wiki/index.php?title=Autoencoders_and_Sparsity&diff=872&oldid=prevWatsuen at 10:50, 26 May 20112011-05-26T10:50:23Z<p></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 10:50, 26 May 2011</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 135:</td>
<td colspan="2" class="diff-lineno">Line 135:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div><math>\textstyle J_{\rm sparse}(W,b)</math>. Using the derivative checking method, you will be able to verify</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div><math>\textstyle J_{\rm sparse}(W,b)</math>. Using the derivative checking method, you will be able to verify</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>this for yourself as well.</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>this for yourself as well.</div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div><ins style="color: red; font-weight: bold; text-decoration: none;">{{Sparse_Autoencoder}}</ins></div></td></tr>
</table>Watsuenhttp://ufldl.stanford.edu/wiki/index.php?title=Autoencoders_and_Sparsity&diff=721&oldid=prev216.239.45.4 at 20:17, 11 May 20112011-05-11T20:17:26Z<p></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 20:17, 11 May 2011</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 1:</td>
<td colspan="2" class="diff-lineno">Line 1:</td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>So far, we have described the application of neural networks to supervised learning, in which we <del class="diffchange diffchange-inline">are </del>have labeled</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>So far, we have described the application of neural networks to supervised learning, in which we have labeled</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>training examples. Now suppose we have only unlabeled training examples set <math>\textstyle \{x^{(1)}, x^{(2)}, x^{(3)}, \ldots\}</math>,</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>training examples. Now suppose we have only unlabeled training examples set <math>\textstyle \{x^{(1)}, x^{(2)}, x^{(3)}, \ldots\}</math>,</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>where <math>\textstyle x^{(i)} \in \Re^{n}</math>. An</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>where <math>\textstyle x^{(i)} \in \Re^{n}</math>. An</div></td></tr>
</table>216.239.45.4http://ufldl.stanford.edu/wiki/index.php?title=Autoencoders_and_Sparsity&diff=362&oldid=prevZellyn at 21:15, 23 April 20112011-04-23T21:15:43Z<p></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 21:15, 23 April 2011</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 26:</td>
<td colspan="2" class="diff-lineno">Line 26:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>then this algorithm will be able to discover some of those correlations. In fact,</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>then this algorithm will be able to discover some of those correlations. In fact,</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>this simple autoencoder often ends up learning a low-dimensional representation very similar</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>this simple autoencoder often ends up learning a low-dimensional representation very similar</div></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>to <del class="diffchange diffchange-inline">PCA's</del>.</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>to <ins class="diffchange diffchange-inline">PCAs</ins>.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Our argument above relied on the number of hidden units <math>\textstyle s_2</math> being small. But</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Our argument above relied on the number of hidden units <math>\textstyle s_2</math> being small. But</div></td></tr>
</table>Zellynhttp://ufldl.stanford.edu/wiki/index.php?title=Autoencoders_and_Sparsity&diff=352&oldid=prevJngiam at 23:22, 22 April 20112011-04-22T23:22:00Z<p></p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 23:22, 22 April 2011</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 24:</td>
<td colspan="2" class="diff-lineno">Line 24:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>features---then this compression task would be very difficult. But if there is</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>features---then this compression task would be very difficult. But if there is</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>structure in the data, for example, if some of the input features are correlated,</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>structure in the data, for example, if some of the input features are correlated,</div></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>then this algorithm will be able to discover some of those correlations.<del class="diffchange diffchange-inline">\footnote{</del>In fact,</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>then this algorithm will be able to discover some of those correlations. In fact,</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>this simple autoencoder often ends up learning a low-dimensional representation very similar</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>this simple autoencoder often ends up learning a low-dimensional representation very similar</div></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>to PCA's.<del class="diffchange diffchange-inline">}</del></div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>to PCA's.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Our argument above relied on the number of hidden units <math>\textstyle s_2</math> being small. But</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Our argument above relied on the number of hidden units <math>\textstyle s_2</math> being small. But</div></td></tr>
<tr><td colspan="2" class="diff-lineno">Line 39:</td>
<td colspan="2" class="diff-lineno">Line 39:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>its output value is close to 1, or as being "inactive" if its output value is</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>its output value is close to 1, or as being "inactive" if its output value is</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>close to 0. We would like to constrain the neurons to be inactive most of the</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>close to 0. We would like to constrain the neurons to be inactive most of the</div></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>time.<del class="diffchange diffchange-inline">\footnote{</del>This discussion assumes a sigmoid activation function. If you are</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>time. This discussion assumes a sigmoid activation function. If you are</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>using a tanh activation function, then we think of a neuron as being inactive</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>using a tanh activation function, then we think of a neuron as being inactive</div></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>when it outputs values close to -1.<del class="diffchange diffchange-inline">}</del></div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>when it outputs values close to -1.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Recall that <math>\textstyle a^{(2)}_j</math> denotes the activation of hidden unit <math>\textstyle j</math> in the</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>Recall that <math>\textstyle a^{(2)}_j</math> denotes the activation of hidden unit <math>\textstyle j</math> in the</div></td></tr>
</table>Jngiamhttp://ufldl.stanford.edu/wiki/index.php?title=Autoencoders_and_Sparsity&diff=327&oldid=prevMaiyifan: Cleaned up quotes2011-04-22T01:25:52Z<p>Cleaned up quotes</p>
<table style="background-color: white; color:black;">
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 01:25, 22 April 2011</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 18:</td>
<td colspan="2" class="diff-lineno">Line 18:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>pixels) so <math>\textstyle n=100</math>, and there are <math>\textstyle s_2=50</math> hidden units in layer <math>\textstyle L_2</math>. Note that</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>pixels) so <math>\textstyle n=100</math>, and there are <math>\textstyle s_2=50</math> hidden units in layer <math>\textstyle L_2</math>. Note that</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>we also have <math>\textstyle y \in \Re^{100}</math>. Since there are only 50 hidden units, the</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>we also have <math>\textstyle y \in \Re^{100}</math>. Since there are only 50 hidden units, the</div></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>network is forced to learn a <del class="diffchange diffchange-inline">\emph{</del>compressed<del class="diffchange diffchange-inline">} </del>representation of the input.</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>network is forced to learn a <ins class="diffchange diffchange-inline">''</ins>compressed<ins class="diffchange diffchange-inline">'' </ins>representation of the input.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>I.e., given only the vector of hidden unit activations <math>\textstyle a^{(2)} \in \Re^{50}</math>,</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>I.e., given only the vector of hidden unit activations <math>\textstyle a^{(2)} \in \Re^{50}</math>,</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>it must try to '''reconstruct''' the 100-pixel input <math>\textstyle x</math>. If the input were completely</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>it must try to '''reconstruct''' the 100-pixel input <math>\textstyle x</math>. If the input were completely</div></td></tr>
<tr><td colspan="2" class="diff-lineno">Line 36:</td>
<td colspan="2" class="diff-lineno">Line 36:</td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>is large.</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>is large.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>Informally, we will think of a neuron as being <del class="diffchange diffchange-inline">``</del>active<del class="diffchange diffchange-inline">'' </del>(or as <del class="diffchange diffchange-inline">``</del>firing<del class="diffchange diffchange-inline">''</del>) if</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>Informally, we will think of a neuron as being <ins class="diffchange diffchange-inline">"</ins>active<ins class="diffchange diffchange-inline">" </ins>(or as <ins class="diffchange diffchange-inline">"</ins>firing<ins class="diffchange diffchange-inline">"</ins>) if</div></td></tr>
<tr><td class='diff-marker'>-</td><td style="background: #ffa; color:black; font-size: smaller;"><div>its output value is close to 1, or as being <del class="diffchange diffchange-inline">``</del>inactive<del class="diffchange diffchange-inline">'' </del>if its output value is</div></td><td class='diff-marker'>+</td><td style="background: #cfc; color:black; font-size: smaller;"><div>its output value is close to 1, or as being <ins class="diffchange diffchange-inline">"</ins>inactive<ins class="diffchange diffchange-inline">" </ins>if its output value is</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>close to 0. We would like to constrain the neurons to be inactive most of the</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>close to 0. We would like to constrain the neurons to be inactive most of the</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>time.\footnote{This discussion assumes a sigmoid activation function. If you are</div></td><td class='diff-marker'> </td><td style="background: #eee; color:black; font-size: smaller;"><div>time.\footnote{This discussion assumes a sigmoid activation function. If you are</div></td></tr>
</table>Maiyifanhttp://ufldl.stanford.edu/wiki/index.php?title=Autoencoders_and_Sparsity&diff=36&oldid=prevAng: Created page with "So far, we have described the application of neural networks to supervised learning, in which we are have labeled training examples. Now suppose we have only unlabeled training ..."2011-02-26T23:38:13Z<p>Created page with "So far, we have described the application of neural networks to supervised learning, in which we are have labeled training examples. Now suppose we have only unlabeled training ..."</p>
<p><b>New page</b></p><div>So far, we have described the application of neural networks to supervised learning, in which we are have labeled<br />
training examples. Now suppose we have only unlabeled training examples set <math>\textstyle \{x^{(1)}, x^{(2)}, x^{(3)}, \ldots\}</math>,<br />
where <math>\textstyle x^{(i)} \in \Re^{n}</math>. An<br />
'''autoencoder''' neural network is an unsupervised learning algorithm that applies backpropagation,<br />
setting the target values to be equal to the inputs. I.e., it uses <math>\textstyle y^{(i)} = x^{(i)}</math>.<br />
<br />
Here is an autoencoder:<br />
<br />
[[Image:Autoencoder636.png|400px|center]]<br />
<br />
The autoencoder tries to learn a function <math>\textstyle h_{W,b}(x) \approx x</math>. In other<br />
words, it is trying to learn an approximation to the identity function, so as<br />
to output <math>\textstyle \hat{x}</math> that is similar to <math>\textstyle x</math>. The identity function seems a<br />
particularly trivial function to be trying to learn; but by placing constraints<br />
on the network, such as by limiting the number of hidden units, we can discover<br />
interesting structure about the data. As a concrete example, suppose the<br />
inputs <math>\textstyle x</math> are the pixel intensity values from a <math>\textstyle 10 \times 10</math> image (100<br />
pixels) so <math>\textstyle n=100</math>, and there are <math>\textstyle s_2=50</math> hidden units in layer <math>\textstyle L_2</math>. Note that<br />
we also have <math>\textstyle y \in \Re^{100}</math>. Since there are only 50 hidden units, the<br />
network is forced to learn a \emph{compressed} representation of the input.<br />
I.e., given only the vector of hidden unit activations <math>\textstyle a^{(2)} \in \Re^{50}</math>,<br />
it must try to '''reconstruct''' the 100-pixel input <math>\textstyle x</math>. If the input were completely<br />
random---say, each <math>\textstyle x_i</math> comes from an IID Gaussian independent of the other<br />
features---then this compression task would be very difficult. But if there is<br />
structure in the data, for example, if some of the input features are correlated,<br />
then this algorithm will be able to discover some of those correlations.\footnote{In fact,<br />
this simple autoencoder often ends up learning a low-dimensional representation very similar<br />
to PCA's.}<br />
<br />
Our argument above relied on the number of hidden units <math>\textstyle s_2</math> being small. But<br />
even when the number of hidden units is large (perhaps even greater than the<br />
number of input pixels), we can still discover interesting structure, by<br />
imposing other constraints on the network. In particular, if we impose a<br />
'''sparsity''' constraint on the hidden units, then the autoencoder will still<br />
discover interesting structure in the data, even if the number of hidden units<br />
is large.<br />
<br />
Informally, we will think of a neuron as being ``active'' (or as ``firing'') if<br />
its output value is close to 1, or as being ``inactive'' if its output value is<br />
close to 0. We would like to constrain the neurons to be inactive most of the<br />
time.\footnote{This discussion assumes a sigmoid activation function. If you are<br />
using a tanh activation function, then we think of a neuron as being inactive<br />
when it outputs values close to -1.}<br />
<br />
Recall that <math>\textstyle a^{(2)}_j</math> denotes the activation of hidden unit <math>\textstyle j</math> in the<br />
autoencoder. However, this notation doesn't make explicit what was the input <math>\textstyle x</math><br />
that led to that activation. Thus, we will write <math>\textstyle a^{(2)}_j(x)</math> to denote the activation<br />
of this hidden unit when the network is given a specific input <math>\textstyle x</math>. Further, let<br />
:<math>\begin{align}<br />
\hat\rho_j = \frac{1}{m} \sum_{i=1}^m \left[ a^{(2)}_j(x^{(i)}) \right]<br />
\end{align}</math><br />
be the average activation of hidden unit <math>\textstyle j</math> (averaged over the training set).<br />
We would like to (approximately) enforce the constraint<br />
:<math>\begin{align}<br />
\hat\rho_j = \rho,<br />
\end{align}</math><br />
where <math>\textstyle \rho</math> is a '''sparsity parameter''', typically a small value close to zero<br />
(say <math>\textstyle \rho = 0.05</math>). In other words, we would like the average activation<br />
of each hidden neuron <math>\textstyle j</math> to be close to 0.05 (say). To satisfy this<br />
constraint, the hidden unit's activations must mostly be near 0.<br />
<br />
<br />
To achieve this, we will add an extra penalty term to our optimization objective that<br />
penalizes <math>\textstyle \hat\rho_j</math> deviating significantly from <math>\textstyle \rho</math>. Many choices of the penalty<br />
term will give reasonable results. We will choose the following:<br />
:<math>\begin{align}<br />
\sum_{j=1}^{s_2} \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j}.<br />
\end{align}</math><br />
Here, <math>\textstyle s_2</math> is the number of neurons in the hidden layer, and the index <math>\textstyle j</math> is summing<br />
over the hidden units in our network. If you are<br />
familiar with the concept of KL divergence, this penalty term is based on<br />
it, and can also be written<br />
:<math>\begin{align}<br />
\sum_{j=1}^{s_2} {\rm KL}(\rho || \hat\rho_j),<br />
\end{align}</math><br />
where <math>\textstyle {\rm KL}(\rho || \hat\rho_j)<br />
= \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j}</math><br />
is the Kullback-Leibler (KL) divergence between<br />
a Bernoulli random variable with mean <math>\textstyle \rho</math> and a Bernoulli random variable with mean <math>\textstyle \hat\rho_j</math>.<br />
KL-divergence is a standard function for measuring how different two different<br />
distributions are. (If you've not seen KL-divergence before, don't worry about<br />
it; everything you need to know about it is contained in these notes.)<br />
<br />
This penalty function has the property that <math>\textstyle {\rm KL}(\rho || \hat\rho_j) = 0</math> if <math>\textstyle \hat\rho_j = \rho</math>,<br />
and otherwise it increases monotonically as <math>\textstyle \hat\rho_j</math> diverges from <math>\textstyle \rho</math>. For example, in the<br />
figure below, we have set <math>\textstyle \rho = 0.2</math>, and plotted<br />
<math>\textstyle {\rm KL}(\rho || \hat\rho_j)</math> for a range of values of <math>\textstyle \hat\rho_j</math>:<br />
<br />
[[Image:KLPenaltyExample.png|400px|center]]<br />
<br />
We see that the KL-divergence reaches its minimum of 0 at<br />
<math>\textstyle \hat\rho_j = \rho</math>, and blows up (it actually approaches <math>\textstyle \infty</math>) as <math>\textstyle \hat\rho_j</math><br />
approaches 0 or 1. Thus, minimizing<br />
this penalty term has the effect of causing <math>\textstyle \hat\rho_j</math> to be close to <math>\textstyle \rho</math>.<br />
<br />
Our overall cost function is now<br />
:<math>\begin{align}<br />
J_{\rm sparse}(W,b) = J(W,b) + \beta \sum_{j=1}^{s_2} {\rm KL}(\rho || \hat\rho_j),<br />
\end{align}</math><br />
where <math>\textstyle J(W,b)</math> is as defined previously, and <math>\textstyle \beta</math> controls the weight of<br />
the sparsity penalty term. The term <math>\textstyle \hat\rho_j</math> (implicitly) depends on <math>\textstyle W,b</math> also,<br />
because it is the average activation of hidden unit <math>\textstyle j</math>, and the activation of a hidden<br />
unit depends on the parameters <math>\textstyle W,b</math>.<br />
<br />
To incorporate the KL-divergence term into your derivative calculation, there is a simple-to-implement<br />
trick involving only a small change to your code. Specifically, where previously for<br />
the second layer (<math>\textstyle l=2</math>), during backpropagation you would have computed<br />
:<math>\begin{align}<br />
\delta^{(2)}_i = \left( \sum_{j=1}^{s_{2}} W^{(2)}_{ji} \delta^{(3)}_j \right) f'(z^{(2)}_i),<br />
\end{align}</math><br />
now instead compute<br />
:<math>\begin{align}<br />
\delta^{(2)}_i =<br />
\left( \left( \sum_{j=1}^{s_{2}} W^{(2)}_{ji} \delta^{(3)}_j \right)<br />
+ \beta \left( - \frac{\rho}{\hat\rho_i} + \frac{1-\rho}{1-\hat\rho_i} \right) \right) f'(z^{(2)}_i) .<br />
\end{align}</math><br />
One subtlety is that you'll need to know <math>\textstyle \hat\rho_i</math> to compute this term. Thus, you'll need<br />
to compute a forward pass on all the training examples first to compute the average<br />
activations on the training set, before computing backpropagation on any example. If your<br />
training set is small enough to fit comfortably in computer memory (this will be the case for the programming<br />
assignment), you can compute forward passes on all your examples and keep the resulting activations<br />
in memory and compute the <math>\textstyle \hat\rho_i</math>s. Then you can use your precomputed activations to<br />
perform backpropagation on all your examples. If your data is too large to fit in memory, you<br />
may have to scan through your examples computing a forward pass on each to accumulate (sum up) the<br />
activations and compute <math>\textstyle \hat\rho_i</math> (discarding the result of each forward pass after you<br />
have taken its activations <math>\textstyle a^{(2)}_i</math> into account for computing <math>\textstyle \hat\rho_i</math>). Then after<br />
having computed <math>\textstyle \hat\rho_i</math>, you'd have to redo the forward pass for each example so that you<br />
can do backpropagation on that example. In this latter case, you would end up computing a forward<br />
pass twice on each example in your training set, making it computationally less efficient.<br />
<br />
<br />
The full derivation showing that the algorithm above results in gradient descent is beyond the scope<br />
of these notes. But if you implement the autoencoder using backpropagation modified this way,<br />
you will be performing gradient descent exactly on the objective<br />
<math>\textstyle J_{\rm sparse}(W,b)</math>. Using the derivative checking method, you will be able to verify<br />
this for yourself as well.</div>Ang