http://ufldl.stanford.edu/wiki/index.php?title=Special:Contributions&feed=atom&limit=50&target=AngUfldl - User contributions [en]2024-03-29T07:36:30ZFrom UfldlMediaWiki 1.16.2http://ufldl.stanford.edu/wiki/index.php/Ufldl:CopyrightsUfldl:Copyrights2011-08-15T01:50:50Z<p>Ang: Created page with "By submitting text or other materials to this Wiki, you are asserting that, and promising us that, you wrote this yourself, or copied it from a public domain or similar free reso..."</p>
<hr />
<div>By submitting text or other materials to this Wiki, you are asserting that, and promising us that, you wrote this yourself, or copied it from a public domain or similar free resource. Further, by submitting text or other materials to this Wiki, in consideration for having your text incorporated into the Wiki and thus potentially having others be exposed to content provided by you--which you acknowledge is valuable consideration--you agree to assign and hereby do assign all copyright, title and interest in these materials to the Stanford authors of this Wiki. Do not submit copyrighted work without permission.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/UFLDL_TutorialUFLDL Tutorial2011-05-27T20:28:44Z<p>Ang: </p>
<hr />
<div>'''Description:''' This tutorial will teach you the main ideas of Unsupervised Feature Learning and Deep Learning. By working through it, you will also get to implement several feature learning/deep learning algorithms, get to see them work for yourself, and learn how to apply/adapt these ideas to new problems.<br />
<br />
This tutorial assumes a basic knowledge of machine learning (specifically, familiarity with the ideas of supervised learning, logistic regression, gradient descent). If you are not familiar with these ideas, we suggest you go to this [http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning Machine Learning course] and complete<br />
sections II, III, IV (up to Logistic Regression) first. <br />
<br />
<br />
'''Sparse Autoencoder'''<br />
* [[Neural Networks]]<br />
* [[Backpropagation Algorithm]]<br />
* [[Gradient checking and advanced optimization]]<br />
* [[Autoencoders and Sparsity]]<br />
* [[Visualizing a Trained Autoencoder]]<br />
* [[Sparse Autoencoder Notation Summary]] <br />
* [[Exercise:Sparse Autoencoder]]<br />
<br />
<br />
'''Vectorized implementation'''<br />
* [[Vectorization]]<br />
* [[Logistic Regression Vectorization Example]]<br />
* [[Neural Network Vectorization]]<br />
* [[Exercise:Vectorization]]<br />
<br />
<br />
'''Preprocessing: PCA and Whitening'''<br />
* [[PCA]]<br />
* [[Whitening]]<br />
* [[Implementing PCA/Whitening]]<br />
* [[Exercise:PCA in 2D]]<br />
* [[Exercise:PCA and Whitening]]<br />
<br />
<br />
'''Softmax Regression'''<br />
* [[Softmax Regression]]<br />
* [[Exercise:Softmax Regression]]<br />
<br />
<br />
'''Self-Taught Learning and Unsupervised Feature Learning''' <br />
* [[Self-Taught Learning]]<br />
* [[Exercise:Self-Taught Learning]]<br />
<br />
<br />
'''Building Deep Networks for Classification'''<br />
* [[Self-Taught Learning to Deep Networks | From Self-Taught Learning to Deep Networks]]<br />
* [[Deep Networks: Overview]]<br />
* [[Stacked Autoencoders]]<br />
* [[Fine-tuning Stacked AEs]]<br />
* [[Exercise: Implement deep networks for digit classification]]<br />
<br />
<br />
'''Linear Decoders with Autoencoders'''<br />
* [[Linear Decoders]]<br />
* [[Exercise:Learning color features with Sparse Autoencoders]]<br />
<br />
<br />
'''Working with Large Images'''<br />
* [[Feature extraction using convolution]]<br />
* [[Pooling]]<br />
* [[Exercise:Convolution and Pooling]]<br />
<br />
----<br />
'''Note''': The sections above this line are stable. The sections below are still under construction, and may change without notice. Feel free to browse around however, and feedback/suggestions are welcome. <br />
<br />
<br />
'''Miscellaneous''':<br />
<br />
[[MATLAB Modules]]<br />
<br />
[[Data Preprocessing]]<br />
<br />
[[Style Guide]]<br />
<br />
[[Useful Links]]<br />
<br />
<br />
'''Advanced Topics''':<br />
<br />
[[Convolutional training]] <br />
<br />
[[Restricted Boltzmann Machines]]<br />
<br />
[[Deep Belief Networks]]<br />
<br />
[[Denoising Autoencoders]]<br />
<br />
[[Sparse Coding]]<br />
<br />
[[K-means]]<br />
<br />
[[Spatial pyramids / Multiscale]]<br />
<br />
[[Slow Feature Analysis]]<br />
<br />
ICA Style Models:<br />
* [[Independent Component Analysis]]<br />
* [[Topographic Independent Component Analysis]]<br />
<br />
[[Tiled Convolution Networks]]<br />
<br />
----<br />
<br />
Material contributed by: Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Exercise:Convolution_and_PoolingExercise:Convolution and Pooling2011-05-27T19:05:16Z<p>Ang: /* Step 4: Test classifier */</p>
<hr />
<div>== Convolution and Pooling ==<br />
<br />
In this exercise you will use the features you learned on 8x8 patches sampled from images from the STL-10 dataset in [[Exercise:Learning color features with Sparse Autoencoders | the earlier exercise on linear decoders]] for classifying images from a reduced STL-10 dataset applying [[Feature extraction using convolution | convolution]] and [[Pooling | pooling]]. The reduced STL-10 dataset comprises 64x64 images from 4 classes (airplane, car, cat, dog).<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/cnn_exercise.zip cnn_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to modify '''<tt>cnnConvolve.m</tt>''' and '''<tt>cnnPool.m</tt>'''.<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stlSubset.zip A subset of the STL10 Dataset (stlSubset.zip)]<br />
* [http://ufldl.stanford.edu/wiki/resources/cnn_exercise.zip Starter Code (cnn_exercise.zip)]<br />
<br />
You will also need:<br />
* <tt>sparseAutoencoderLinear.m</tt> or your saved features from [[Exercise:Learning color features with Sparse Autoencoders]]<br />
* <tt>feedForwardAutoencoder.m</tt> (and related functions) from [[Exercise:Self-Taught Learning]]<br />
* <tt>softmaxTrain.m</tt> (and related functions) from [[Exercise:Softmax Regression]]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
=== Step 1: Load learned features ===<br />
<br />
In this step, you will use the features from [[Exercise:Learning color features with Sparse Autoencoders]]. If you have completed that exercise, you can load the color features that were previously saved. To verify that the features are good, the visualized features should look like the following:<br />
<br />
[[File:CNN_Features_Good.png|300px]]<br />
<br />
=== Step 2: Implement and test convolution and pooling ===<br />
<br />
In this step, you will implement convolution and pooling, and test them on a small part of the data set to ensure that you have implemented these two functions correctly. In the next step, you will actually convolve and pool the features with the STL-10 images.<br />
<br />
==== Step 2a: Implement convolution ====<br />
<br />
Implement convolution, as described in [[feature extraction using convolution]], in the function <tt>cnnConvolve</tt> in <tt>cnnConvolve.m</tt>. Implementing convolution is somewhat involved, so we will guide you through the process below.<br />
<br />
First, we want to compute <math>\sigma(Wx_{(r,c)} + b)</math> for all ''valid'' <math>(r, c)</math> (''valid'' meaning that the entire 8x8 patch is contained within the image; this is as opposed to a ''full'' convolution, which allows the patch to extend outside the image, with the area outside the image assumed to be 0), where <math>W</math> and <math>b</math> are the learned weights and biases from the input layer to the hidden layer, and <math>x_{(r,c)}</math> is the 8x8 patch with the upper left corner at <math>(r, c)</math>. To accomplish this, one naive method is to loop over all such patches and compute <math>\sigma(Wx_{(r,c)} + b)</math> for each of them; while this is fine in theory, it can very slow. Hence, we usually use Matlab's built in convolution functions, which are well optimized.<br />
<br />
Observe that the convolution above can be broken down into the following three small steps. First, compute <math>Wx_{(r,c)}</math> for all <math>(r, c)</math>. Next, add b to all the computed values. Finally, apply the sigmoid function to the resulting values. This doesn't seem to buy you anything, since the first step still requires a loop. However, you can replace the loop in the first step with one of MATLAB's optimized convolution functions, <tt>conv2</tt>, speeding up the process significantly.<br />
<br />
However, there are two important points to note in using <tt>conv2</tt>. <br />
<br />
First, <tt>conv2</tt> performs a 2-D convolution, but you have 5 "dimensions" - image number, feature number, row of image, column of image, and (color) channel of image - that you want to convolve over. Because of this, you will have to convolve each feature and image channel separately for each image, using the row and column of the image as the 2 dimensions you convolve over. This means that you will need three outer loops over the image number <tt>imageNum</tt>, feature number <tt>featureNum</tt>, and the channel number of the image <tt>channel</tt>. Inside the three nested for-loops, you will perform a <tt>conv2</tt> 2-D convolution, using the weight matrix for the <tt>featureNum</tt>-th feature and <tt>channel</tt>-th channel, and the image matrix for the <tt>imageNum</tt>-th image. <br />
<br />
Second, because of the mathematical definition of convolution, the feature matrix must be "flipped" before passing it to <tt>conv2</tt>. The following implementation tip explains the "flipping" of feature matrices when using MATLAB's convolution functions:<br />
<br />
<div style="border:1px solid black; padding: 5px"><br />
<br />
'''Implementation tip:''' Using <tt>conv2</tt> and <tt>convn</tt><br />
<br />
Because the mathematical definition of convolution involves "flipping" the matrix to convolve with (reversing its rows and its columns), to use MATLAB's convolution functions, you must first "flip" the weight matrix so that when MATLAB "flips" it according to the mathematical definition the entries will be at the correct place. For example, suppose you wanted to convolve two matrices <tt>image</tt> (a large image) and <tt>W</tt> (the feature) using <tt>conv2(image, W)</tt>, and W is a 3x3 matrix as below:<br />
<br />
<math><br />
W = <br />
\begin{pmatrix}<br />
1 & 2 & 3 \\<br />
4 & 5 & 6 \\<br />
7 & 8 & 9 \\<br />
\end{pmatrix}<br />
</math><br />
<br />
If you use <tt>conv2(image, W)</tt>, MATLAB will first "flip" <tt>W</tt>, reversing its rows and columns, before convolving <tt>W</tt> with <tt>image</tt>, as below:<br />
<br />
<math><br />
\begin{pmatrix}<br />
1 & 2 & 3 \\<br />
4 & 5 & 6 \\<br />
7 & 8 & 9 \\<br />
\end{pmatrix}<br />
<br />
\xrightarrow{flip}<br />
<br />
\begin{pmatrix}<br />
9 & 8 & 7 \\<br />
6 & 5 & 4 \\<br />
3 & 2 & 1 \\<br />
\end{pmatrix}<br />
</math><br />
<br />
If the original layout of <tt>W</tt> was correct, after flipping, it would be incorrect. For the layout to be correct after flipping, you will have to flip <tt>W</tt> before passing it into <tt>conv2</tt>, so that after MATLAB flips <tt>W</tt> in <tt>conv2</tt>, the layout will be correct. For <tt>conv2</tt>, this means reversing the rows and columns, which can be done with <tt>flipud</tt> and <tt>fliplr</tt>, as shown below:<br />
<br />
<syntaxhighlight lang="matlab"><br />
% Flip W for use in conv2<br />
W = flipud(fliplr(W));<br />
</syntaxhighlight><br />
<br />
</div><br />
<br />
Next, to each of the <tt>convolvedFeatures</tt>, you should then add <tt>b</tt>, the corresponding bias for the <tt>featureNum</tt>-th feature. <br />
<br />
However, there is one additional complication. If we had not done any preprocessing of the input patches, you could just follow the procedure as described above, and apply the sigmoid function to obtain the convolved features, and we'd be done. However, because you preprocessed the patches before learning features on them, you must also apply the same preprocessing steps to the convolved patches to get the correct feature activations. <br />
<br />
In particular, you did the following to the patches:<br />
<ol><br />
<li> subtract the mean patch, <tt>meanPatch</tt> to zero the mean of the patches <br />
<li> ZCA whiten using the whitening matrix <tt>ZCAWhite</tt>.<br />
</ol><br />
These same three steps must also be applied to the input image patches. <br />
<br />
Taking the preprocessing steps into account, the feature activations that you should compute is <math>\sigma(W(T(x-\bar{x})) + b)</math>, where <math>T</math> is the whitening matrix and <math>\bar{x}</math> is the mean patch. Expanding this, you obtain <math>\sigma(WTx - WT\bar{x} + b)</math>, which suggests that you should convolve the images with <math>WT</math> rather than <math>W</math> as earlier, and you should add <math>(b - WT\bar{x})</math>, rather than just <math>b</math> to <tt>convolvedFeatures</tt>, before finally applying the sigmoid function.<br />
<br />
==== Step 2b: Check your convolution ====<br />
<br />
We have provided some code for you to check that you have done the convolution correctly. The code randomly checks the convolved values for a number of (feature, row, column) tuples by computing the feature activations using <tt>feedForwardAutoencoder</tt> for the selected features and patches directly using the sparse autoencoder. <br />
<br />
==== Step 2c: Pooling ====<br />
<br />
Implement [[pooling]] in the function <tt>cnnPool</tt> in <tt>cnnPool.m</tt>. You should implement ''mean'' pooling (i.e., averaging over feature responses) for this part.<br />
<br />
==== Step 2d: Check your pooling ====<br />
<br />
We have provided some code for you to check that you have done the pooling correctly. The code runs <tt>cnnPool</tt> against a test matrix to see if it produces the expected result.<br />
<br />
=== Step 3: Convolve and pool with the dataset ===<br />
<br />
In this step, you will convolve each of the features you learned with the full 64x64 images from the STL-10 dataset to obtain the convolved features for both the training and test sets. You will then pool the convolved features to obtain the pooled features for both training and test sets. The pooled features for the training set will be used to train your classifier, which you can then test on the test set.<br />
<br />
Because the convolved features matrix is very large, the code provided does the convolution and pooling 50 features at a time to avoid running out of memory.<br />
<br />
=== Step 4: Use pooled features for classification ===<br />
<br />
In this step, you will use the pooled features to train a softmax classifier to map the pooled features to the class labels. The code in this section uses <tt>softmaxTrain</tt> from the softmax exercise to train a softmax classifier on the pooled features for 500 iterations, which should take around 5 minutes.<br />
<br />
=== Step 5: Test classifier ===<br />
<br />
Now that you have a trained softmax classifier, you can see how well it performs on the test set. These pooled features for the test set will be run through the softmax classifier, and the accuracy of the predictions will be computed. You should expect to get an accuracy of around 80%.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Exercise:Convolution_and_PoolingExercise:Convolution and Pooling2011-05-27T19:04:24Z<p>Ang: /* Convolution and Pooling */</p>
<hr />
<div>== Convolution and Pooling ==<br />
<br />
In this exercise you will use the features you learned on 8x8 patches sampled from images from the STL-10 dataset in [[Exercise:Learning color features with Sparse Autoencoders | the earlier exercise on linear decoders]] for classifying images from a reduced STL-10 dataset applying [[Feature extraction using convolution | convolution]] and [[Pooling | pooling]]. The reduced STL-10 dataset comprises 64x64 images from 4 classes (airplane, car, cat, dog).<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/cnn_exercise.zip cnn_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to modify '''<tt>cnnConvolve.m</tt>''' and '''<tt>cnnPool.m</tt>'''.<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stlSubset.zip A subset of the STL10 Dataset (stlSubset.zip)]<br />
* [http://ufldl.stanford.edu/wiki/resources/cnn_exercise.zip Starter Code (cnn_exercise.zip)]<br />
<br />
You will also need:<br />
* <tt>sparseAutoencoderLinear.m</tt> or your saved features from [[Exercise:Learning color features with Sparse Autoencoders]]<br />
* <tt>feedForwardAutoencoder.m</tt> (and related functions) from [[Exercise:Self-Taught Learning]]<br />
* <tt>softmaxTrain.m</tt> (and related functions) from [[Exercise:Softmax Regression]]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
=== Step 1: Load learned features ===<br />
<br />
In this step, you will use the features from [[Exercise:Learning color features with Sparse Autoencoders]]. If you have completed that exercise, you can load the color features that were previously saved. To verify that the features are good, the visualized features should look like the following:<br />
<br />
[[File:CNN_Features_Good.png|300px]]<br />
<br />
=== Step 2: Implement and test convolution and pooling ===<br />
<br />
In this step, you will implement convolution and pooling, and test them on a small part of the data set to ensure that you have implemented these two functions correctly. In the next step, you will actually convolve and pool the features with the STL-10 images.<br />
<br />
==== Step 2a: Implement convolution ====<br />
<br />
Implement convolution, as described in [[feature extraction using convolution]], in the function <tt>cnnConvolve</tt> in <tt>cnnConvolve.m</tt>. Implementing convolution is somewhat involved, so we will guide you through the process below.<br />
<br />
First, we want to compute <math>\sigma(Wx_{(r,c)} + b)</math> for all ''valid'' <math>(r, c)</math> (''valid'' meaning that the entire 8x8 patch is contained within the image; this is as opposed to a ''full'' convolution, which allows the patch to extend outside the image, with the area outside the image assumed to be 0), where <math>W</math> and <math>b</math> are the learned weights and biases from the input layer to the hidden layer, and <math>x_{(r,c)}</math> is the 8x8 patch with the upper left corner at <math>(r, c)</math>. To accomplish this, one naive method is to loop over all such patches and compute <math>\sigma(Wx_{(r,c)} + b)</math> for each of them; while this is fine in theory, it can very slow. Hence, we usually use Matlab's built in convolution functions, which are well optimized.<br />
<br />
Observe that the convolution above can be broken down into the following three small steps. First, compute <math>Wx_{(r,c)}</math> for all <math>(r, c)</math>. Next, add b to all the computed values. Finally, apply the sigmoid function to the resulting values. This doesn't seem to buy you anything, since the first step still requires a loop. However, you can replace the loop in the first step with one of MATLAB's optimized convolution functions, <tt>conv2</tt>, speeding up the process significantly.<br />
<br />
However, there are two important points to note in using <tt>conv2</tt>. <br />
<br />
First, <tt>conv2</tt> performs a 2-D convolution, but you have 5 "dimensions" - image number, feature number, row of image, column of image, and (color) channel of image - that you want to convolve over. Because of this, you will have to convolve each feature and image channel separately for each image, using the row and column of the image as the 2 dimensions you convolve over. This means that you will need three outer loops over the image number <tt>imageNum</tt>, feature number <tt>featureNum</tt>, and the channel number of the image <tt>channel</tt>. Inside the three nested for-loops, you will perform a <tt>conv2</tt> 2-D convolution, using the weight matrix for the <tt>featureNum</tt>-th feature and <tt>channel</tt>-th channel, and the image matrix for the <tt>imageNum</tt>-th image. <br />
<br />
Second, because of the mathematical definition of convolution, the feature matrix must be "flipped" before passing it to <tt>conv2</tt>. The following implementation tip explains the "flipping" of feature matrices when using MATLAB's convolution functions:<br />
<br />
<div style="border:1px solid black; padding: 5px"><br />
<br />
'''Implementation tip:''' Using <tt>conv2</tt> and <tt>convn</tt><br />
<br />
Because the mathematical definition of convolution involves "flipping" the matrix to convolve with (reversing its rows and its columns), to use MATLAB's convolution functions, you must first "flip" the weight matrix so that when MATLAB "flips" it according to the mathematical definition the entries will be at the correct place. For example, suppose you wanted to convolve two matrices <tt>image</tt> (a large image) and <tt>W</tt> (the feature) using <tt>conv2(image, W)</tt>, and W is a 3x3 matrix as below:<br />
<br />
<math><br />
W = <br />
\begin{pmatrix}<br />
1 & 2 & 3 \\<br />
4 & 5 & 6 \\<br />
7 & 8 & 9 \\<br />
\end{pmatrix}<br />
</math><br />
<br />
If you use <tt>conv2(image, W)</tt>, MATLAB will first "flip" <tt>W</tt>, reversing its rows and columns, before convolving <tt>W</tt> with <tt>image</tt>, as below:<br />
<br />
<math><br />
\begin{pmatrix}<br />
1 & 2 & 3 \\<br />
4 & 5 & 6 \\<br />
7 & 8 & 9 \\<br />
\end{pmatrix}<br />
<br />
\xrightarrow{flip}<br />
<br />
\begin{pmatrix}<br />
9 & 8 & 7 \\<br />
6 & 5 & 4 \\<br />
3 & 2 & 1 \\<br />
\end{pmatrix}<br />
</math><br />
<br />
If the original layout of <tt>W</tt> was correct, after flipping, it would be incorrect. For the layout to be correct after flipping, you will have to flip <tt>W</tt> before passing it into <tt>conv2</tt>, so that after MATLAB flips <tt>W</tt> in <tt>conv2</tt>, the layout will be correct. For <tt>conv2</tt>, this means reversing the rows and columns, which can be done with <tt>flipud</tt> and <tt>fliplr</tt>, as shown below:<br />
<br />
<syntaxhighlight lang="matlab"><br />
% Flip W for use in conv2<br />
W = flipud(fliplr(W));<br />
</syntaxhighlight><br />
<br />
</div><br />
<br />
Next, to each of the <tt>convolvedFeatures</tt>, you should then add <tt>b</tt>, the corresponding bias for the <tt>featureNum</tt>-th feature. <br />
<br />
However, there is one additional complication. If we had not done any preprocessing of the input patches, you could just follow the procedure as described above, and apply the sigmoid function to obtain the convolved features, and we'd be done. However, because you preprocessed the patches before learning features on them, you must also apply the same preprocessing steps to the convolved patches to get the correct feature activations. <br />
<br />
In particular, you did the following to the patches:<br />
<ol><br />
<li> subtract the mean patch, <tt>meanPatch</tt> to zero the mean of the patches <br />
<li> ZCA whiten using the whitening matrix <tt>ZCAWhite</tt>.<br />
</ol><br />
These same three steps must also be applied to the input image patches. <br />
<br />
Taking the preprocessing steps into account, the feature activations that you should compute is <math>\sigma(W(T(x-\bar{x})) + b)</math>, where <math>T</math> is the whitening matrix and <math>\bar{x}</math> is the mean patch. Expanding this, you obtain <math>\sigma(WTx - WT\bar{x} + b)</math>, which suggests that you should convolve the images with <math>WT</math> rather than <math>W</math> as earlier, and you should add <math>(b - WT\bar{x})</math>, rather than just <math>b</math> to <tt>convolvedFeatures</tt>, before finally applying the sigmoid function.<br />
<br />
==== Step 2b: Check your convolution ====<br />
<br />
We have provided some code for you to check that you have done the convolution correctly. The code randomly checks the convolved values for a number of (feature, row, column) tuples by computing the feature activations using <tt>feedForwardAutoencoder</tt> for the selected features and patches directly using the sparse autoencoder. <br />
<br />
==== Step 2c: Pooling ====<br />
<br />
Implement [[pooling]] in the function <tt>cnnPool</tt> in <tt>cnnPool.m</tt>. You should implement ''mean'' pooling (i.e., averaging over feature responses) for this part.<br />
<br />
==== Step 2d: Check your pooling ====<br />
<br />
We have provided some code for you to check that you have done the pooling correctly. The code runs <tt>cnnPool</tt> against a test matrix to see if it produces the expected result.<br />
<br />
=== Step 3: Convolve and pool with the dataset ===<br />
<br />
In this step, you will convolve each of the features you learned with the full 64x64 images from the STL-10 dataset to obtain the convolved features for both the training and test sets. You will then pool the convolved features to obtain the pooled features for both training and test sets. The pooled features for the training set will be used to train your classifier, which you can then test on the test set.<br />
<br />
Because the convolved features matrix is very large, the code provided does the convolution and pooling 50 features at a time to avoid running out of memory.<br />
<br />
=== Step 4: Use pooled features for classification ===<br />
<br />
In this step, you will use the pooled features to train a softmax classifier to map the pooled features to the class labels. The code in this section uses <tt>softmaxTrain</tt> from the softmax exercise to train a softmax classifier on the pooled features for 500 iterations, which should take around 5 minutes.<br />
<br />
=== Step 4: Test classifier ===<br />
<br />
Now that you have a trained softmax classifier, you can see how well it performs on the test set. These pooled features for the test set will be run through the softmax classifier, and the accuracy of the predictions will be computed. You should expect to get an accuracy of around 80%.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/PoolingPooling2011-05-27T18:30:31Z<p>Ang: /* Notes */</p>
<hr />
<div>== Pooling: Overview ==<br />
<br />
After obtaining features using convolution, we would next like to use them for classification. In theory, one could use all the extracted features with a classifier such as a softmax classifier, but this can be computationally challenging. Consider for instance images of size 96x96 pixels, and suppose we have learned 400 features over 8x8 inputs. Each convolution results in an output of size <math>(96-8+1)*(96-8+1)=7921</math>, and since we have 400 features, this results in a vector of <math>89^2 * 400 = 3,168,400</math> features per example. Learning a classifier with inputs having 3+ million features can be unwieldy, and can also be prone to over-fitting. <br />
<br />
To address this, first recall that we decided to obtain convolved features because images have the "stationarity" property, which implies that features that are useful in one region are also likely to be useful for other regions. Thus, to describe a large image, one natural approach is to aggregate statistics of these features at various locations. For example, one could compute the mean (or max) value of a particular feature over a region of the image. These summary statistics are much lower in dimension (compared to using all of the extracted features) and can also improve results (less over-fitting). We aggregation operation is called this operation '''pooling''', or sometimes '''mean pooling''' or '''max pooling''' (depending on the pooling operation applied). <br />
<br />
The following image shows how pooling is done over 4 non-overlapping regions of the image.<br />
<br />
[[File:Pooling_schematic.gif]]<br />
<br />
== Pooling for Invariance ==<br />
<br />
If one chooses the pooling regions to be contiguous areas in the image and only pools features generated from the same (replicated) hidden units. Then, these pooling units will then be '''translation invariant'''. This means that the same (pooled) feature will be active even when the image undergoes (small) translations. Translation-invariant features are often desirable; in many tasks (e.g., object detection, audio recognition), the label of the example (image) is the same even when the image is translated. For example, if you were to take an MNIST digit and translate it left or right, you would want your classifier to still accurately classify it as the same digit regardless of its final position.<br />
<br />
== Formal description ==<br />
<br />
Formally, after obtaining our convolved features as described earlier, we decide the size of the region, say <math>m \times n</math> to pool our convolved features over. Then, we divide our convolved features into disjoint <math>m \times n</math> regions, and take the mean (or maximum) feature activation over these regions to obtain the pooled convolved features. These pooled features can then be used for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/PoolingPooling2011-05-27T18:29:02Z<p>Ang: /* Pooling: Overview */</p>
<hr />
<div>== Pooling: Overview ==<br />
<br />
After obtaining features using convolution, we would next like to use them for classification. In theory, one could use all the extracted features with a classifier such as a softmax classifier, but this can be computationally challenging. Consider for instance images of size 96x96 pixels, and suppose we have learned 400 features over 8x8 inputs. Each convolution results in an output of size <math>(96-8+1)*(96-8+1)=7921</math>, and since we have 400 features, this results in a vector of <math>89^2 * 400 = 3,168,400</math> features per example. Learning a classifier with inputs having 3+ million features can be unwieldy, and can also be prone to over-fitting. <br />
<br />
To address this, first recall that we decided to obtain convolved features because images have the "stationarity" property, which implies that features that are useful in one region are also likely to be useful for other regions. Thus, to describe a large image, one natural approach is to aggregate statistics of these features at various locations. For example, one could compute the mean (or max) value of a particular feature over a region of the image. These summary statistics are much lower in dimension (compared to using all of the extracted features) and can also improve results (less over-fitting). We aggregation operation is called this operation '''pooling''', or sometimes '''mean pooling''' or '''max pooling''' (depending on the pooling operation applied). <br />
<br />
The following image shows how pooling is done over 4 non-overlapping regions of the image.<br />
<br />
[[File:Pooling_schematic.gif]]<br />
<br />
== Pooling for Invariance ==<br />
<br />
If one chooses the pooling regions to be contiguous areas in the image and only pools features generated from the same (replicated) hidden units. Then, these pooling units will then be '''translation invariant'''. This means that the same (pooled) feature will be active even when the image undergoes (small) translations. Translation-invariant features are often desirable; in many tasks (e.g., object detection, audio recognition), the label of the example (image) is the same even when the image is translated. For example, if you were to take an MNIST digit and translate it left or right, you would want your classifier to still accurately classify it as the same digit regardless of its final position.<br />
<br />
== Notes ==<br />
<br />
Formally, after obtaining our convolved features as earlier, we decide the size of the region, say <math>m \times n</math> to pool our convolved features over. Then, we divide our convolved features into disjoint <math>m \times n</math> regions, and take the maximum (or mean) feature activation over these regions to obtain the pooled convolved features. These pooled features can then be used for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolutionFeature extraction using convolution2011-05-27T18:15:47Z<p>Ang: /* Convolutions */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which will allow us to scale up these methods to more realistic datasets that have larger images.<br />
<br />
== Fully Connected Networks ==<br />
<br />
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On the relatively small images that we were working with (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it was computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.<br />
<br />
== Locally Connected Networks ==<br />
<br />
One simple solution to this problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a small subset of the input units. Specifically, each hidden unit will connect to only a small contiguous region of pixels in the input. (For input modalities different than images, there is often also a natural way to select "contiguous groups" of input units to connect to a single hidden unit as well; for example, for audio, a hidden unit might be connected to only the input units corresponding to a certain time span of the input audio clip.) <br />
<br />
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up in biology. Specifically, neurons in the visual cortex have localized receptive fields (i.e., they respond only to stimuli in a certain location).<br />
<br />
== Convolutions ==<br />
<br />
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applied to other parts of the image, and we can use the same features at all locations. <br />
<!--<br />
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.<br />
<br />
== Fast Feature Learning and Extraction ==<br />
<br />
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute. <br />
!--><br />
<br />
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and <br />
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. <br />
<br />
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. Suppose further this was done with an autoencoder that has 100 hidden units. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in 100 sets 89x89 convolved features. <br />
<br />
<!--<br />
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. <br />
!--><br />
<br />
[[File:Convolution_schematic.gif]]<br />
<br />
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. <br />
<br />
In the next section, we further describe how to "pool" these features together to get even better features for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolutionFeature extraction using convolution2011-05-27T18:13:56Z<p>Ang: /* Convolutions */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which will allow us to scale up these methods to more realistic datasets that have larger images.<br />
<br />
== Fully Connected Networks ==<br />
<br />
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On the relatively small images that we were working with (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it was computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.<br />
<br />
== Locally Connected Networks ==<br />
<br />
One simple solution to this problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a small subset of the input units. Specifically, each hidden unit will connect to only a small contiguous region of pixels in the input. (For input modalities different than images, there is often also a natural way to select "contiguous groups" of input units to connect to a single hidden unit as well; for example, for audio, a hidden unit might be connected to only the input units corresponding to a certain time span of the input audio clip.) <br />
<br />
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up in biology. Specifically, neurons in the visual cortex have localized receptive fields (i.e., they respond only to stimuli in a certain location).<br />
<br />
== Convolutions ==<br />
<br />
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applied to other parts of the image, and we can use the same features at all locations. <br />
<!--<br />
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.<br />
<br />
== Fast Feature Learning and Extraction ==<br />
<br />
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute. <br />
!--><br />
<br />
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and <br />
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. <br />
<br />
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. <br />
<br />
<!--<br />
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. <br />
!--><br />
<br />
[[File:Convolution_schematic.gif]]<br />
<br />
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. <br />
<br />
In the next section, we further describe how to "pool" these features together to get even better features for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolutionFeature extraction using convolution2011-05-27T18:13:31Z<p>Ang: /* Weight Sharing (Convolution) */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which will allow us to scale up these methods to more realistic datasets that have larger images.<br />
<br />
== Fully Connected Networks ==<br />
<br />
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On the relatively small images that we were working with (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it was computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.<br />
<br />
== Locally Connected Networks ==<br />
<br />
One simple solution to this problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a small subset of the input units. Specifically, each hidden unit will connect to only a small contiguous region of pixels in the input. (For input modalities different than images, there is often also a natural way to select "contiguous groups" of input units to connect to a single hidden unit as well; for example, for audio, a hidden unit might be connected to only the input units corresponding to a certain time span of the input audio clip.) <br />
<br />
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up in biology. Specifically, neurons in the visual cortex have localized receptive fields (i.e., they respond only to stimuli in a certain location).<br />
<br />
== Convolutions ==<br />
<br />
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applied to other regions--i.e., we can use the same features at all locations. <br />
<!--<br />
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.<br />
<br />
== Fast Feature Learning and Extraction ==<br />
<br />
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute. <br />
!--><br />
<br />
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and <br />
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. <br />
<br />
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. <br />
<br />
<!--<br />
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. <br />
!--><br />
<br />
[[File:Convolution_schematic.gif]]<br />
<br />
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. <br />
<br />
In the next section, we further describe how to "pool" these features together to get even better features for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolutionFeature extraction using convolution2011-05-27T18:12:56Z<p>Ang: /* Locally Connected Networks */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which will allow us to scale up these methods to more realistic datasets that have larger images.<br />
<br />
== Fully Connected Networks ==<br />
<br />
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On the relatively small images that we were working with (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it was computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.<br />
<br />
== Locally Connected Networks ==<br />
<br />
One simple solution to this problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a small subset of the input units. Specifically, each hidden unit will connect to only a small contiguous region of pixels in the input. (For input modalities different than images, there is often also a natural way to select "contiguous groups" of input units to connect to a single hidden unit as well; for example, for audio, a hidden unit might be connected to only the input units corresponding to a certain time span of the input audio clip.) <br />
<br />
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up in biology. Specifically, neurons in the visual cortex have localized receptive fields (i.e., they respond only to stimuli in a certain location).<br />
<br />
== Weight Sharing (Convolution) ==<br />
<br />
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applied to other regions--i.e., we can use the same features at all locations. <br />
<!--<br />
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.<br />
<br />
== Fast Feature Learning and Extraction ==<br />
<br />
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute. <br />
!--><br />
<br />
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and <br />
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. <br />
<br />
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. <br />
<br />
<!--<br />
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. <br />
!--><br />
<br />
[[File:Convolution_schematic.gif]]<br />
<br />
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. <br />
<br />
In the next section, we further describe how to "pool" these features together to get even better features for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolutionFeature extraction using convolution2011-05-27T18:08:59Z<p>Ang: /* Weight Sharing (Convolution) */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which will allow us to scale up these methods to more realistic datasets that have larger images.<br />
<br />
== Fully Connected Networks ==<br />
<br />
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On the relatively small images that we were working with (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it was computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.<br />
<br />
== Locally Connected Networks ==<br />
<br />
One simple solution to this problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a small subset of the input units. Specifically, each hidden unit will connect to only a small contiguous region of pixels in the input. (For input modalities other than vision, there is often a natural way to select "contiguous groups" of inputs to connect to a single hidden units as well; for example, for audio, each hidden unit might be connected to only a certain time span of the input audio clip.) <br />
<br />
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up in biology. Specifically, neurons in the visual cortex have localized receptive fields (i.e., they respond only to stimuli in a certain location).<br />
<br />
== Weight Sharing (Convolution) ==<br />
<br />
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applied to other regions--i.e., we can use the same features at all locations. <br />
<!--<br />
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.<br />
<br />
== Fast Feature Learning and Extraction ==<br />
<br />
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute. <br />
!--><br />
<br />
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and <br />
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. <br />
<br />
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. <br />
<br />
<!--<br />
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. <br />
!--><br />
<br />
[[File:Convolution_schematic.gif]]<br />
<br />
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. <br />
<br />
In the next section, we further describe how to "pool" these features together to get even better features for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolutionFeature extraction using convolution2011-05-27T18:08:31Z<p>Ang: /* Weight Sharing (Convolution) */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which will allow us to scale up these methods to more realistic datasets that have larger images.<br />
<br />
== Fully Connected Networks ==<br />
<br />
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On the relatively small images that we were working with (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it was computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.<br />
<br />
== Locally Connected Networks ==<br />
<br />
One simple solution to this problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a small subset of the input units. Specifically, each hidden unit will connect to only a small contiguous region of pixels in the input. (For input modalities other than vision, there is often a natural way to select "contiguous groups" of inputs to connect to a single hidden units as well; for example, for audio, each hidden unit might be connected to only a certain time span of the input audio clip.) <br />
<br />
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up in biology. Specifically, neurons in the visual cortex have localized receptive fields (i.e., they respond only to stimuli in a certain location).<br />
<br />
== Weight Sharing (Convolution) ==<br />
<br />
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applicable to other regions -- i.e., we can have the same features at all locations. <br />
<!--<br />
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.<br />
<br />
== Fast Feature Learning and Extraction ==<br />
<br />
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute. <br />
!--><br />
<br />
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and <br />
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. <br />
<br />
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. <br />
<br />
<!--<br />
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. <br />
!--><br />
<br />
[[File:Convolution_schematic.gif]]<br />
<br />
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. <br />
<br />
In the next section, we further describe how to "pool" these features together to get even better features for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolutionFeature extraction using convolution2011-05-27T18:08:04Z<p>Ang: /* Locally Connected Networks */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which will allow us to scale up these methods to more realistic datasets that have larger images.<br />
<br />
== Fully Connected Networks ==<br />
<br />
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On the relatively small images that we were working with (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it was computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.<br />
<br />
== Locally Connected Networks ==<br />
<br />
One simple solution to this problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a small subset of the input units. Specifically, each hidden unit will connect to only a small contiguous region of pixels in the input. (For input modalities other than vision, there is often a natural way to select "contiguous groups" of inputs to connect to a single hidden units as well; for example, for audio, each hidden unit might be connected to only a certain time span of the input audio clip.) <br />
<br />
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up in biology. Specifically, neurons in the visual cortex have localized receptive fields (i.e., they respond only to stimuli in a certain location).<br />
<br />
== Weight Sharing (Convolution) ==<br />
<br />
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applicable to other regions -- i.e., we can have the same features at all locations. <br />
<br />
<!--<br />
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.<br />
<br />
== Fast Feature Learning and Extraction ==<br />
<br />
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute. <br />
!--><br />
<br />
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and <br />
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. <br />
<br />
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. <br />
<br />
<!--<br />
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. <br />
!--><br />
<br />
[[File:Convolution_schematic.gif]]<br />
<br />
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. <br />
<br />
In the next section, we further describe how to "pool" these features together to get even better features for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolutionFeature extraction using convolution2011-05-27T18:03:02Z<p>Ang: /* Fully Connected Networks */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which will allow us to scale up these methods to more realistic datasets that have larger images.<br />
<br />
== Fully Connected Networks ==<br />
<br />
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On the relatively small images that we were working with (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it was computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.<br />
<br />
== Locally Connected Networks ==<br />
<br />
One simple solution to the problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a select number of input units. The selection of connections between the hidden and input units can often be determined based on the input modality -- e.g., for images, we will have hidden units that connect to local contiguous regions of pixels. <br />
<br />
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up. Specifically, neurons in the visual cortex are found to have localized receptive fields (i.e., they respond only to stimuli in a certain location). <br />
<br />
== Weight Sharing (Convolution) ==<br />
<br />
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applicable to other regions -- i.e., we can have the same features at all locations. <br />
<br />
<!--<br />
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.<br />
<br />
== Fast Feature Learning and Extraction ==<br />
<br />
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute. <br />
!--><br />
<br />
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and <br />
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. <br />
<br />
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. <br />
<br />
<!--<br />
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. <br />
!--><br />
<br />
[[File:Convolution_schematic.gif]]<br />
<br />
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. <br />
<br />
In the next section, we further describe how to "pool" these features together to get even better features for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolutionFeature extraction using convolution2011-05-27T18:02:19Z<p>Ang: /* Overview */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which will allow us to scale up these methods to more realistic datasets that have larger images.<br />
<br />
== Fully Connected Networks ==<br />
<br />
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On relatively small images (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it is computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.<br />
<br />
== Locally Connected Networks ==<br />
<br />
One simple solution to the problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a select number of input units. The selection of connections between the hidden and input units can often be determined based on the input modality -- e.g., for images, we will have hidden units that connect to local contiguous regions of pixels. <br />
<br />
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up. Specifically, neurons in the visual cortex are found to have localized receptive fields (i.e., they respond only to stimuli in a certain location). <br />
<br />
== Weight Sharing (Convolution) ==<br />
<br />
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applicable to other regions -- i.e., we can have the same features at all locations. <br />
<br />
<!--<br />
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.<br />
<br />
== Fast Feature Learning and Extraction ==<br />
<br />
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute. <br />
!--><br />
<br />
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and <br />
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. <br />
<br />
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. <br />
<br />
<!--<br />
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. <br />
!--><br />
<br />
[[File:Convolution_schematic.gif]]<br />
<br />
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. <br />
<br />
In the next section, we further describe how to "pool" these features together to get even better features for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolutionFeature extraction using convolution2011-05-27T17:55:59Z<p>Ang: </p>
<hr />
<div>== Overview ==<br />
<br />
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which allow us to scale up these methods to work with more realistic datasets that have larger images.<br />
<br />
== Fully Connected Networks ==<br />
<br />
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On relatively small images (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it is computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.<br />
<br />
== Locally Connected Networks ==<br />
<br />
One simple solution to the problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a select number of input units. The selection of connections between the hidden and input units can often be determined based on the input modality -- e.g., for images, we will have hidden units that connect to local contiguous regions of pixels. <br />
<br />
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up. Specifically, neurons in the visual cortex are found to have localized receptive fields (i.e., they respond only to stimuli in a certain location). <br />
<br />
== Weight Sharing (Convolution) ==<br />
<br />
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applicable to other regions -- i.e., we can have the same features at all locations. <br />
<br />
<!--<br />
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.<br />
<br />
== Fast Feature Learning and Extraction ==<br />
<br />
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute. <br />
!--><br />
<br />
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and <br />
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. <br />
<br />
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. <br />
<br />
<!--<br />
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. <br />
!--><br />
<br />
[[File:Convolution_schematic.gif]]<br />
<br />
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. <br />
<br />
In the next section, we further describe how to "pool" these features together to get even better features for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/UFLDL_TutorialUFLDL Tutorial2011-05-27T17:38:43Z<p>Ang: </p>
<hr />
<div>'''Description:''' This tutorial will teach you the main ideas of Unsupervised Feature Learning and Deep Learning. By working through it, you will also get to implement several feature learning/deep learning algorithms, get to see them work for yourself, and learn how to apply/adapt these ideas to new problems.<br />
<br />
This tutorial assumes a basic knowledge of machine learning (specifically, familiarity with the ideas of supervised learning, logistic regression, gradient descent). If you are not familiar with these ideas, we suggest you go to this [http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning Machine Learning course] and complete<br />
sections II, III, IV (up to Logistic Regression) first. <br />
<br />
<br />
'''Sparse Autoencoder'''<br />
* [[Neural Networks]]<br />
* [[Backpropagation Algorithm]]<br />
* [[Gradient checking and advanced optimization]]<br />
* [[Autoencoders and Sparsity]]<br />
* [[Visualizing a Trained Autoencoder]]<br />
* [[Sparse Autoencoder Notation Summary]] <br />
* [[Exercise:Sparse Autoencoder]]<br />
<br />
<br />
'''Vectorized implementation'''<br />
* [[Vectorization]]<br />
* [[Logistic Regression Vectorization Example]]<br />
* [[Neural Network Vectorization]]<br />
* [[Exercise:Vectorization]]<br />
<br />
<br />
'''Preprocessing: PCA and Whitening'''<br />
* [[PCA]]<br />
* [[Whitening]]<br />
* [[Implementing PCA/Whitening]]<br />
* [[Exercise:PCA in 2D]]<br />
* [[Exercise:PCA and Whitening]]<br />
<br />
<br />
'''Softmax Regression'''<br />
* [[Softmax Regression]]<br />
* [[Exercise:Softmax Regression]]<br />
<br />
<br />
'''Self-Taught Learning and Unsupervised Feature Learning''' <br />
* [[Self-Taught Learning]]<br />
* [[Exercise:Self-Taught Learning]]<br />
<br />
<br />
'''Building Deep Networks for Classification'''<br />
* [[Self-Taught Learning to Deep Networks | From Self-Taught Learning to Deep Networks]]<br />
* [[Deep Networks: Overview]]<br />
* [[Stacked Autoencoders]]<br />
* [[Fine-tuning Stacked AEs]]<br />
* [[Exercise: Implement deep networks for digit classification]]<br />
<br />
<br />
'''Linear Decoders with Autoencoders'''<br />
* [[Linear Decoders]]<br />
* [[Exercise:Learning color features with Sparse Autoencoders]]<br />
<br />
----<br />
'''Note''': The sections above this line are stable. The sections below are still under construction, and may change without notice. Feel free to browse around however, and feedback/suggestions are welcome. <br />
<br />
'''Working with Large Images'''<br />
* [[Feature extraction using convolution]]<br />
* [[Pooling]]<br />
* [[Exercise:Convolution and Pooling]]<br />
<br />
<br />
----<br />
<br />
'''Miscellaneous''':<br />
<br />
[[MATLAB Modules]]<br />
<br />
[[Data Preprocessing]]<br />
<br />
[[Style Guide]]<br />
<br />
[[Useful Links]]<br />
<br />
<br />
'''Advanced Topics''':<br />
<br />
[[Convolutional training]] <br />
<br />
[[Restricted Boltzmann Machines]]<br />
<br />
[[Deep Belief Networks]]<br />
<br />
[[Denoising Autoencoders]]<br />
<br />
[[Sparse Coding]]<br />
<br />
[[K-means]]<br />
<br />
[[Spatial pyramids / Multiscale]]<br />
<br />
[[Slow Feature Analysis]]<br />
<br />
ICA Style Models:<br />
* [[Independent Component Analysis]]<br />
* [[Topographic Independent Component Analysis]]<br />
<br />
[[Tiled Convolution Networks]]<br />
<br />
----<br />
<br />
Material contributed by: Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Exercise:Learning_color_features_with_Sparse_AutoencodersExercise:Learning color features with Sparse Autoencoders2011-05-27T17:38:24Z<p>Ang: /* Step 2: Learn features on small patches */</p>
<hr />
<div>== Learning color features with Sparse Autoencoders ==<br />
<br />
In this exercise, you will implement a [[Linear Decoders | linear decoder]] (a sparse autoencoder whose output layer uses a linear activation function). You will then apply it to learn features on color images from the STL-10 dataset. These features will be used in an later [[Exercise:Convolution and Pooling | exercise on convolution and pooling]] for classifying STL-10 images.<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/linear_decoder_exercise.zip linear_decoder_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to copy and modify '''<tt>sparseAutoencoderCost.m</tt>''' from the [[Exercise:Sparse Autoencoder | sparse autoencoder exercise]].<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stl10_patches_100k.zip Sampled 8x8 patches from the STL-10 dataset (stl10_patches_100k.zip)]<br />
* [http://ufldl.stanford.edu/wiki/resources/linear_decoder_exercise.zip Starter Code (linear_decoder_exercise.zip)]<br />
<br />
You will also need:<br />
* <tt>sparseAutoencoderCost.m</tt> (and related functions) from [[Exercise:Sparse Autoencoder]]<br />
<br />
''If you have not completed the exercise listed above, we strongly suggest you complete it first.''<br />
<br />
=== Learning from color image patches ===<br />
<br />
In all the exercises so far, you have been working only with grayscale images. In this exercise, you will get to work with RGB color images for the first time. <br />
<br />
Conveniently, the fact that an image has three color channels (RGB), rather than a single gray channel, presents little difficulty for the sparse autoencoder. You can just combine the intensities from all the color channels for the pixels into one long vector, as if you were working with a grayscale image with 3x the number of pixels as the original image. <br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used in the exercise (see starter code for details).<br />
<br />
=== Step 1: Modify your sparse autoencoder to use a linear decoder ===<br />
<br />
Copy <tt>sparseAutoencoder.m</tt> to the directory for this exercise and rename it to <tt>sparseAutoencoderLinear.m</tt>. Rename the function <tt>sparseAutoencoderCost</tt> in the file to <tt>sparseAutoencoderLinearCost</tt>, and modify it to use a [[Linear Decoders | linear decoder]]. In particular, you should change the cost and gradients returned to reflect the change from a sigmoid to a linear decoder. After making this change, check your gradients to ensure that they are correct.<br />
<br />
=== Step 2: Learn features on small patches ===<br />
<br />
You will now use your sparse autoencoder to learn features on a set of 100,000 small 8x8 patches sampled from the larger 96x96 STL-10 images (The [http://www.stanford.edu/~acoates//stl10/ STL-10 dataset] comprises 5000 training and 8000 test examples, with each example being a 96x96 labelled color image belonging to one of ten classes: airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck.) <br />
<br />
The code provided in this step trains your sparse autoencoder for 400 iterations with the default parameters initialized in step 0. This should take around 45 minutes. Your sparse autoencoder should learn features which when visualized, look like edges and "opponent colors," as in the figure below. <br />
<br />
[[File:CNN_Features_Good.png|480px]]<br />
<br />
If your parameters are improperly tuned (the default parameters should work), or if your implementation of the autoencoder is buggy, you might instead get images that look like one of the following:<br />
<br />
<table cellpadding=5px><br />
<tr><td>[[File:cnn_Features_Bad1.png|240px]]</td><td>[[File:cnn_Features_Bad2.png|240px]]</td></tr><br />
</table><br />
<br />
The learned features will be saved to <tt>STL10Features.mat</tt>, which will be used in the later [[Exercise:Convolution and Pooling | exercise on convolution and pooling]].</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Exercise:Learning_color_features_with_Sparse_AutoencodersExercise:Learning color features with Sparse Autoencoders2011-05-27T17:33:05Z<p>Ang: /* Step 1: Modify your sparse autoencoder to use a linear decoder */</p>
<hr />
<div>== Learning color features with Sparse Autoencoders ==<br />
<br />
In this exercise, you will implement a [[Linear Decoders | linear decoder]] (a sparse autoencoder whose output layer uses a linear activation function). You will then apply it to learn features on color images from the STL-10 dataset. These features will be used in an later [[Exercise:Convolution and Pooling | exercise on convolution and pooling]] for classifying STL-10 images.<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/linear_decoder_exercise.zip linear_decoder_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to copy and modify '''<tt>sparseAutoencoderCost.m</tt>''' from the [[Exercise:Sparse Autoencoder | sparse autoencoder exercise]].<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stl10_patches_100k.zip Sampled 8x8 patches from the STL-10 dataset (stl10_patches_100k.zip)]<br />
* [http://ufldl.stanford.edu/wiki/resources/linear_decoder_exercise.zip Starter Code (linear_decoder_exercise.zip)]<br />
<br />
You will also need:<br />
* <tt>sparseAutoencoderCost.m</tt> (and related functions) from [[Exercise:Sparse Autoencoder]]<br />
<br />
''If you have not completed the exercise listed above, we strongly suggest you complete it first.''<br />
<br />
=== Learning from color image patches ===<br />
<br />
In all the exercises so far, you have been working only with grayscale images. In this exercise, you will get to work with RGB color images for the first time. <br />
<br />
Conveniently, the fact that an image has three color channels (RGB), rather than a single gray channel, presents little difficulty for the sparse autoencoder. You can just combine the intensities from all the color channels for the pixels into one long vector, as if you were working with a grayscale image with 3x the number of pixels as the original image. <br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used in the exercise (see starter code for details).<br />
<br />
=== Step 1: Modify your sparse autoencoder to use a linear decoder ===<br />
<br />
Copy <tt>sparseAutoencoder.m</tt> to the directory for this exercise and rename it to <tt>sparseAutoencoderLinear.m</tt>. Rename the function <tt>sparseAutoencoderCost</tt> in the file to <tt>sparseAutoencoderLinearCost</tt>, and modify it to use a [[Linear Decoders | linear decoder]]. In particular, you should change the cost and gradients returned to reflect the change from a sigmoid to a linear decoder. After making this change, check your gradients to ensure that they are correct.<br />
<br />
=== Step 2: Learn features on small patches ===<br />
<br />
You will now use your sparse autoencoder to learn features on a set of 100 000 small 8x8 patches sampled from the larger 96x96 STL10 images (The [http://www.stanford.edu/~acoates//stl10/ STL10 dataset] comprises 5000 test and 8000 train 96x96 labelled color images belonging to one of ten classes: airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck). <br />
<br />
The code provided in this step trains your sparse autoencoder for 400 iterations with the default parameters initialized in step 0. This should take around 45 minutes. Your sparse autoencoder should learn features which when visualized, look like edges and opponent colors, as in the figure below. <br />
<br />
[[File:CNN_Features_Good.png|480px]]<br />
<br />
If your parameters are improperly tuned (the default parameters should work), or if your implementation of the autoencoder is buggy, you might get one of the following images instead:<br />
<br />
<table cellpadding=5px><br />
<tr><td>[[File:cnn_Features_Bad1.png|240px]]</td><td>[[File:cnn_Features_Bad2.png|240px]]</td></tr><br />
</table><br />
<br />
The learned features will be saved to <tt>STL10Features.mat</tt>, which will be used in the later [[Exercise:Convolution and Pooling | exercise on convolution and pooling]].</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Exercise:Learning_color_features_with_Sparse_AutoencodersExercise:Learning color features with Sparse Autoencoders2011-05-27T17:30:45Z<p>Ang: /* Step 0: Initialization */</p>
<hr />
<div>== Learning color features with Sparse Autoencoders ==<br />
<br />
In this exercise, you will implement a [[Linear Decoders | linear decoder]] (a sparse autoencoder whose output layer uses a linear activation function). You will then apply it to learn features on color images from the STL-10 dataset. These features will be used in an later [[Exercise:Convolution and Pooling | exercise on convolution and pooling]] for classifying STL-10 images.<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/linear_decoder_exercise.zip linear_decoder_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to copy and modify '''<tt>sparseAutoencoderCost.m</tt>''' from the [[Exercise:Sparse Autoencoder | sparse autoencoder exercise]].<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stl10_patches_100k.zip Sampled 8x8 patches from the STL-10 dataset (stl10_patches_100k.zip)]<br />
* [http://ufldl.stanford.edu/wiki/resources/linear_decoder_exercise.zip Starter Code (linear_decoder_exercise.zip)]<br />
<br />
You will also need:<br />
* <tt>sparseAutoencoderCost.m</tt> (and related functions) from [[Exercise:Sparse Autoencoder]]<br />
<br />
''If you have not completed the exercise listed above, we strongly suggest you complete it first.''<br />
<br />
=== Learning from color image patches ===<br />
<br />
In all the exercises so far, you have been working only with grayscale images. In this exercise, you will get to work with RGB color images for the first time. <br />
<br />
Conveniently, the fact that an image has three color channels (RGB), rather than a single gray channel, presents little difficulty for the sparse autoencoder. You can just combine the intensities from all the color channels for the pixels into one long vector, as if you were working with a grayscale image with 3x the number of pixels as the original image. <br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used in the exercise (see starter code for details).<br />
<br />
=== Step 1: Modify your sparse autoencoder to use a linear decoder ===<br />
<br />
Copy <tt>sparseAutoencoder.m</tt> to the directory for this exercise and rename it to <tt>sparseAutoencoderLinear.m</tt>. Rename the function <tt>sparseAutoencoderCost</tt> in the file to <tt>sparseAutoencoderLinearCost</tt>, and modify it to use a [[Linear Decoders | linear decoder]]. In particular, you should change the cost and gradients returned to reflect the change from a sigmoid to a linear decoder. After making this change, check your gradient to ensure that they are correct.<br />
<br />
=== Step 2: Learn features on small patches ===<br />
<br />
You will now use your sparse autoencoder to learn features on a set of 100 000 small 8x8 patches sampled from the larger 96x96 STL10 images (The [http://www.stanford.edu/~acoates//stl10/ STL10 dataset] comprises 5000 test and 8000 train 96x96 labelled color images belonging to one of ten classes: airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck). <br />
<br />
The code provided in this step trains your sparse autoencoder for 400 iterations with the default parameters initialized in step 0. This should take around 45 minutes. Your sparse autoencoder should learn features which when visualized, look like edges and opponent colors, as in the figure below. <br />
<br />
[[File:CNN_Features_Good.png|480px]]<br />
<br />
If your parameters are improperly tuned (the default parameters should work), or if your implementation of the autoencoder is buggy, you might get one of the following images instead:<br />
<br />
<table cellpadding=5px><br />
<tr><td>[[File:cnn_Features_Bad1.png|240px]]</td><td>[[File:cnn_Features_Bad2.png|240px]]</td></tr><br />
</table><br />
<br />
The learned features will be saved to <tt>STL10Features.mat</tt>, which will be used in the later [[Exercise:Convolution and Pooling | exercise on convolution and pooling]].</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Exercise:Learning_color_features_with_Sparse_AutoencodersExercise:Learning color features with Sparse Autoencoders2011-05-27T17:30:34Z<p>Ang: /* Step 0: Initialization */</p>
<hr />
<div>== Learning color features with Sparse Autoencoders ==<br />
<br />
In this exercise, you will implement a [[Linear Decoders | linear decoder]] (a sparse autoencoder whose output layer uses a linear activation function). You will then apply it to learn features on color images from the STL-10 dataset. These features will be used in an later [[Exercise:Convolution and Pooling | exercise on convolution and pooling]] for classifying STL-10 images.<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/linear_decoder_exercise.zip linear_decoder_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to copy and modify '''<tt>sparseAutoencoderCost.m</tt>''' from the [[Exercise:Sparse Autoencoder | sparse autoencoder exercise]].<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stl10_patches_100k.zip Sampled 8x8 patches from the STL-10 dataset (stl10_patches_100k.zip)]<br />
* [http://ufldl.stanford.edu/wiki/resources/linear_decoder_exercise.zip Starter Code (linear_decoder_exercise.zip)]<br />
<br />
You will also need:<br />
* <tt>sparseAutoencoderCost.m</tt> (and related functions) from [[Exercise:Sparse Autoencoder]]<br />
<br />
''If you have not completed the exercise listed above, we strongly suggest you complete it first.''<br />
<br />
=== Learning from color image patches ===<br />
<br />
In all the exercises so far, you have been working only with grayscale images. In this exercise, you will get to work with RGB color images for the first time. <br />
<br />
Conveniently, the fact that an image has three color channels (RGB), rather than a single gray channel, presents little difficulty for the sparse autoencoder. You can just combine the intensities from all the color channels for the pixels into one long vector, as if you were working with a grayscale image with 3x the number of pixels as the original image. <br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used in the exercise (see started code for details).<br />
<br />
=== Step 1: Modify your sparse autoencoder to use a linear decoder ===<br />
<br />
Copy <tt>sparseAutoencoder.m</tt> to the directory for this exercise and rename it to <tt>sparseAutoencoderLinear.m</tt>. Rename the function <tt>sparseAutoencoderCost</tt> in the file to <tt>sparseAutoencoderLinearCost</tt>, and modify it to use a [[Linear Decoders | linear decoder]]. In particular, you should change the cost and gradients returned to reflect the change from a sigmoid to a linear decoder. After making this change, check your gradient to ensure that they are correct.<br />
<br />
=== Step 2: Learn features on small patches ===<br />
<br />
You will now use your sparse autoencoder to learn features on a set of 100 000 small 8x8 patches sampled from the larger 96x96 STL10 images (The [http://www.stanford.edu/~acoates//stl10/ STL10 dataset] comprises 5000 test and 8000 train 96x96 labelled color images belonging to one of ten classes: airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck). <br />
<br />
The code provided in this step trains your sparse autoencoder for 400 iterations with the default parameters initialized in step 0. This should take around 45 minutes. Your sparse autoencoder should learn features which when visualized, look like edges and opponent colors, as in the figure below. <br />
<br />
[[File:CNN_Features_Good.png|480px]]<br />
<br />
If your parameters are improperly tuned (the default parameters should work), or if your implementation of the autoencoder is buggy, you might get one of the following images instead:<br />
<br />
<table cellpadding=5px><br />
<tr><td>[[File:cnn_Features_Bad1.png|240px]]</td><td>[[File:cnn_Features_Bad2.png|240px]]</td></tr><br />
</table><br />
<br />
The learned features will be saved to <tt>STL10Features.mat</tt>, which will be used in the later [[Exercise:Convolution and Pooling | exercise on convolution and pooling]].</div>Anghttp://ufldl.stanford.edu/wiki/index.php/UFLDL_TutorialUFLDL Tutorial2011-05-17T19:26:26Z<p>Ang: </p>
<hr />
<div>'''Description:''' This tutorial will teach you the main ideas of Unsupervised Feature Learning and Deep Learning. By working through it, you will also get to implement several feature learning/deep learning algorithms, get to see them work for yourself, and learn how to apply/adapt these ideas to new problems.<br />
<br />
This tutorial assumes a basic knowledge of machine learning (specifically, familiarity with the ideas of supervised learning, logistic regression, gradient descent). If you are not familiar with these ideas, we suggest you go to this [http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning Machine Learning course] and complete<br />
sections II, III, IV (up to Logistic Regression) first. <br />
<br />
<br />
'''Sparse Autoencoder'''<br />
* [[Neural Networks]]<br />
* [[Backpropagation Algorithm]]<br />
* [[Gradient checking and advanced optimization]]<br />
* [[Autoencoders and Sparsity]]<br />
* [[Visualizing a Trained Autoencoder]]<br />
* [[Sparse Autoencoder Notation Summary]] <br />
* [[Exercise:Sparse Autoencoder]]<br />
<br />
<br />
'''Vectorized implementation'''<br />
* [[Vectorization]]<br />
* [[Logistic Regression Vectorization Example]]<br />
* [[Neural Network Vectorization]]<br />
* [[Exercise:Vectorization]]<br />
<br />
<br />
'''Preprocessing: PCA and Whitening'''<br />
* [[PCA]]<br />
* [[Whitening]]<br />
* [[Implementing PCA/Whitening]]<br />
* [[Exercise:PCA in 2D]]<br />
* [[Exercise:PCA and Whitening]]<br />
<br />
<br />
'''Softmax Regression'''<br />
* [[Softmax Regression]]<br />
* [[Exercise:Softmax Regression]]<br />
<br />
<br />
'''Self-Taught Learning and Unsupervised Feature Learning''' <br />
* [[Self-Taught Learning]]<br />
* [[Exercise:Self-Taught Learning]]<br />
<br />
<br />
'''Building Deep Networks for Classification'''<br />
* [[Self-Taught Learning to Deep Networks | From Self-Taught Learning to Deep Networks]]<br />
* [[Deep Networks: Overview]]<br />
* [[Stacked Autoencoders]]<br />
* [[Fine-tuning Stacked AEs]]<br />
* [[Exercise: Implement deep networks for digit classification]]<br />
<br />
<br />
----<br />
'''Note''': The sections above this line are stable. The sections below are still under construction, and may change without notice. Feel free to browse around however, and feedback/suggestions are welcome. <br />
<br />
'''Working with Large Images'''<br />
* [[Feature extraction using convolution]]<br />
* [[Pooling]]<br />
* [[Linear Decoders]]<br />
* [[Exercise:Convolution and Pooling]]<br />
* [[Multiple layers of convolution and pooling]]<br />
<br />
----<br />
<br />
'''Miscellaneous''':<br />
<br />
[[MATLAB Modules]]<br />
<br />
[[Data Preprocessing]]<br />
<br />
[[Style Guide]]<br />
<br />
'''Advanced Topics''':<br />
<br />
[[Convolutional training]] <br />
<br />
[[Restricted Boltzmann Machines]]<br />
<br />
[[Deep Belief Networks]]<br />
<br />
[[Denoising Autoencoders]]<br />
<br />
[[Sparse Coding]]<br />
<br />
[[K-means]]<br />
<br />
[[Spatial pyramids / Multiscale]]<br />
<br />
[[Slow Feature Analysis]]<br />
<br />
ICA Style Models:<br />
* [[Independent Component Analysis]]<br />
* [[Topographic Independent Component Analysis]]<br />
<br />
[[Tiled Convolution Networks]]<br />
<br />
----<br />
<br />
Material contributed by: Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Stacked_AutoencodersStacked Autoencoders2011-05-13T20:48:43Z<p>Ang: /* Motivation */</p>
<hr />
<div>===Overview===<br />
<br />
The greedy layerwise approach for pretraining a deep network works by training each layer in turn. In this page, you will find out how autoencoders can be "stacked" in a greedy layerwise fashion for pretraining (initializing) the weights of a deep network.<br />
<br />
A stacked autoencoder is a neural network consisting of multiple layers of sparse autoencoders in which the outputs of each layer is wired to the inputs of the successive layer. Formally, consider a stacked autoencoder with n layers. Using notation from the autoencoder section, let <math>W^{(k, 1)}, W^{(k, 2)}, b^{(k, 1)}, b^{(k, 2)}</math> denote the parameters <math>W^{(1)}, W^{(2)}, b^{(1)}, b^{(2)}</math> for kth autoencoder. Then the encoding step for the stacked autoencoder is given by running the encoding step of each layer in forward order:<br />
<br />
<math><br />
\begin{align}<br />
a^{(l)} = f(z^{(l)}) \\<br />
z^{(l + 1)} = W^{(l, 1)}a^{(l)} + b^{(l, l)}<br />
\end{align}<br />
</math><br />
<br />
The decoding step is given by running the decoding stack of each autoencoder in reverse order:<br />
<br />
<math><br />
\begin{align}<br />
a^{(n + l)} = f(z^{(n + l)}) \\<br />
z^{(n + l + 1)} = W^{(n - l, 2)}a^{(n + l)} + b^{(n - l, 2)}<br />
\end{align}<br />
</math><br />
<br />
The information of interest is contained within <math>a^{(n)}</math>, which is the activation of the deepest layer of hidden units. This vector gives us a representation of the input in terms of higher-order features. <br />
<br />
The features from the stacked autoencoder can be used for classification problems by feeding <math>a(n)</math> to a softmax classifier.<br />
<br />
===Training===<br />
A good way to obtain good parameters for a stacked autoencoder is to use greedy layer-wise training. To do this, first train the first layer on raw input to obtain parameters W1, W2, b1 and b2. Use the first layer to transform the raw input into a vector consisting of activation of the hidden units, A. Train the second layer on this vector to obtain parameters W1, W2, b1 and b2. Repeat for subsequent layers, using the output of each layer as input for the subsequent layer.<br />
<br />
This method trains the parameters of each layer individually while freezing parameters for the remainder of the model. To produce better results, after this phase of training is complete, [[Fine-tuning Stacked AEs | fine-tuning]] using backpropagation can be used to improve the results by tuning the parameters of all layers are changed at the same time. <br />
<br />
<!-- In practice, fine-tuning should be use when the parameters have been brought close to convergence through layer-wise training. Attempting to use fine-tuning with the weights initialized randomly will lead to poor results due to local optima. --><br />
<br />
{{Quote|<br />
If one is only interested in finetuning for the purposes of classification, the common practice is to then discard the "decoding" layers of the stacked autoencoder and link the last hidden layer <math>a^{(n)}</math> to the softmax classifier. The gradients from the (softmax) classification error will then be backpropagated into the encoding layers.<br />
}}<br />
<br />
===Concrete example===<br />
<br />
To give a concrete example, suppose you wished to train a stacked autoencoder with 2 hidden layers for classification of MNIST digits, as you will be doing in [[Exercise: Implement deep networks for digit classification | the next exercise]]. <br />
<br />
First, you would train a sparse autoencoder on the raw inputs <math>x^{(k)}</math> to learn primary features <math>h^{(1)(k)}</math> on the raw input.<br />
<br />
[[File:Stacked_SparseAE_Features1.png|400px]]<br />
<br />
Next, you would feed the raw input into this trained sparse autoencoder, obtaining the primary feature activations <math>h^{(1)(k)}</math> for each of the inputs <math>x^{(k)}</math>. You would then use these primary features as the "raw input" to another sparse autoencoder to learn secondary features <math>h^{(2)(k)}</math> on these primary features.<br />
<br />
[[File:Stacked_SparseAE_Features2.png|400px]]<br />
<br />
Following this, you would feed the primary features into the second sparse autoencoder to obtain the secondary feature activations <math>h^{(2)(k)}</math> for each of the primary features <math>h^{(1)(k)}</math> (which correspond to the primary features of the corresponding inputs <math>x^{(k)}</math>). You would then treat these secondary features as "raw input" to a softmax classifier, training it to map secondary features to digit labels.<br />
<br />
[[File:Stacked_Softmax_Classifier.png|400px]]<br />
<br />
Finally, you would combine all three layers together to form a stacked autoencoder with 2 hidden layers and a final softmax classifier layer capable of classifying the MNIST digits as desired.<br />
<br />
[[File:Stacked_Combined.png|500px]]<br />
<br />
===Discussion===<br />
<br />
A stacked autoencoder enjoys all the benefits of any deep network of greater expressive power. <br />
<br />
Further, it often captures a useful "hierarchical grouping" or "part-whole decomposition" of the input. To see this, recall that an autoencoder tends to learn features that form a good representation of its input. The first layer of a stacked autoencoder tends to learn first-order features in the raw input (such as edges in an image). The second layer of a stacked autoencoder tends to learn second-order features corresponding to patterns in the appearance of first-order features (e.g., in terms of what edges tend to occur together--for example, to form contour or corner detectors). Higher layers of the stacked autoencoder tend to learn even higher-order features. <br />
<br />
<!-- <br />
For instance, in the context of image input, the first layers usually learns to recognize edges. The second layer usually learns features that arise from combinations of the edges, such as corners. With certain types of network configuration and input modes, the higher layers can learn meaningful combinations of features. For instance, if the input set consists of images of faces, higher layers may learn features corresponding to parts of the face such as eyes, noses or mouths.<br />
!--></div>Anghttp://ufldl.stanford.edu/wiki/index.php/Deep_Networks:_OverviewDeep Networks: Overview2011-05-13T20:40:42Z<p>Ang: /* Greedy layer-wise training */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous sections, you constructed a 3-layer neural network comprising<br />
an input, hidden and output layer. While fairly effective for MNIST, this<br />
3-layer model is a fairly '''shallow''' network; by this, we mean that the<br />
features (hidden layer activations <math>a^{(2)}</math>) are computed using<br />
only "one layer" of computation (the hidden layer).<br />
<br />
In this section, we begin to discuss '''deep''' neural networks, meaning ones<br />
in which we have multiple hidden layers; this will allow us to compute much <br />
more complex features of the input. Because each hidden layer computes a <br />
non-linear transformation of the previous layer, a deep network can have<br />
significantly greater representational power (i.e., can learn<br />
significantly more complex functions) than a shallow one. <br />
<br />
Note that when training a deep network, it is important to use a ''non-linear''<br />
activation function <math>f(\cdot)</math> in each hidden layer. This is<br />
because multiple layers of linear functions would itself compute only a linear<br />
function of the input (i.e., composing multiple linear functions together<br />
results in just another linear function), and thus be no more expressive than<br />
using just a single layer of hidden units.<br />
<br />
== Advantages of deep networks ==<br />
<br />
Why do we want to use a deep network? The primary advantage is<br />
that it can compactly represent a significantly larger set of fuctions<br />
than shallow networks. Formally, one can show that there are functions<br />
which a <math>k</math>-layer network can represent compactly<br />
(with a number of hidden units that is ''polynomial'' in the number<br />
of inputs), that a <math>(k-1)</math>-layer network cannot represent<br />
unless it has an exponentially large number of hidden units.<br />
<br />
To take a simple example, consider building a boolean circuit/network to<br />
compute the parity (or XOR) of <math>n</math> input bits. Suppose each node in<br />
the network can compute either the logical OR of its inputs (or the OR of the <br />
negation of the inputs), or compute the logical AND. If we have a network with<br />
only one input, one hidden, and one output layer, the parity function would require a number of nodes that<br />
is exponential in the input size <math>n</math>. If however we are allowed a<br />
deeper network, then the network/circuit size can be only polynomial in<br />
<math>n</math>.<br />
<br />
By using a deep network, in the case of images, one can also start to learn part-whole decompositions.<br />
For example, the first layer might learn to group together pixels in an image<br />
in order to detect edges (as seen in the earlier exercises). The second layer might then group together edges to<br />
detect longer contours, or perhaps detect simple "parts of objects." An even deeper layer<br />
might then group together these contours or detect even more complex features.<br />
<br />
Finally, cortical computations (in the brain) also have multiple layers of<br />
processing. For example, visual images are processed in multiple stages by the<br />
brain, by cortical area "V1", followed by cortical area "V2" (a different part<br />
of the brain), and so on. <br />
<br />
<!--<br />
Informally, one way a deep network helps in representing functions compactly is<br />
through ''factorization''. Factorization, as the name suggests, occurs when the<br />
network represents at lower layers functions of the input that are then reused<br />
multiple times at higher layers. To gain some intuition for this, consider an<br />
arithmetic network for computing the values of polynomials, in which alternate<br />
layers implement addition and multiplication. In this network, an intermediate<br />
layer could compute the values of terms which are then used repeatedly in the<br />
next higher layer, the results of which are used repeatedly in the next higher<br />
layer, and so on.<br />
!--><br />
<br />
== Difficulty of training deep architectures ==<br />
<br />
While the theoretical benefits of deep networks in terms of their compactness<br />
and expressive power have been appreciated for many decades, until recently<br />
researchers had little success training deep architectures.<br />
<br />
The main learning algorithm that researchers were using was to randomly initialize<br />
the weights of a deep network, and then train it using a labeled<br />
training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l)}) \}</math><br />
using a supervised learning objective, for example by applying gradient descent to try to<br />
drive down the training error. However, this usually did not work well.<br />
There were several reasons for this.<br />
<br />
===Availability of data=== <br />
<br />
With the method described above, one relies only on<br />
labeled data for training. However, labeled data is often scarce, and thus for many<br />
problems it is difficult to get enough examples to fit the parameters of a<br />
complex model. For example, given the high degree of expressive power of deep networks,<br />
training on insufficient data would also result in overfitting. <br />
<br />
===Local optima=== <br />
<br />
Training a shallow network (with 1 hidden layer) using<br />
supervised learning usually resulted in the parameters converging to reasonable values;<br />
but when we are training a deep network, this works much less well. <br />
In particular, training a neural network using supervised learning<br />
involves solving a highly non-convex optimization problem (say, minimizing the<br />
training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a<br />
function of the network parameters <math>\textstyle W</math>). <br />
In a deep network, this problem turns out to be rife with bad local optima, and<br />
training with gradient descent (or methods like conjugate gradient and L-BFGS)<br />
no longer work well. <br />
<br />
===Diffusion of gradients=== <br />
<br />
There is an additional technical reason,<br />
pertaining to the gradients becoming very small, that explains why gradient<br />
descent (and related algorithms like L-BFGS) do not work well on a deep networks<br />
with randomly initialized weights. Specifically, when using backpropagation to<br />
compute the derivatives, the gradients that are propagated backwards (from the<br />
output layer to the earlier layers of the network) rapidly diminish in<br />
magnitude as the depth of the network increases. As a result, the derivative of<br />
the overall cost with respect to the weights in the earlier layers is very<br />
small. Thus, when using gradient descent, the weights of the earlier layers<br />
change slowly, and the earlier layers fail to learn much. This problem<br />
is often called the "diffusion of gradients."<br />
<br />
A closely related problem to the diffusion of gradients is that if the last few<br />
layers in a neural network have a large enough number of neurons, it may be<br />
possible for them to model the labeled data alone without the help of the<br />
earlier layers. Hence, training the entire network at once with all the layers<br />
randomly initialized ends up giving similar performance to training a<br />
shallow network (the last few layers) on corrupted input (the result of<br />
the processing done by the earlier layers). <br />
<br />
<!--<br />
When the last layer is used<br />
for classification, often training a network like this results in low<br />
training error, but high error is low, but training error is high, suggesting that<br />
the last few layers are over-fitting the training data.<br />
!--><br />
<br />
== Greedy layer-wise training ==<br />
<br />
How can we train a deep network? One method that has seen some<br />
success is the '''greedy layer-wise training''' method. We describe this<br />
method in detail in later sections, but briefly, the main idea is to train the<br />
layers of the network one at a time, so that we first train a network with 1 <br />
hidden layer, and only after that is done, train a network with 2 hidden layers,<br />
and so on. At each step, we take the old network with <math>k-1</math> hidden<br />
layers, and add an additional <math>k</math>-th hidden layer (that takes as <br />
input the previous hidden layer <math>k-1</math> that we had just<br />
trained). Training can either be <br />
supervised (say, with classification error as the objective function on each<br />
step), but more frequently it is <br />
unsupervised (as in an autoencoder; details to provided later). <br />
The weights from training the layers individually are then used to initialize the weights <br />
in the final/overall deep network, and only then is the entire architecture "fine-tuned" (i.e.,<br />
trained together to optimize the labeled training set error). <br />
<br />
The success of greedy<br />
layer-wise training has been attributed to a number of factors:<br />
<br />
===Availability of data=== <br />
<br />
While labeled data can be expensive to obtain,<br />
unlabeled data is cheap and plentiful. The promise of self-taught learning is<br />
that by exploiting the massive amount of unlabeled data, we can learn much<br />
better models. By using unlabeled data to learn a good initial value for the<br />
weights in all the layers <math>\textstyle W^{(l)}</math> (except for the final<br />
classification layer that maps to the outputs/predictions), our algorithm is<br />
able to learn and discover patterns from massively more amounts of data than<br />
purely supervised approaches. This often results in much better classifiers <br />
being learned. <br />
<br />
===Better local optima=== <br />
<br />
After having trained the network<br />
on the unlabeled data, the weights are now starting at a better location in<br />
parameter space than if they had been randomly initialized. We can then<br />
further fine-tune the weights starting from this location. Empirically, it<br />
turns out that gradient descent from this location is much more likely to<br />
lead to a good local minimum, because the unlabeled data has already provided<br />
a significant amount of "prior" information about what patterns there<br />
are in the input data. <br />
<br />
<br />
In the next section, we will describe the specific details of how to go about<br />
implementing greedy layer-wise training. <br />
<br />
<br />
<br />
<!--<br />
Specifically,<br />
since the weights of the layers have already been initialized to reasonable<br />
values, the final solution tends to be near the good initial solution, forming<br />
a useful "regularization" effect. (more details in Erhan et al., 2010).<br />
!--><br />
<br />
<!--<br />
== References ==<br />
<br />
Erhan et al. (2010). Why Does Unsupervised Pre-training Help Deep Learning?.<br />
AISTATS 2010.<br />
[http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf]<br />
!--></div>Anghttp://ufldl.stanford.edu/wiki/index.php/Deep_Networks:_OverviewDeep Networks: Overview2011-05-13T20:33:32Z<p>Ang: /* Difficulty of training deep architectures */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous sections, you constructed a 3-layer neural network comprising<br />
an input, hidden and output layer. While fairly effective for MNIST, this<br />
3-layer model is a fairly '''shallow''' network; by this, we mean that the<br />
features (hidden layer activations <math>a^{(2)}</math>) are computed using<br />
only "one layer" of computation (the hidden layer).<br />
<br />
In this section, we begin to discuss '''deep''' neural networks, meaning ones<br />
in which we have multiple hidden layers; this will allow us to compute much <br />
more complex features of the input. Because each hidden layer computes a <br />
non-linear transformation of the previous layer, a deep network can have<br />
significantly greater representational power (i.e., can learn<br />
significantly more complex functions) than a shallow one. <br />
<br />
Note that when training a deep network, it is important to use a ''non-linear''<br />
activation function <math>f(\cdot)</math> in each hidden layer. This is<br />
because multiple layers of linear functions would itself compute only a linear<br />
function of the input (i.e., composing multiple linear functions together<br />
results in just another linear function), and thus be no more expressive than<br />
using just a single layer of hidden units.<br />
<br />
== Advantages of deep networks ==<br />
<br />
Why do we want to use a deep network? The primary advantage is<br />
that it can compactly represent a significantly larger set of fuctions<br />
than shallow networks. Formally, one can show that there are functions<br />
which a <math>k</math>-layer network can represent compactly<br />
(with a number of hidden units that is ''polynomial'' in the number<br />
of inputs), that a <math>(k-1)</math>-layer network cannot represent<br />
unless it has an exponentially large number of hidden units.<br />
<br />
To take a simple example, consider building a boolean circuit/network to<br />
compute the parity (or XOR) of <math>n</math> input bits. Suppose each node in<br />
the network can compute either the logical OR of its inputs (or the OR of the <br />
negation of the inputs), or compute the logical AND. If we have a network with<br />
only one input, one hidden, and one output layer, the parity function would require a number of nodes that<br />
is exponential in the input size <math>n</math>. If however we are allowed a<br />
deeper network, then the network/circuit size can be only polynomial in<br />
<math>n</math>.<br />
<br />
By using a deep network, in the case of images, one can also start to learn part-whole decompositions.<br />
For example, the first layer might learn to group together pixels in an image<br />
in order to detect edges (as seen in the earlier exercises). The second layer might then group together edges to<br />
detect longer contours, or perhaps detect simple "parts of objects." An even deeper layer<br />
might then group together these contours or detect even more complex features.<br />
<br />
Finally, cortical computations (in the brain) also have multiple layers of<br />
processing. For example, visual images are processed in multiple stages by the<br />
brain, by cortical area "V1", followed by cortical area "V2" (a different part<br />
of the brain), and so on. <br />
<br />
<!--<br />
Informally, one way a deep network helps in representing functions compactly is<br />
through ''factorization''. Factorization, as the name suggests, occurs when the<br />
network represents at lower layers functions of the input that are then reused<br />
multiple times at higher layers. To gain some intuition for this, consider an<br />
arithmetic network for computing the values of polynomials, in which alternate<br />
layers implement addition and multiplication. In this network, an intermediate<br />
layer could compute the values of terms which are then used repeatedly in the<br />
next higher layer, the results of which are used repeatedly in the next higher<br />
layer, and so on.<br />
!--><br />
<br />
== Difficulty of training deep architectures ==<br />
<br />
While the theoretical benefits of deep networks in terms of their compactness<br />
and expressive power have been appreciated for many decades, until recently<br />
researchers had little success training deep architectures.<br />
<br />
The main learning algorithm that researchers were using was to randomly initialize<br />
the weights of a deep network, and then train it using a labeled<br />
training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l)}) \}</math><br />
using a supervised learning objective, for example by applying gradient descent to try to<br />
drive down the training error. However, this usually did not work well.<br />
There were several reasons for this.<br />
<br />
===Availability of data=== <br />
<br />
With the method described above, one relies only on<br />
labeled data for training. However, labeled data is often scarce, and thus for many<br />
problems it is difficult to get enough examples to fit the parameters of a<br />
complex model. For example, given the high degree of expressive power of deep networks,<br />
training on insufficient data would also result in overfitting. <br />
<br />
===Local optima=== <br />
<br />
Training a shallow network (with 1 hidden layer) using<br />
supervised learning usually resulted in the parameters converging to reasonable values;<br />
but when we are training a deep network, this works much less well. <br />
In particular, training a neural network using supervised learning<br />
involves solving a highly non-convex optimization problem (say, minimizing the<br />
training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a<br />
function of the network parameters <math>\textstyle W</math>). <br />
In a deep network, this problem turns out to be rife with bad local optima, and<br />
training with gradient descent (or methods like conjugate gradient and L-BFGS)<br />
no longer work well. <br />
<br />
===Diffusion of gradients=== <br />
<br />
There is an additional technical reason,<br />
pertaining to the gradients becoming very small, that explains why gradient<br />
descent (and related algorithms like L-BFGS) do not work well on a deep networks<br />
with randomly initialized weights. Specifically, when using backpropagation to<br />
compute the derivatives, the gradients that are propagated backwards (from the<br />
output layer to the earlier layers of the network) rapidly diminish in<br />
magnitude as the depth of the network increases. As a result, the derivative of<br />
the overall cost with respect to the weights in the earlier layers is very<br />
small. Thus, when using gradient descent, the weights of the earlier layers<br />
change slowly, and the earlier layers fail to learn much. This problem<br />
is often called the "diffusion of gradients."<br />
<br />
A closely related problem to the diffusion of gradients is that if the last few<br />
layers in a neural network have a large enough number of neurons, it may be<br />
possible for them to model the labeled data alone without the help of the<br />
earlier layers. Hence, training the entire network at once with all the layers<br />
randomly initialized ends up giving similar performance to training a<br />
shallow network (the last few layers) on corrupted input (the result of<br />
the processing done by the earlier layers). <br />
<br />
<!--<br />
When the last layer is used<br />
for classification, often training a network like this results in low<br />
training error, but high error is low, but training error is high, suggesting that<br />
the last few layers are over-fitting the training data.<br />
!--><br />
<br />
== Greedy layer-wise training ==<br />
<br />
How should deep architectures be trained then? One method that has seen some<br />
success is the '''greedy layer-wise training''' method. We describe this<br />
method in detail in later sections, but briefly, the main idea is to train the<br />
layers of the network one at a time, with the input of each layer being the<br />
output of the previous layer (which has been trained). Training can either be<br />
supervised (say, with classification error as the objective function), or<br />
unsupervised (say, with the error of the layer in reconstructing its input as<br />
the objective function, as in an autoencoder). The weights from training the<br />
layers individually are then used to initialize the weights in the deep<br />
architecture, and only then is the entire architecture "fine-tuned" (i.e.,<br />
trained together to optimize the training set error). The success of greedy<br />
layer-wise training has been attributed to a number of factors:<br />
<br />
===Availability of data=== <br />
<br />
While labeled data can be expensive to obtain,<br />
unlabeled data is cheap and plentiful. The promise of self-taught learning is<br />
that by exploiting the massive amount of unlabeled data, we can learn much<br />
better models. By using unlabeled data to learn a good initial value for the<br />
weights in all the layers <math>\textstyle W^{(l)}</math> (except for the final<br />
classification layer that maps to the outputs/predictions), our algorithm is<br />
able to learn and discover patterns from massively more amounts of data than<br />
purely supervised approaches, and thus often results in much better hypotheses.<br />
<br />
===Regularization and better local optima=== <br />
<br />
After having trained the network<br />
on the unlabeled data, the weights are now starting at a better location in<br />
parameter space than if they had been randomly initialized. We usually then<br />
further fine-tune the weights starting from this location. Empirically, it<br />
turns out that gradient descent from this location is also much more likely to<br />
lead to a good local minimum, because the unlabeled data has already provided<br />
a significant amount of "prior" information about what patterns there<br />
are in the input data.<br />
<br />
In the next section, we will describe the specific details of how to go about<br />
implementing greedy layer-wise training.<br />
<br />
<br />
<br />
<!--<br />
Specifically,<br />
since the weights of the layers have already been initialized to reasonable<br />
values, the final solution tends to be near the good initial solution, forming<br />
a useful "regularization" effect. (more details in Erhan et al., 2010).<br />
!--><br />
<br />
<!--<br />
== References ==<br />
<br />
Erhan et al. (2010). Why Does Unsupervised Pre-training Help Deep Learning?.<br />
AISTATS 2010.<br />
[http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf]<br />
!--></div>Anghttp://ufldl.stanford.edu/wiki/index.php/Deep_Networks:_OverviewDeep Networks: Overview2011-05-13T20:25:20Z<p>Ang: /* Difficulty of training deep architectures */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous sections, you constructed a 3-layer neural network comprising<br />
an input, hidden and output layer. While fairly effective for MNIST, this<br />
3-layer model is a fairly '''shallow''' network; by this, we mean that the<br />
features (hidden layer activations <math>a^{(2)}</math>) are computed using<br />
only "one layer" of computation (the hidden layer).<br />
<br />
In this section, we begin to discuss '''deep''' neural networks, meaning ones<br />
in which we have multiple hidden layers; this will allow us to compute much <br />
more complex features of the input. Because each hidden layer computes a <br />
non-linear transformation of the previous layer, a deep network can have<br />
significantly greater representational power (i.e., can learn<br />
significantly more complex functions) than a shallow one. <br />
<br />
Note that when training a deep network, it is important to use a ''non-linear''<br />
activation function <math>f(\cdot)</math> in each hidden layer. This is<br />
because multiple layers of linear functions would itself compute only a linear<br />
function of the input (i.e., composing multiple linear functions together<br />
results in just another linear function), and thus be no more expressive than<br />
using just a single layer of hidden units.<br />
<br />
== Advantages of deep networks ==<br />
<br />
Why do we want to use a deep network? The primary advantage is<br />
that it can compactly represent a significantly larger set of fuctions<br />
than shallow networks. Formally, one can show that there are functions<br />
which a <math>k</math>-layer network can represent compactly<br />
(with a number of hidden units that is ''polynomial'' in the number<br />
of inputs), that a <math>(k-1)</math>-layer network cannot represent<br />
unless it has an exponentially large number of hidden units.<br />
<br />
To take a simple example, consider building a boolean circuit/network to<br />
compute the parity (or XOR) of <math>n</math> input bits. Suppose each node in<br />
the network can compute either the logical OR of its inputs (or the OR of the <br />
negation of the inputs), or compute the logical AND. If we have a network with<br />
only one input, one hidden, and one output layer, the parity function would require a number of nodes that<br />
is exponential in the input size <math>n</math>. If however we are allowed a<br />
deeper network, then the network/circuit size can be only polynomial in<br />
<math>n</math>.<br />
<br />
By using a deep network, in the case of images, one can also start to learn part-whole decompositions.<br />
For example, the first layer might learn to group together pixels in an image<br />
in order to detect edges (as seen in the earlier exercises). The second layer might then group together edges to<br />
detect longer contours, or perhaps detect simple "parts of objects." An even deeper layer<br />
might then group together these contours or detect even more complex features.<br />
<br />
Finally, cortical computations (in the brain) also have multiple layers of<br />
processing. For example, visual images are processed in multiple stages by the<br />
brain, by cortical area "V1", followed by cortical area "V2" (a different part<br />
of the brain), and so on. <br />
<br />
<!--<br />
Informally, one way a deep network helps in representing functions compactly is<br />
through ''factorization''. Factorization, as the name suggests, occurs when the<br />
network represents at lower layers functions of the input that are then reused<br />
multiple times at higher layers. To gain some intuition for this, consider an<br />
arithmetic network for computing the values of polynomials, in which alternate<br />
layers implement addition and multiplication. In this network, an intermediate<br />
layer could compute the values of terms which are then used repeatedly in the<br />
next higher layer, the results of which are used repeatedly in the next higher<br />
layer, and so on.<br />
!--><br />
<br />
== Difficulty of training deep architectures ==<br />
<br />
While the theoretical benefits of deep networks in terms of their compactness<br />
and expressive power have been appreciated for many decades, until recently<br />
researchers had little success training deep architectures.<br />
<br />
The main learning algorithm that researchers were using was to randomly initialize<br />
the weights of a deep network, and then train it using a labeled<br />
training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l}) \}</math><br />
using a supervised learning objective, using gradient descent to try to<br />
drive down the training error. However, this usually did not work well.<br />
There were several reasons for this.<br />
<br />
===Availability of data=== <br />
<br />
With the method described above, one relies only on<br />
labeled data for training. However, labeled data is often scarce, and thus it<br />
is easy to overfit the training data and obtain a model which does not<br />
generalize well.<br />
<br />
===Local optima=== <br />
<br />
Training a neural network using supervised learning<br />
involves solving a highly non-convex optimization problem (say, minimizing the<br />
training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a<br />
function of the network parameters <math>\textstyle W</math>). When the<br />
network is deep, this optimization problem is rife with bad local optima, and<br />
training with gradient descent (or methods like conjugate gradient and L-BFGS)<br />
do not work well.<br />
<br />
===Diffusion of gradients=== <br />
<br />
There is an additional technical reason,<br />
pertaining to the gradients becoming very small, that explains why gradient<br />
descent (and related algorithms like L-BFGS) do not work well on a deep network<br />
with randomly initialized weights. Specifically, when using backpropagation to<br />
compute the derivatives, the gradients that are propagated backwards (from the<br />
output layer to the earlier layers of the network) rapidly diminishes in<br />
magnitude as the depth of the network increases. As a result, the derivative of<br />
the overall cost with respect to the weights in the earlier layers is very<br />
small. Thus, when using gradient descent, the weights of the earlier layers<br />
change slowly, and the earlier layers fail to learn much. This problem<br />
is often called the "diffusion of gradients."<br />
<br />
A closely related problem to the diffusion of gradients is that if the last few<br />
layers in a neural network have a large enough number of neurons, it may be<br />
possible for them to model the labeled data alone without the help of the<br />
earlier layers. Hence, training the entire network at once with all the layers<br />
randomly initialized ends up giving similar performance to training a<br />
shallow network (the last few layers) on corrupted input (the result of<br />
the processing done by the earlier layers).<br />
<br />
<!--<br />
When the last layer is used<br />
for classification, often training a network like this results in low<br />
training error, but high error is low, but training error is high, suggesting that<br />
the last few layers are over-fitting the training data.<br />
!--><br />
<br />
== Greedy layer-wise training ==<br />
<br />
How should deep architectures be trained then? One method that has seen some<br />
success is the '''greedy layer-wise training''' method. We describe this<br />
method in detail in later sections, but briefly, the main idea is to train the<br />
layers of the network one at a time, with the input of each layer being the<br />
output of the previous layer (which has been trained). Training can either be<br />
supervised (say, with classification error as the objective function), or<br />
unsupervised (say, with the error of the layer in reconstructing its input as<br />
the objective function, as in an autoencoder). The weights from training the<br />
layers individually are then used to initialize the weights in the deep<br />
architecture, and only then is the entire architecture "fine-tuned" (i.e.,<br />
trained together to optimize the training set error). The success of greedy<br />
layer-wise training has been attributed to a number of factors:<br />
<br />
===Availability of data=== <br />
<br />
While labeled data can be expensive to obtain,<br />
unlabeled data is cheap and plentiful. The promise of self-taught learning is<br />
that by exploiting the massive amount of unlabeled data, we can learn much<br />
better models. By using unlabeled data to learn a good initial value for the<br />
weights in all the layers <math>\textstyle W^{(l)}</math> (except for the final<br />
classification layer that maps to the outputs/predictions), our algorithm is<br />
able to learn and discover patterns from massively more amounts of data than<br />
purely supervised approaches, and thus often results in much better hypotheses.<br />
<br />
===Regularization and better local optima=== <br />
<br />
After having trained the network<br />
on the unlabeled data, the weights are now starting at a better location in<br />
parameter space than if they had been randomly initialized. We usually then<br />
further fine-tune the weights starting from this location. Empirically, it<br />
turns out that gradient descent from this location is also much more likely to<br />
lead to a good local minimum, because the unlabeled data has already provided<br />
a significant amount of "prior" information about what patterns there<br />
are in the input data.<br />
<br />
In the next section, we will describe the specific details of how to go about<br />
implementing greedy layer-wise training.<br />
<br />
<br />
<br />
<!--<br />
Specifically,<br />
since the weights of the layers have already been initialized to reasonable<br />
values, the final solution tends to be near the good initial solution, forming<br />
a useful "regularization" effect. (more details in Erhan et al., 2010).<br />
!--><br />
<br />
<!--<br />
== References ==<br />
<br />
Erhan et al. (2010). Why Does Unsupervised Pre-training Help Deep Learning?.<br />
AISTATS 2010.<br />
[http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf]<br />
!--></div>Anghttp://ufldl.stanford.edu/wiki/index.php/Deep_Networks:_OverviewDeep Networks: Overview2011-05-13T20:24:38Z<p>Ang: /* Advantages of deep networks */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous sections, you constructed a 3-layer neural network comprising<br />
an input, hidden and output layer. While fairly effective for MNIST, this<br />
3-layer model is a fairly '''shallow''' network; by this, we mean that the<br />
features (hidden layer activations <math>a^{(2)}</math>) are computed using<br />
only "one layer" of computation (the hidden layer).<br />
<br />
In this section, we begin to discuss '''deep''' neural networks, meaning ones<br />
in which we have multiple hidden layers; this will allow us to compute much <br />
more complex features of the input. Because each hidden layer computes a <br />
non-linear transformation of the previous layer, a deep network can have<br />
significantly greater representational power (i.e., can learn<br />
significantly more complex functions) than a shallow one. <br />
<br />
Note that when training a deep network, it is important to use a ''non-linear''<br />
activation function <math>f(\cdot)</math> in each hidden layer. This is<br />
because multiple layers of linear functions would itself compute only a linear<br />
function of the input (i.e., composing multiple linear functions together<br />
results in just another linear function), and thus be no more expressive than<br />
using just a single layer of hidden units.<br />
<br />
== Advantages of deep networks ==<br />
<br />
Why do we want to use a deep network? The primary advantage is<br />
that it can compactly represent a significantly larger set of fuctions<br />
than shallow networks. Formally, one can show that there are functions<br />
which a <math>k</math>-layer network can represent compactly<br />
(with a number of hidden units that is ''polynomial'' in the number<br />
of inputs), that a <math>(k-1)</math>-layer network cannot represent<br />
unless it has an exponentially large number of hidden units.<br />
<br />
To take a simple example, consider building a boolean circuit/network to<br />
compute the parity (or XOR) of <math>n</math> input bits. Suppose each node in<br />
the network can compute either the logical OR of its inputs (or the OR of the <br />
negation of the inputs), or compute the logical AND. If we have a network with<br />
only one input, one hidden, and one output layer, the parity function would require a number of nodes that<br />
is exponential in the input size <math>n</math>. If however we are allowed a<br />
deeper network, then the network/circuit size can be only polynomial in<br />
<math>n</math>.<br />
<br />
By using a deep network, in the case of images, one can also start to learn part-whole decompositions.<br />
For example, the first layer might learn to group together pixels in an image<br />
in order to detect edges (as seen in the earlier exercises). The second layer might then group together edges to<br />
detect longer contours, or perhaps detect simple "parts of objects." An even deeper layer<br />
might then group together these contours or detect even more complex features.<br />
<br />
Finally, cortical computations (in the brain) also have multiple layers of<br />
processing. For example, visual images are processed in multiple stages by the<br />
brain, by cortical area "V1", followed by cortical area "V2" (a different part<br />
of the brain), and so on. <br />
<br />
<!--<br />
Informally, one way a deep network helps in representing functions compactly is<br />
through ''factorization''. Factorization, as the name suggests, occurs when the<br />
network represents at lower layers functions of the input that are then reused<br />
multiple times at higher layers. To gain some intuition for this, consider an<br />
arithmetic network for computing the values of polynomials, in which alternate<br />
layers implement addition and multiplication. In this network, an intermediate<br />
layer could compute the values of terms which are then used repeatedly in the<br />
next higher layer, the results of which are used repeatedly in the next higher<br />
layer, and so on.<br />
!--><br />
<br />
== Difficulty of training deep architectures ==<br />
<br />
While the theoretical benefits of deep networks in terms of their compactness<br />
and expressive power have been appreciated for many decades, until recently<br />
researchers had little success training deep architectures.<br />
<br />
The main method that researchers were using was to randomly initialize<br />
the weights of the deep network, and then train it using a labeled<br />
training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l}) \}</math><br />
using a supervised learning objective, using gradient descent to try to<br />
drive down the training error. However, this usually did not work well.<br />
There were several reasons for this.<br />
<br />
===Availability of data=== <br />
<br />
With the method described above, one relies only on<br />
labeled data for training. However, labeled data is often scarce, and thus it<br />
is easy to overfit the training data and obtain a model which does not<br />
generalize well.<br />
<br />
===Local optima=== <br />
<br />
Training a neural network using supervised learning<br />
involves solving a highly non-convex optimization problem (say, minimizing the<br />
training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a<br />
function of the network parameters <math>\textstyle W</math>). When the<br />
network is deep, this optimization problem is rife with bad local optima, and<br />
training with gradient descent (or methods like conjugate gradient and L-BFGS)<br />
do not work well.<br />
<br />
===Diffusion of gradients=== <br />
<br />
There is an additional technical reason,<br />
pertaining to the gradients becoming very small, that explains why gradient<br />
descent (and related algorithms like L-BFGS) do not work well on a deep network<br />
with randomly initialized weights. Specifically, when using backpropagation to<br />
compute the derivatives, the gradients that are propagated backwards (from the<br />
output layer to the earlier layers of the network) rapidly diminishes in<br />
magnitude as the depth of the network increases. As a result, the derivative of<br />
the overall cost with respect to the weights in the earlier layers is very<br />
small. Thus, when using gradient descent, the weights of the earlier layers<br />
change slowly, and the earlier layers fail to learn much. This problem<br />
is often called the "diffusion of gradients."<br />
<br />
A closely related problem to the diffusion of gradients is that if the last few<br />
layers in a neural network have a large enough number of neurons, it may be<br />
possible for them to model the labeled data alone without the help of the<br />
earlier layers. Hence, training the entire network at once with all the layers<br />
randomly initialized ends up giving similar performance to training a<br />
shallow network (the last few layers) on corrupted input (the result of<br />
the processing done by the earlier layers).<br />
<br />
<!--<br />
When the last layer is used<br />
for classification, often training a network like this results in low<br />
training error, but high error is low, but training error is high, suggesting that<br />
the last few layers are over-fitting the training data.<br />
!--><br />
<br />
== Greedy layer-wise training ==<br />
<br />
How should deep architectures be trained then? One method that has seen some<br />
success is the '''greedy layer-wise training''' method. We describe this<br />
method in detail in later sections, but briefly, the main idea is to train the<br />
layers of the network one at a time, with the input of each layer being the<br />
output of the previous layer (which has been trained). Training can either be<br />
supervised (say, with classification error as the objective function), or<br />
unsupervised (say, with the error of the layer in reconstructing its input as<br />
the objective function, as in an autoencoder). The weights from training the<br />
layers individually are then used to initialize the weights in the deep<br />
architecture, and only then is the entire architecture "fine-tuned" (i.e.,<br />
trained together to optimize the training set error). The success of greedy<br />
layer-wise training has been attributed to a number of factors:<br />
<br />
===Availability of data=== <br />
<br />
While labeled data can be expensive to obtain,<br />
unlabeled data is cheap and plentiful. The promise of self-taught learning is<br />
that by exploiting the massive amount of unlabeled data, we can learn much<br />
better models. By using unlabeled data to learn a good initial value for the<br />
weights in all the layers <math>\textstyle W^{(l)}</math> (except for the final<br />
classification layer that maps to the outputs/predictions), our algorithm is<br />
able to learn and discover patterns from massively more amounts of data than<br />
purely supervised approaches, and thus often results in much better hypotheses.<br />
<br />
===Regularization and better local optima=== <br />
<br />
After having trained the network<br />
on the unlabeled data, the weights are now starting at a better location in<br />
parameter space than if they had been randomly initialized. We usually then<br />
further fine-tune the weights starting from this location. Empirically, it<br />
turns out that gradient descent from this location is also much more likely to<br />
lead to a good local minimum, because the unlabeled data has already provided<br />
a significant amount of "prior" information about what patterns there<br />
are in the input data.<br />
<br />
In the next section, we will describe the specific details of how to go about<br />
implementing greedy layer-wise training.<br />
<br />
<br />
<br />
<!--<br />
Specifically,<br />
since the weights of the layers have already been initialized to reasonable<br />
values, the final solution tends to be near the good initial solution, forming<br />
a useful "regularization" effect. (more details in Erhan et al., 2010).<br />
!--><br />
<br />
<!--<br />
== References ==<br />
<br />
Erhan et al. (2010). Why Does Unsupervised Pre-training Help Deep Learning?.<br />
AISTATS 2010.<br />
[http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf]<br />
!--></div>Anghttp://ufldl.stanford.edu/wiki/index.php/Deep_Networks:_OverviewDeep Networks: Overview2011-05-13T20:23:17Z<p>Ang: /* Advantages of deep networks */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous sections, you constructed a 3-layer neural network comprising<br />
an input, hidden and output layer. While fairly effective for MNIST, this<br />
3-layer model is a fairly '''shallow''' network; by this, we mean that the<br />
features (hidden layer activations <math>a^{(2)}</math>) are computed using<br />
only "one layer" of computation (the hidden layer).<br />
<br />
In this section, we begin to discuss '''deep''' neural networks, meaning ones<br />
in which we have multiple hidden layers; this will allow us to compute much <br />
more complex features of the input. Because each hidden layer computes a <br />
non-linear transformation of the previous layer, a deep network can have<br />
significantly greater representational power (i.e., can learn<br />
significantly more complex functions) than a shallow one. <br />
<br />
Note that when training a deep network, it is important to use a ''non-linear''<br />
activation function <math>f(\cdot)</math> in each hidden layer. This is<br />
because multiple layers of linear functions would itself compute only a linear<br />
function of the input (i.e., composing multiple linear functions together<br />
results in just another linear function), and thus be no more expressive than<br />
using just a single layer of hidden units.<br />
<br />
== Advantages of deep networks ==<br />
<br />
Why do we want to use a deep network? The primary advantage is<br />
that it can compactly represent a significantly larger set of fuctions<br />
than shallow networks. Formally, one can show that there are functions<br />
which a <math>k</math>-layer network can represent compactly<br />
(with a number of hidden units that is ''polynomial'' in the number<br />
of inputs), that a <math>(k-1)</math>-layer network cannot represent<br />
unless it has an exponentially large number of hidden units.<br />
<br />
To take a simple example, consider building a boolean circuit/network to<br />
compute the parity (or XOR) of <math>n</math> input bits. Suppose each node in<br />
the network can compute either the logical OR of its inputs (or the OR of the <br />
negation of the inputs), or compute the logical AND. If we have a network with<br />
only one input, one hidden, and one output layer, the parity function would require a number of nodes that<br />
is exponential in the input size <math>n</math>. If however we are allowed a<br />
deeper network, then the network/circuit size can be only polynomial in<br />
<math>n</math>.<br />
<br />
By using a deep network, in the case of images, one can also start to learn part-whole decompositions.<br />
For example, the first layer might learn to group together pixels in an image<br />
in order to detect edges. The second layer might then group together edges to<br />
detect longer contours, or perhaps simple "object parts." An even deeper layer<br />
might then group together these contours or detect even more complex features.<br />
<br />
Finally, cortical computations (in the brain) also have multiple layers of<br />
processing. For example, visual images are processed in multiple stages by the<br />
brain, by cortical area "V1", followed by cortical area "V2" (a different part<br />
of the brain), and so on.<br />
<br />
<!--<br />
Informally, one way a deep network helps in representing functions compactly is<br />
through ''factorization''. Factorization, as the name suggests, occurs when the<br />
network represents at lower layers functions of the input that are then reused<br />
multiple times at higher layers. To gain some intuition for this, consider an<br />
arithmetic network for computing the values of polynomials, in which alternate<br />
layers implement addition and multiplication. In this network, an intermediate<br />
layer could compute the values of terms which are then used repeatedly in the<br />
next higher layer, the results of which are used repeatedly in the next higher<br />
layer, and so on.<br />
!--><br />
<br />
== Difficulty of training deep architectures ==<br />
<br />
While the theoretical benefits of deep networks in terms of their compactness<br />
and expressive power have been appreciated for many decades, until recently<br />
researchers had little success training deep architectures.<br />
<br />
The main method that researchers were using was to randomly initialize<br />
the weights of the deep network, and then train it using a labeled<br />
training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l}) \}</math><br />
using a supervised learning objective, using gradient descent to try to<br />
drive down the training error. However, this usually did not work well.<br />
There were several reasons for this.<br />
<br />
===Availability of data=== <br />
<br />
With the method described above, one relies only on<br />
labeled data for training. However, labeled data is often scarce, and thus it<br />
is easy to overfit the training data and obtain a model which does not<br />
generalize well.<br />
<br />
===Local optima=== <br />
<br />
Training a neural network using supervised learning<br />
involves solving a highly non-convex optimization problem (say, minimizing the<br />
training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a<br />
function of the network parameters <math>\textstyle W</math>). When the<br />
network is deep, this optimization problem is rife with bad local optima, and<br />
training with gradient descent (or methods like conjugate gradient and L-BFGS)<br />
do not work well.<br />
<br />
===Diffusion of gradients=== <br />
<br />
There is an additional technical reason,<br />
pertaining to the gradients becoming very small, that explains why gradient<br />
descent (and related algorithms like L-BFGS) do not work well on a deep network<br />
with randomly initialized weights. Specifically, when using backpropagation to<br />
compute the derivatives, the gradients that are propagated backwards (from the<br />
output layer to the earlier layers of the network) rapidly diminishes in<br />
magnitude as the depth of the network increases. As a result, the derivative of<br />
the overall cost with respect to the weights in the earlier layers is very<br />
small. Thus, when using gradient descent, the weights of the earlier layers<br />
change slowly, and the earlier layers fail to learn much. This problem<br />
is often called the "diffusion of gradients."<br />
<br />
A closely related problem to the diffusion of gradients is that if the last few<br />
layers in a neural network have a large enough number of neurons, it may be<br />
possible for them to model the labeled data alone without the help of the<br />
earlier layers. Hence, training the entire network at once with all the layers<br />
randomly initialized ends up giving similar performance to training a<br />
shallow network (the last few layers) on corrupted input (the result of<br />
the processing done by the earlier layers).<br />
<br />
<!--<br />
When the last layer is used<br />
for classification, often training a network like this results in low<br />
training error, but high error is low, but training error is high, suggesting that<br />
the last few layers are over-fitting the training data.<br />
!--><br />
<br />
== Greedy layer-wise training ==<br />
<br />
How should deep architectures be trained then? One method that has seen some<br />
success is the '''greedy layer-wise training''' method. We describe this<br />
method in detail in later sections, but briefly, the main idea is to train the<br />
layers of the network one at a time, with the input of each layer being the<br />
output of the previous layer (which has been trained). Training can either be<br />
supervised (say, with classification error as the objective function), or<br />
unsupervised (say, with the error of the layer in reconstructing its input as<br />
the objective function, as in an autoencoder). The weights from training the<br />
layers individually are then used to initialize the weights in the deep<br />
architecture, and only then is the entire architecture "fine-tuned" (i.e.,<br />
trained together to optimize the training set error). The success of greedy<br />
layer-wise training has been attributed to a number of factors:<br />
<br />
===Availability of data=== <br />
<br />
While labeled data can be expensive to obtain,<br />
unlabeled data is cheap and plentiful. The promise of self-taught learning is<br />
that by exploiting the massive amount of unlabeled data, we can learn much<br />
better models. By using unlabeled data to learn a good initial value for the<br />
weights in all the layers <math>\textstyle W^{(l)}</math> (except for the final<br />
classification layer that maps to the outputs/predictions), our algorithm is<br />
able to learn and discover patterns from massively more amounts of data than<br />
purely supervised approaches, and thus often results in much better hypotheses.<br />
<br />
===Regularization and better local optima=== <br />
<br />
After having trained the network<br />
on the unlabeled data, the weights are now starting at a better location in<br />
parameter space than if they had been randomly initialized. We usually then<br />
further fine-tune the weights starting from this location. Empirically, it<br />
turns out that gradient descent from this location is also much more likely to<br />
lead to a good local minimum, because the unlabeled data has already provided<br />
a significant amount of "prior" information about what patterns there<br />
are in the input data.<br />
<br />
In the next section, we will describe the specific details of how to go about<br />
implementing greedy layer-wise training.<br />
<br />
<br />
<br />
<!--<br />
Specifically,<br />
since the weights of the layers have already been initialized to reasonable<br />
values, the final solution tends to be near the good initial solution, forming<br />
a useful "regularization" effect. (more details in Erhan et al., 2010).<br />
!--><br />
<br />
<!--<br />
== References ==<br />
<br />
Erhan et al. (2010). Why Does Unsupervised Pre-training Help Deep Learning?.<br />
AISTATS 2010.<br />
[http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf]<br />
!--></div>Anghttp://ufldl.stanford.edu/wiki/index.php/Deep_Networks:_OverviewDeep Networks: Overview2011-05-13T20:22:54Z<p>Ang: /* Advantages of deep networks */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous sections, you constructed a 3-layer neural network comprising<br />
an input, hidden and output layer. While fairly effective for MNIST, this<br />
3-layer model is a fairly '''shallow''' network; by this, we mean that the<br />
features (hidden layer activations <math>a^{(2)}</math>) are computed using<br />
only "one layer" of computation (the hidden layer).<br />
<br />
In this section, we begin to discuss '''deep''' neural networks, meaning ones<br />
in which we have multiple hidden layers; this will allow us to compute much <br />
more complex features of the input. Because each hidden layer computes a <br />
non-linear transformation of the previous layer, a deep network can have<br />
significantly greater representational power (i.e., can learn<br />
significantly more complex functions) than a shallow one. <br />
<br />
Note that when training a deep network, it is important to use a ''non-linear''<br />
activation function <math>f(\cdot)</math> in each hidden layer. This is<br />
because multiple layers of linear functions would itself compute only a linear<br />
function of the input (i.e., composing multiple linear functions together<br />
results in just another linear function), and thus be no more expressive than<br />
using just a single layer of hidden units.<br />
<br />
== Advantages of deep networks ==<br />
<br />
Why do we want to use a deep network? The primary advantage is<br />
that it can compactly represent a significantly larger set of fuctions<br />
than shallow networks. Formally, one can show that there are functions<br />
which a <math>k</math>-layer network can represent compactly<br />
(with a number of hidden units that is ''polynomial'' in the number<br />
of inputs), that a <math>(k-1)</math>-layer network cannot represent<br />
unless it has an exponentially large number of hidden units.<br />
<br />
To take a simple example, consider building a boolean circuit/network to<br />
compute the parity (or XOR) of <math>n</math> input bits. Suppose each node in<br />
the network can compute either the logical OR of its inputs (or the OR of the <br />
negation of the inputs), or compute the logical AND. If we have a network with<br />
only one input, one hidden, and one output layer, the parity function would require a number of nodes that<br />
is exponential in the input size <math>n</math>. If however we are allowed a<br />
deeper network, then the network/circuit size can be only polynomial in<br />
<math>n</math>.<br />
<br />
By using a deep network, one can also start to learn part-whole decompositions.<br />
For example, the first layer might learn to group together pixels in an image<br />
in order to detect edges. The second layer might then group together edges to<br />
detect longer contours, or perhaps simple "object parts." An even deeper layer<br />
might then group together these contours or detect even more complex features.<br />
<br />
Finally, cortical computations (in the brain) also have multiple layers of<br />
processing. For example, visual images are processed in multiple stages by the<br />
brain, by cortical area "V1", followed by cortical area "V2" (a different part<br />
of the brain), and so on.<br />
<br />
<!--<br />
Informally, one way a deep network helps in representing functions compactly is<br />
through ''factorization''. Factorization, as the name suggests, occurs when the<br />
network represents at lower layers functions of the input that are then reused<br />
multiple times at higher layers. To gain some intuition for this, consider an<br />
arithmetic network for computing the values of polynomials, in which alternate<br />
layers implement addition and multiplication. In this network, an intermediate<br />
layer could compute the values of terms which are then used repeatedly in the<br />
next higher layer, the results of which are used repeatedly in the next higher<br />
layer, and so on.<br />
!--><br />
<br />
== Difficulty of training deep architectures ==<br />
<br />
While the theoretical benefits of deep networks in terms of their compactness<br />
and expressive power have been appreciated for many decades, until recently<br />
researchers had little success training deep architectures.<br />
<br />
The main method that researchers were using was to randomly initialize<br />
the weights of the deep network, and then train it using a labeled<br />
training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l}) \}</math><br />
using a supervised learning objective, using gradient descent to try to<br />
drive down the training error. However, this usually did not work well.<br />
There were several reasons for this.<br />
<br />
===Availability of data=== <br />
<br />
With the method described above, one relies only on<br />
labeled data for training. However, labeled data is often scarce, and thus it<br />
is easy to overfit the training data and obtain a model which does not<br />
generalize well.<br />
<br />
===Local optima=== <br />
<br />
Training a neural network using supervised learning<br />
involves solving a highly non-convex optimization problem (say, minimizing the<br />
training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a<br />
function of the network parameters <math>\textstyle W</math>). When the<br />
network is deep, this optimization problem is rife with bad local optima, and<br />
training with gradient descent (or methods like conjugate gradient and L-BFGS)<br />
do not work well.<br />
<br />
===Diffusion of gradients=== <br />
<br />
There is an additional technical reason,<br />
pertaining to the gradients becoming very small, that explains why gradient<br />
descent (and related algorithms like L-BFGS) do not work well on a deep network<br />
with randomly initialized weights. Specifically, when using backpropagation to<br />
compute the derivatives, the gradients that are propagated backwards (from the<br />
output layer to the earlier layers of the network) rapidly diminishes in<br />
magnitude as the depth of the network increases. As a result, the derivative of<br />
the overall cost with respect to the weights in the earlier layers is very<br />
small. Thus, when using gradient descent, the weights of the earlier layers<br />
change slowly, and the earlier layers fail to learn much. This problem<br />
is often called the "diffusion of gradients."<br />
<br />
A closely related problem to the diffusion of gradients is that if the last few<br />
layers in a neural network have a large enough number of neurons, it may be<br />
possible for them to model the labeled data alone without the help of the<br />
earlier layers. Hence, training the entire network at once with all the layers<br />
randomly initialized ends up giving similar performance to training a<br />
shallow network (the last few layers) on corrupted input (the result of<br />
the processing done by the earlier layers).<br />
<br />
<!--<br />
When the last layer is used<br />
for classification, often training a network like this results in low<br />
training error, but high error is low, but training error is high, suggesting that<br />
the last few layers are over-fitting the training data.<br />
!--><br />
<br />
== Greedy layer-wise training ==<br />
<br />
How should deep architectures be trained then? One method that has seen some<br />
success is the '''greedy layer-wise training''' method. We describe this<br />
method in detail in later sections, but briefly, the main idea is to train the<br />
layers of the network one at a time, with the input of each layer being the<br />
output of the previous layer (which has been trained). Training can either be<br />
supervised (say, with classification error as the objective function), or<br />
unsupervised (say, with the error of the layer in reconstructing its input as<br />
the objective function, as in an autoencoder). The weights from training the<br />
layers individually are then used to initialize the weights in the deep<br />
architecture, and only then is the entire architecture "fine-tuned" (i.e.,<br />
trained together to optimize the training set error). The success of greedy<br />
layer-wise training has been attributed to a number of factors:<br />
<br />
===Availability of data=== <br />
<br />
While labeled data can be expensive to obtain,<br />
unlabeled data is cheap and plentiful. The promise of self-taught learning is<br />
that by exploiting the massive amount of unlabeled data, we can learn much<br />
better models. By using unlabeled data to learn a good initial value for the<br />
weights in all the layers <math>\textstyle W^{(l)}</math> (except for the final<br />
classification layer that maps to the outputs/predictions), our algorithm is<br />
able to learn and discover patterns from massively more amounts of data than<br />
purely supervised approaches, and thus often results in much better hypotheses.<br />
<br />
===Regularization and better local optima=== <br />
<br />
After having trained the network<br />
on the unlabeled data, the weights are now starting at a better location in<br />
parameter space than if they had been randomly initialized. We usually then<br />
further fine-tune the weights starting from this location. Empirically, it<br />
turns out that gradient descent from this location is also much more likely to<br />
lead to a good local minimum, because the unlabeled data has already provided<br />
a significant amount of "prior" information about what patterns there<br />
are in the input data.<br />
<br />
In the next section, we will describe the specific details of how to go about<br />
implementing greedy layer-wise training.<br />
<br />
<br />
<br />
<!--<br />
Specifically,<br />
since the weights of the layers have already been initialized to reasonable<br />
values, the final solution tends to be near the good initial solution, forming<br />
a useful "regularization" effect. (more details in Erhan et al., 2010).<br />
!--><br />
<br />
<!--<br />
== References ==<br />
<br />
Erhan et al. (2010). Why Does Unsupervised Pre-training Help Deep Learning?.<br />
AISTATS 2010.<br />
[http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf]<br />
!--></div>Anghttp://ufldl.stanford.edu/wiki/index.php/Deep_Networks:_OverviewDeep Networks: Overview2011-05-13T20:22:17Z<p>Ang: /* Advantages of deep networks */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous sections, you constructed a 3-layer neural network comprising<br />
an input, hidden and output layer. While fairly effective for MNIST, this<br />
3-layer model is a fairly '''shallow''' network; by this, we mean that the<br />
features (hidden layer activations <math>a^{(2)}</math>) are computed using<br />
only "one layer" of computation (the hidden layer).<br />
<br />
In this section, we begin to discuss '''deep''' neural networks, meaning ones<br />
in which we have multiple hidden layers; this will allow us to compute much <br />
more complex features of the input. Because each hidden layer computes a <br />
non-linear transformation of the previous layer, a deep network can have<br />
significantly greater representational power (i.e., can learn<br />
significantly more complex functions) than a shallow one. <br />
<br />
Note that when training a deep network, it is important to use a ''non-linear''<br />
activation function <math>f(\cdot)</math> in each hidden layer. This is<br />
because multiple layers of linear functions would itself compute only a linear<br />
function of the input (i.e., composing multiple linear functions together<br />
results in just another linear function), and thus be no more expressive than<br />
using just a single layer of hidden units.<br />
<br />
== Advantages of deep networks ==<br />
<br />
Why do we want to use a deep network? The primary advantage is<br />
that it can compactly represent a significantly larger set of fuctions<br />
than shallow networks. Formally, one can show that there are functions<br />
which a <math>k</math>-layer network can represent compactly<br />
(with a number of hidden units that is ''polynomial'' in the number<br />
of inputs), that a <math>(k-1)</math>-layer network cannot represent<br />
unless it has an exponentially large number of hidden units.<br />
<br />
To take a simple example, consider building a boolean circuit/network to<br />
compute the parity (or XOR) of <math>n</math> input bits. Suppose each node in<br />
the network can compute either the logical OR of its inputs (or the OR of the <br />
negation of the inputs), or compute the logical AND. If we have a network with<br />
only 1 hidden layer, the parity function would require a number of nodes that<br />
is exponential in the input size <math>n</math>. If however we are allowed a<br />
deeper network, then the network/circuit size can be only polynomial in<br />
<math>n</math>.<br />
<br />
By using a deep network, one can also start to learn part-whole decompositions.<br />
For example, the first layer might learn to group together pixels in an image<br />
in order to detect edges. The second layer might then group together edges to<br />
detect longer contours, or perhaps simple "object parts." An even deeper layer<br />
might then group together these contours or detect even more complex features.<br />
<br />
Finally, cortical computations (in the brain) also have multiple layers of<br />
processing. For example, visual images are processed in multiple stages by the<br />
brain, by cortical area "V1", followed by cortical area "V2" (a different part<br />
of the brain), and so on.<br />
<br />
<!--<br />
Informally, one way a deep network helps in representing functions compactly is<br />
through ''factorization''. Factorization, as the name suggests, occurs when the<br />
network represents at lower layers functions of the input that are then reused<br />
multiple times at higher layers. To gain some intuition for this, consider an<br />
arithmetic network for computing the values of polynomials, in which alternate<br />
layers implement addition and multiplication. In this network, an intermediate<br />
layer could compute the values of terms which are then used repeatedly in the<br />
next higher layer, the results of which are used repeatedly in the next higher<br />
layer, and so on.<br />
!--><br />
<br />
== Difficulty of training deep architectures ==<br />
<br />
While the theoretical benefits of deep networks in terms of their compactness<br />
and expressive power have been appreciated for many decades, until recently<br />
researchers had little success training deep architectures.<br />
<br />
The main method that researchers were using was to randomly initialize<br />
the weights of the deep network, and then train it using a labeled<br />
training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l}) \}</math><br />
using a supervised learning objective, using gradient descent to try to<br />
drive down the training error. However, this usually did not work well.<br />
There were several reasons for this.<br />
<br />
===Availability of data=== <br />
<br />
With the method described above, one relies only on<br />
labeled data for training. However, labeled data is often scarce, and thus it<br />
is easy to overfit the training data and obtain a model which does not<br />
generalize well.<br />
<br />
===Local optima=== <br />
<br />
Training a neural network using supervised learning<br />
involves solving a highly non-convex optimization problem (say, minimizing the<br />
training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a<br />
function of the network parameters <math>\textstyle W</math>). When the<br />
network is deep, this optimization problem is rife with bad local optima, and<br />
training with gradient descent (or methods like conjugate gradient and L-BFGS)<br />
do not work well.<br />
<br />
===Diffusion of gradients=== <br />
<br />
There is an additional technical reason,<br />
pertaining to the gradients becoming very small, that explains why gradient<br />
descent (and related algorithms like L-BFGS) do not work well on a deep network<br />
with randomly initialized weights. Specifically, when using backpropagation to<br />
compute the derivatives, the gradients that are propagated backwards (from the<br />
output layer to the earlier layers of the network) rapidly diminishes in<br />
magnitude as the depth of the network increases. As a result, the derivative of<br />
the overall cost with respect to the weights in the earlier layers is very<br />
small. Thus, when using gradient descent, the weights of the earlier layers<br />
change slowly, and the earlier layers fail to learn much. This problem<br />
is often called the "diffusion of gradients."<br />
<br />
A closely related problem to the diffusion of gradients is that if the last few<br />
layers in a neural network have a large enough number of neurons, it may be<br />
possible for them to model the labeled data alone without the help of the<br />
earlier layers. Hence, training the entire network at once with all the layers<br />
randomly initialized ends up giving similar performance to training a<br />
shallow network (the last few layers) on corrupted input (the result of<br />
the processing done by the earlier layers).<br />
<br />
<!--<br />
When the last layer is used<br />
for classification, often training a network like this results in low<br />
training error, but high error is low, but training error is high, suggesting that<br />
the last few layers are over-fitting the training data.<br />
!--><br />
<br />
== Greedy layer-wise training ==<br />
<br />
How should deep architectures be trained then? One method that has seen some<br />
success is the '''greedy layer-wise training''' method. We describe this<br />
method in detail in later sections, but briefly, the main idea is to train the<br />
layers of the network one at a time, with the input of each layer being the<br />
output of the previous layer (which has been trained). Training can either be<br />
supervised (say, with classification error as the objective function), or<br />
unsupervised (say, with the error of the layer in reconstructing its input as<br />
the objective function, as in an autoencoder). The weights from training the<br />
layers individually are then used to initialize the weights in the deep<br />
architecture, and only then is the entire architecture "fine-tuned" (i.e.,<br />
trained together to optimize the training set error). The success of greedy<br />
layer-wise training has been attributed to a number of factors:<br />
<br />
===Availability of data=== <br />
<br />
While labeled data can be expensive to obtain,<br />
unlabeled data is cheap and plentiful. The promise of self-taught learning is<br />
that by exploiting the massive amount of unlabeled data, we can learn much<br />
better models. By using unlabeled data to learn a good initial value for the<br />
weights in all the layers <math>\textstyle W^{(l)}</math> (except for the final<br />
classification layer that maps to the outputs/predictions), our algorithm is<br />
able to learn and discover patterns from massively more amounts of data than<br />
purely supervised approaches, and thus often results in much better hypotheses.<br />
<br />
===Regularization and better local optima=== <br />
<br />
After having trained the network<br />
on the unlabeled data, the weights are now starting at a better location in<br />
parameter space than if they had been randomly initialized. We usually then<br />
further fine-tune the weights starting from this location. Empirically, it<br />
turns out that gradient descent from this location is also much more likely to<br />
lead to a good local minimum, because the unlabeled data has already provided<br />
a significant amount of "prior" information about what patterns there<br />
are in the input data.<br />
<br />
In the next section, we will describe the specific details of how to go about<br />
implementing greedy layer-wise training.<br />
<br />
<br />
<br />
<!--<br />
Specifically,<br />
since the weights of the layers have already been initialized to reasonable<br />
values, the final solution tends to be near the good initial solution, forming<br />
a useful "regularization" effect. (more details in Erhan et al., 2010).<br />
!--><br />
<br />
<!--<br />
== References ==<br />
<br />
Erhan et al. (2010). Why Does Unsupervised Pre-training Help Deep Learning?.<br />
AISTATS 2010.<br />
[http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf]<br />
!--></div>Anghttp://ufldl.stanford.edu/wiki/index.php/Deep_Networks:_OverviewDeep Networks: Overview2011-05-13T20:21:42Z<p>Ang: /* Advantages of deep networks */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous sections, you constructed a 3-layer neural network comprising<br />
an input, hidden and output layer. While fairly effective for MNIST, this<br />
3-layer model is a fairly '''shallow''' network; by this, we mean that the<br />
features (hidden layer activations <math>a^{(2)}</math>) are computed using<br />
only "one layer" of computation (the hidden layer).<br />
<br />
In this section, we begin to discuss '''deep''' neural networks, meaning ones<br />
in which we have multiple hidden layers; this will allow us to compute much <br />
more complex features of the input. Because each hidden layer computes a <br />
non-linear transformation of the previous layer, a deep network can have<br />
significantly greater representational power (i.e., can learn<br />
significantly more complex functions) than a shallow one. <br />
<br />
Note that when training a deep network, it is important to use a ''non-linear''<br />
activation function <math>f(\cdot)</math> in each hidden layer. This is<br />
because multiple layers of linear functions would itself compute only a linear<br />
function of the input (i.e., composing multiple linear functions together<br />
results in just another linear function), and thus be no more expressive than<br />
using just a single layer of hidden units.<br />
<br />
== Advantages of deep networks ==<br />
<br />
Why do we want to use a deep network? The primary advantage is<br />
that it can compactly represent a significantly larger set of fuctions<br />
than shallow networks. Formally, one can show that there are functions<br />
which a <math>k</math>-layer network can represent compactly<br />
(with a number of hidden units that is ''polynomial'' in the number<br />
of inputs), that a <math>(k-1)</math>-layer network cannot represent<br />
unless it has an exponentially large number of hidden units.<br />
<br />
To take a simple example, consider building a boolean circuit/network to<br />
compute the parity (or XOR) of <math>n</math> input bits. Suppose each node in<br />
the network can compute either the logical OR of its inputs (or the logical<br />
negation of the inputs), or compute the logical AND. If we have a network with<br />
only 1 hidden layer, the parity function would require a number of nodes that<br />
is exponential in the input size <math>n</math>. If however we are allowed a<br />
deeper network, then the network/circuit size can be only polynomial in<br />
<math>n</math>.<br />
<br />
By using a deep network, one can also start to learn part-whole decompositions.<br />
For example, the first layer might learn to group together pixels in an image<br />
in order to detect edges. The second layer might then group together edges to<br />
detect longer contours, or perhaps simple "object parts." An even deeper layer<br />
might then group together these contours or detect even more complex features.<br />
<br />
Finally, cortical computations (in the brain) also have multiple layers of<br />
processing. For example, visual images are processed in multiple stages by the<br />
brain, by cortical area "V1", followed by cortical area "V2" (a different part<br />
of the brain), and so on.<br />
<br />
<!--<br />
Informally, one way a deep network helps in representing functions compactly is<br />
through ''factorization''. Factorization, as the name suggests, occurs when the<br />
network represents at lower layers functions of the input that are then reused<br />
multiple times at higher layers. To gain some intuition for this, consider an<br />
arithmetic network for computing the values of polynomials, in which alternate<br />
layers implement addition and multiplication. In this network, an intermediate<br />
layer could compute the values of terms which are then used repeatedly in the<br />
next higher layer, the results of which are used repeatedly in the next higher<br />
layer, and so on.<br />
!--><br />
<br />
== Difficulty of training deep architectures ==<br />
<br />
While the theoretical benefits of deep networks in terms of their compactness<br />
and expressive power have been appreciated for many decades, until recently<br />
researchers had little success training deep architectures.<br />
<br />
The main method that researchers were using was to randomly initialize<br />
the weights of the deep network, and then train it using a labeled<br />
training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l}) \}</math><br />
using a supervised learning objective, using gradient descent to try to<br />
drive down the training error. However, this usually did not work well.<br />
There were several reasons for this.<br />
<br />
===Availability of data=== <br />
<br />
With the method described above, one relies only on<br />
labeled data for training. However, labeled data is often scarce, and thus it<br />
is easy to overfit the training data and obtain a model which does not<br />
generalize well.<br />
<br />
===Local optima=== <br />
<br />
Training a neural network using supervised learning<br />
involves solving a highly non-convex optimization problem (say, minimizing the<br />
training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a<br />
function of the network parameters <math>\textstyle W</math>). When the<br />
network is deep, this optimization problem is rife with bad local optima, and<br />
training with gradient descent (or methods like conjugate gradient and L-BFGS)<br />
do not work well.<br />
<br />
===Diffusion of gradients=== <br />
<br />
There is an additional technical reason,<br />
pertaining to the gradients becoming very small, that explains why gradient<br />
descent (and related algorithms like L-BFGS) do not work well on a deep network<br />
with randomly initialized weights. Specifically, when using backpropagation to<br />
compute the derivatives, the gradients that are propagated backwards (from the<br />
output layer to the earlier layers of the network) rapidly diminishes in<br />
magnitude as the depth of the network increases. As a result, the derivative of<br />
the overall cost with respect to the weights in the earlier layers is very<br />
small. Thus, when using gradient descent, the weights of the earlier layers<br />
change slowly, and the earlier layers fail to learn much. This problem<br />
is often called the "diffusion of gradients."<br />
<br />
A closely related problem to the diffusion of gradients is that if the last few<br />
layers in a neural network have a large enough number of neurons, it may be<br />
possible for them to model the labeled data alone without the help of the<br />
earlier layers. Hence, training the entire network at once with all the layers<br />
randomly initialized ends up giving similar performance to training a<br />
shallow network (the last few layers) on corrupted input (the result of<br />
the processing done by the earlier layers).<br />
<br />
<!--<br />
When the last layer is used<br />
for classification, often training a network like this results in low<br />
training error, but high error is low, but training error is high, suggesting that<br />
the last few layers are over-fitting the training data.<br />
!--><br />
<br />
== Greedy layer-wise training ==<br />
<br />
How should deep architectures be trained then? One method that has seen some<br />
success is the '''greedy layer-wise training''' method. We describe this<br />
method in detail in later sections, but briefly, the main idea is to train the<br />
layers of the network one at a time, with the input of each layer being the<br />
output of the previous layer (which has been trained). Training can either be<br />
supervised (say, with classification error as the objective function), or<br />
unsupervised (say, with the error of the layer in reconstructing its input as<br />
the objective function, as in an autoencoder). The weights from training the<br />
layers individually are then used to initialize the weights in the deep<br />
architecture, and only then is the entire architecture "fine-tuned" (i.e.,<br />
trained together to optimize the training set error). The success of greedy<br />
layer-wise training has been attributed to a number of factors:<br />
<br />
===Availability of data=== <br />
<br />
While labeled data can be expensive to obtain,<br />
unlabeled data is cheap and plentiful. The promise of self-taught learning is<br />
that by exploiting the massive amount of unlabeled data, we can learn much<br />
better models. By using unlabeled data to learn a good initial value for the<br />
weights in all the layers <math>\textstyle W^{(l)}</math> (except for the final<br />
classification layer that maps to the outputs/predictions), our algorithm is<br />
able to learn and discover patterns from massively more amounts of data than<br />
purely supervised approaches, and thus often results in much better hypotheses.<br />
<br />
===Regularization and better local optima=== <br />
<br />
After having trained the network<br />
on the unlabeled data, the weights are now starting at a better location in<br />
parameter space than if they had been randomly initialized. We usually then<br />
further fine-tune the weights starting from this location. Empirically, it<br />
turns out that gradient descent from this location is also much more likely to<br />
lead to a good local minimum, because the unlabeled data has already provided<br />
a significant amount of "prior" information about what patterns there<br />
are in the input data.<br />
<br />
In the next section, we will describe the specific details of how to go about<br />
implementing greedy layer-wise training.<br />
<br />
<br />
<br />
<!--<br />
Specifically,<br />
since the weights of the layers have already been initialized to reasonable<br />
values, the final solution tends to be near the good initial solution, forming<br />
a useful "regularization" effect. (more details in Erhan et al., 2010).<br />
!--><br />
<br />
<!--<br />
== References ==<br />
<br />
Erhan et al. (2010). Why Does Unsupervised Pre-training Help Deep Learning?.<br />
AISTATS 2010.<br />
[http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf]<br />
!--></div>Anghttp://ufldl.stanford.edu/wiki/index.php/Deep_Networks:_OverviewDeep Networks: Overview2011-05-13T20:21:00Z<p>Ang: /* Overview */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous sections, you constructed a 3-layer neural network comprising<br />
an input, hidden and output layer. While fairly effective for MNIST, this<br />
3-layer model is a fairly '''shallow''' network; by this, we mean that the<br />
features (hidden layer activations <math>a^{(2)}</math>) are computed using<br />
only "one layer" of computation (the hidden layer).<br />
<br />
In this section, we begin to discuss '''deep''' neural networks, meaning ones<br />
in which we have multiple hidden layers; this will allow us to compute much <br />
more complex features of the input. Because each hidden layer computes a <br />
non-linear transformation of the previous layer, a deep network can have<br />
significantly greater representational power (i.e., can learn<br />
significantly more complex functions) than a shallow one. <br />
<br />
Note that when training a deep network, it is important to use a ''non-linear''<br />
activation function <math>f(\cdot)</math> in each hidden layer. This is<br />
because multiple layers of linear functions would itself compute only a linear<br />
function of the input (i.e., composing multiple linear functions together<br />
results in just another linear function), and thus be no more expressive than<br />
using just a single layer of hidden units.<br />
<br />
== Advantages of deep networks ==<br />
<br />
Why do we want to use a deep network? The primary advantage is<br />
that it can compactly represent a significantly larger set of fuctions<br />
than shallow networks. Formally, one can show that there are functions<br />
which a <math>k</math>-layer network can represent compactly<br />
(with a number of hidden units that is ''polynomial'' in the number<br />
of inputs), that a <math>(k-1)</math>-layer network cannot represent<br />
unless it has an exponentially large number of hidden units.<br />
<br />
To take a simple example, consider building a boolean network/circuit to<br />
compute the parity (or XOR) of <math>n</math> input bits. Suppose each node in<br />
the network can compute either the logical OR of its inputs (or the logical<br />
negation of the inputs), or compute the logical AND. If we have a network with<br />
only 1 hidden layer, the parity function would require a number of nodes that<br />
is exponential in the input size <math>n</math>. If however we are allowed a<br />
deeper network, then the network/circuit size can be only polynomial in<br />
<math>n</math>.<br />
<br />
By using a deep network, one can also start to learn part-whole decompositions.<br />
For example, the first layer might learn to group together pixels in an image<br />
in order to detect edges. The second layer might then group together edges to<br />
detect longer contours, or perhaps simple "object parts." An even deeper layer<br />
might then group together these contours or detect even more complex features.<br />
<br />
Finally, cortical computations (in the brain) also have multiple layers of<br />
processing. For example, visual images are processed in multiple stages by the<br />
brain, by cortical area "V1", followed by cortical area "V2" (a different part<br />
of the brain), and so on.<br />
<br />
<!--<br />
Informally, one way a deep network helps in representing functions compactly is<br />
through ''factorization''. Factorization, as the name suggests, occurs when the<br />
network represents at lower layers functions of the input that are then reused<br />
multiple times at higher layers. To gain some intuition for this, consider an<br />
arithmetic network for computing the values of polynomials, in which alternate<br />
layers implement addition and multiplication. In this network, an intermediate<br />
layer could compute the values of terms which are then used repeatedly in the<br />
next higher layer, the results of which are used repeatedly in the next higher<br />
layer, and so on.<br />
!--><br />
<br />
== Difficulty of training deep architectures ==<br />
<br />
While the theoretical benefits of deep networks in terms of their compactness<br />
and expressive power have been appreciated for many decades, until recently<br />
researchers had little success training deep architectures.<br />
<br />
The main method that researchers were using was to randomly initialize<br />
the weights of the deep network, and then train it using a labeled<br />
training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l}) \}</math><br />
using a supervised learning objective, using gradient descent to try to<br />
drive down the training error. However, this usually did not work well.<br />
There were several reasons for this.<br />
<br />
===Availability of data=== <br />
<br />
With the method described above, one relies only on<br />
labeled data for training. However, labeled data is often scarce, and thus it<br />
is easy to overfit the training data and obtain a model which does not<br />
generalize well.<br />
<br />
===Local optima=== <br />
<br />
Training a neural network using supervised learning<br />
involves solving a highly non-convex optimization problem (say, minimizing the<br />
training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a<br />
function of the network parameters <math>\textstyle W</math>). When the<br />
network is deep, this optimization problem is rife with bad local optima, and<br />
training with gradient descent (or methods like conjugate gradient and L-BFGS)<br />
do not work well.<br />
<br />
===Diffusion of gradients=== <br />
<br />
There is an additional technical reason,<br />
pertaining to the gradients becoming very small, that explains why gradient<br />
descent (and related algorithms like L-BFGS) do not work well on a deep network<br />
with randomly initialized weights. Specifically, when using backpropagation to<br />
compute the derivatives, the gradients that are propagated backwards (from the<br />
output layer to the earlier layers of the network) rapidly diminishes in<br />
magnitude as the depth of the network increases. As a result, the derivative of<br />
the overall cost with respect to the weights in the earlier layers is very<br />
small. Thus, when using gradient descent, the weights of the earlier layers<br />
change slowly, and the earlier layers fail to learn much. This problem<br />
is often called the "diffusion of gradients."<br />
<br />
A closely related problem to the diffusion of gradients is that if the last few<br />
layers in a neural network have a large enough number of neurons, it may be<br />
possible for them to model the labeled data alone without the help of the<br />
earlier layers. Hence, training the entire network at once with all the layers<br />
randomly initialized ends up giving similar performance to training a<br />
shallow network (the last few layers) on corrupted input (the result of<br />
the processing done by the earlier layers).<br />
<br />
<!--<br />
When the last layer is used<br />
for classification, often training a network like this results in low<br />
training error, but high error is low, but training error is high, suggesting that<br />
the last few layers are over-fitting the training data.<br />
!--><br />
<br />
== Greedy layer-wise training ==<br />
<br />
How should deep architectures be trained then? One method that has seen some<br />
success is the '''greedy layer-wise training''' method. We describe this<br />
method in detail in later sections, but briefly, the main idea is to train the<br />
layers of the network one at a time, with the input of each layer being the<br />
output of the previous layer (which has been trained). Training can either be<br />
supervised (say, with classification error as the objective function), or<br />
unsupervised (say, with the error of the layer in reconstructing its input as<br />
the objective function, as in an autoencoder). The weights from training the<br />
layers individually are then used to initialize the weights in the deep<br />
architecture, and only then is the entire architecture "fine-tuned" (i.e.,<br />
trained together to optimize the training set error). The success of greedy<br />
layer-wise training has been attributed to a number of factors:<br />
<br />
===Availability of data=== <br />
<br />
While labeled data can be expensive to obtain,<br />
unlabeled data is cheap and plentiful. The promise of self-taught learning is<br />
that by exploiting the massive amount of unlabeled data, we can learn much<br />
better models. By using unlabeled data to learn a good initial value for the<br />
weights in all the layers <math>\textstyle W^{(l)}</math> (except for the final<br />
classification layer that maps to the outputs/predictions), our algorithm is<br />
able to learn and discover patterns from massively more amounts of data than<br />
purely supervised approaches, and thus often results in much better hypotheses.<br />
<br />
===Regularization and better local optima=== <br />
<br />
After having trained the network<br />
on the unlabeled data, the weights are now starting at a better location in<br />
parameter space than if they had been randomly initialized. We usually then<br />
further fine-tune the weights starting from this location. Empirically, it<br />
turns out that gradient descent from this location is also much more likely to<br />
lead to a good local minimum, because the unlabeled data has already provided<br />
a significant amount of "prior" information about what patterns there<br />
are in the input data.<br />
<br />
In the next section, we will describe the specific details of how to go about<br />
implementing greedy layer-wise training.<br />
<br />
<br />
<br />
<!--<br />
Specifically,<br />
since the weights of the layers have already been initialized to reasonable<br />
values, the final solution tends to be near the good initial solution, forming<br />
a useful "regularization" effect. (more details in Erhan et al., 2010).<br />
!--><br />
<br />
<!--<br />
== References ==<br />
<br />
Erhan et al. (2010). Why Does Unsupervised Pre-training Help Deep Learning?.<br />
AISTATS 2010.<br />
[http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf]<br />
!--></div>Anghttp://ufldl.stanford.edu/wiki/index.php/Deep_Networks:_OverviewDeep Networks: Overview2011-05-13T19:39:50Z<p>Ang: </p>
<hr />
<div>== Overview ==<br />
<br />
In the previous sections, you constructed a 3-layer neural network comprising<br />
an input, hidden and output layer. While fairly effective for MNIST, the<br />
3-layer network is a fairly '''shallow''' network; by this, we mean that the<br />
features (hidden layer activations <math>a^{(2)}</math>) are computed using<br />
only "one layer" of computation (the hidden layer).<br />
<br />
In this section, we begin to discuss '''deep''' neural networks, meaning ones<br />
in which we have multiple hidden layers, so that we use multiple layers of<br />
computation to compute increasingly complex features from the input. Each<br />
hidden layer computes a non-linear transformation of the previous layer. By<br />
using more hidden layers, deep networks can have significantly greater<br />
expressive power (i.e., can learn significantly more complex functions)<br />
than simple ones.<br />
<br />
When training a deep network, it is important that we use a ''non-linear''<br />
activation function <math>f(\cdot)</math> in each hidden layer. This is<br />
because multiple layers of linear functions would itself compute only a linear<br />
function of the input (i.e., composing multiple linear functions together<br />
results in just another linear function), and thus be no more expressive than<br />
using just a single layer of hidden units.<br />
<br />
== Advantages of deep networks ==<br />
<br />
Why do we want to use a deep network? The primary advantage is<br />
that it can compactly represent a significantly larger set of fuctions<br />
than shallow networks. Formally, one can show that there are functions<br />
which a <math>k</math>-layer network can represent compactly<br />
(with a number of hidden units that is ''polynomial'' in the number<br />
of inputs), that a <math>(k-1)</math>-layer network cannot represent<br />
unless it has an exponentially large number of hidden units.<br />
<br />
To take a simple example, consider building a boolean network/circuit to<br />
compute the parity (or XOR) of <math>n</math> input bits. Suppose each node in<br />
the network can compute either the logical OR of its inputs (or the logical<br />
negation of the inputs), or compute the logical AND. If we have a network with<br />
only 1 hidden layer, the parity function would require a number of nodes that<br />
is exponential in the input size <math>n</math>. If however we are allowed a<br />
deeper network, then the network/circuit size can be only polynomial in<br />
<math>n</math>.<br />
<br />
By using a deep network, one can also start to learn part-whole decompositions.<br />
For example, the first layer might learn to group together pixels in an image<br />
in order to detect edges. The second layer might then group together edges to<br />
detect longer contours, or perhaps simple "object parts." An even deeper layer<br />
might then group together these contours or detect even more complex features.<br />
<br />
Finally, cortical computations (in the brain) also have multiple layers of<br />
processing. For example, visual images are processed in multiple stages by the<br />
brain, by cortical area "V1", followed by cortical area "V2" (a different part<br />
of the brain), and so on.<br />
<br />
<!--<br />
Informally, one way a deep network helps in representing functions compactly is<br />
through ''factorization''. Factorization, as the name suggests, occurs when the<br />
network represents at lower layers functions of the input that are then reused<br />
multiple times at higher layers. To gain some intuition for this, consider an<br />
arithmetic network for computing the values of polynomials, in which alternate<br />
layers implement addition and multiplication. In this network, an intermediate<br />
layer could compute the values of terms which are then used repeatedly in the<br />
next higher layer, the results of which are used repeatedly in the next higher<br />
layer, and so on.<br />
!--><br />
<br />
== Difficulty of training deep architectures ==<br />
<br />
While the theoretical benefits of deep networks in terms of their compactness<br />
and expressive power have been appreciated for many decades, until recently<br />
researchers had little success training deep architectures.<br />
<br />
The main method that researchers were using was to randomly initialize<br />
the weights of the deep network, and then train it using a labeled<br />
training set <math>\{ (x^{(1)}_l, y^{(1}), \ldots, (x^{(m_l)}_l, y^{(m_l}) \}</math><br />
using a supervised learning objective, using gradient descent to try to<br />
drive down the training error. However, this usually did not work well.<br />
There were several reasons for this.<br />
<br />
===Availability of data=== <br />
<br />
With the method described above, one relies only on<br />
labeled data for training. However, labeled data is often scarce, and thus it<br />
is easy to overfit the training data and obtain a model which does not<br />
generalize well.<br />
<br />
===Local optima=== <br />
<br />
Training a neural network using supervised learning<br />
involves solving a highly non-convex optimization problem (say, minimizing the<br />
training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a<br />
function of the network parameters <math>\textstyle W</math>). When the<br />
network is deep, this optimization problem is rife with bad local optima, and<br />
training with gradient descent (or methods like conjugate gradient and L-BFGS)<br />
do not work well.<br />
<br />
===Diffusion of gradients=== <br />
<br />
There is an additional technical reason,<br />
pertaining to the gradients becoming very small, that explains why gradient<br />
descent (and related algorithms like L-BFGS) do not work well on a deep network<br />
with randomly initialized weights. Specifically, when using backpropagation to<br />
compute the derivatives, the gradients that are propagated backwards (from the<br />
output layer to the earlier layers of the network) rapidly diminishes in<br />
magnitude as the depth of the network increases. As a result, the derivative of<br />
the overall cost with respect to the weights in the earlier layers is very<br />
small. Thus, when using gradient descent, the weights of the earlier layers<br />
change slowly, and the earlier layers fail to learn much. This problem<br />
is often called the "diffusion of gradients."<br />
<br />
A closely related problem to the diffusion of gradients is that if the last few<br />
layers in a neural network have a large enough number of neurons, it may be<br />
possible for them to model the labeled data alone without the help of the<br />
earlier layers. Hence, training the entire network at once with all the layers<br />
randomly initialized ends up giving similar performance to training a<br />
shallow network (the last few layers) on corrupted input (the result of<br />
the processing done by the earlier layers).<br />
<br />
<!--<br />
When the last layer is used<br />
for classification, often training a network like this results in low<br />
training error, but high error is low, but training error is high, suggesting that<br />
the last few layers are over-fitting the training data.<br />
!--><br />
<br />
== Greedy layer-wise training ==<br />
<br />
How should deep architectures be trained then? One method that has seen some<br />
success is the '''greedy layer-wise training''' method. We describe this<br />
method in detail in later sections, but briefly, the main idea is to train the<br />
layers of the network one at a time, with the input of each layer being the<br />
output of the previous layer (which has been trained). Training can either be<br />
supervised (say, with classification error as the objective function), or<br />
unsupervised (say, with the error of the layer in reconstructing its input as<br />
the objective function, as in an autoencoder). The weights from training the<br />
layers individually are then used to initialize the weights in the deep<br />
architecture, and only then is the entire architecture "fine-tuned" (i.e.,<br />
trained together to optimize the training set error). The success of greedy<br />
layer-wise training has been attributed to a number of factors:<br />
<br />
===Availability of data=== <br />
<br />
While labeled data can be expensive to obtain,<br />
unlabeled data is cheap and plentiful. The promise of self-taught learning is<br />
that by exploiting the massive amount of unlabeled data, we can learn much<br />
better models. By using unlabeled data to learn a good initial value for the<br />
weights in all the layers <math>\textstyle W^{(l)}</math> (except for the final<br />
classification layer that maps to the outputs/predictions), our algorithm is<br />
able to learn and discover patterns from massively more amounts of data than<br />
purely supervised approaches, and thus often results in much better hypotheses.<br />
<br />
===Regularization and better local optima=== <br />
<br />
After having trained the network<br />
on the unlabeled data, the weights are now starting at a better location in<br />
parameter space than if they had been randomly initialized. We usually then<br />
further fine-tune the weights starting from this location. Empirically, it<br />
turns out that gradient descent from this location is also much more likely to<br />
lead to a good local minimum, because the unlabeled data has already provided<br />
a significant amount of "prior" information about what patterns there<br />
are in the input data.<br />
<br />
In the next section, we will describe the specific details of how to go about<br />
implementing greedy layer-wise training.<br />
<br />
<br />
<br />
<!--<br />
Specifically,<br />
since the weights of the layers have already been initialized to reasonable<br />
values, the final solution tends to be near the good initial solution, forming<br />
a useful "regularization" effect. (more details in Erhan et al., 2010).<br />
!--><br />
<br />
<!--<br />
== References ==<br />
<br />
Erhan et al. (2010). Why Does Unsupervised Pre-training Help Deep Learning?.<br />
AISTATS 2010.<br />
[http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf]<br />
!--></div>Anghttp://ufldl.stanford.edu/wiki/index.php/Exercise:_Implement_deep_networks_for_digit_classificationExercise: Implement deep networks for digit classification2011-05-13T17:47:55Z<p>Ang: /* Step 4: Implement fine-tuning */</p>
<hr />
<div>===Overview===<br />
<br />
In this exercise, you will use a stacked autoencoder for digit classification. This exercise is very similar to the self-taught learning exercise, in which we trained a digit classifier using a autoencoder layer followed by a softmax layer. The only difference in this exercise is that we will be using two autoencoder layers instead of one and further finetune the two layers.<br />
<br />
The code you have already implemented will allow you to stack various layers and perform layer-wise training. However, to perform fine-tuning, you will need to implement backpropogation through both layers. We will see that fine-tuning significantly improves the model's performance.<br />
<br />
In the file [http://ufldl.stanford.edu/wiki/resources/stackedae_exercise.zip stackedae_exercise.zip], we have provided some starter code. You will need to complete the code in '''<tt>stackedAECost.m</tt>''', '''<tt>stackedAEPredict.m</tt>''' and '''<tt>stackedAEExercise.m</tt>'''. We have also provided <tt>params2stack.m</tt> and <tt>stack2params.m</tt> which you might find helpful in constructing deep networks.<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://yann.lecun.com/exdb/mnist/ MNIST Dataset]<br />
* [[Using the MNIST Dataset | Support functions for loading MNIST in Matlab ]]<br />
* [http://ufldl.stanford.edu/wiki/resources/stackedae_exercise.zip Starter Code (stackedae_exercise.zip)]<br />
<br />
You will also need your code from the following exercises:<br />
* [[Exercise:Sparse Autoencoder]]<br />
* [[Exercise:Vectorization]]<br />
* [[Exercise:Softmax Regression]]<br />
* [[Exercise:Self-Taught Learning]]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
=== Step 0: Initialize constants and parameters ===<br />
<br />
Open <tt>stackedAEExercise.m</tt>. In this step, we set meta-parameters to the same values that were used in previous exercise, which should produce reasonable results. You may to modify the meta-parameters if you wish.<br />
<br />
=== Step 1: Train the data on the first stacked autoencoder ===<br />
<br />
Train the first autoencoder on the training images to obtain its parameters. This step is identical to the corresponding step in the sparse autoencoder and STL assignments, complete this part of the code so as to learn a first layer of features using your <tt>sparseAutoencoderCost.m</tt> and minFunc.<br />
<br />
=== Step 2: Train the data on the second stacked autoencoder ===<br />
<br />
We first forward propagate the training set through the first autoencoder (using <tt>feedForwardAutoencoder.m</tt> that you completed in [[Exercise:Self-Taught_Learning]]) to obtain hidden unit activations. These activations are then used to train the second sparse autoencoder. Since this is just an adapted application of a standard autoencoder, it should run similarly with the first. Complete this part of the code so as to learn a first layer of features using your <tt>sparseAutoencoderCost.m</tt> and minFunc.<br />
<br />
This part of the exercise demonstrates the idea of greedy layerwise training with the ''same'' learning algorithm reapplied multiple times.<br />
<br />
=== Step 3: Train the softmax classifier on the L2 features ===<br />
<br />
Next, continue to forward propagate the L1 features through the second autoencoder (using <tt>feedForwardAutoencoder.m</tt>) to obtain the L2 hidden unit activations. These activations are then used to train the softmax classifier. You can either use <tt>softmaxTrain.m</tt> or directly use <tt>softmaxCost.m</tt> that you completed in [[Exercise:Softmax Regression]] to complete this part of the assignment.<br />
<br />
=== Step 4: Implement fine-tuning ===<br />
<br />
To implement fine tuning, we need to consider all three layers as a single model. Implement <tt>stackedAECost.m</tt> to return the cost and gradient of the model. The cost function should be as defined as the log likelihood and a gradient decay term. The gradient should be computed using [[Backpropagation Algorithm | back-propogation as discussed earlier]]. The predictions should consist of the activations of the output layer of the softmax model.<br />
<br />
To help you check that your implementation is correct, you should also check your gradients on a synthetic small dataset. We have implemented <tt>checkStackedAECost.m</tt> to help you check your gradients. If this checks passes, you will have implemented fine-tuning correctly.<br />
<br />
'''Note:''' When adding the weight decay term to the cost, you should regularize '''all''' the weights in the network.<br />
<br />
'''Implementation Tip:''' It is always a good idea to implement the code modularly and check (the gradient) of each part of the code before writing the more complicated parts.<br />
<br />
=== Step 5: Test the model ===<br />
<br />
Finally, you will need to classify with this model; complete the code in <tt>stackedAEPredict.m</tt> to classify using the stacked autoencoder with a classification layer.<br />
<br />
After completing these steps, running the entire script in stackedAETrain.m will perform layer-wise training of the stacked autoencoder, finetune the model, and measure its performance on the test set. If you've done all the steps correctly, you should get an accuracy of about 97.6% (for the 10-way classification problem).</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Self-Taught_Learning_to_Deep_NetworksSelf-Taught Learning to Deep Networks2011-05-13T05:57:07Z<p>Ang: </p>
<hr />
<div>In the previous section, you used an autoencoder to learn features that were then fed as input <br />
to a softmax or logistic regression classifier. In that method, the features were learned using<br />
only unlabeled data. In this section, we describe how you can '''fine-tune''' and further improve <br />
the learned features using labeled data. When you have a large amount of labeled<br />
training data, this can significantly improve your classifier's performance.<br />
<br />
In self-taught learning, we first trained a sparse autoencoder on the unlabeled data. Then, <br />
given a new example <math>\textstyle x</math>, we used the hidden layer to extract <br />
features <math>\textstyle a</math>. This is illustrated in the following diagram: <br />
<br />
[[File:STL_SparseAE_Features.png|300px]]<br />
<br />
We are interested in solving a classification task, where our goal is to<br />
predict labels <math>\textstyle y</math>. We have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),<br />
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> labeled examples.<br />
We showed previously that we can replace the original features <math>\textstyle x^{(i)}</math> with features <math>\textstyle a^{(l)}</math><br />
computed by the sparse autoencoder (the "replacement" representation). This gives us a training set <math>\textstyle \{(a^{(1)},<br />
y^{(1)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}</math>. Finally, we train a logistic<br />
classifier to map from the features <math>\textstyle a^{(i)}</math> to the classification label <math>\textstyle y^{(i)}</math>.<br />
To illustrate this step, similar to [[Neural Networks|our earlier notes]], we can draw our logistic regression unit (shown in orange) as follows:<br />
<br />
::::[[File:STL_Logistic_Classifier.png|380px]]<br />
<br />
Now, consider the overall classifier (i.e., the input-output mapping) that we have learned <br />
using this method. <br />
In particular, let us examine the function that our classifier uses to map from from a new test example <br />
<math>\textstyle x</math> to a new prediction <math>p(y=1|x)</math>. <br />
We can draw a representation of this function by putting together the <br />
two pictures from above. In particular, the final classifier looks like this:<br />
<br />
[[File:STL_CombinedAE.png|500px]]<br />
<br />
The parameters of this model were trained in two stages: The first layer of weights <math>\textstyle W^{(1)}</math><br />
mapping from the input <math>\textstyle x</math> to the hidden unit activations <math>\textstyle a</math> were trained<br />
as part of the sparse autoencoder training process. The second layer<br />
of weights <math>\textstyle W^{(2)}</math> mapping from the activations <math>\textstyle a</math> to the output <math>\textstyle y</math> was<br />
trained using logistic regression (or softmax regression).<br />
<br />
But the form of our overall/final classifier is clearly just a whole big neural network. So,<br />
having trained up an initial set of parameters for our model (training the first layer using an <br />
autoencoder, and the second layer<br />
via logistic/softmax regression), we can further modify all the parameters in our model to try to <br />
further reduce the training error. In particular, we can '''fine-tune''' the parameters, meaning perform <br />
gradient descent (or use L-BFGS) from the current setting of the<br />
parameters to try to reduce the training error on our labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),<br />
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math>. <br />
<br />
When fine-tuning is used, sometimes the original unsupervised feature learning steps <br />
(i.e., training the autoencoder and the logistic classifier) are called '''pre-training.'''<br />
The effect of fine-tuning is that the labeled data can be used to modify the weights <math>W^{(1)}</math> as<br />
well, so that adjustments can be made to the features <math>a</math> extracted by the layer<br />
of hidden units. <br />
<br />
So far, we have described this process assuming that you used the "replacement" representation, where<br />
the training examples seen by the logistic classifier are of the form <math>(a^{(i)}, y^{(i)})</math>,<br />
rather than the "concatenation" representation, where the examples are of the form <math>((x^{(i)}, a^{(i)}), y^{(i)})</math>.<br />
It is also possible to perform fine-tuning too using the "concatenation" representation. (This corresponds<br />
to a neural network where the input units <math>x_i</math> also feed directly to the logistic<br />
classifier in the output layer. You can draw this using a slightly different type of neural network<br />
diagram than the ones we have seen so far; in particular, you would have edges that go directly<br />
from the first layer input nodes to the third layer output node, "skipping over" the hidden layer.) <br />
However, so long as we are using finetuning, usually the "concatenation" representation <br />
has little advantage over the "replacement" representation. Thus, if we are using fine-tuning usually we will do so<br />
with a network built using the replacement representation. (If you are not using fine-tuning however,<br />
then sometimes the concatenation representation can give much better performance.) <br />
<br />
When should we use fine-tuning? It is typically used only if you have a large labeled training <br />
set; in this setting, fine-tuning can significantly improve the performance of your classifier. <br />
However, if you<br />
have a large ''unlabeled'' dataset (for unsupervised feature learning/pre-training) and<br />
only a relatively small labeled training set, then fine-tuning is significantly less likely to<br />
help.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Self-Taught_Learning_to_Deep_NetworksSelf-Taught Learning to Deep Networks2011-05-13T05:55:17Z<p>Ang: </p>
<hr />
<div>In the previous section, you used an autoencoder to learn features that were then fed as input <br />
to a softmax or logistic regression classifier. In that method, the features were learned using<br />
only unlabeled data. In this section, we describe how you can '''fine-tune''' and further improve <br />
the learned features using labeled data. When you have a large amount of labeled<br />
training data, this can significantly improve your classifier's performance.<br />
<br />
In self-taught learning, we first trained a sparse autoencoder on the unlabeled data. Then, <br />
given a new example <math>\textstyle x</math>, we used the hidden layer to extract <br />
features <math>\textstyle a</math>. This is illustrated in the following diagram: <br />
<br />
[[File:STL_SparseAE_Features.png|200px]]<br />
<br />
We are interested in solving a classification task, where our goal is to<br />
predict labels <math>\textstyle y</math>. We have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),<br />
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> labeled examples.<br />
We showed previously that we can replace the original features <math>\textstyle x^{(i)}</math> with features <math>\textstyle a^{(l)}</math><br />
computed by the sparse autoencoder (the "replacement" representation). This gives us a training set <math>\textstyle \{(a^{(1)},<br />
y^{(1)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}</math>. Finally, we train a logistic<br />
classifier to map from the features <math>\textstyle a^{(i)}</math> to the classification label <math>\textstyle y^{(i)}</math>.<br />
To illustrate this step, similar to [[Neural Networks|our earlier notes]], we can draw our logistic regression unit (shown in orange) as follows:<br />
<br />
[[File:STL_Logistic_Classifier.png|400px]]<br />
<br />
Now, consider the overall classifier (i.e., the input-output mapping) that we have learned <br />
using this method. <br />
In particular, let us examine the function that our classifier uses to map from from a new test example <br />
<math>\textstyle x</math> to a new prediction <math>p(y=1|x)</math>. <br />
We can draw a representation of this function by putting together the <br />
two pictures from above. In particular, the final classifier looks like this:<br />
<br />
[[File:STL_CombinedAE.png|500px]]<br />
<br />
The parameters of this model were trained in two stages: The first layer of weights <math>\textstyle W^{(1)}</math><br />
mapping from the input <math>\textstyle x</math> to the hidden unit activations <math>\textstyle a</math> were trained<br />
as part of the sparse autoencoder training process. The second layer<br />
of weights <math>\textstyle W^{(2)}</math> mapping from the activations <math>\textstyle a</math> to the output <math>\textstyle y</math> was<br />
trained using logistic regression (or softmax regression).<br />
<br />
But the form of our overall/final classifier is clearly just a whole big neural network. So,<br />
having trained up an initial set of parameters for our model (training the first layer using an <br />
autoencoder, and the second layer<br />
via logistic/softmax regression), we can further modify all the parameters in our model to try to <br />
further reduce the training error. In particular, we can '''fine-tune''' the parameters, meaning perform <br />
gradient descent (or use L-BFGS) from the current setting of the<br />
parameters to try to reduce the training error on our labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),<br />
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math>. <br />
<br />
When fine-tuning is used, sometimes the original unsupervised feature learning steps <br />
(i.e., training the autoencoder and the logistic classifier) are called '''pre-training.'''<br />
The effect of fine-tuning is that the labeled data can be used to modify the weights <math>W^{(1)}</math> as<br />
well, so that adjustments can be made to the features <math>a</math> extracted by the layer<br />
of hidden units. <br />
<br />
So far, we have described this process assuming that you used the "replacement" representation, where<br />
the training examples seen by the logistic classifier are of the form <math>(a^{(i)}, y^{(i)})</math>,<br />
rather than the "concatenation" representation, where the examples are of the form <math>((x^{(i)}, a^{(i)}), y^{(i)})</math>.<br />
It is also possible to perform fine-tuning too using the "concatenation" representation. (This corresponds<br />
to a neural network where the input units <math>x_i</math> also feed directly to the logistic<br />
classifier in the output layer. You can draw this using a slightly different type of neural network<br />
diagram than the ones we have seen so far; in particular, you would have edges that go directly<br />
from the first layer input nodes to the third layer output node, "skipping over" the hidden layer.) <br />
However, so long as we are using finetuning, usually the "concatenation" representation <br />
has little advantage over the "replacement" representation. Thus, if we are using fine-tuning usually we will do so<br />
with a network built using the replacement representation. (If you are not using fine-tuning however,<br />
then sometimes the concatenation representation can give much better performance.) <br />
<br />
When should we use fine-tuning? It is typically used only if you have a large labeled training <br />
set; in this setting, fine-tuning can significantly improve the performance of your classifier. <br />
However, if you<br />
have a large ''unlabeled'' dataset (for unsupervised feature learning/pre-training) and<br />
only a relatively small labeled training set, then fine-tuning is significantly less likely to<br />
help.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Self-Taught_Learning_to_Deep_NetworksSelf-Taught Learning to Deep Networks2011-05-13T05:55:07Z<p>Ang: </p>
<hr />
<div><br />
In the previous section, you used an autoencoder to learn features that were then fed as input <br />
to a softmax or logistic regression classifier. In that method, the features were learned using<br />
only unlabeled data. In this section, we describe how you can '''fine-tune''' and further improve <br />
the learned features using labeled data. When you have a large amount of labeled<br />
training data, this can significantly improve your classifier's performance.<br />
<br />
<br />
In self-taught learning, we first trained a sparse autoencoder on the unlabeled data. Then, <br />
given a new example <math>\textstyle x</math>, we used the hidden layer to extract <br />
features <math>\textstyle a</math>. This is illustrated in the following diagram: <br />
<br />
[[File:STL_SparseAE_Features.png|200px]]<br />
<br />
We are interested in solving a classification task, where our goal is to<br />
predict labels <math>\textstyle y</math>. We have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),<br />
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> labeled examples.<br />
We showed previously that we can replace the original features <math>\textstyle x^{(i)}</math> with features <math>\textstyle a^{(l)}</math><br />
computed by the sparse autoencoder (the "replacement" representation). This gives us a training set <math>\textstyle \{(a^{(1)},<br />
y^{(1)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}</math>. Finally, we train a logistic<br />
classifier to map from the features <math>\textstyle a^{(i)}</math> to the classification label <math>\textstyle y^{(i)}</math>.<br />
To illustrate this step, similar to [[Neural Networks|our earlier notes]], we can draw our logistic regression unit (shown in orange) as follows:<br />
<br />
[[File:STL_Logistic_Classifier.png|400px]]<br />
<br />
Now, consider the overall classifier (i.e., the input-output mapping) that we have learned <br />
using this method. <br />
In particular, let us examine the function that our classifier uses to map from from a new test example <br />
<math>\textstyle x</math> to a new prediction <math>p(y=1|x)</math>. <br />
We can draw a representation of this function by putting together the <br />
two pictures from above. In particular, the final classifier looks like this:<br />
<br />
[[File:STL_CombinedAE.png|500px]]<br />
<br />
The parameters of this model were trained in two stages: The first layer of weights <math>\textstyle W^{(1)}</math><br />
mapping from the input <math>\textstyle x</math> to the hidden unit activations <math>\textstyle a</math> were trained<br />
as part of the sparse autoencoder training process. The second layer<br />
of weights <math>\textstyle W^{(2)}</math> mapping from the activations <math>\textstyle a</math> to the output <math>\textstyle y</math> was<br />
trained using logistic regression (or softmax regression).<br />
<br />
But the form of our overall/final classifier is clearly just a whole big neural network. So,<br />
having trained up an initial set of parameters for our model (training the first layer using an <br />
autoencoder, and the second layer<br />
via logistic/softmax regression), we can further modify all the parameters in our model to try to <br />
further reduce the training error. In particular, we can '''fine-tune''' the parameters, meaning perform <br />
gradient descent (or use L-BFGS) from the current setting of the<br />
parameters to try to reduce the training error on our labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),<br />
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math>. <br />
<br />
When fine-tuning is used, sometimes the original unsupervised feature learning steps <br />
(i.e., training the autoencoder and the logistic classifier) are called '''pre-training.'''<br />
The effect of fine-tuning is that the labeled data can be used to modify the weights <math>W^{(1)}</math> as<br />
well, so that adjustments can be made to the features <math>a</math> extracted by the layer<br />
of hidden units. <br />
<br />
So far, we have described this process assuming that you used the "replacement" representation, where<br />
the training examples seen by the logistic classifier are of the form <math>(a^{(i)}, y^{(i)})</math>,<br />
rather than the "concatenation" representation, where the examples are of the form <math>((x^{(i)}, a^{(i)}), y^{(i)})</math>.<br />
It is also possible to perform fine-tuning too using the "concatenation" representation. (This corresponds<br />
to a neural network where the input units <math>x_i</math> also feed directly to the logistic<br />
classifier in the output layer. You can draw this using a slightly different type of neural network<br />
diagram than the ones we have seen so far; in particular, you would have edges that go directly<br />
from the first layer input nodes to the third layer output node, "skipping over" the hidden layer.) <br />
However, so long as we are using finetuning, usually the "concatenation" representation <br />
has little advantage over the "replacement" representation. Thus, if we are using fine-tuning usually we will do so<br />
with a network built using the replacement representation. (If you are not using fine-tuning however,<br />
then sometimes the concatenation representation can give much better performance.) <br />
<br />
When should we use fine-tuning? It is typically used only if you have a large labeled training <br />
set; in this setting, fine-tuning can significantly improve the performance of your classifier. <br />
However, if you<br />
have a large ''unlabeled'' dataset (for unsupervised feature learning/pre-training) and<br />
only a relatively small labeled training set, then fine-tuning is significantly less likely to<br />
help.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Self-Taught_Learning_to_Deep_NetworksSelf-Taught Learning to Deep Networks2011-05-13T05:54:33Z<p>Ang: /* Feature Learning pipeline */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous section, you used an autoencoder to learn features that were then fed as input <br />
to a softmax or logistic regression classifier. In that method, the features were learned using<br />
only unlabeled data. In this section, we describe how you can '''fine-tune''' and further improve <br />
the learned features using labeled data. When you have a large amount of labeled<br />
training data, this can significantly improve your classifier's performance.<br />
<br />
<br />
In self-taught learning, we first trained a sparse autoencoder on the unlabeled data. Then, <br />
given a new example <math>\textstyle x</math>, we used the hidden layer to extract <br />
features <math>\textstyle a</math>. This is illustrated in the following diagram: <br />
<br />
[[File:STL_SparseAE_Features.png|200px]]<br />
<br />
We are interested in solving a classification task, where our goal is to<br />
predict labels <math>\textstyle y</math>. We have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),<br />
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> labeled examples.<br />
We showed previously that we can replace the original features <math>\textstyle x^{(i)}</math> with features <math>\textstyle a^{(l)}</math><br />
computed by the sparse autoencoder (the "replacement" representation). This gives us a training set <math>\textstyle \{(a^{(1)},<br />
y^{(1)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}</math>. Finally, we train a logistic<br />
classifier to map from the features <math>\textstyle a^{(i)}</math> to the classification label <math>\textstyle y^{(i)}</math>.<br />
To illustrate this step, similar to [[Neural Networks|our earlier notes]], we can draw our logistic regression unit (shown in orange) as follows:<br />
<br />
[[File:STL_Logistic_Classifier.png|400px]]<br />
<br />
Now, consider the overall classifier (i.e., the input-output mapping) that we have learned <br />
using this method. <br />
In particular, let us examine the function that our classifier uses to map from from a new test example <br />
<math>\textstyle x</math> to a new prediction <math>p(y=1|x)</math>. <br />
We can draw a representation of this function by putting together the <br />
two pictures from above. In particular, the final classifier looks like this:<br />
<br />
[[File:STL_CombinedAE.png|500px]]<br />
<br />
The parameters of this model were trained in two stages: The first layer of weights <math>\textstyle W^{(1)}</math><br />
mapping from the input <math>\textstyle x</math> to the hidden unit activations <math>\textstyle a</math> were trained<br />
as part of the sparse autoencoder training process. The second layer<br />
of weights <math>\textstyle W^{(2)}</math> mapping from the activations <math>\textstyle a</math> to the output <math>\textstyle y</math> was<br />
trained using logistic regression (or softmax regression).<br />
<br />
== Fine-tuning == <br />
But now, we notice that the form of our overall/final classifier is clearly just a whole big neural network. So,<br />
having trained up an initial set of parameters for our model (training the first layer using an <br />
autoencoder, and the second layer<br />
via logistic/softmax regression), we can further modify all the parameters in our model to try to <br />
further reduce the training error. In particular, we can '''fine-tune''' the parameters, meaning perform <br />
gradient descent (or use L-BFGS) from the current setting of the<br />
parameters to try to reduce the training error on our labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),<br />
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math>. <br />
<br />
When fine-tuning is used, sometimes the original unsupervised feature learning steps <br />
(i.e., training the autoencoder and the logistic classifier) are called '''pre-training.'''<br />
The effect of fine-tuning is that the labeled data can be used to modify the weights <math>W^{(1)}</math> as<br />
well, so that adjustments can be made to the features <math>a</math> extracted by the layer<br />
of hidden units. <br />
<br />
So far, we have described this process assuming that you used the "replacement" representation, where<br />
the training examples seen by the logistic classifier are of the form <math>(a^{(i)}, y^{(i)})</math>,<br />
rather than the "concatenation" representation, where the examples are of the form <math>((x^{(i)}, a^{(i)}), y^{(i)})</math>.<br />
It is also possible to perform fine-tuning too using the "concatenation" representation. (This corresponds<br />
to a neural network where the input units <math>x_i</math> also feed directly to the logistic<br />
classifier in the output layer. You can draw this using a slightly different type of neural network<br />
diagram than the ones we have seen so far; in particular, you would have edges that go directly<br />
from the first layer input nodes to the third layer output node, "skipping over" the hidden layer.) <br />
However, so long as we are using finetuning, usually the "concatenation" representation <br />
has little advantage over the "replacement" representation. Thus, if we are using fine-tuning usually we will do so<br />
with a network built using the replacement representation. (If you are not using fine-tuning however,<br />
then sometimes the concatenation representation can give much better performance.) <br />
<br />
When should we use fine-tuning? It is typically used only if you have a large labeled training <br />
set; in this setting, fine-tuning can significantly improve the performance of your classifier. <br />
However, if you<br />
have a large ''unlabeled'' dataset (for unsupervised feature learning/pre-training) and<br />
only a relatively small labeled training set, then fine-tuning is significantly less likely to<br />
help.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Self-Taught_Learning_to_Deep_NetworksSelf-Taught Learning to Deep Networks2011-05-13T05:54:08Z<p>Ang: </p>
<hr />
<div>== Overview ==<br />
<br />
In the previous section, you used an autoencoder to learn features that were then fed as input <br />
to a softmax or logistic regression classifier. In that method, the features were learned using<br />
only unlabeled data. In this section, we describe how you can '''fine-tune''' and further improve <br />
the learned features using labeled data. When you have a large amount of labeled<br />
training data, this can significantly improve your classifier's performance.<br />
<br />
== Feature Learning pipeline == <br />
<br />
In self-taught learning, we first trained a sparse autoencoder on the unlabeled data. Then, <br />
given a new example <math>\textstyle x</math>, we used the hidden layer to extract <br />
features <math>\textstyle a</math>. This is illustrated in the following diagram: <br />
<br />
[[File:STL_SparseAE_Features.png|200px]]<br />
<br />
We are interested in solving a classification task, where our goal is to<br />
predict labels <math>\textstyle y</math>. We have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),<br />
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> labeled examples.<br />
We showed previously that we can replace the original features <math>\textstyle x^{(i)}</math> with features <math>\textstyle a^{(l)}</math><br />
computed by the sparse autoencoder (the "replacement" representation). This gives us a training set <math>\textstyle \{(a^{(1)},<br />
y^{(1)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}</math>. Finally, we train a logistic<br />
classifier to map from the features <math>\textstyle a^{(i)}</math> to the classification label <math>\textstyle y^{(i)}</math>.<br />
To illustrate this step, similar to [[Neural Networks|our earlier notes]], we can draw our logistic regression unit (shown in orange) as follows:<br />
<br />
[[File:STL_Logistic_Classifier.png|400px]]<br />
<br />
Now, consider the overall classifier (i.e., the input-output mapping) that we have learned <br />
using this method. <br />
In particular, let us examine the function that our classifier uses to map from from a new test example <br />
<math>\textstyle x</math> to a new prediction <math>p(y=1|x)</math>. <br />
We can draw a representation of this function by putting together the <br />
two pictures from above. In particular, the final classifier looks like this:<br />
<br />
[[File:STL_CombinedAE.png|500px]]<br />
<br />
The parameters of this model were trained in two stages: The first layer of weights <math>\textstyle W^{(1)}</math><br />
mapping from the input <math>\textstyle x</math> to the hidden unit activations <math>\textstyle a</math> were trained<br />
as part of the sparse autoencoder training process. The second layer<br />
of weights <math>\textstyle W^{(2)}</math> mapping from the activations <math>\textstyle a</math> to the output <math>\textstyle y</math> was<br />
trained using logistic regression (or softmax regression). <br />
<br />
== Fine-tuning == <br />
But now, we notice that the form of our overall/final classifier is clearly just a whole big neural network. So,<br />
having trained up an initial set of parameters for our model (training the first layer using an <br />
autoencoder, and the second layer<br />
via logistic/softmax regression), we can further modify all the parameters in our model to try to <br />
further reduce the training error. In particular, we can '''fine-tune''' the parameters, meaning perform <br />
gradient descent (or use L-BFGS) from the current setting of the<br />
parameters to try to reduce the training error on our labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),<br />
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math>. <br />
<br />
When fine-tuning is used, sometimes the original unsupervised feature learning steps <br />
(i.e., training the autoencoder and the logistic classifier) are called '''pre-training.'''<br />
The effect of fine-tuning is that the labeled data can be used to modify the weights <math>W^{(1)}</math> as<br />
well, so that adjustments can be made to the features <math>a</math> extracted by the layer<br />
of hidden units. <br />
<br />
So far, we have described this process assuming that you used the "replacement" representation, where<br />
the training examples seen by the logistic classifier are of the form <math>(a^{(i)}, y^{(i)})</math>,<br />
rather than the "concatenation" representation, where the examples are of the form <math>((x^{(i)}, a^{(i)}), y^{(i)})</math>.<br />
It is also possible to perform fine-tuning too using the "concatenation" representation. (This corresponds<br />
to a neural network where the input units <math>x_i</math> also feed directly to the logistic<br />
classifier in the output layer. You can draw this using a slightly different type of neural network<br />
diagram than the ones we have seen so far; in particular, you would have edges that go directly<br />
from the first layer input nodes to the third layer output node, "skipping over" the hidden layer.) <br />
However, so long as we are using finetuning, usually the "concatenation" representation <br />
has little advantage over the "replacement" representation. Thus, if we are using fine-tuning usually we will do so<br />
with a network built using the replacement representation. (If you are not using fine-tuning however,<br />
then sometimes the concatenation representation can give much better performance.) <br />
<br />
When should we use fine-tuning? It is typically used only if you have a large labeled training <br />
set; in this setting, fine-tuning can significantly improve the performance of your classifier. <br />
However, if you<br />
have a large ''unlabeled'' dataset (for unsupervised feature learning/pre-training) and<br />
only a relatively small labeled training set, then fine-tuning is significantly less likely to<br />
help.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Self-Taught_Learning_to_Deep_NetworksSelf-Taught Learning to Deep Networks2011-05-13T05:52:36Z<p>Ang: </p>
<hr />
<div>== Overview ==<br />
<br />
In the previous section, you used an autoencoder to learn features that were then fed as input <br />
to a softmax or logistic regression classifier. In that method, the features were learned using<br />
only unlabeled data. In this section, we describe how you can '''fine-tune''' and further improve <br />
the learned features using labeled data. When you have a large amount of labeled<br />
training data, this can significantly improve your classifier's performance.<br />
<br />
In self-taught learning, we first trained a sparse autoencoder on the unlabeled data. Then, <br />
given a new example <math>\textstyle x</math>, we used the hidden layer to extract <br />
features <math>\textstyle a</math>. This is illustrated in the following diagram: <br />
<br />
[[File:STL_SparseAE_Features.png|200px]]<br />
<br />
We are interested in solving a classification task, where our goal is to<br />
predict labels <math>\textstyle y</math>. We have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),<br />
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> labeled examples.<br />
We showed previously that we can replace the original features <math>\textstyle x^{(i)}</math> with features <math>\textstyle a^{(l)}</math><br />
computed by the sparse autoencoder (the "replacement" representation). This gives us a training set <math>\textstyle \{(a^{(1)},<br />
y^{(1)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}</math>. Finally, we train a logistic<br />
classifier to map from the features <math>\textstyle a^{(i)}</math> to the classification label <math>\textstyle y^{(i)}</math>.<br />
To illustrate this step, similar to [[Neural Networks|our earlier notes]], we can draw our logistic regression unit (shown in orange) as follows:<br />
<br />
[[File:STL_Logistic_Classifier.png|400px]]<br />
<br />
Now, consider the overall classifier (i.e., the input-output mapping) that we have learned <br />
using this method. <br />
In particular, let us examine the function that our classifier uses to map from from a new test example <br />
<math>\textstyle x</math> to a new prediction <math>p(y=1|x)</math>. <br />
We can draw a representation of this function by putting together the <br />
two pictures from above. In particular, the final classifier looks like this:<br />
<br />
[[File:STL_CombinedAE.png|500px]]<br />
<br />
The parameters of this model were trained in two stages: The first layer of weights <math>\textstyle W^{(1)}</math><br />
mapping from the input <math>\textstyle x</math> to the hidden unit activations <math>\textstyle a</math> were trained<br />
as part of the sparse autoencoder training process. The second layer<br />
of weights <math>\textstyle W^{(2)}</math> mapping from the activations <math>\textstyle a</math> to the output <math>\textstyle y</math> was<br />
trained using logistic regression (or softmax regression). <br />
<br />
But the form of our overall/final classifier is clearly just a whole big neural network. So,<br />
having trained up an initial set of parameters for our model (training the first layer using an <br />
autoencoder, and the second layer<br />
via logistic/softmax regression), we can further modify all the parameters in our model to try to <br />
further reduce the training error. In particular, we can '''fine-tune''' the parameters, meaning perform <br />
gradient descent (or use L-BFGS) from the current setting of the<br />
parameters to try to reduce the training error on our labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),<br />
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math>. <br />
<br />
When fine-tuning is used, sometimes the original unsupervised feature learning steps <br />
(i.e., training the autoencoder and the logistic classifier) are called '''pre-training.'''<br />
The effect of fine-tuning is that the labeled data can be used to modify the weights <math>W^{(1)}</math> as<br />
well, so that adjustments can be made to the features <math>a</math> extracted by the layer<br />
of hidden units. <br />
<br />
So far, we have described this process assuming that you used the "replacement" representation, where<br />
the training examples seen by the logistic classifier are of the form <math>(a^{(i)}, y^{(i)})</math>,<br />
rather than the "concatenation" representation, where the examples are of the form <math>((x^{(i)}, a^{(i)}), y^{(i)})</math>.<br />
It is also possible to perform fine-tuning too using the "concatenation" representation. (This corresponds<br />
to a neural network where the input units <math>x_i</math> also feed directly to the logistic<br />
classifier in the output layer. You can draw this using a slightly different type of neural network<br />
diagram than the ones we have seen so far; in particular, you would have edges that go directly<br />
from the first layer input nodes to the third layer output node, "skipping over" the hidden layer.) <br />
However, so long as we are using finetuning, usually the "concatenation" representation <br />
has little advantage over the "replacement" representation. Thus, if we are using fine-tuning usually we will do so<br />
with a network built using the replacement representation. (If you are not using fine-tuning however,<br />
then sometimes the concatenation representation can give much better performance.) <br />
<br />
When should we use fine-tuning? It is typically used only if you have a large labeled training <br />
set; in this setting, fine-tuning can significantly improve the performance of your classifier. <br />
However, if you<br />
have a large ''unlabeled'' dataset (for unsupervised feature learning/pre-training) and<br />
only a relatively small labeled training set, then fine-tuning is significantly less likely to<br />
help.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Self-Taught_Learning_to_Deep_NetworksSelf-Taught Learning to Deep Networks2011-05-13T05:46:06Z<p>Ang: /* Overview */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous section, you used an autoencoder to learn features that were then fed as input <br />
to a softmax or logistic regression classifier. In that method, the features were learned using<br />
only unlabeled data. In this section, we describe how you can '''fine-tune''' and further improve <br />
the learned features using labeled data. When you have a large amount of labeled<br />
training data, this can significantly improve your classifier's performance.<br />
<br />
In self-taught learning, we first trained a sparse autoencoder on the unlabeled data. Then, <br />
given a new example <math>\textstyle x</math>, we used the hidden layer to extract <br />
features <math>\textstyle a</math>. This is illustrated in the following diagram: <br />
<br />
[[File:STL_SparseAE_Features.png|200px]]<br />
<br />
We are interested in solving a classification task, where our goal is to<br />
predict labels <math>\textstyle y</math>. We have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),<br />
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> labeled examples.<br />
We showed previously that we can replace the original features <math>\textstyle x^{(i)}</math> with features <math>\textstyle a^{(l)}</math><br />
computed by the sparse autoencoder (the "replacement" representation). This gives us a training set <math>\textstyle \{(a^{(1)},<br />
y^{(1)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}</math>. Finally, we train a logistic<br />
classifier to map from the features <math>\textstyle a^{(i)}</math> to the classification label <math>\textstyle y^{(i)}</math>.<br />
To illustrate this step, similar to [[Neural Networks|our earlier notes]], we can draw our logistic regression unit (shown in orange) as follows:<br />
<br />
[[File:STL_Logistic_Classifier.png|400px]]<br />
<br />
Now, consider the overall classifier (i.e., the input-output mapping) that we have learned <br />
using this method. <br />
In particular, let us examine the function that our classifier uses to map from from a new test example <br />
<math>\textstyle x</math> to a new prediction <math>p(y=1|x)</math>. <br />
We can draw a representation of this function by putting together the <br />
two pictures from above. In particular, the final classifier looks like this:<br />
<br />
[[File:STL_CombinedAE.png|500px]]<br />
<br />
The parameters of this model were trained in two stages: The first layer of weights <math>\textstyle W^{(1)}</math><br />
mapping from the input <math>\textstyle x</math> to the hidden unit activations <math>\textstyle a</math> were trained<br />
as part of the sparse autoencoder training process. The second layer<br />
of weights <math>\textstyle W^{(2)}</math> mapping from the activations to the output <math>\textstyle y</math> was<br />
trained using logistic regression (or softmax regression). <br />
<br />
But the form of our overall/final classifier is clearly just a whole big neural network. So,<br />
having trained up an initial set of parameters for our model (training the first layer using an <br />
autoencoder, and the second layer<br />
via logistic/softmax regression), we can further modify all the parameters in our model to try to <br />
further reduce the training error. In particular, we can '''fine-tune''' the parameters, meaning perform <br />
gradient descent (or use L-BFGS) from the current setting of the<br />
parameters to try to reduce the training error on our labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),<br />
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math>. <br />
<br />
When fine-tuning is used, sometimes the original unsupervised feature learning steps <br />
(i.e., training the autoencoder and the logistic classifier) are also called '''pre-training.'''<br />
The effect of fine-tuning is that the labeled data can be used to modify the weights <math>W^{(1)}</math> as<br />
well, so that adjustments can be made to the features <math>a</math> extracted by the layer<br />
of hidden units. <br />
<br />
So far, we have described this process assuming that you used the "replacement" representation, where<br />
the training examples seen by the logistic classifier are of the form <math>(a^{(i)}, y^{(i)})</math>,<br />
rather than the "concatenation" representation, where the examples are of the form <math>((x^{(i)}, a^{(i)}), y^{(i)})</math>.<br />
It is also possible to perform fine-tuning too using the "concatenation" representation; this corresponds<br />
to a neural network where the input units <math>x_i</math> also feed directly to the logistic<br />
classifier in the output layer. (You can draw this using a slightly different type of neural network<br />
diagram than the ones we have seen so far; in particular, you would have edges that go directly<br />
from the first layer input nodes to the third layer output node, "skipping over" the hidden layer.) <br />
However, so long as we are using finetuning, usually the "concatenation" representation usually<br />
has little advantage over the "replacement" representation. Thus, if we are using fine-tuning <br />
in our of unsupervised feature learning or self-taught learning application, usually we will do so<br />
with a network built using the replacement representation. <br />
<br />
When should we use fine-tuning? It is typically used only if you have a large labeled training set; in this<br />
setting, fine-tuning can significantly improve the performance of your classifier. If you<br />
have a large unlabeled dataset (for unsupervised feature learning/pre-training) and<br />
a relatively small labeled training set, then fine-tuning is less likely to help.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Self-Taught_Learning_to_Deep_NetworksSelf-Taught Learning to Deep Networks2011-05-13T05:41:50Z<p>Ang: </p>
<hr />
<div>== Overview ==<br />
<br />
In the previous section, you used an autoencoder to learn features that were then fed as input <br />
to a softmax or logistic regression classifier. There, the features were learned using<br />
only unlabeled data. In this section, we show how you can '''fine-tune''' or further improve <br />
the learned features using the labeled data. When you have a large amount of labeled<br />
training data, this can significantly improve your classifier's performance.<br />
<br />
In self-taught learning, we first trained a sparse autoencoder on the unlabeled data. Then, <br />
given a new example <math>\textstyle x</math>, we used the hidden layer to extract <br />
features <math>\textstyle a</math>. This is illustrated in the following diagram: <br />
<br />
[[File:STL_SparseAE_Features.png|200px]]<br />
<br />
We are interested in solving a classification task, where our goal is to<br />
predict labels <math>\textstyle y</math>. We have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),<br />
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> labeled examples.<br />
We showed previously that we can replace the original features <math>\textstyle x^{(i)}</math> with features <math>\textstyle a^{(l)}</math><br />
computed by the sparse autoencoder (the "replacement" representation). This gives us a training set <math>\textstyle \{(a^{(1)},<br />
y^{(1)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}</math>. Finally, we train a logistic<br />
classifier to map from the features <math>\textstyle a^{(i)}</math> to the classification label <math>\textstyle y^{(i)}</math>.<br />
To illustrate this step, similar to [[Neural Networks|our earlier notes]], we can draw our logistic regression unit (shown in orange) as follows:<br />
<br />
[[File:STL_Logistic_Classifier.png|400px]]<br />
<br />
Now, consider the overall classifier (i.e., the input-output mapping) that we have learned <br />
using this method. <br />
In particular, let us examine the function that our classifier uses to map from from a new test example <br />
<math>\textstyle x</math> to a new prediction <math>p(y=1|x)</math>. <br />
We can draw a representation of this function by putting together the <br />
two pictures from above. In particular, the final classifier looks like this:<br />
<br />
[[File:STL_CombinedAE.png|500px]]<br />
<br />
The parameters of this model were trained in two stages: The first layer of weights <math>\textstyle W^{(1)}</math><br />
mapping from the input <math>\textstyle x</math> to the hidden unit activations <math>\textstyle a</math> were trained<br />
as part of the sparse autoencoder training process. The second layer<br />
of weights <math>\textstyle W^{(2)}</math> mapping from the activations to the output <math>\textstyle y</math> was<br />
trained using logistic regression (or softmax regression). <br />
<br />
But the form of our overall/final classifier is clearly just a whole big neural network. So,<br />
having trained up an initial set of parameters for our model (training the first layer using an <br />
autoencoder, and the second layer<br />
via logistic/softmax regression), we can further modify all the parameters in our model to try to <br />
further reduce the training error. In particular, we can '''fine-tune''' the parameters, meaning perform <br />
gradient descent (or use L-BFGS) from the current setting of the<br />
parameters to try to reduce the training error on our labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),<br />
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math>. <br />
<br />
When fine-tuning is used, sometimes the original unsupervised feature learning steps <br />
(i.e., training the autoencoder and the logistic classifier) are also called '''pre-training.'''<br />
The effect of fine-tuning is that the labeled data can be used to modify the weights <math>W^{(1)}</math> as<br />
well, so that adjustments can be made to the features <math>a</math> extracted by the layer<br />
of hidden units. <br />
<br />
So far, we have described this process assuming that you used the "replacement" representation, where<br />
the training examples seen by the logistic classifier are of the form <math>(a^{(i)}, y^{(i)})</math>,<br />
rather than the "concatenation" representation, where the examples are of the form <math>((x^{(i)}, a^{(i)}), y^{(i)})</math>.<br />
It is also possible to perform fine-tuning too using the "concatenation" representation; this corresponds<br />
to a neural network where the input units <math>x_i</math> also feed directly to the logistic<br />
classifier in the output layer. (You can draw this using a slightly different type of neural network<br />
diagram than the ones we have seen so far; in particular, you would have edges that go directly<br />
from the first layer input nodes to the third layer output node, "skipping over" the hidden layer.) <br />
However, so long as we are using finetuning, usually the "concatenation" representation usually<br />
has little advantage over the "replacement" representation. Thus, if we are using fine-tuning <br />
in our of unsupervised feature learning or self-taught learning application, usually we will do so<br />
with a network built using the replacement representation. <br />
<br />
When should we use fine-tuning? It is typically used only if you have a large labeled training set; in this<br />
setting, fine-tuning can significantly improve the performance of your classifier. If you<br />
have a large unlabeled dataset (for unsupervised feature learning/pre-training) and<br />
a relatively small labeled training set, then fine-tuning is less likely to help.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/UFLDL_TutorialUFLDL Tutorial2011-05-11T00:15:41Z<p>Ang: </p>
<hr />
<div>'''Description:''' This tutorial will teach you the main ideas of Unsupervised Feature Learning and Deep Learning. By working through it, you will also get to implement several feature learning/deep learning algorithms, get to see them work for yourself, and learn how to apply/adapt these ideas to new problems.<br />
<br />
This tutorial assumes a basic knowledge of machine learning (specifically, familiarity with the ideas of supervised learning, logistic regression, gradient descent). If you are not familiar with these ideas, we suggest you go to this [http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning Machine Learning course] and complete<br />
sections II, III, IV (up to Logistic Regression) first. <br />
<br />
<br />
'''Sparse Autoencoder'''<br />
* [[Neural Networks]]<br />
* [[Backpropagation Algorithm]]<br />
* [[Gradient checking and advanced optimization]]<br />
* [[Autoencoders and Sparsity]]<br />
* [[Visualizing a Trained Autoencoder]]<br />
* [[Sparse Autoencoder Notation Summary]] <br />
* [[Exercise:Sparse Autoencoder]]<br />
<br />
<br />
'''Vectorized implementation'''<br />
* [[Vectorization]]<br />
* [[Logistic Regression Vectorization Example]]<br />
* [[Neural Network Vectorization]]<br />
* [[Exercise:Vectorization]]<br />
<br />
<br />
'''Preprocessing: PCA and Whitening'''<br />
* [[PCA]]<br />
* [[Whitening]]<br />
* [[Implementing PCA/Whitening]]<br />
* [[Exercise:PCA in 2D]]<br />
* [[Exercise:PCA and Whitening]]<br />
<br />
<br />
'''Softmax Regression'''<br />
* [[Softmax Regression]]<br />
* [[Exercise:Softmax Regression]]<br />
<br />
<br />
<br />
'''Self-Taught Learning and Unsupervised Feature Learning''' <br />
* [[Self-Taught Learning]]<br />
* [[Exercise:Self-Taught Learning]]<br />
<br />
<br />
----<br />
'''Note''': The sections above this line are stable. The sections below are still under construction, and may change without notice. Feel free to browse around however, and feedback/suggestions are welcome. <br />
<br />
'''Building Deep Networks for Classification'''<br />
* [[Self-Taught Learning to Deep Networks | From Self-Taught Learning to Deep Networks]]<br />
* [[Deep Networks: Overview]]<br />
* [[Stacked Autoencoders]]<br />
* [[Fine-tuning Stacked AEs]]<br />
* [[Exercise: Implement deep networks for digit classification]]<br />
<br />
<br />
'''Working with Large Images'''<br />
* [[Feature extraction using convolution]]<br />
* [[Linear Decoders]]<br />
* [[Exercise:Convolution and Pooling]]<br />
* [[Pooling]]<br />
* [[Multiple layers of convolution and pooling]]<br />
<br />
----<br />
<br />
'''Miscellaneous''':<br />
<br />
[[MATLAB Modules]]<br />
<br />
[[Data Preprocessing]]<br />
<br />
[[Style Guide]]<br />
<br />
'''Advanced Topics''':<br />
<br />
[[Convolutional training]] <br />
<br />
[[Restricted Boltzmann Machines]]<br />
<br />
[[Deep Belief Networks]]<br />
<br />
[[Denoising Autoencoders]]<br />
<br />
[[Sparse Coding]]<br />
<br />
[[K-means]]<br />
<br />
[[Spatial pyramids / Multiscale]]<br />
<br />
[[Slow Feature Analysis]]<br />
<br />
ICA Style Models:<br />
* [[Independent Component Analysis]]<br />
* [[Topographic Independent Component Analysis]]<br />
<br />
[[Tiled Convolution Networks]]<br />
<br />
----<br />
<br />
Material contributed by: Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen</div>Anghttp://ufldl.stanford.edu/wiki/index.php/UFLDL_TutorialUFLDL Tutorial2011-05-10T23:49:18Z<p>Ang: </p>
<hr />
<div>'''Description:''' This tutorial will teach you the main ideas of Unsupervised Feature Learning and Deep Learning. By working through it, you will also get to implement several feature learning/deep learning algorithms, get to see them work for yourself, and learn how to apply/adapt these ideas to new problems.<br />
<br />
This tutorial assumes a basic knowledge of machine learning (specifically, familiarity with the ideas of supervised learning, logistic regression, gradient descent). If you are not familiar with these ideas, we suggest you go to this [http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning Machine Learning course] and complete<br />
sections II, III, IV (up to Logistic Regression) first. <br />
<br />
<br />
'''Sparse Autoencoder'''<br />
* [[Neural Networks]]<br />
* [[Backpropagation Algorithm]]<br />
* [[Gradient checking and advanced optimization]]<br />
* [[Autoencoders and Sparsity]]<br />
* [[Visualizing a Trained Autoencoder]]<br />
* [[Sparse Autoencoder Notation Summary]] <br />
* [[Exercise:Sparse Autoencoder]]<br />
<br />
<br />
'''Vectorized implementation'''<br />
* [[Vectorization]]<br />
* [[Logistic Regression Vectorization Example]]<br />
* [[Neural Network Vectorization]]<br />
* [[Exercise:Vectorization]]<br />
<br />
<br />
'''Preprocessing: PCA and Whitening'''<br />
* [[PCA]]<br />
* [[Whitening]]<br />
* [[Implementing PCA/Whitening]]<br />
* [[Exercise:PCA in 2D]]<br />
* [[Exercise:PCA and Whitening]]<br />
<br />
<br />
'''Softmax Regression'''<br />
* [[Softmax Regression]]<br />
* [[Exercise:Softmax Regression]]<br />
<br />
<br />
<br />
'''Self-Taught Learning and Unsupervised Feature Learning''' <br />
* [[Self-Taught Learning]]<br />
* [[Exercise:Self-Taught Learning]]<br />
<br />
<br />
----<br />
'''Note''': The sections above this line are stable. The sections below are still under construction, and may change without notice. Feel free to browse around however, and feedback/suggestions are welcome. <br />
<br />
'''Building Deep Networks for Classification'''<br />
* [[Self-Taught Learning to Deep Networks | From Self-Taught Learning to Deep Networks]]<br />
* [[Deep Networks: Overview]]<br />
* [[Stacked Autoencoders]]<br />
* [[Fine-tuning Stacked AEs]]<br />
* [[Exercise: Implement deep networks for digit classification]]<br />
<br />
<br />
'''Working with Large Images'''<br />
* [[Feature extraction using convolution]]<br />
* [[Linear Decoders]]<br />
* [[Exercise:Convolution and Pooling]]<br />
* [[Pooling]]<br />
* [[Multiple layers of convolution and pooling]]<br />
<br />
----<br />
<br />
'''Miscellaneous''':<br />
<br />
[[MATLAB Modules]]<br />
<br />
[[Data Preprocessing]]<br />
<br />
[[Style Guide]]<br />
<br />
'''Advanced Topics''':<br />
<br />
[[Restricted Boltzmann Machines]]<br />
<br />
[[Deep Belief Networks]]<br />
<br />
[[Denoising Autoencoders]]<br />
<br />
[[Sparse Coding]]<br />
<br />
[[K-means]]<br />
<br />
[[Spatial pyramids / Multiscale]]<br />
<br />
[[Slow Feature Analysis]]<br />
<br />
ICA Style Models:<br />
* [[Independent Component Analysis]]<br />
* [[Topographic Independent Component Analysis]]<br />
<br />
[[Tiled Convolution Networks]]<br />
<br />
----<br />
<br />
Material contributed by: Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Exercise:Self-Taught_LearningExercise:Self-Taught Learning2011-05-10T23:45:28Z<p>Ang: /* Step 4: Training and testing the logistic regression model */</p>
<hr />
<div>===Overview===<br />
<br />
In this exercise, we will use the self-taught learning paradigm with the sparse autoencoder and softmax classifier to build a classifier for handwritten digits.<br />
<br />
You will be building upon your code from the earlier exercises. First, you will train your sparse autoencoder on an "unlabeled" training dataset of handwritten digits. This produces feature that are penstroke-like. We then extract these learned features from a labeled dataset of handwritten digits. These features will then be used as inputs to the softmax classifier that you wrote in the previous exercise. <br />
<br />
Concretely, for each example in the the labeled training dataset <math>\textstyle x_l</math>, we forward propagate the example to obtain the activation of the hidden units <math>\textstyle a^{(2)}</math>. We now represent this example using <math>\textstyle a^{(2)}</math> (the "replacement" representation), and use this to as the new feature representation with which to train the softmax classifier. <br />
<br />
Finally, we also extract the same features from the test data to obtain predictions.<br />
<br />
In this exercise, our goal is to distinguish between the digits from 0 to 4. We will use the digits 5 to 9 as our <br />
"unlabeled" dataset which which to learn the features; we will then use a labeled dataset with the digits 0 to 4 with<br />
which to train the softmax classifier. <br />
<br />
In the starter code, we have provided a file '''<tt>stlExercise.m</tt>''' that will help walk you through the steps in this exercise.<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://yann.lecun.com/exdb/mnist/ MNIST Dataset]<br />
* [[Using the MNIST Dataset | Support functions for loading MNIST in Matlab ]]<br />
* [http://ufldl.stanford.edu/wiki/resources/stl_exercise.zip Starter Code (stl_exercise.zip)]<br />
<br />
You will also need your code from the following exercises:<br />
* [[Exercise:Sparse Autoencoder]]<br />
* [[Exercise:Vectorization]]<br />
* [[Exercise:Softmax Regression]]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
===Step 1: Generate the input and test data sets===<br />
<br />
Download and decompress <tt>[http://ufldl.stanford.edu/wiki/resources/stl_exercise.zip stl_exercise.zip]</tt>, which contains starter code for this exercise. Additionally, you will need to download the datasets from the MNIST Handwritten Digit Database for this project.<br />
<br />
===Step 2: Train the sparse autoencoder===<br />
<br />
Next, use the unlabeled data (the digits from 5 to 9) to train a sparse autoencoder, using the same <tt>sparseAutoencoderCost.m</tt> function as you had written in the previous exercise. (From the earlier exercise, you should have a working and vectorized implementation of the sparse autoencoder.) For us, the training step took less than 25 minutes on a fast desktop. When training is complete, you should get a visualization of pen strokes like the image shown below: <br />
<br />
[[File:selfTaughtFeatures.png]]<br />
<br />
Informally, the features learned by the sparse autoencoder should correspond to penstrokes.<br />
<br />
===Step 3: Extracting features===<br />
<br />
After the sparse autoencoder is trained, you will use it to extract features from the handwritten digit images. <br />
<br />
Complete <tt>feedForwardAutoencoder.m</tt> to produce a matrix whose columns correspond to activations of the hidden layer for each example, i.e., the vector <math>a^{(2)}</math> corresponding to activation of layer 2. (Recall that we treat the inputs as layer 1).<br />
<br />
After completing this step, calling <tt>feedForwardAutoencoder.m</tt> should convert the raw image data to hidden unit activations <math>a^{(2)}</math>.<br />
<br />
===Step 4: Training and testing the logistic regression model===<br />
<br />
Use your code from the softmax exercise (<tt>softmaxTrain.m</tt>) to train a softmax classifier using the training set features (<tt>trainFeatures</tt>) and labels (<tt>trainLabels</tt>).<br />
<br />
===Step 5: Classifying on the test set===<br />
<br />
Finally, complete the code to make predictions on the test set (<tt>testFeatures</tt>) and see how your learned features perform! If you've done all the steps correctly, you should get an accuracy of about '''98%''' percent.<br />
<br />
[[Category:Exercises]]</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Exercise:Self-Taught_LearningExercise:Self-Taught Learning2011-05-10T23:45:02Z<p>Ang: /* Step 3: Extracting features */</p>
<hr />
<div>===Overview===<br />
<br />
In this exercise, we will use the self-taught learning paradigm with the sparse autoencoder and softmax classifier to build a classifier for handwritten digits.<br />
<br />
You will be building upon your code from the earlier exercises. First, you will train your sparse autoencoder on an "unlabeled" training dataset of handwritten digits. This produces feature that are penstroke-like. We then extract these learned features from a labeled dataset of handwritten digits. These features will then be used as inputs to the softmax classifier that you wrote in the previous exercise. <br />
<br />
Concretely, for each example in the the labeled training dataset <math>\textstyle x_l</math>, we forward propagate the example to obtain the activation of the hidden units <math>\textstyle a^{(2)}</math>. We now represent this example using <math>\textstyle a^{(2)}</math> (the "replacement" representation), and use this to as the new feature representation with which to train the softmax classifier. <br />
<br />
Finally, we also extract the same features from the test data to obtain predictions.<br />
<br />
In this exercise, our goal is to distinguish between the digits from 0 to 4. We will use the digits 5 to 9 as our <br />
"unlabeled" dataset which which to learn the features; we will then use a labeled dataset with the digits 0 to 4 with<br />
which to train the softmax classifier. <br />
<br />
In the starter code, we have provided a file '''<tt>stlExercise.m</tt>''' that will help walk you through the steps in this exercise.<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://yann.lecun.com/exdb/mnist/ MNIST Dataset]<br />
* [[Using the MNIST Dataset | Support functions for loading MNIST in Matlab ]]<br />
* [http://ufldl.stanford.edu/wiki/resources/stl_exercise.zip Starter Code (stl_exercise.zip)]<br />
<br />
You will also need your code from the following exercises:<br />
* [[Exercise:Sparse Autoencoder]]<br />
* [[Exercise:Vectorization]]<br />
* [[Exercise:Softmax Regression]]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
===Step 1: Generate the input and test data sets===<br />
<br />
Download and decompress <tt>[http://ufldl.stanford.edu/wiki/resources/stl_exercise.zip stl_exercise.zip]</tt>, which contains starter code for this exercise. Additionally, you will need to download the datasets from the MNIST Handwritten Digit Database for this project.<br />
<br />
===Step 2: Train the sparse autoencoder===<br />
<br />
Next, use the unlabeled data (the digits from 5 to 9) to train a sparse autoencoder, using the same <tt>sparseAutoencoderCost.m</tt> function as you had written in the previous exercise. (From the earlier exercise, you should have a working and vectorized implementation of the sparse autoencoder.) For us, the training step took less than 25 minutes on a fast desktop. When training is complete, you should get a visualization of pen strokes like the image shown below: <br />
<br />
[[File:selfTaughtFeatures.png]]<br />
<br />
Informally, the features learned by the sparse autoencoder should correspond to penstrokes.<br />
<br />
===Step 3: Extracting features===<br />
<br />
After the sparse autoencoder is trained, you will use it to extract features from the handwritten digit images. <br />
<br />
Complete <tt>feedForwardAutoencoder.m</tt> to produce a matrix whose columns correspond to activations of the hidden layer for each example, i.e., the vector <math>a^{(2)}</math> corresponding to activation of layer 2. (Recall that we treat the inputs as layer 1).<br />
<br />
After completing this step, calling <tt>feedForwardAutoencoder.m</tt> should convert the raw image data to hidden unit activations <math>a^{(2)}</math>.<br />
<br />
===Step 4: Training and testing the logistic regression model===<br />
<br />
In this step, you should use your code from the softmax exercise (<tt>softmaxTrain.m</tt>) to train the softmax classifier using the training features (<tt>trainFeatures</tt>) and labels (<tt>trainLabels</tt>).<br />
<br />
===Step 5: Classifying on the test set===<br />
<br />
Finally, complete the code to make predictions on the test set (<tt>testFeatures</tt>) and see how your learned features perform! If you've done all the steps correctly, you should get an accuracy of about '''98%''' percent.<br />
<br />
[[Category:Exercises]]</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Exercise:Self-Taught_LearningExercise:Self-Taught Learning2011-05-10T23:44:41Z<p>Ang: /* Step 3: Extracting features */</p>
<hr />
<div>===Overview===<br />
<br />
In this exercise, we will use the self-taught learning paradigm with the sparse autoencoder and softmax classifier to build a classifier for handwritten digits.<br />
<br />
You will be building upon your code from the earlier exercises. First, you will train your sparse autoencoder on an "unlabeled" training dataset of handwritten digits. This produces feature that are penstroke-like. We then extract these learned features from a labeled dataset of handwritten digits. These features will then be used as inputs to the softmax classifier that you wrote in the previous exercise. <br />
<br />
Concretely, for each example in the the labeled training dataset <math>\textstyle x_l</math>, we forward propagate the example to obtain the activation of the hidden units <math>\textstyle a^{(2)}</math>. We now represent this example using <math>\textstyle a^{(2)}</math> (the "replacement" representation), and use this to as the new feature representation with which to train the softmax classifier. <br />
<br />
Finally, we also extract the same features from the test data to obtain predictions.<br />
<br />
In this exercise, our goal is to distinguish between the digits from 0 to 4. We will use the digits 5 to 9 as our <br />
"unlabeled" dataset which which to learn the features; we will then use a labeled dataset with the digits 0 to 4 with<br />
which to train the softmax classifier. <br />
<br />
In the starter code, we have provided a file '''<tt>stlExercise.m</tt>''' that will help walk you through the steps in this exercise.<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://yann.lecun.com/exdb/mnist/ MNIST Dataset]<br />
* [[Using the MNIST Dataset | Support functions for loading MNIST in Matlab ]]<br />
* [http://ufldl.stanford.edu/wiki/resources/stl_exercise.zip Starter Code (stl_exercise.zip)]<br />
<br />
You will also need your code from the following exercises:<br />
* [[Exercise:Sparse Autoencoder]]<br />
* [[Exercise:Vectorization]]<br />
* [[Exercise:Softmax Regression]]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
===Step 1: Generate the input and test data sets===<br />
<br />
Download and decompress <tt>[http://ufldl.stanford.edu/wiki/resources/stl_exercise.zip stl_exercise.zip]</tt>, which contains starter code for this exercise. Additionally, you will need to download the datasets from the MNIST Handwritten Digit Database for this project.<br />
<br />
===Step 2: Train the sparse autoencoder===<br />
<br />
Next, use the unlabeled data (the digits from 5 to 9) to train a sparse autoencoder, using the same <tt>sparseAutoencoderCost.m</tt> function as you had written in the previous exercise. (From the earlier exercise, you should have a working and vectorized implementation of the sparse autoencoder.) For us, the training step took less than 25 minutes on a fast desktop. When training is complete, you should get a visualization of pen strokes like the image shown below: <br />
<br />
[[File:selfTaughtFeatures.png]]<br />
<br />
Informally, the features learned by the sparse autoencoder should correspond to penstrokes.<br />
<br />
===Step 3: Extracting features===<br />
<br />
After the sparse autoencoder is trained, you will use it to extract features from the handwritten digit images. <br />
<br />
Complete <tt>feedForwardAutoencoder.m</tt> to produce a matrix whose columns correspond to activation of the hidden layer for each example, i.e., the vector <math>a^{(2)}</math> corresponding to activation of layer 2. (Recall that we treat the inputs as layer 1).<br />
<br />
After completing this step, calling <tt>feedForwardAutoencoder.m</tt> should convert the raw image data to hidden unit activations <math>a^{(2)}</math>.<br />
<br />
===Step 4: Training and testing the logistic regression model===<br />
<br />
In this step, you should use your code from the softmax exercise (<tt>softmaxTrain.m</tt>) to train the softmax classifier using the training features (<tt>trainFeatures</tt>) and labels (<tt>trainLabels</tt>).<br />
<br />
===Step 5: Classifying on the test set===<br />
<br />
Finally, complete the code to make predictions on the test set (<tt>testFeatures</tt>) and see how your learned features perform! If you've done all the steps correctly, you should get an accuracy of about '''98%''' percent.<br />
<br />
[[Category:Exercises]]</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Exercise:Self-Taught_LearningExercise:Self-Taught Learning2011-05-10T23:43:06Z<p>Ang: /* Step 2: Train the sparse autoencoder */</p>
<hr />
<div>===Overview===<br />
<br />
In this exercise, we will use the self-taught learning paradigm with the sparse autoencoder and softmax classifier to build a classifier for handwritten digits.<br />
<br />
You will be building upon your code from the earlier exercises. First, you will train your sparse autoencoder on an "unlabeled" training dataset of handwritten digits. This produces feature that are penstroke-like. We then extract these learned features from a labeled dataset of handwritten digits. These features will then be used as inputs to the softmax classifier that you wrote in the previous exercise. <br />
<br />
Concretely, for each example in the the labeled training dataset <math>\textstyle x_l</math>, we forward propagate the example to obtain the activation of the hidden units <math>\textstyle a^{(2)}</math>. We now represent this example using <math>\textstyle a^{(2)}</math> (the "replacement" representation), and use this to as the new feature representation with which to train the softmax classifier. <br />
<br />
Finally, we also extract the same features from the test data to obtain predictions.<br />
<br />
In this exercise, our goal is to distinguish between the digits from 0 to 4. We will use the digits 5 to 9 as our <br />
"unlabeled" dataset which which to learn the features; we will then use a labeled dataset with the digits 0 to 4 with<br />
which to train the softmax classifier. <br />
<br />
In the starter code, we have provided a file '''<tt>stlExercise.m</tt>''' that will help walk you through the steps in this exercise.<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://yann.lecun.com/exdb/mnist/ MNIST Dataset]<br />
* [[Using the MNIST Dataset | Support functions for loading MNIST in Matlab ]]<br />
* [http://ufldl.stanford.edu/wiki/resources/stl_exercise.zip Starter Code (stl_exercise.zip)]<br />
<br />
You will also need your code from the following exercises:<br />
* [[Exercise:Sparse Autoencoder]]<br />
* [[Exercise:Vectorization]]<br />
* [[Exercise:Softmax Regression]]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
===Step 1: Generate the input and test data sets===<br />
<br />
Download and decompress <tt>[http://ufldl.stanford.edu/wiki/resources/stl_exercise.zip stl_exercise.zip]</tt>, which contains starter code for this exercise. Additionally, you will need to download the datasets from the MNIST Handwritten Digit Database for this project.<br />
<br />
===Step 2: Train the sparse autoencoder===<br />
<br />
Next, use the unlabeled data (the digits from 5 to 9) to train a sparse autoencoder, using the same <tt>sparseAutoencoderCost.m</tt> function as you had written in the previous exercise. (From the earlier exercise, you should have a working and vectorized implementation of the sparse autoencoder.) For us, the training step took less than 25 minutes on a fast desktop. When training is complete, you should get a visualization of pen strokes like the image shown below: <br />
<br />
[[File:selfTaughtFeatures.png]]<br />
<br />
Informally, the features learned by the sparse autoencoder should correspond to penstrokes.<br />
<br />
===Step 3: Extracting features===<br />
<br />
After the sparse autoencoder is trained, we can use it to extract features from the handwritten digit images. <br />
<br />
Complete <tt>feedForwardAutoencoder.m</tt> to produce a matrix whose columns correspond to activation of the hidden layer for each example i.e. the vector <math>a^{(2)}</math> corresponding to activation of layer 2 (recall that we treat the inputs as layer 1).<br />
<br />
After doing so, this step will use your modified function to convert the raw image data to feature unit activations.<br />
<br />
===Step 4: Training and testing the logistic regression model===<br />
<br />
In this step, you should use your code from the softmax exercise (<tt>softmaxTrain.m</tt>) to train the softmax classifier using the training features (<tt>trainFeatures</tt>) and labels (<tt>trainLabels</tt>).<br />
<br />
===Step 5: Classifying on the test set===<br />
<br />
Finally, complete the code to make predictions on the test set (<tt>testFeatures</tt>) and see how your learned features perform! If you've done all the steps correctly, you should get an accuracy of about '''98%''' percent.<br />
<br />
[[Category:Exercises]]</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Exercise:Self-Taught_LearningExercise:Self-Taught Learning2011-05-10T23:42:49Z<p>Ang: /* Step 2: Train the sparse autoencoder */</p>
<hr />
<div>===Overview===<br />
<br />
In this exercise, we will use the self-taught learning paradigm with the sparse autoencoder and softmax classifier to build a classifier for handwritten digits.<br />
<br />
You will be building upon your code from the earlier exercises. First, you will train your sparse autoencoder on an "unlabeled" training dataset of handwritten digits. This produces feature that are penstroke-like. We then extract these learned features from a labeled dataset of handwritten digits. These features will then be used as inputs to the softmax classifier that you wrote in the previous exercise. <br />
<br />
Concretely, for each example in the the labeled training dataset <math>\textstyle x_l</math>, we forward propagate the example to obtain the activation of the hidden units <math>\textstyle a^{(2)}</math>. We now represent this example using <math>\textstyle a^{(2)}</math> (the "replacement" representation), and use this to as the new feature representation with which to train the softmax classifier. <br />
<br />
Finally, we also extract the same features from the test data to obtain predictions.<br />
<br />
In this exercise, our goal is to distinguish between the digits from 0 to 4. We will use the digits 5 to 9 as our <br />
"unlabeled" dataset which which to learn the features; we will then use a labeled dataset with the digits 0 to 4 with<br />
which to train the softmax classifier. <br />
<br />
In the starter code, we have provided a file '''<tt>stlExercise.m</tt>''' that will help walk you through the steps in this exercise.<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://yann.lecun.com/exdb/mnist/ MNIST Dataset]<br />
* [[Using the MNIST Dataset | Support functions for loading MNIST in Matlab ]]<br />
* [http://ufldl.stanford.edu/wiki/resources/stl_exercise.zip Starter Code (stl_exercise.zip)]<br />
<br />
You will also need your code from the following exercises:<br />
* [[Exercise:Sparse Autoencoder]]<br />
* [[Exercise:Vectorization]]<br />
* [[Exercise:Softmax Regression]]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
===Step 1: Generate the input and test data sets===<br />
<br />
Download and decompress <tt>[http://ufldl.stanford.edu/wiki/resources/stl_exercise.zip stl_exercise.zip]</tt>, which contains starter code for this exercise. Additionally, you will need to download the datasets from the MNIST Handwritten Digit Database for this project.<br />
<br />
===Step 2: Train the sparse autoencoder===<br />
<br />
Next, use the unlabeled data (digits 5-9) to train a sparse autoencoder, using the same <tt>sparseAutoencoderCost.m</tt> function as you had written in the previous exercise. (From the earlier exercise, you should have a working and vectorized implementation of the sparse autoencoder.) For us, the training step took less than 25 minutes on a fast desktop. When training is complete, you should get a visualization of pen strokes like the image shown below: <br />
<br />
[[File:selfTaughtFeatures.png]]<br />
<br />
Informally, the features learned by the sparse autoencoder should correspond to penstrokes.<br />
<br />
===Step 3: Extracting features===<br />
<br />
After the sparse autoencoder is trained, we can use it to extract features from the handwritten digit images. <br />
<br />
Complete <tt>feedForwardAutoencoder.m</tt> to produce a matrix whose columns correspond to activation of the hidden layer for each example i.e. the vector <math>a^{(2)}</math> corresponding to activation of layer 2 (recall that we treat the inputs as layer 1).<br />
<br />
After doing so, this step will use your modified function to convert the raw image data to feature unit activations.<br />
<br />
===Step 4: Training and testing the logistic regression model===<br />
<br />
In this step, you should use your code from the softmax exercise (<tt>softmaxTrain.m</tt>) to train the softmax classifier using the training features (<tt>trainFeatures</tt>) and labels (<tt>trainLabels</tt>).<br />
<br />
===Step 5: Classifying on the test set===<br />
<br />
Finally, complete the code to make predictions on the test set (<tt>testFeatures</tt>) and see how your learned features perform! If you've done all the steps correctly, you should get an accuracy of about '''98%''' percent.<br />
<br />
[[Category:Exercises]]</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Exercise:Self-Taught_LearningExercise:Self-Taught Learning2011-05-10T23:42:32Z<p>Ang: /* Step 2: Train the sparse autoencoder */</p>
<hr />
<div>===Overview===<br />
<br />
In this exercise, we will use the self-taught learning paradigm with the sparse autoencoder and softmax classifier to build a classifier for handwritten digits.<br />
<br />
You will be building upon your code from the earlier exercises. First, you will train your sparse autoencoder on an "unlabeled" training dataset of handwritten digits. This produces feature that are penstroke-like. We then extract these learned features from a labeled dataset of handwritten digits. These features will then be used as inputs to the softmax classifier that you wrote in the previous exercise. <br />
<br />
Concretely, for each example in the the labeled training dataset <math>\textstyle x_l</math>, we forward propagate the example to obtain the activation of the hidden units <math>\textstyle a^{(2)}</math>. We now represent this example using <math>\textstyle a^{(2)}</math> (the "replacement" representation), and use this to as the new feature representation with which to train the softmax classifier. <br />
<br />
Finally, we also extract the same features from the test data to obtain predictions.<br />
<br />
In this exercise, our goal is to distinguish between the digits from 0 to 4. We will use the digits 5 to 9 as our <br />
"unlabeled" dataset which which to learn the features; we will then use a labeled dataset with the digits 0 to 4 with<br />
which to train the softmax classifier. <br />
<br />
In the starter code, we have provided a file '''<tt>stlExercise.m</tt>''' that will help walk you through the steps in this exercise.<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://yann.lecun.com/exdb/mnist/ MNIST Dataset]<br />
* [[Using the MNIST Dataset | Support functions for loading MNIST in Matlab ]]<br />
* [http://ufldl.stanford.edu/wiki/resources/stl_exercise.zip Starter Code (stl_exercise.zip)]<br />
<br />
You will also need your code from the following exercises:<br />
* [[Exercise:Sparse Autoencoder]]<br />
* [[Exercise:Vectorization]]<br />
* [[Exercise:Softmax Regression]]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
===Step 1: Generate the input and test data sets===<br />
<br />
Download and decompress <tt>[http://ufldl.stanford.edu/wiki/resources/stl_exercise.zip stl_exercise.zip]</tt>, which contains starter code for this exercise. Additionally, you will need to download the datasets from the MNIST Handwritten Digit Database for this project.<br />
<br />
===Step 2: Train the sparse autoencoder===<br />
<br />
Next, use the unlabeled data to train a sparse autoencoder, using the same <tt>sparseAutoencoderCost.m</tt> function as you had written in the previous exercise. (From the earlier exercise, you should have a working and vectorized implementation of the sparse autoencoder.) For us, the training step took less than 25 minutes on a fast desktop. When training is complete, you should get a visualization of pen strokes like the image shown below: <br />
<br />
[[File:selfTaughtFeatures.png]]<br />
<br />
Informally, the features learned by the sparse autoencoder should correspond to penstrokes.<br />
<br />
===Step 3: Extracting features===<br />
<br />
After the sparse autoencoder is trained, we can use it to extract features from the handwritten digit images. <br />
<br />
Complete <tt>feedForwardAutoencoder.m</tt> to produce a matrix whose columns correspond to activation of the hidden layer for each example i.e. the vector <math>a^{(2)}</math> corresponding to activation of layer 2 (recall that we treat the inputs as layer 1).<br />
<br />
After doing so, this step will use your modified function to convert the raw image data to feature unit activations.<br />
<br />
===Step 4: Training and testing the logistic regression model===<br />
<br />
In this step, you should use your code from the softmax exercise (<tt>softmaxTrain.m</tt>) to train the softmax classifier using the training features (<tt>trainFeatures</tt>) and labels (<tt>trainLabels</tt>).<br />
<br />
===Step 5: Classifying on the test set===<br />
<br />
Finally, complete the code to make predictions on the test set (<tt>testFeatures</tt>) and see how your learned features perform! If you've done all the steps correctly, you should get an accuracy of about '''98%''' percent.<br />
<br />
[[Category:Exercises]]</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Exercise:Self-Taught_LearningExercise:Self-Taught Learning2011-05-10T23:36:46Z<p>Ang: /* Overview */</p>
<hr />
<div>===Overview===<br />
<br />
In this exercise, we will use the self-taught learning paradigm with the sparse autoencoder and softmax classifier to build a classifier for handwritten digits.<br />
<br />
You will be building upon your code from the earlier exercises. First, you will train your sparse autoencoder on an "unlabeled" training dataset of handwritten digits. This produces feature that are penstroke-like. We then extract these learned features from a labeled dataset of handwritten digits. These features will then be used as inputs to the softmax classifier that you wrote in the previous exercise. <br />
<br />
Concretely, for each example in the the labeled training dataset <math>\textstyle x_l</math>, we forward propagate the example to obtain the activation of the hidden units <math>\textstyle a^{(2)}</math>. We now represent this example using <math>\textstyle a^{(2)}</math> (the "replacement" representation), and use this to as the new feature representation with which to train the softmax classifier. <br />
<br />
Finally, we also extract the same features from the test data to obtain predictions.<br />
<br />
In this exercise, our goal is to distinguish between the digits from 0 to 4. We will use the digits 5 to 9 as our <br />
"unlabeled" dataset which which to learn the features; we will then use a labeled dataset with the digits 0 to 4 with<br />
which to train the softmax classifier. <br />
<br />
In the starter code, we have provided a file '''<tt>stlExercise.m</tt>''' that will help walk you through the steps in this exercise.<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://yann.lecun.com/exdb/mnist/ MNIST Dataset]<br />
* [[Using the MNIST Dataset | Support functions for loading MNIST in Matlab ]]<br />
* [http://ufldl.stanford.edu/wiki/resources/stl_exercise.zip Starter Code (stl_exercise.zip)]<br />
<br />
You will also need your code from the following exercises:<br />
* [[Exercise:Sparse Autoencoder]]<br />
* [[Exercise:Vectorization]]<br />
* [[Exercise:Softmax Regression]]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
===Step 1: Generate the input and test data sets===<br />
<br />
Download and decompress <tt>[http://ufldl.stanford.edu/wiki/resources/stl_exercise.zip stl_exercise.zip]</tt>, which contains starter code for this exercise. Additionally, you will need to download the datasets from the MNIST Handwritten Digit Database for this project.<br />
<br />
===Step 2: Train the sparse autoencoder===<br />
<br />
Next, we will train the unlabeled dataset on the sparse autoencoder, using the same <tt>sparseAutoencoderCost.m</tt> function from the previous assignments. (Use the frameworks from previous assignments to ensure that your code is working and vectorized.) The training step should take less than 25 minutes (on a reasonably fast computer). When it is completed, a visualization of pen strokes should be displayed.<br />
<br />
[[File:selfTaughtFeatures.png]]<br />
<br />
The features learned by the sparse autoencoder should correspond to penstrokes.<br />
<br />
===Step 3: Extracting features===<br />
<br />
After the sparse autoencoder is trained, we can use it to extract features from the handwritten digit images. <br />
<br />
Complete <tt>feedForwardAutoencoder.m</tt> to produce a matrix whose columns correspond to activation of the hidden layer for each example i.e. the vector <math>a^{(2)}</math> corresponding to activation of layer 2 (recall that we treat the inputs as layer 1).<br />
<br />
After doing so, this step will use your modified function to convert the raw image data to feature unit activations.<br />
<br />
===Step 4: Training and testing the logistic regression model===<br />
<br />
In this step, you should use your code from the softmax exercise (<tt>softmaxTrain.m</tt>) to train the softmax classifier using the training features (<tt>trainFeatures</tt>) and labels (<tt>trainLabels</tt>).<br />
<br />
===Step 5: Classifying on the test set===<br />
<br />
Finally, complete the code to make predictions on the test set (<tt>testFeatures</tt>) and see how your learned features perform! If you've done all the steps correctly, you should get an accuracy of about '''98%''' percent.<br />
<br />
[[Category:Exercises]]</div>Ang