http://ufldl.stanford.edu/wiki/index.php?title=Special:Contributions&feed=atom&limit=20&target=Ang&year=&month=Ufldl - User contributions [en]2024-03-28T09:03:12ZFrom UfldlMediaWiki 1.16.2http://ufldl.stanford.edu/wiki/index.php/Ufldl:CopyrightsUfldl:Copyrights2011-08-15T01:50:50Z<p>Ang: Created page with "By submitting text or other materials to this Wiki, you are asserting that, and promising us that, you wrote this yourself, or copied it from a public domain or similar free reso..."</p>
<hr />
<div>By submitting text or other materials to this Wiki, you are asserting that, and promising us that, you wrote this yourself, or copied it from a public domain or similar free resource. Further, by submitting text or other materials to this Wiki, in consideration for having your text incorporated into the Wiki and thus potentially having others be exposed to content provided by you--which you acknowledge is valuable consideration--you agree to assign and hereby do assign all copyright, title and interest in these materials to the Stanford authors of this Wiki. Do not submit copyrighted work without permission.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/UFLDL_TutorialUFLDL Tutorial2011-05-27T20:28:44Z<p>Ang: </p>
<hr />
<div>'''Description:''' This tutorial will teach you the main ideas of Unsupervised Feature Learning and Deep Learning. By working through it, you will also get to implement several feature learning/deep learning algorithms, get to see them work for yourself, and learn how to apply/adapt these ideas to new problems.<br />
<br />
This tutorial assumes a basic knowledge of machine learning (specifically, familiarity with the ideas of supervised learning, logistic regression, gradient descent). If you are not familiar with these ideas, we suggest you go to this [http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning Machine Learning course] and complete<br />
sections II, III, IV (up to Logistic Regression) first. <br />
<br />
<br />
'''Sparse Autoencoder'''<br />
* [[Neural Networks]]<br />
* [[Backpropagation Algorithm]]<br />
* [[Gradient checking and advanced optimization]]<br />
* [[Autoencoders and Sparsity]]<br />
* [[Visualizing a Trained Autoencoder]]<br />
* [[Sparse Autoencoder Notation Summary]] <br />
* [[Exercise:Sparse Autoencoder]]<br />
<br />
<br />
'''Vectorized implementation'''<br />
* [[Vectorization]]<br />
* [[Logistic Regression Vectorization Example]]<br />
* [[Neural Network Vectorization]]<br />
* [[Exercise:Vectorization]]<br />
<br />
<br />
'''Preprocessing: PCA and Whitening'''<br />
* [[PCA]]<br />
* [[Whitening]]<br />
* [[Implementing PCA/Whitening]]<br />
* [[Exercise:PCA in 2D]]<br />
* [[Exercise:PCA and Whitening]]<br />
<br />
<br />
'''Softmax Regression'''<br />
* [[Softmax Regression]]<br />
* [[Exercise:Softmax Regression]]<br />
<br />
<br />
'''Self-Taught Learning and Unsupervised Feature Learning''' <br />
* [[Self-Taught Learning]]<br />
* [[Exercise:Self-Taught Learning]]<br />
<br />
<br />
'''Building Deep Networks for Classification'''<br />
* [[Self-Taught Learning to Deep Networks | From Self-Taught Learning to Deep Networks]]<br />
* [[Deep Networks: Overview]]<br />
* [[Stacked Autoencoders]]<br />
* [[Fine-tuning Stacked AEs]]<br />
* [[Exercise: Implement deep networks for digit classification]]<br />
<br />
<br />
'''Linear Decoders with Autoencoders'''<br />
* [[Linear Decoders]]<br />
* [[Exercise:Learning color features with Sparse Autoencoders]]<br />
<br />
<br />
'''Working with Large Images'''<br />
* [[Feature extraction using convolution]]<br />
* [[Pooling]]<br />
* [[Exercise:Convolution and Pooling]]<br />
<br />
----<br />
'''Note''': The sections above this line are stable. The sections below are still under construction, and may change without notice. Feel free to browse around however, and feedback/suggestions are welcome. <br />
<br />
<br />
'''Miscellaneous''':<br />
<br />
[[MATLAB Modules]]<br />
<br />
[[Data Preprocessing]]<br />
<br />
[[Style Guide]]<br />
<br />
[[Useful Links]]<br />
<br />
<br />
'''Advanced Topics''':<br />
<br />
[[Convolutional training]] <br />
<br />
[[Restricted Boltzmann Machines]]<br />
<br />
[[Deep Belief Networks]]<br />
<br />
[[Denoising Autoencoders]]<br />
<br />
[[Sparse Coding]]<br />
<br />
[[K-means]]<br />
<br />
[[Spatial pyramids / Multiscale]]<br />
<br />
[[Slow Feature Analysis]]<br />
<br />
ICA Style Models:<br />
* [[Independent Component Analysis]]<br />
* [[Topographic Independent Component Analysis]]<br />
<br />
[[Tiled Convolution Networks]]<br />
<br />
----<br />
<br />
Material contributed by: Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Exercise:Convolution_and_PoolingExercise:Convolution and Pooling2011-05-27T19:05:16Z<p>Ang: /* Step 4: Test classifier */</p>
<hr />
<div>== Convolution and Pooling ==<br />
<br />
In this exercise you will use the features you learned on 8x8 patches sampled from images from the STL-10 dataset in [[Exercise:Learning color features with Sparse Autoencoders | the earlier exercise on linear decoders]] for classifying images from a reduced STL-10 dataset applying [[Feature extraction using convolution | convolution]] and [[Pooling | pooling]]. The reduced STL-10 dataset comprises 64x64 images from 4 classes (airplane, car, cat, dog).<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/cnn_exercise.zip cnn_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to modify '''<tt>cnnConvolve.m</tt>''' and '''<tt>cnnPool.m</tt>'''.<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stlSubset.zip A subset of the STL10 Dataset (stlSubset.zip)]<br />
* [http://ufldl.stanford.edu/wiki/resources/cnn_exercise.zip Starter Code (cnn_exercise.zip)]<br />
<br />
You will also need:<br />
* <tt>sparseAutoencoderLinear.m</tt> or your saved features from [[Exercise:Learning color features with Sparse Autoencoders]]<br />
* <tt>feedForwardAutoencoder.m</tt> (and related functions) from [[Exercise:Self-Taught Learning]]<br />
* <tt>softmaxTrain.m</tt> (and related functions) from [[Exercise:Softmax Regression]]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
=== Step 1: Load learned features ===<br />
<br />
In this step, you will use the features from [[Exercise:Learning color features with Sparse Autoencoders]]. If you have completed that exercise, you can load the color features that were previously saved. To verify that the features are good, the visualized features should look like the following:<br />
<br />
[[File:CNN_Features_Good.png|300px]]<br />
<br />
=== Step 2: Implement and test convolution and pooling ===<br />
<br />
In this step, you will implement convolution and pooling, and test them on a small part of the data set to ensure that you have implemented these two functions correctly. In the next step, you will actually convolve and pool the features with the STL-10 images.<br />
<br />
==== Step 2a: Implement convolution ====<br />
<br />
Implement convolution, as described in [[feature extraction using convolution]], in the function <tt>cnnConvolve</tt> in <tt>cnnConvolve.m</tt>. Implementing convolution is somewhat involved, so we will guide you through the process below.<br />
<br />
First, we want to compute <math>\sigma(Wx_{(r,c)} + b)</math> for all ''valid'' <math>(r, c)</math> (''valid'' meaning that the entire 8x8 patch is contained within the image; this is as opposed to a ''full'' convolution, which allows the patch to extend outside the image, with the area outside the image assumed to be 0), where <math>W</math> and <math>b</math> are the learned weights and biases from the input layer to the hidden layer, and <math>x_{(r,c)}</math> is the 8x8 patch with the upper left corner at <math>(r, c)</math>. To accomplish this, one naive method is to loop over all such patches and compute <math>\sigma(Wx_{(r,c)} + b)</math> for each of them; while this is fine in theory, it can very slow. Hence, we usually use Matlab's built in convolution functions, which are well optimized.<br />
<br />
Observe that the convolution above can be broken down into the following three small steps. First, compute <math>Wx_{(r,c)}</math> for all <math>(r, c)</math>. Next, add b to all the computed values. Finally, apply the sigmoid function to the resulting values. This doesn't seem to buy you anything, since the first step still requires a loop. However, you can replace the loop in the first step with one of MATLAB's optimized convolution functions, <tt>conv2</tt>, speeding up the process significantly.<br />
<br />
However, there are two important points to note in using <tt>conv2</tt>. <br />
<br />
First, <tt>conv2</tt> performs a 2-D convolution, but you have 5 "dimensions" - image number, feature number, row of image, column of image, and (color) channel of image - that you want to convolve over. Because of this, you will have to convolve each feature and image channel separately for each image, using the row and column of the image as the 2 dimensions you convolve over. This means that you will need three outer loops over the image number <tt>imageNum</tt>, feature number <tt>featureNum</tt>, and the channel number of the image <tt>channel</tt>. Inside the three nested for-loops, you will perform a <tt>conv2</tt> 2-D convolution, using the weight matrix for the <tt>featureNum</tt>-th feature and <tt>channel</tt>-th channel, and the image matrix for the <tt>imageNum</tt>-th image. <br />
<br />
Second, because of the mathematical definition of convolution, the feature matrix must be "flipped" before passing it to <tt>conv2</tt>. The following implementation tip explains the "flipping" of feature matrices when using MATLAB's convolution functions:<br />
<br />
<div style="border:1px solid black; padding: 5px"><br />
<br />
'''Implementation tip:''' Using <tt>conv2</tt> and <tt>convn</tt><br />
<br />
Because the mathematical definition of convolution involves "flipping" the matrix to convolve with (reversing its rows and its columns), to use MATLAB's convolution functions, you must first "flip" the weight matrix so that when MATLAB "flips" it according to the mathematical definition the entries will be at the correct place. For example, suppose you wanted to convolve two matrices <tt>image</tt> (a large image) and <tt>W</tt> (the feature) using <tt>conv2(image, W)</tt>, and W is a 3x3 matrix as below:<br />
<br />
<math><br />
W = <br />
\begin{pmatrix}<br />
1 & 2 & 3 \\<br />
4 & 5 & 6 \\<br />
7 & 8 & 9 \\<br />
\end{pmatrix}<br />
</math><br />
<br />
If you use <tt>conv2(image, W)</tt>, MATLAB will first "flip" <tt>W</tt>, reversing its rows and columns, before convolving <tt>W</tt> with <tt>image</tt>, as below:<br />
<br />
<math><br />
\begin{pmatrix}<br />
1 & 2 & 3 \\<br />
4 & 5 & 6 \\<br />
7 & 8 & 9 \\<br />
\end{pmatrix}<br />
<br />
\xrightarrow{flip}<br />
<br />
\begin{pmatrix}<br />
9 & 8 & 7 \\<br />
6 & 5 & 4 \\<br />
3 & 2 & 1 \\<br />
\end{pmatrix}<br />
</math><br />
<br />
If the original layout of <tt>W</tt> was correct, after flipping, it would be incorrect. For the layout to be correct after flipping, you will have to flip <tt>W</tt> before passing it into <tt>conv2</tt>, so that after MATLAB flips <tt>W</tt> in <tt>conv2</tt>, the layout will be correct. For <tt>conv2</tt>, this means reversing the rows and columns, which can be done with <tt>flipud</tt> and <tt>fliplr</tt>, as shown below:<br />
<br />
<syntaxhighlight lang="matlab"><br />
% Flip W for use in conv2<br />
W = flipud(fliplr(W));<br />
</syntaxhighlight><br />
<br />
</div><br />
<br />
Next, to each of the <tt>convolvedFeatures</tt>, you should then add <tt>b</tt>, the corresponding bias for the <tt>featureNum</tt>-th feature. <br />
<br />
However, there is one additional complication. If we had not done any preprocessing of the input patches, you could just follow the procedure as described above, and apply the sigmoid function to obtain the convolved features, and we'd be done. However, because you preprocessed the patches before learning features on them, you must also apply the same preprocessing steps to the convolved patches to get the correct feature activations. <br />
<br />
In particular, you did the following to the patches:<br />
<ol><br />
<li> subtract the mean patch, <tt>meanPatch</tt> to zero the mean of the patches <br />
<li> ZCA whiten using the whitening matrix <tt>ZCAWhite</tt>.<br />
</ol><br />
These same three steps must also be applied to the input image patches. <br />
<br />
Taking the preprocessing steps into account, the feature activations that you should compute is <math>\sigma(W(T(x-\bar{x})) + b)</math>, where <math>T</math> is the whitening matrix and <math>\bar{x}</math> is the mean patch. Expanding this, you obtain <math>\sigma(WTx - WT\bar{x} + b)</math>, which suggests that you should convolve the images with <math>WT</math> rather than <math>W</math> as earlier, and you should add <math>(b - WT\bar{x})</math>, rather than just <math>b</math> to <tt>convolvedFeatures</tt>, before finally applying the sigmoid function.<br />
<br />
==== Step 2b: Check your convolution ====<br />
<br />
We have provided some code for you to check that you have done the convolution correctly. The code randomly checks the convolved values for a number of (feature, row, column) tuples by computing the feature activations using <tt>feedForwardAutoencoder</tt> for the selected features and patches directly using the sparse autoencoder. <br />
<br />
==== Step 2c: Pooling ====<br />
<br />
Implement [[pooling]] in the function <tt>cnnPool</tt> in <tt>cnnPool.m</tt>. You should implement ''mean'' pooling (i.e., averaging over feature responses) for this part.<br />
<br />
==== Step 2d: Check your pooling ====<br />
<br />
We have provided some code for you to check that you have done the pooling correctly. The code runs <tt>cnnPool</tt> against a test matrix to see if it produces the expected result.<br />
<br />
=== Step 3: Convolve and pool with the dataset ===<br />
<br />
In this step, you will convolve each of the features you learned with the full 64x64 images from the STL-10 dataset to obtain the convolved features for both the training and test sets. You will then pool the convolved features to obtain the pooled features for both training and test sets. The pooled features for the training set will be used to train your classifier, which you can then test on the test set.<br />
<br />
Because the convolved features matrix is very large, the code provided does the convolution and pooling 50 features at a time to avoid running out of memory.<br />
<br />
=== Step 4: Use pooled features for classification ===<br />
<br />
In this step, you will use the pooled features to train a softmax classifier to map the pooled features to the class labels. The code in this section uses <tt>softmaxTrain</tt> from the softmax exercise to train a softmax classifier on the pooled features for 500 iterations, which should take around 5 minutes.<br />
<br />
=== Step 5: Test classifier ===<br />
<br />
Now that you have a trained softmax classifier, you can see how well it performs on the test set. These pooled features for the test set will be run through the softmax classifier, and the accuracy of the predictions will be computed. You should expect to get an accuracy of around 80%.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Exercise:Convolution_and_PoolingExercise:Convolution and Pooling2011-05-27T19:04:24Z<p>Ang: /* Convolution and Pooling */</p>
<hr />
<div>== Convolution and Pooling ==<br />
<br />
In this exercise you will use the features you learned on 8x8 patches sampled from images from the STL-10 dataset in [[Exercise:Learning color features with Sparse Autoencoders | the earlier exercise on linear decoders]] for classifying images from a reduced STL-10 dataset applying [[Feature extraction using convolution | convolution]] and [[Pooling | pooling]]. The reduced STL-10 dataset comprises 64x64 images from 4 classes (airplane, car, cat, dog).<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/cnn_exercise.zip cnn_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to modify '''<tt>cnnConvolve.m</tt>''' and '''<tt>cnnPool.m</tt>'''.<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stlSubset.zip A subset of the STL10 Dataset (stlSubset.zip)]<br />
* [http://ufldl.stanford.edu/wiki/resources/cnn_exercise.zip Starter Code (cnn_exercise.zip)]<br />
<br />
You will also need:<br />
* <tt>sparseAutoencoderLinear.m</tt> or your saved features from [[Exercise:Learning color features with Sparse Autoencoders]]<br />
* <tt>feedForwardAutoencoder.m</tt> (and related functions) from [[Exercise:Self-Taught Learning]]<br />
* <tt>softmaxTrain.m</tt> (and related functions) from [[Exercise:Softmax Regression]]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
=== Step 1: Load learned features ===<br />
<br />
In this step, you will use the features from [[Exercise:Learning color features with Sparse Autoencoders]]. If you have completed that exercise, you can load the color features that were previously saved. To verify that the features are good, the visualized features should look like the following:<br />
<br />
[[File:CNN_Features_Good.png|300px]]<br />
<br />
=== Step 2: Implement and test convolution and pooling ===<br />
<br />
In this step, you will implement convolution and pooling, and test them on a small part of the data set to ensure that you have implemented these two functions correctly. In the next step, you will actually convolve and pool the features with the STL-10 images.<br />
<br />
==== Step 2a: Implement convolution ====<br />
<br />
Implement convolution, as described in [[feature extraction using convolution]], in the function <tt>cnnConvolve</tt> in <tt>cnnConvolve.m</tt>. Implementing convolution is somewhat involved, so we will guide you through the process below.<br />
<br />
First, we want to compute <math>\sigma(Wx_{(r,c)} + b)</math> for all ''valid'' <math>(r, c)</math> (''valid'' meaning that the entire 8x8 patch is contained within the image; this is as opposed to a ''full'' convolution, which allows the patch to extend outside the image, with the area outside the image assumed to be 0), where <math>W</math> and <math>b</math> are the learned weights and biases from the input layer to the hidden layer, and <math>x_{(r,c)}</math> is the 8x8 patch with the upper left corner at <math>(r, c)</math>. To accomplish this, one naive method is to loop over all such patches and compute <math>\sigma(Wx_{(r,c)} + b)</math> for each of them; while this is fine in theory, it can very slow. Hence, we usually use Matlab's built in convolution functions, which are well optimized.<br />
<br />
Observe that the convolution above can be broken down into the following three small steps. First, compute <math>Wx_{(r,c)}</math> for all <math>(r, c)</math>. Next, add b to all the computed values. Finally, apply the sigmoid function to the resulting values. This doesn't seem to buy you anything, since the first step still requires a loop. However, you can replace the loop in the first step with one of MATLAB's optimized convolution functions, <tt>conv2</tt>, speeding up the process significantly.<br />
<br />
However, there are two important points to note in using <tt>conv2</tt>. <br />
<br />
First, <tt>conv2</tt> performs a 2-D convolution, but you have 5 "dimensions" - image number, feature number, row of image, column of image, and (color) channel of image - that you want to convolve over. Because of this, you will have to convolve each feature and image channel separately for each image, using the row and column of the image as the 2 dimensions you convolve over. This means that you will need three outer loops over the image number <tt>imageNum</tt>, feature number <tt>featureNum</tt>, and the channel number of the image <tt>channel</tt>. Inside the three nested for-loops, you will perform a <tt>conv2</tt> 2-D convolution, using the weight matrix for the <tt>featureNum</tt>-th feature and <tt>channel</tt>-th channel, and the image matrix for the <tt>imageNum</tt>-th image. <br />
<br />
Second, because of the mathematical definition of convolution, the feature matrix must be "flipped" before passing it to <tt>conv2</tt>. The following implementation tip explains the "flipping" of feature matrices when using MATLAB's convolution functions:<br />
<br />
<div style="border:1px solid black; padding: 5px"><br />
<br />
'''Implementation tip:''' Using <tt>conv2</tt> and <tt>convn</tt><br />
<br />
Because the mathematical definition of convolution involves "flipping" the matrix to convolve with (reversing its rows and its columns), to use MATLAB's convolution functions, you must first "flip" the weight matrix so that when MATLAB "flips" it according to the mathematical definition the entries will be at the correct place. For example, suppose you wanted to convolve two matrices <tt>image</tt> (a large image) and <tt>W</tt> (the feature) using <tt>conv2(image, W)</tt>, and W is a 3x3 matrix as below:<br />
<br />
<math><br />
W = <br />
\begin{pmatrix}<br />
1 & 2 & 3 \\<br />
4 & 5 & 6 \\<br />
7 & 8 & 9 \\<br />
\end{pmatrix}<br />
</math><br />
<br />
If you use <tt>conv2(image, W)</tt>, MATLAB will first "flip" <tt>W</tt>, reversing its rows and columns, before convolving <tt>W</tt> with <tt>image</tt>, as below:<br />
<br />
<math><br />
\begin{pmatrix}<br />
1 & 2 & 3 \\<br />
4 & 5 & 6 \\<br />
7 & 8 & 9 \\<br />
\end{pmatrix}<br />
<br />
\xrightarrow{flip}<br />
<br />
\begin{pmatrix}<br />
9 & 8 & 7 \\<br />
6 & 5 & 4 \\<br />
3 & 2 & 1 \\<br />
\end{pmatrix}<br />
</math><br />
<br />
If the original layout of <tt>W</tt> was correct, after flipping, it would be incorrect. For the layout to be correct after flipping, you will have to flip <tt>W</tt> before passing it into <tt>conv2</tt>, so that after MATLAB flips <tt>W</tt> in <tt>conv2</tt>, the layout will be correct. For <tt>conv2</tt>, this means reversing the rows and columns, which can be done with <tt>flipud</tt> and <tt>fliplr</tt>, as shown below:<br />
<br />
<syntaxhighlight lang="matlab"><br />
% Flip W for use in conv2<br />
W = flipud(fliplr(W));<br />
</syntaxhighlight><br />
<br />
</div><br />
<br />
Next, to each of the <tt>convolvedFeatures</tt>, you should then add <tt>b</tt>, the corresponding bias for the <tt>featureNum</tt>-th feature. <br />
<br />
However, there is one additional complication. If we had not done any preprocessing of the input patches, you could just follow the procedure as described above, and apply the sigmoid function to obtain the convolved features, and we'd be done. However, because you preprocessed the patches before learning features on them, you must also apply the same preprocessing steps to the convolved patches to get the correct feature activations. <br />
<br />
In particular, you did the following to the patches:<br />
<ol><br />
<li> subtract the mean patch, <tt>meanPatch</tt> to zero the mean of the patches <br />
<li> ZCA whiten using the whitening matrix <tt>ZCAWhite</tt>.<br />
</ol><br />
These same three steps must also be applied to the input image patches. <br />
<br />
Taking the preprocessing steps into account, the feature activations that you should compute is <math>\sigma(W(T(x-\bar{x})) + b)</math>, where <math>T</math> is the whitening matrix and <math>\bar{x}</math> is the mean patch. Expanding this, you obtain <math>\sigma(WTx - WT\bar{x} + b)</math>, which suggests that you should convolve the images with <math>WT</math> rather than <math>W</math> as earlier, and you should add <math>(b - WT\bar{x})</math>, rather than just <math>b</math> to <tt>convolvedFeatures</tt>, before finally applying the sigmoid function.<br />
<br />
==== Step 2b: Check your convolution ====<br />
<br />
We have provided some code for you to check that you have done the convolution correctly. The code randomly checks the convolved values for a number of (feature, row, column) tuples by computing the feature activations using <tt>feedForwardAutoencoder</tt> for the selected features and patches directly using the sparse autoencoder. <br />
<br />
==== Step 2c: Pooling ====<br />
<br />
Implement [[pooling]] in the function <tt>cnnPool</tt> in <tt>cnnPool.m</tt>. You should implement ''mean'' pooling (i.e., averaging over feature responses) for this part.<br />
<br />
==== Step 2d: Check your pooling ====<br />
<br />
We have provided some code for you to check that you have done the pooling correctly. The code runs <tt>cnnPool</tt> against a test matrix to see if it produces the expected result.<br />
<br />
=== Step 3: Convolve and pool with the dataset ===<br />
<br />
In this step, you will convolve each of the features you learned with the full 64x64 images from the STL-10 dataset to obtain the convolved features for both the training and test sets. You will then pool the convolved features to obtain the pooled features for both training and test sets. The pooled features for the training set will be used to train your classifier, which you can then test on the test set.<br />
<br />
Because the convolved features matrix is very large, the code provided does the convolution and pooling 50 features at a time to avoid running out of memory.<br />
<br />
=== Step 4: Use pooled features for classification ===<br />
<br />
In this step, you will use the pooled features to train a softmax classifier to map the pooled features to the class labels. The code in this section uses <tt>softmaxTrain</tt> from the softmax exercise to train a softmax classifier on the pooled features for 500 iterations, which should take around 5 minutes.<br />
<br />
=== Step 4: Test classifier ===<br />
<br />
Now that you have a trained softmax classifier, you can see how well it performs on the test set. These pooled features for the test set will be run through the softmax classifier, and the accuracy of the predictions will be computed. You should expect to get an accuracy of around 80%.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/PoolingPooling2011-05-27T18:30:31Z<p>Ang: /* Notes */</p>
<hr />
<div>== Pooling: Overview ==<br />
<br />
After obtaining features using convolution, we would next like to use them for classification. In theory, one could use all the extracted features with a classifier such as a softmax classifier, but this can be computationally challenging. Consider for instance images of size 96x96 pixels, and suppose we have learned 400 features over 8x8 inputs. Each convolution results in an output of size <math>(96-8+1)*(96-8+1)=7921</math>, and since we have 400 features, this results in a vector of <math>89^2 * 400 = 3,168,400</math> features per example. Learning a classifier with inputs having 3+ million features can be unwieldy, and can also be prone to over-fitting. <br />
<br />
To address this, first recall that we decided to obtain convolved features because images have the "stationarity" property, which implies that features that are useful in one region are also likely to be useful for other regions. Thus, to describe a large image, one natural approach is to aggregate statistics of these features at various locations. For example, one could compute the mean (or max) value of a particular feature over a region of the image. These summary statistics are much lower in dimension (compared to using all of the extracted features) and can also improve results (less over-fitting). We aggregation operation is called this operation '''pooling''', or sometimes '''mean pooling''' or '''max pooling''' (depending on the pooling operation applied). <br />
<br />
The following image shows how pooling is done over 4 non-overlapping regions of the image.<br />
<br />
[[File:Pooling_schematic.gif]]<br />
<br />
== Pooling for Invariance ==<br />
<br />
If one chooses the pooling regions to be contiguous areas in the image and only pools features generated from the same (replicated) hidden units. Then, these pooling units will then be '''translation invariant'''. This means that the same (pooled) feature will be active even when the image undergoes (small) translations. Translation-invariant features are often desirable; in many tasks (e.g., object detection, audio recognition), the label of the example (image) is the same even when the image is translated. For example, if you were to take an MNIST digit and translate it left or right, you would want your classifier to still accurately classify it as the same digit regardless of its final position.<br />
<br />
== Formal description ==<br />
<br />
Formally, after obtaining our convolved features as described earlier, we decide the size of the region, say <math>m \times n</math> to pool our convolved features over. Then, we divide our convolved features into disjoint <math>m \times n</math> regions, and take the mean (or maximum) feature activation over these regions to obtain the pooled convolved features. These pooled features can then be used for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/PoolingPooling2011-05-27T18:29:02Z<p>Ang: /* Pooling: Overview */</p>
<hr />
<div>== Pooling: Overview ==<br />
<br />
After obtaining features using convolution, we would next like to use them for classification. In theory, one could use all the extracted features with a classifier such as a softmax classifier, but this can be computationally challenging. Consider for instance images of size 96x96 pixels, and suppose we have learned 400 features over 8x8 inputs. Each convolution results in an output of size <math>(96-8+1)*(96-8+1)=7921</math>, and since we have 400 features, this results in a vector of <math>89^2 * 400 = 3,168,400</math> features per example. Learning a classifier with inputs having 3+ million features can be unwieldy, and can also be prone to over-fitting. <br />
<br />
To address this, first recall that we decided to obtain convolved features because images have the "stationarity" property, which implies that features that are useful in one region are also likely to be useful for other regions. Thus, to describe a large image, one natural approach is to aggregate statistics of these features at various locations. For example, one could compute the mean (or max) value of a particular feature over a region of the image. These summary statistics are much lower in dimension (compared to using all of the extracted features) and can also improve results (less over-fitting). We aggregation operation is called this operation '''pooling''', or sometimes '''mean pooling''' or '''max pooling''' (depending on the pooling operation applied). <br />
<br />
The following image shows how pooling is done over 4 non-overlapping regions of the image.<br />
<br />
[[File:Pooling_schematic.gif]]<br />
<br />
== Pooling for Invariance ==<br />
<br />
If one chooses the pooling regions to be contiguous areas in the image and only pools features generated from the same (replicated) hidden units. Then, these pooling units will then be '''translation invariant'''. This means that the same (pooled) feature will be active even when the image undergoes (small) translations. Translation-invariant features are often desirable; in many tasks (e.g., object detection, audio recognition), the label of the example (image) is the same even when the image is translated. For example, if you were to take an MNIST digit and translate it left or right, you would want your classifier to still accurately classify it as the same digit regardless of its final position.<br />
<br />
== Notes ==<br />
<br />
Formally, after obtaining our convolved features as earlier, we decide the size of the region, say <math>m \times n</math> to pool our convolved features over. Then, we divide our convolved features into disjoint <math>m \times n</math> regions, and take the maximum (or mean) feature activation over these regions to obtain the pooled convolved features. These pooled features can then be used for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolutionFeature extraction using convolution2011-05-27T18:15:47Z<p>Ang: /* Convolutions */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which will allow us to scale up these methods to more realistic datasets that have larger images.<br />
<br />
== Fully Connected Networks ==<br />
<br />
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On the relatively small images that we were working with (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it was computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.<br />
<br />
== Locally Connected Networks ==<br />
<br />
One simple solution to this problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a small subset of the input units. Specifically, each hidden unit will connect to only a small contiguous region of pixels in the input. (For input modalities different than images, there is often also a natural way to select "contiguous groups" of input units to connect to a single hidden unit as well; for example, for audio, a hidden unit might be connected to only the input units corresponding to a certain time span of the input audio clip.) <br />
<br />
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up in biology. Specifically, neurons in the visual cortex have localized receptive fields (i.e., they respond only to stimuli in a certain location).<br />
<br />
== Convolutions ==<br />
<br />
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applied to other parts of the image, and we can use the same features at all locations. <br />
<!--<br />
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.<br />
<br />
== Fast Feature Learning and Extraction ==<br />
<br />
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute. <br />
!--><br />
<br />
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and <br />
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. <br />
<br />
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. Suppose further this was done with an autoencoder that has 100 hidden units. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in 100 sets 89x89 convolved features. <br />
<br />
<!--<br />
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. <br />
!--><br />
<br />
[[File:Convolution_schematic.gif]]<br />
<br />
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. <br />
<br />
In the next section, we further describe how to "pool" these features together to get even better features for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolutionFeature extraction using convolution2011-05-27T18:13:56Z<p>Ang: /* Convolutions */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which will allow us to scale up these methods to more realistic datasets that have larger images.<br />
<br />
== Fully Connected Networks ==<br />
<br />
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On the relatively small images that we were working with (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it was computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.<br />
<br />
== Locally Connected Networks ==<br />
<br />
One simple solution to this problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a small subset of the input units. Specifically, each hidden unit will connect to only a small contiguous region of pixels in the input. (For input modalities different than images, there is often also a natural way to select "contiguous groups" of input units to connect to a single hidden unit as well; for example, for audio, a hidden unit might be connected to only the input units corresponding to a certain time span of the input audio clip.) <br />
<br />
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up in biology. Specifically, neurons in the visual cortex have localized receptive fields (i.e., they respond only to stimuli in a certain location).<br />
<br />
== Convolutions ==<br />
<br />
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applied to other parts of the image, and we can use the same features at all locations. <br />
<!--<br />
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.<br />
<br />
== Fast Feature Learning and Extraction ==<br />
<br />
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute. <br />
!--><br />
<br />
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and <br />
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. <br />
<br />
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. <br />
<br />
<!--<br />
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. <br />
!--><br />
<br />
[[File:Convolution_schematic.gif]]<br />
<br />
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. <br />
<br />
In the next section, we further describe how to "pool" these features together to get even better features for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolutionFeature extraction using convolution2011-05-27T18:13:31Z<p>Ang: /* Weight Sharing (Convolution) */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which will allow us to scale up these methods to more realistic datasets that have larger images.<br />
<br />
== Fully Connected Networks ==<br />
<br />
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On the relatively small images that we were working with (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it was computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.<br />
<br />
== Locally Connected Networks ==<br />
<br />
One simple solution to this problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a small subset of the input units. Specifically, each hidden unit will connect to only a small contiguous region of pixels in the input. (For input modalities different than images, there is often also a natural way to select "contiguous groups" of input units to connect to a single hidden unit as well; for example, for audio, a hidden unit might be connected to only the input units corresponding to a certain time span of the input audio clip.) <br />
<br />
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up in biology. Specifically, neurons in the visual cortex have localized receptive fields (i.e., they respond only to stimuli in a certain location).<br />
<br />
== Convolutions ==<br />
<br />
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applied to other regions--i.e., we can use the same features at all locations. <br />
<!--<br />
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.<br />
<br />
== Fast Feature Learning and Extraction ==<br />
<br />
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute. <br />
!--><br />
<br />
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and <br />
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. <br />
<br />
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. <br />
<br />
<!--<br />
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. <br />
!--><br />
<br />
[[File:Convolution_schematic.gif]]<br />
<br />
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. <br />
<br />
In the next section, we further describe how to "pool" these features together to get even better features for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolutionFeature extraction using convolution2011-05-27T18:12:56Z<p>Ang: /* Locally Connected Networks */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which will allow us to scale up these methods to more realistic datasets that have larger images.<br />
<br />
== Fully Connected Networks ==<br />
<br />
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On the relatively small images that we were working with (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it was computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.<br />
<br />
== Locally Connected Networks ==<br />
<br />
One simple solution to this problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a small subset of the input units. Specifically, each hidden unit will connect to only a small contiguous region of pixels in the input. (For input modalities different than images, there is often also a natural way to select "contiguous groups" of input units to connect to a single hidden unit as well; for example, for audio, a hidden unit might be connected to only the input units corresponding to a certain time span of the input audio clip.) <br />
<br />
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up in biology. Specifically, neurons in the visual cortex have localized receptive fields (i.e., they respond only to stimuli in a certain location).<br />
<br />
== Weight Sharing (Convolution) ==<br />
<br />
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applied to other regions--i.e., we can use the same features at all locations. <br />
<!--<br />
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.<br />
<br />
== Fast Feature Learning and Extraction ==<br />
<br />
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute. <br />
!--><br />
<br />
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and <br />
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. <br />
<br />
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. <br />
<br />
<!--<br />
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. <br />
!--><br />
<br />
[[File:Convolution_schematic.gif]]<br />
<br />
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. <br />
<br />
In the next section, we further describe how to "pool" these features together to get even better features for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolutionFeature extraction using convolution2011-05-27T18:08:59Z<p>Ang: /* Weight Sharing (Convolution) */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which will allow us to scale up these methods to more realistic datasets that have larger images.<br />
<br />
== Fully Connected Networks ==<br />
<br />
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On the relatively small images that we were working with (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it was computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.<br />
<br />
== Locally Connected Networks ==<br />
<br />
One simple solution to this problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a small subset of the input units. Specifically, each hidden unit will connect to only a small contiguous region of pixels in the input. (For input modalities other than vision, there is often a natural way to select "contiguous groups" of inputs to connect to a single hidden units as well; for example, for audio, each hidden unit might be connected to only a certain time span of the input audio clip.) <br />
<br />
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up in biology. Specifically, neurons in the visual cortex have localized receptive fields (i.e., they respond only to stimuli in a certain location).<br />
<br />
== Weight Sharing (Convolution) ==<br />
<br />
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applied to other regions--i.e., we can use the same features at all locations. <br />
<!--<br />
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.<br />
<br />
== Fast Feature Learning and Extraction ==<br />
<br />
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute. <br />
!--><br />
<br />
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and <br />
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. <br />
<br />
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. <br />
<br />
<!--<br />
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. <br />
!--><br />
<br />
[[File:Convolution_schematic.gif]]<br />
<br />
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. <br />
<br />
In the next section, we further describe how to "pool" these features together to get even better features for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolutionFeature extraction using convolution2011-05-27T18:08:31Z<p>Ang: /* Weight Sharing (Convolution) */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which will allow us to scale up these methods to more realistic datasets that have larger images.<br />
<br />
== Fully Connected Networks ==<br />
<br />
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On the relatively small images that we were working with (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it was computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.<br />
<br />
== Locally Connected Networks ==<br />
<br />
One simple solution to this problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a small subset of the input units. Specifically, each hidden unit will connect to only a small contiguous region of pixels in the input. (For input modalities other than vision, there is often a natural way to select "contiguous groups" of inputs to connect to a single hidden units as well; for example, for audio, each hidden unit might be connected to only a certain time span of the input audio clip.) <br />
<br />
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up in biology. Specifically, neurons in the visual cortex have localized receptive fields (i.e., they respond only to stimuli in a certain location).<br />
<br />
== Weight Sharing (Convolution) ==<br />
<br />
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applicable to other regions -- i.e., we can have the same features at all locations. <br />
<!--<br />
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.<br />
<br />
== Fast Feature Learning and Extraction ==<br />
<br />
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute. <br />
!--><br />
<br />
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and <br />
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. <br />
<br />
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. <br />
<br />
<!--<br />
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. <br />
!--><br />
<br />
[[File:Convolution_schematic.gif]]<br />
<br />
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. <br />
<br />
In the next section, we further describe how to "pool" these features together to get even better features for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolutionFeature extraction using convolution2011-05-27T18:08:04Z<p>Ang: /* Locally Connected Networks */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which will allow us to scale up these methods to more realistic datasets that have larger images.<br />
<br />
== Fully Connected Networks ==<br />
<br />
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On the relatively small images that we were working with (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it was computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.<br />
<br />
== Locally Connected Networks ==<br />
<br />
One simple solution to this problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a small subset of the input units. Specifically, each hidden unit will connect to only a small contiguous region of pixels in the input. (For input modalities other than vision, there is often a natural way to select "contiguous groups" of inputs to connect to a single hidden units as well; for example, for audio, each hidden unit might be connected to only a certain time span of the input audio clip.) <br />
<br />
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up in biology. Specifically, neurons in the visual cortex have localized receptive fields (i.e., they respond only to stimuli in a certain location).<br />
<br />
== Weight Sharing (Convolution) ==<br />
<br />
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applicable to other regions -- i.e., we can have the same features at all locations. <br />
<br />
<!--<br />
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.<br />
<br />
== Fast Feature Learning and Extraction ==<br />
<br />
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute. <br />
!--><br />
<br />
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and <br />
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. <br />
<br />
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. <br />
<br />
<!--<br />
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. <br />
!--><br />
<br />
[[File:Convolution_schematic.gif]]<br />
<br />
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. <br />
<br />
In the next section, we further describe how to "pool" these features together to get even better features for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolutionFeature extraction using convolution2011-05-27T18:03:02Z<p>Ang: /* Fully Connected Networks */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which will allow us to scale up these methods to more realistic datasets that have larger images.<br />
<br />
== Fully Connected Networks ==<br />
<br />
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On the relatively small images that we were working with (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it was computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.<br />
<br />
== Locally Connected Networks ==<br />
<br />
One simple solution to the problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a select number of input units. The selection of connections between the hidden and input units can often be determined based on the input modality -- e.g., for images, we will have hidden units that connect to local contiguous regions of pixels. <br />
<br />
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up. Specifically, neurons in the visual cortex are found to have localized receptive fields (i.e., they respond only to stimuli in a certain location). <br />
<br />
== Weight Sharing (Convolution) ==<br />
<br />
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applicable to other regions -- i.e., we can have the same features at all locations. <br />
<br />
<!--<br />
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.<br />
<br />
== Fast Feature Learning and Extraction ==<br />
<br />
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute. <br />
!--><br />
<br />
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and <br />
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. <br />
<br />
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. <br />
<br />
<!--<br />
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. <br />
!--><br />
<br />
[[File:Convolution_schematic.gif]]<br />
<br />
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. <br />
<br />
In the next section, we further describe how to "pool" these features together to get even better features for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolutionFeature extraction using convolution2011-05-27T18:02:19Z<p>Ang: /* Overview */</p>
<hr />
<div>== Overview ==<br />
<br />
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which will allow us to scale up these methods to more realistic datasets that have larger images.<br />
<br />
== Fully Connected Networks ==<br />
<br />
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On relatively small images (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it is computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.<br />
<br />
== Locally Connected Networks ==<br />
<br />
One simple solution to the problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a select number of input units. The selection of connections between the hidden and input units can often be determined based on the input modality -- e.g., for images, we will have hidden units that connect to local contiguous regions of pixels. <br />
<br />
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up. Specifically, neurons in the visual cortex are found to have localized receptive fields (i.e., they respond only to stimuli in a certain location). <br />
<br />
== Weight Sharing (Convolution) ==<br />
<br />
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applicable to other regions -- i.e., we can have the same features at all locations. <br />
<br />
<!--<br />
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.<br />
<br />
== Fast Feature Learning and Extraction ==<br />
<br />
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute. <br />
!--><br />
<br />
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and <br />
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. <br />
<br />
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. <br />
<br />
<!--<br />
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. <br />
!--><br />
<br />
[[File:Convolution_schematic.gif]]<br />
<br />
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. <br />
<br />
In the next section, we further describe how to "pool" these features together to get even better features for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolutionFeature extraction using convolution2011-05-27T17:55:59Z<p>Ang: </p>
<hr />
<div>== Overview ==<br />
<br />
In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which allow us to scale up these methods to work with more realistic datasets that have larger images.<br />
<br />
== Fully Connected Networks ==<br />
<br />
In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On relatively small images (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it is computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about <math>10^4</math> input units, and assuming you want to learn 100 features, you would have on the order of <math>10^6</math> parameters to learn. The feedforward and backpropagation computations would also be about <math>10^2</math> times slower, compared to 28x28 images.<br />
<br />
== Locally Connected Networks ==<br />
<br />
One simple solution to the problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a select number of input units. The selection of connections between the hidden and input units can often be determined based on the input modality -- e.g., for images, we will have hidden units that connect to local contiguous regions of pixels. <br />
<br />
This idea of having locally connected networks also draws inspiration from how the early visual system is wired up. Specifically, neurons in the visual cortex are found to have localized receptive fields (i.e., they respond only to stimuli in a certain location). <br />
<br />
== Weight Sharing (Convolution) ==<br />
<br />
Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applicable to other regions -- i.e., we can have the same features at all locations. <br />
<br />
<!--<br />
To capture this idea of learning the same features "everywhere in the image," one option is to add an additional added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image.<br />
<br />
== Fast Feature Learning and Extraction ==<br />
<br />
While in principle one can learn feature convolutionally over the entire image, the learning procedure becomes more complicated to implement and often takes longer to execute. <br />
!--><br />
<br />
More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and <br />
'''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. <br />
<br />
To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. <br />
<br />
<!--<br />
These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. <br />
!--><br />
<br />
[[File:Convolution_schematic.gif]]<br />
<br />
Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where <math>\sigma</math> is the sigmoid function), given by the weights <math>W^{(1)}</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. <br />
<br />
In the next section, we further describe how to "pool" these features together to get even better features for classification.</div>Anghttp://ufldl.stanford.edu/wiki/index.php/UFLDL_TutorialUFLDL Tutorial2011-05-27T17:38:43Z<p>Ang: </p>
<hr />
<div>'''Description:''' This tutorial will teach you the main ideas of Unsupervised Feature Learning and Deep Learning. By working through it, you will also get to implement several feature learning/deep learning algorithms, get to see them work for yourself, and learn how to apply/adapt these ideas to new problems.<br />
<br />
This tutorial assumes a basic knowledge of machine learning (specifically, familiarity with the ideas of supervised learning, logistic regression, gradient descent). If you are not familiar with these ideas, we suggest you go to this [http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning Machine Learning course] and complete<br />
sections II, III, IV (up to Logistic Regression) first. <br />
<br />
<br />
'''Sparse Autoencoder'''<br />
* [[Neural Networks]]<br />
* [[Backpropagation Algorithm]]<br />
* [[Gradient checking and advanced optimization]]<br />
* [[Autoencoders and Sparsity]]<br />
* [[Visualizing a Trained Autoencoder]]<br />
* [[Sparse Autoencoder Notation Summary]] <br />
* [[Exercise:Sparse Autoencoder]]<br />
<br />
<br />
'''Vectorized implementation'''<br />
* [[Vectorization]]<br />
* [[Logistic Regression Vectorization Example]]<br />
* [[Neural Network Vectorization]]<br />
* [[Exercise:Vectorization]]<br />
<br />
<br />
'''Preprocessing: PCA and Whitening'''<br />
* [[PCA]]<br />
* [[Whitening]]<br />
* [[Implementing PCA/Whitening]]<br />
* [[Exercise:PCA in 2D]]<br />
* [[Exercise:PCA and Whitening]]<br />
<br />
<br />
'''Softmax Regression'''<br />
* [[Softmax Regression]]<br />
* [[Exercise:Softmax Regression]]<br />
<br />
<br />
'''Self-Taught Learning and Unsupervised Feature Learning''' <br />
* [[Self-Taught Learning]]<br />
* [[Exercise:Self-Taught Learning]]<br />
<br />
<br />
'''Building Deep Networks for Classification'''<br />
* [[Self-Taught Learning to Deep Networks | From Self-Taught Learning to Deep Networks]]<br />
* [[Deep Networks: Overview]]<br />
* [[Stacked Autoencoders]]<br />
* [[Fine-tuning Stacked AEs]]<br />
* [[Exercise: Implement deep networks for digit classification]]<br />
<br />
<br />
'''Linear Decoders with Autoencoders'''<br />
* [[Linear Decoders]]<br />
* [[Exercise:Learning color features with Sparse Autoencoders]]<br />
<br />
----<br />
'''Note''': The sections above this line are stable. The sections below are still under construction, and may change without notice. Feel free to browse around however, and feedback/suggestions are welcome. <br />
<br />
'''Working with Large Images'''<br />
* [[Feature extraction using convolution]]<br />
* [[Pooling]]<br />
* [[Exercise:Convolution and Pooling]]<br />
<br />
<br />
----<br />
<br />
'''Miscellaneous''':<br />
<br />
[[MATLAB Modules]]<br />
<br />
[[Data Preprocessing]]<br />
<br />
[[Style Guide]]<br />
<br />
[[Useful Links]]<br />
<br />
<br />
'''Advanced Topics''':<br />
<br />
[[Convolutional training]] <br />
<br />
[[Restricted Boltzmann Machines]]<br />
<br />
[[Deep Belief Networks]]<br />
<br />
[[Denoising Autoencoders]]<br />
<br />
[[Sparse Coding]]<br />
<br />
[[K-means]]<br />
<br />
[[Spatial pyramids / Multiscale]]<br />
<br />
[[Slow Feature Analysis]]<br />
<br />
ICA Style Models:<br />
* [[Independent Component Analysis]]<br />
* [[Topographic Independent Component Analysis]]<br />
<br />
[[Tiled Convolution Networks]]<br />
<br />
----<br />
<br />
Material contributed by: Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Exercise:Learning_color_features_with_Sparse_AutoencodersExercise:Learning color features with Sparse Autoencoders2011-05-27T17:38:24Z<p>Ang: /* Step 2: Learn features on small patches */</p>
<hr />
<div>== Learning color features with Sparse Autoencoders ==<br />
<br />
In this exercise, you will implement a [[Linear Decoders | linear decoder]] (a sparse autoencoder whose output layer uses a linear activation function). You will then apply it to learn features on color images from the STL-10 dataset. These features will be used in an later [[Exercise:Convolution and Pooling | exercise on convolution and pooling]] for classifying STL-10 images.<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/linear_decoder_exercise.zip linear_decoder_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to copy and modify '''<tt>sparseAutoencoderCost.m</tt>''' from the [[Exercise:Sparse Autoencoder | sparse autoencoder exercise]].<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stl10_patches_100k.zip Sampled 8x8 patches from the STL-10 dataset (stl10_patches_100k.zip)]<br />
* [http://ufldl.stanford.edu/wiki/resources/linear_decoder_exercise.zip Starter Code (linear_decoder_exercise.zip)]<br />
<br />
You will also need:<br />
* <tt>sparseAutoencoderCost.m</tt> (and related functions) from [[Exercise:Sparse Autoencoder]]<br />
<br />
''If you have not completed the exercise listed above, we strongly suggest you complete it first.''<br />
<br />
=== Learning from color image patches ===<br />
<br />
In all the exercises so far, you have been working only with grayscale images. In this exercise, you will get to work with RGB color images for the first time. <br />
<br />
Conveniently, the fact that an image has three color channels (RGB), rather than a single gray channel, presents little difficulty for the sparse autoencoder. You can just combine the intensities from all the color channels for the pixels into one long vector, as if you were working with a grayscale image with 3x the number of pixels as the original image. <br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used in the exercise (see starter code for details).<br />
<br />
=== Step 1: Modify your sparse autoencoder to use a linear decoder ===<br />
<br />
Copy <tt>sparseAutoencoder.m</tt> to the directory for this exercise and rename it to <tt>sparseAutoencoderLinear.m</tt>. Rename the function <tt>sparseAutoencoderCost</tt> in the file to <tt>sparseAutoencoderLinearCost</tt>, and modify it to use a [[Linear Decoders | linear decoder]]. In particular, you should change the cost and gradients returned to reflect the change from a sigmoid to a linear decoder. After making this change, check your gradients to ensure that they are correct.<br />
<br />
=== Step 2: Learn features on small patches ===<br />
<br />
You will now use your sparse autoencoder to learn features on a set of 100,000 small 8x8 patches sampled from the larger 96x96 STL-10 images (The [http://www.stanford.edu/~acoates//stl10/ STL-10 dataset] comprises 5000 training and 8000 test examples, with each example being a 96x96 labelled color image belonging to one of ten classes: airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck.) <br />
<br />
The code provided in this step trains your sparse autoencoder for 400 iterations with the default parameters initialized in step 0. This should take around 45 minutes. Your sparse autoencoder should learn features which when visualized, look like edges and "opponent colors," as in the figure below. <br />
<br />
[[File:CNN_Features_Good.png|480px]]<br />
<br />
If your parameters are improperly tuned (the default parameters should work), or if your implementation of the autoencoder is buggy, you might instead get images that look like one of the following:<br />
<br />
<table cellpadding=5px><br />
<tr><td>[[File:cnn_Features_Bad1.png|240px]]</td><td>[[File:cnn_Features_Bad2.png|240px]]</td></tr><br />
</table><br />
<br />
The learned features will be saved to <tt>STL10Features.mat</tt>, which will be used in the later [[Exercise:Convolution and Pooling | exercise on convolution and pooling]].</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Exercise:Learning_color_features_with_Sparse_AutoencodersExercise:Learning color features with Sparse Autoencoders2011-05-27T17:33:05Z<p>Ang: /* Step 1: Modify your sparse autoencoder to use a linear decoder */</p>
<hr />
<div>== Learning color features with Sparse Autoencoders ==<br />
<br />
In this exercise, you will implement a [[Linear Decoders | linear decoder]] (a sparse autoencoder whose output layer uses a linear activation function). You will then apply it to learn features on color images from the STL-10 dataset. These features will be used in an later [[Exercise:Convolution and Pooling | exercise on convolution and pooling]] for classifying STL-10 images.<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/linear_decoder_exercise.zip linear_decoder_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to copy and modify '''<tt>sparseAutoencoderCost.m</tt>''' from the [[Exercise:Sparse Autoencoder | sparse autoencoder exercise]].<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stl10_patches_100k.zip Sampled 8x8 patches from the STL-10 dataset (stl10_patches_100k.zip)]<br />
* [http://ufldl.stanford.edu/wiki/resources/linear_decoder_exercise.zip Starter Code (linear_decoder_exercise.zip)]<br />
<br />
You will also need:<br />
* <tt>sparseAutoencoderCost.m</tt> (and related functions) from [[Exercise:Sparse Autoencoder]]<br />
<br />
''If you have not completed the exercise listed above, we strongly suggest you complete it first.''<br />
<br />
=== Learning from color image patches ===<br />
<br />
In all the exercises so far, you have been working only with grayscale images. In this exercise, you will get to work with RGB color images for the first time. <br />
<br />
Conveniently, the fact that an image has three color channels (RGB), rather than a single gray channel, presents little difficulty for the sparse autoencoder. You can just combine the intensities from all the color channels for the pixels into one long vector, as if you were working with a grayscale image with 3x the number of pixels as the original image. <br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used in the exercise (see starter code for details).<br />
<br />
=== Step 1: Modify your sparse autoencoder to use a linear decoder ===<br />
<br />
Copy <tt>sparseAutoencoder.m</tt> to the directory for this exercise and rename it to <tt>sparseAutoencoderLinear.m</tt>. Rename the function <tt>sparseAutoencoderCost</tt> in the file to <tt>sparseAutoencoderLinearCost</tt>, and modify it to use a [[Linear Decoders | linear decoder]]. In particular, you should change the cost and gradients returned to reflect the change from a sigmoid to a linear decoder. After making this change, check your gradients to ensure that they are correct.<br />
<br />
=== Step 2: Learn features on small patches ===<br />
<br />
You will now use your sparse autoencoder to learn features on a set of 100 000 small 8x8 patches sampled from the larger 96x96 STL10 images (The [http://www.stanford.edu/~acoates//stl10/ STL10 dataset] comprises 5000 test and 8000 train 96x96 labelled color images belonging to one of ten classes: airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck). <br />
<br />
The code provided in this step trains your sparse autoencoder for 400 iterations with the default parameters initialized in step 0. This should take around 45 minutes. Your sparse autoencoder should learn features which when visualized, look like edges and opponent colors, as in the figure below. <br />
<br />
[[File:CNN_Features_Good.png|480px]]<br />
<br />
If your parameters are improperly tuned (the default parameters should work), or if your implementation of the autoencoder is buggy, you might get one of the following images instead:<br />
<br />
<table cellpadding=5px><br />
<tr><td>[[File:cnn_Features_Bad1.png|240px]]</td><td>[[File:cnn_Features_Bad2.png|240px]]</td></tr><br />
</table><br />
<br />
The learned features will be saved to <tt>STL10Features.mat</tt>, which will be used in the later [[Exercise:Convolution and Pooling | exercise on convolution and pooling]].</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Exercise:Learning_color_features_with_Sparse_AutoencodersExercise:Learning color features with Sparse Autoencoders2011-05-27T17:30:45Z<p>Ang: /* Step 0: Initialization */</p>
<hr />
<div>== Learning color features with Sparse Autoencoders ==<br />
<br />
In this exercise, you will implement a [[Linear Decoders | linear decoder]] (a sparse autoencoder whose output layer uses a linear activation function). You will then apply it to learn features on color images from the STL-10 dataset. These features will be used in an later [[Exercise:Convolution and Pooling | exercise on convolution and pooling]] for classifying STL-10 images.<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/linear_decoder_exercise.zip linear_decoder_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to copy and modify '''<tt>sparseAutoencoderCost.m</tt>''' from the [[Exercise:Sparse Autoencoder | sparse autoencoder exercise]].<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stl10_patches_100k.zip Sampled 8x8 patches from the STL-10 dataset (stl10_patches_100k.zip)]<br />
* [http://ufldl.stanford.edu/wiki/resources/linear_decoder_exercise.zip Starter Code (linear_decoder_exercise.zip)]<br />
<br />
You will also need:<br />
* <tt>sparseAutoencoderCost.m</tt> (and related functions) from [[Exercise:Sparse Autoencoder]]<br />
<br />
''If you have not completed the exercise listed above, we strongly suggest you complete it first.''<br />
<br />
=== Learning from color image patches ===<br />
<br />
In all the exercises so far, you have been working only with grayscale images. In this exercise, you will get to work with RGB color images for the first time. <br />
<br />
Conveniently, the fact that an image has three color channels (RGB), rather than a single gray channel, presents little difficulty for the sparse autoencoder. You can just combine the intensities from all the color channels for the pixels into one long vector, as if you were working with a grayscale image with 3x the number of pixels as the original image. <br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used in the exercise (see starter code for details).<br />
<br />
=== Step 1: Modify your sparse autoencoder to use a linear decoder ===<br />
<br />
Copy <tt>sparseAutoencoder.m</tt> to the directory for this exercise and rename it to <tt>sparseAutoencoderLinear.m</tt>. Rename the function <tt>sparseAutoencoderCost</tt> in the file to <tt>sparseAutoencoderLinearCost</tt>, and modify it to use a [[Linear Decoders | linear decoder]]. In particular, you should change the cost and gradients returned to reflect the change from a sigmoid to a linear decoder. After making this change, check your gradient to ensure that they are correct.<br />
<br />
=== Step 2: Learn features on small patches ===<br />
<br />
You will now use your sparse autoencoder to learn features on a set of 100 000 small 8x8 patches sampled from the larger 96x96 STL10 images (The [http://www.stanford.edu/~acoates//stl10/ STL10 dataset] comprises 5000 test and 8000 train 96x96 labelled color images belonging to one of ten classes: airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck). <br />
<br />
The code provided in this step trains your sparse autoencoder for 400 iterations with the default parameters initialized in step 0. This should take around 45 minutes. Your sparse autoencoder should learn features which when visualized, look like edges and opponent colors, as in the figure below. <br />
<br />
[[File:CNN_Features_Good.png|480px]]<br />
<br />
If your parameters are improperly tuned (the default parameters should work), or if your implementation of the autoencoder is buggy, you might get one of the following images instead:<br />
<br />
<table cellpadding=5px><br />
<tr><td>[[File:cnn_Features_Bad1.png|240px]]</td><td>[[File:cnn_Features_Bad2.png|240px]]</td></tr><br />
</table><br />
<br />
The learned features will be saved to <tt>STL10Features.mat</tt>, which will be used in the later [[Exercise:Convolution and Pooling | exercise on convolution and pooling]].</div>Anghttp://ufldl.stanford.edu/wiki/index.php/Exercise:Learning_color_features_with_Sparse_AutoencodersExercise:Learning color features with Sparse Autoencoders2011-05-27T17:30:34Z<p>Ang: /* Step 0: Initialization */</p>
<hr />
<div>== Learning color features with Sparse Autoencoders ==<br />
<br />
In this exercise, you will implement a [[Linear Decoders | linear decoder]] (a sparse autoencoder whose output layer uses a linear activation function). You will then apply it to learn features on color images from the STL-10 dataset. These features will be used in an later [[Exercise:Convolution and Pooling | exercise on convolution and pooling]] for classifying STL-10 images.<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/linear_decoder_exercise.zip linear_decoder_exercise.zip]</tt> we have provided some starter code. You should write your code at the places indicated "YOUR CODE HERE" in the files.<br />
<br />
For this exercise, you will need to copy and modify '''<tt>sparseAutoencoderCost.m</tt>''' from the [[Exercise:Sparse Autoencoder | sparse autoencoder exercise]].<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://ufldl.stanford.edu/wiki/resources/stl10_patches_100k.zip Sampled 8x8 patches from the STL-10 dataset (stl10_patches_100k.zip)]<br />
* [http://ufldl.stanford.edu/wiki/resources/linear_decoder_exercise.zip Starter Code (linear_decoder_exercise.zip)]<br />
<br />
You will also need:<br />
* <tt>sparseAutoencoderCost.m</tt> (and related functions) from [[Exercise:Sparse Autoencoder]]<br />
<br />
''If you have not completed the exercise listed above, we strongly suggest you complete it first.''<br />
<br />
=== Learning from color image patches ===<br />
<br />
In all the exercises so far, you have been working only with grayscale images. In this exercise, you will get to work with RGB color images for the first time. <br />
<br />
Conveniently, the fact that an image has three color channels (RGB), rather than a single gray channel, presents little difficulty for the sparse autoencoder. You can just combine the intensities from all the color channels for the pixels into one long vector, as if you were working with a grayscale image with 3x the number of pixels as the original image. <br />
<br />
=== Step 0: Initialization ===<br />
<br />
In this step, we initialize some parameters used in the exercise (see started code for details).<br />
<br />
=== Step 1: Modify your sparse autoencoder to use a linear decoder ===<br />
<br />
Copy <tt>sparseAutoencoder.m</tt> to the directory for this exercise and rename it to <tt>sparseAutoencoderLinear.m</tt>. Rename the function <tt>sparseAutoencoderCost</tt> in the file to <tt>sparseAutoencoderLinearCost</tt>, and modify it to use a [[Linear Decoders | linear decoder]]. In particular, you should change the cost and gradients returned to reflect the change from a sigmoid to a linear decoder. After making this change, check your gradient to ensure that they are correct.<br />
<br />
=== Step 2: Learn features on small patches ===<br />
<br />
You will now use your sparse autoencoder to learn features on a set of 100 000 small 8x8 patches sampled from the larger 96x96 STL10 images (The [http://www.stanford.edu/~acoates//stl10/ STL10 dataset] comprises 5000 test and 8000 train 96x96 labelled color images belonging to one of ten classes: airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck). <br />
<br />
The code provided in this step trains your sparse autoencoder for 400 iterations with the default parameters initialized in step 0. This should take around 45 minutes. Your sparse autoencoder should learn features which when visualized, look like edges and opponent colors, as in the figure below. <br />
<br />
[[File:CNN_Features_Good.png|480px]]<br />
<br />
If your parameters are improperly tuned (the default parameters should work), or if your implementation of the autoencoder is buggy, you might get one of the following images instead:<br />
<br />
<table cellpadding=5px><br />
<tr><td>[[File:cnn_Features_Bad1.png|240px]]</td><td>[[File:cnn_Features_Bad2.png|240px]]</td></tr><br />
</table><br />
<br />
The learned features will be saved to <tt>STL10Features.mat</tt>, which will be used in the later [[Exercise:Convolution and Pooling | exercise on convolution and pooling]].</div>Ang