http://ufldl.stanford.edu/wiki/index.php?title=Special:Contributions&feed=atom&limit=250&target=Watsuen&year=&month=Ufldl - User contributions [en]2024-03-29T07:15:33ZFrom UfldlMediaWiki 1.16.2http://ufldl.stanford.edu/wiki/index.php/Template:SoftmaxTemplate:Softmax2011-05-29T03:05:32Z<p>Watsuen: </p>
<hr />
<div>----<br />
<br />
<div style="text-align: center;font-size:small; background-color: #eeeeee; border-style: solid; border-width: 1px; padding: 5px"><br />
[[Softmax Regression]] | [[Exercise:Softmax Regression]]<br />
</div></div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Template:SoftmaxTemplate:Softmax2011-05-29T03:05:15Z<p>Watsuen: </p>
<hr />
<div>----<br />
<br />
<div style="text-align: center;font-size:x-small; background-color: #eeeeee; border-style: solid; border-width: 1px; padding: 5px"><br />
[[Softmax Regression]] | [[Exercise:Softmax Regression]]<br />
</div></div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Template:SoftmaxTemplate:Softmax2011-05-29T03:04:46Z<p>Watsuen: </p>
<hr />
<div>----<br />
<br />
<div style="text-align: center;font-size:x-small; border-style: solid; border-width: 1px; padding: 5px"><br />
[[Softmax Regression]] | [[Exercise:Softmax Regression]]<br />
</div></div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Template:CNNTemplate:CNN2011-05-26T11:08:03Z<p>Watsuen: </p>
<hr />
<div><div style="text-align: center;font-size:x-small;background-color: #eeeeee; border-style: solid; padding: 5px"><br />
[[Self-Taught Learning to Deep Networks | From Self-Taught Learning to Deep Networks]] | [[Deep Networks: Overview]] | [[Stacked Autoencoders]] | [[Fine-tuning Stacked AEs]] | [[Exercise: Implement deep networks for digit classification]]<br />
</div></div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Template:STLTemplate:STL2011-05-26T11:07:45Z<p>Watsuen: </p>
<hr />
<div><div style="text-align: center;font-size:x-small;background-color: #eeeeee; border-style: solid; padding: 5px"><br />
[[Self-Taught Learning]] | [[Exercise:Self-Taught Learning]]<br />
</div></div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Template:PCATemplate:PCA2011-05-26T11:07:30Z<p>Watsuen: </p>
<hr />
<div><div style="text-align: center;font-size:x-small;background-color: #eeeeee; border-style: solid; padding: 5px"><br />
[[PCA]] | [[Whitening]] | [[Implementing PCA/Whitening]] | [[Exercise:PCA in 2D]] | [[Exercise:PCA and Whitening]]<br />
</div></div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Template:Vectorized_ImplementationTemplate:Vectorized Implementation2011-05-26T11:07:02Z<p>Watsuen: </p>
<hr />
<div><div style="text-align: center;font-size:x-small;background-color: #eeeeee; border-style: solid; padding: 5px"><br />
[[Vectorization]] | [[Logistic Regression Vectorization Example]] | [[Neural Network Vectorization]] | [[Exercise:Vectorization]]<br />
</div></div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Template:Vectorized_ImplementationTemplate:Vectorized Implementation2011-05-26T11:06:50Z<p>Watsuen: </p>
<hr />
<div><div style="text-align: center;font-size:x-small;background-color: #eeeeee; border-style: dotted; padding: 5px"><br />
[[Vectorization]] | [[Logistic Regression Vectorization Example]] | [[Neural Network Vectorization]] | [[Exercise:Vectorization]]<br />
</div></div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Template:Sparse_AutoencoderTemplate:Sparse Autoencoder2011-05-26T11:06:24Z<p>Watsuen: </p>
<hr />
<div><div style="text-align: center;font-size:x-small;background-color: #eeeeee; border-style: solid; padding: 5px;"><br />
[[Neural Networks]] | [[Backpropagation Algorithm]] | [[Gradient checking and advanced optimization]] | [[Autoencoders and Sparsity]] | [[Visualizing a Trained Autoencoder]] | [[Sparse Autoencoder Notation Summary]] | [[Exercise:Sparse Autoencoder]]<br />
</div></div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Exercise:_Implement_deep_networks_for_digit_classificationExercise: Implement deep networks for digit classification2011-05-26T11:04:39Z<p>Watsuen: </p>
<hr />
<div>===Overview===<br />
<br />
In this exercise, you will use a stacked autoencoder for digit classification. This exercise is very similar to the self-taught learning exercise, in which we trained a digit classifier using a autoencoder layer followed by a softmax layer. The only difference in this exercise is that we will be using two autoencoder layers instead of one and further finetune the two layers.<br />
<br />
The code you have already implemented will allow you to stack various layers and perform layer-wise training. However, to perform fine-tuning, you will need to implement backpropogation through both layers. We will see that fine-tuning significantly improves the model's performance.<br />
<br />
In the file [http://ufldl.stanford.edu/wiki/resources/stackedae_exercise.zip stackedae_exercise.zip], we have provided some starter code. You will need to complete the code in '''<tt>stackedAECost.m</tt>''', '''<tt>stackedAEPredict.m</tt>''' and '''<tt>stackedAEExercise.m</tt>'''. We have also provided <tt>params2stack.m</tt> and <tt>stack2params.m</tt> which you might find helpful in constructing deep networks.<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://yann.lecun.com/exdb/mnist/ MNIST Dataset]<br />
* [[Using the MNIST Dataset | Support functions for loading MNIST in Matlab ]]<br />
* [http://ufldl.stanford.edu/wiki/resources/stackedae_exercise.zip Starter Code (stackedae_exercise.zip)]<br />
<br />
You will also need your code from the following exercises:<br />
* [[Exercise:Sparse Autoencoder]]<br />
* [[Exercise:Vectorization]]<br />
* [[Exercise:Softmax Regression]]<br />
* [[Exercise:Self-Taught Learning]]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
=== Step 0: Initialize constants and parameters ===<br />
<br />
Open <tt>stackedAEExercise.m</tt>. In this step, we set meta-parameters to the same values that were used in previous exercise, which should produce reasonable results. You may to modify the meta-parameters if you wish.<br />
<br />
=== Step 1: Train the data on the first stacked autoencoder ===<br />
<br />
Train the first autoencoder on the training images to obtain its parameters. This step is identical to the corresponding step in the sparse autoencoder and STL assignments, complete this part of the code so as to learn a first layer of features using your <tt>sparseAutoencoderCost.m</tt> and minFunc.<br />
<br />
=== Step 2: Train the data on the second stacked autoencoder ===<br />
<br />
We first forward propagate the training set through the first autoencoder (using <tt>feedForwardAutoencoder.m</tt> that you completed in [[Exercise:Self-Taught_Learning]]) to obtain hidden unit activations. These activations are then used to train the second sparse autoencoder. Since this is just an adapted application of a standard autoencoder, it should run similarly with the first. Complete this part of the code so as to learn a first layer of features using your <tt>sparseAutoencoderCost.m</tt> and minFunc.<br />
<br />
This part of the exercise demonstrates the idea of greedy layerwise training with the ''same'' learning algorithm reapplied multiple times.<br />
<br />
=== Step 3: Train the softmax classifier on the L2 features ===<br />
<br />
Next, continue to forward propagate the L1 features through the second autoencoder (using <tt>feedForwardAutoencoder.m</tt>) to obtain the L2 hidden unit activations. These activations are then used to train the softmax classifier. You can either use <tt>softmaxTrain.m</tt> or directly use <tt>softmaxCost.m</tt> that you completed in [[Exercise:Softmax Regression]] to complete this part of the assignment.<br />
<br />
=== Step 4: Implement fine-tuning ===<br />
<br />
To implement fine tuning, we need to consider all three layers as a single model. Implement <tt>stackedAECost.m</tt> to return the cost and gradient of the model. The cost function should be as defined as the log likelihood and a gradient decay term. The gradient should be computed using [[Backpropagation Algorithm | back-propagation as discussed earlier]]. The predictions should consist of the activations of the output layer of the softmax model.<br />
<br />
To help you check that your implementation is correct, you should also check your gradients on a synthetic small dataset. We have implemented <tt>checkStackedAECost.m</tt> to help you check your gradients. If this checks passes, you will have implemented fine-tuning correctly.<br />
<br />
'''Note:''' When adding the weight decay term to the cost, you should regularize only the softmax weights (do not regularize the weights that compute the hidden layer activations).<br />
<br />
'''Implementation Tip:''' It is always a good idea to implement the code modularly and check (the gradient of) each part of the code before writing the more complicated parts.<br />
<br />
=== Step 5: Test the model ===<br />
<br />
Finally, you will need to classify with this model; complete the code in <tt>stackedAEPredict.m</tt> to classify using the stacked autoencoder with a classification layer.<br />
<br />
After completing these steps, running the entire script in stackedAETrain.m will perform layer-wise training of the stacked autoencoder, finetune the model, and measure its performance on the test set. If you've done all the steps correctly, you should get an accuracy of about 87.7% before finetuning and 97.6% after finetuning (for the 10-way classification problem).<br />
<br />
<br />
{{CNN}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Fine-tuning_Stacked_AEsFine-tuning Stacked AEs2011-05-26T11:04:23Z<p>Watsuen: </p>
<hr />
<div>=== Introduction ===<br />
Fine tuning is a strategy that is commonly found in deep learning. As such, it can also be used to greatly improve the performance of a stacked autoencoder. From a high level perspective, fine tuning treats all layers of a stacked autoencoder as a single model, so that in one iteration, we are improving upon all the weights in the stacked autoencoder.<br />
<br />
=== General Strategy ===<br />
Fortunately, we already have all the tools necessary to implement fine tuning for stacked autoencoders! In order to compute the gradients for all the layers of the stacked autoencoder in each iteration, we use the [[Backpropagation Algorithm]], as discussed in the sparse autoencoder section. As the backpropagation algorithm can be extended to apply for an arbitrary number of layers, we can actually use this algorithm on a stacked autoencoder of arbitrary depth.<br />
<br />
=== Finetuning with Backpropagation ===<br />
For your convenience, the summary of the backpropagation algorithm using element wise notation is below:<br />
<br />
: 1. Perform a feedforward pass, computing the activations for layers <math>\textstyle L_2</math>, <math>\textstyle L_3</math>, up to the output layer <math>\textstyle L_{n_l}</math>, using the equations defining the forward propagation steps.<br />
: 2. For the output layer (layer <math>\textstyle n_l</math>), set <br />
::<math>\begin{align}<br />
\delta^{(n_l)}<br />
= - (\nabla_{a^{n_l}}J) \bullet f'(z^{(n_l)})<br />
\end{align}</math><br />
::(When using softmax regression, the softmax layer has <math>\nabla J = \theta^T(I-P)</math> where <math>I</math> is the input labels and <math>P</math> is the vector of conditional probabilities.)<br />
: 3. For <math>\textstyle l = n_l-1, n_l-2, n_l-3, \ldots, 2</math> <br />
::Set<br />
:::<math>\begin{align}<br />
\delta^{(l)} = \left((W^{(l)})^T \delta^{(l+1)}\right) \bullet f'(z^{(l)})<br />
\end{align}</math><br />
: 4. Compute the desired partial derivatives: <br />
::<math>\begin{align}<br />
\nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\<br />
\nabla_{b^{(l)}} J(W,b;x,y) &= \delta^{(l+1)}.<br />
\end{align}</math><br />
<br />
:<math>\begin{align}<br />
J(W,b)<br />
&= \left[ \frac{1}{m} \sum_{i=1}^m J(W,b;x^{(i)},y^{(i)}) \right]<br />
\end{align}</math><br />
<br />
<br />
<br />
{{Quote|<br />
Note: While one could consider the softmax classifier as an additional layer, the derivation above does not. Specifically, we consider the "last layer" of the network to be the features that goes into the softmax classifier. Therefore, the derivatives (in Step 2) are computed using <math>\delta^{(n_l)} = - (\nabla_{a^{n_l}}J) \bullet f'(z^{(n_l)})</math>, where <math>\nabla J = \theta^T(I-P)</math>.<br />
}}<br />
<br />
<br />
{{CNN}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Stacked_AutoencodersStacked Autoencoders2011-05-26T11:04:08Z<p>Watsuen: </p>
<hr />
<div>===Overview===<br />
<br />
The greedy layerwise approach for pretraining a deep network works by training each layer in turn. In this page, you will find out how autoencoders can be "stacked" in a greedy layerwise fashion for pretraining (initializing) the weights of a deep network.<br />
<br />
A stacked autoencoder is a neural network consisting of multiple layers of sparse autoencoders in which the outputs of each layer is wired to the inputs of the successive layer. Formally, consider a stacked autoencoder with n layers. Using notation from the autoencoder section, let <math>W^{(k, 1)}, W^{(k, 2)}, b^{(k, 1)}, b^{(k, 2)}</math> denote the parameters <math>W^{(1)}, W^{(2)}, b^{(1)}, b^{(2)}</math> for kth autoencoder. Then the encoding step for the stacked autoencoder is given by running the encoding step of each layer in forward order:<br />
<br />
<math><br />
\begin{align}<br />
a^{(l)} = f(z^{(l)}) \\<br />
z^{(l + 1)} = W^{(l, 1)}a^{(l)} + b^{(l, 1)}<br />
\end{align}<br />
</math><br />
<br />
The decoding step is given by running the decoding stack of each autoencoder in reverse order:<br />
<br />
<math><br />
\begin{align}<br />
a^{(n + l)} = f(z^{(n + l)}) \\<br />
z^{(n + l + 1)} = W^{(n - l, 2)}a^{(n + l)} + b^{(n - l, 2)}<br />
\end{align}<br />
</math><br />
<br />
The information of interest is contained within <math>a^{(n)}</math>, which is the activation of the deepest layer of hidden units. This vector gives us a representation of the input in terms of higher-order features. <br />
<br />
The features from the stacked autoencoder can be used for classification problems by feeding <math>a(n)</math> to a softmax classifier.<br />
<br />
===Training===<br />
A good way to obtain good parameters for a stacked autoencoder is to use greedy layer-wise training. To do this, first train the first layer on raw input to obtain parameters <math>W^{(1,1)}, W^{(1,2)}, b^{(1,1)}, b^{(1,2)}</math>. Use the first layer to transform the raw input into a vector consisting of activation of the hidden units, A. Train the second layer on this vector to obtain parameters <math>W^{(2,1)}, W^{(2,2)}, b^{(2,1)}, b^{(2,2)}</math>. Repeat for subsequent layers, using the output of each layer as input for the subsequent layer.<br />
<br />
This method trains the parameters of each layer individually while freezing parameters for the remainder of the model. To produce better results, after this phase of training is complete, [[Fine-tuning Stacked AEs | fine-tuning]] using backpropagation can be used to improve the results by tuning the parameters of all layers are changed at the same time. <br />
<br />
<!-- In practice, fine-tuning should be use when the parameters have been brought close to convergence through layer-wise training. Attempting to use fine-tuning with the weights initialized randomly will lead to poor results due to local optima. --><br />
<br />
{{Quote|<br />
If one is only interested in finetuning for the purposes of classification, the common practice is to then discard the "decoding" layers of the stacked autoencoder and link the last hidden layer <math>a^{(n)}</math> to the softmax classifier. The gradients from the (softmax) classification error will then be backpropagated into the encoding layers.<br />
}}<br />
<br />
===Concrete example===<br />
<br />
To give a concrete example, suppose you wished to train a stacked autoencoder with 2 hidden layers for classification of MNIST digits, as you will be doing in [[Exercise: Implement deep networks for digit classification | the next exercise]]. <br />
<br />
First, you would train a sparse autoencoder on the raw inputs <math>x^{(k)}</math> to learn primary features <math>h^{(1)(k)}</math> on the raw input.<br />
<br />
[[File:Stacked_SparseAE_Features1.png|400px]]<br />
<br />
Next, you would feed the raw input into this trained sparse autoencoder, obtaining the primary feature activations <math>h^{(1)(k)}</math> for each of the inputs <math>x^{(k)}</math>. You would then use these primary features as the "raw input" to another sparse autoencoder to learn secondary features <math>h^{(2)(k)}</math> on these primary features.<br />
<br />
[[File:Stacked_SparseAE_Features2.png|400px]]<br />
<br />
Following this, you would feed the primary features into the second sparse autoencoder to obtain the secondary feature activations <math>h^{(2)(k)}</math> for each of the primary features <math>h^{(1)(k)}</math> (which correspond to the primary features of the corresponding inputs <math>x^{(k)}</math>). You would then treat these secondary features as "raw input" to a softmax classifier, training it to map secondary features to digit labels.<br />
<br />
[[File:Stacked_Softmax_Classifier.png|400px]]<br />
<br />
Finally, you would combine all three layers together to form a stacked autoencoder with 2 hidden layers and a final softmax classifier layer capable of classifying the MNIST digits as desired.<br />
<br />
[[File:Stacked_Combined.png|500px]]<br />
<br />
===Discussion===<br />
<br />
A stacked autoencoder enjoys all the benefits of any deep network of greater expressive power. <br />
<br />
Further, it often captures a useful "hierarchical grouping" or "part-whole decomposition" of the input. To see this, recall that an autoencoder tends to learn features that form a good representation of its input. The first layer of a stacked autoencoder tends to learn first-order features in the raw input (such as edges in an image). The second layer of a stacked autoencoder tends to learn second-order features corresponding to patterns in the appearance of first-order features (e.g., in terms of what edges tend to occur together--for example, to form contour or corner detectors). Higher layers of the stacked autoencoder tend to learn even higher-order features. <br />
<br />
<br />
{{CNN}}<br />
<br />
<!-- <br />
For instance, in the context of image input, the first layers usually learns to recognize edges. The second layer usually learns features that arise from combinations of the edges, such as corners. With certain types of network configuration and input modes, the higher layers can learn meaningful combinations of features. For instance, if the input set consists of images of faces, higher layers may learn features corresponding to parts of the face such as eyes, noses or mouths.<br />
!--></div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Deep_Networks:_OverviewDeep Networks: Overview2011-05-26T11:03:52Z<p>Watsuen: </p>
<hr />
<div>== Overview ==<br />
<br />
In the previous sections, you constructed a 3-layer neural network comprising<br />
an input, hidden and output layer. While fairly effective for MNIST, this<br />
3-layer model is a fairly '''shallow''' network; by this, we mean that the<br />
features (hidden layer activations <math>a^{(2)}</math>) are computed using<br />
only "one layer" of computation (the hidden layer).<br />
<br />
In this section, we begin to discuss '''deep''' neural networks, meaning ones<br />
in which we have multiple hidden layers; this will allow us to compute much <br />
more complex features of the input. Because each hidden layer computes a <br />
non-linear transformation of the previous layer, a deep network can have<br />
significantly greater representational power (i.e., can learn<br />
significantly more complex functions) than a shallow one. <br />
<br />
Note that when training a deep network, it is important to use a ''non-linear''<br />
activation function <math>f(\cdot)</math> in each hidden layer. This is<br />
because multiple layers of linear functions would itself compute only a linear<br />
function of the input (i.e., composing multiple linear functions together<br />
results in just another linear function), and thus be no more expressive than<br />
using just a single layer of hidden units.<br />
<br />
== Advantages of deep networks ==<br />
<br />
Why do we want to use a deep network? The primary advantage is<br />
that it can compactly represent a significantly larger set of fuctions<br />
than shallow networks. Formally, one can show that there are functions<br />
which a <math>k</math>-layer network can represent compactly<br />
(with a number of hidden units that is ''polynomial'' in the number<br />
of inputs), that a <math>(k-1)</math>-layer network cannot represent<br />
unless it has an exponentially large number of hidden units.<br />
<br />
To take a simple example, consider building a boolean circuit/network to<br />
compute the parity (or XOR) of <math>n</math> input bits. Suppose each node in<br />
the network can compute either the logical OR of its inputs (or the OR of the <br />
negation of the inputs), or compute the logical AND. If we have a network with<br />
only one input, one hidden, and one output layer, the parity function would require a number of nodes that<br />
is exponential in the input size <math>n</math>. If however we are allowed a<br />
deeper network, then the network/circuit size can be only polynomial in<br />
<math>n</math>.<br />
<br />
By using a deep network, in the case of images, one can also start to learn part-whole decompositions.<br />
For example, the first layer might learn to group together pixels in an image<br />
in order to detect edges (as seen in the earlier exercises). The second layer might then group together edges to<br />
detect longer contours, or perhaps detect simple "parts of objects." An even deeper layer<br />
might then group together these contours or detect even more complex features.<br />
<br />
Finally, cortical computations (in the brain) also have multiple layers of<br />
processing. For example, visual images are processed in multiple stages by the<br />
brain, by cortical area "V1", followed by cortical area "V2" (a different part<br />
of the brain), and so on. <br />
<br />
<!--<br />
Informally, one way a deep network helps in representing functions compactly is<br />
through ''factorization''. Factorization, as the name suggests, occurs when the<br />
network represents at lower layers functions of the input that are then reused<br />
multiple times at higher layers. To gain some intuition for this, consider an<br />
arithmetic network for computing the values of polynomials, in which alternate<br />
layers implement addition and multiplication. In this network, an intermediate<br />
layer could compute the values of terms which are then used repeatedly in the<br />
next higher layer, the results of which are used repeatedly in the next higher<br />
layer, and so on.<br />
!--><br />
<br />
== Difficulty of training deep architectures ==<br />
<br />
While the theoretical benefits of deep networks in terms of their compactness<br />
and expressive power have been appreciated for many decades, until recently<br />
researchers had little success training deep architectures.<br />
<br />
The main learning algorithm that researchers were using was to randomly initialize<br />
the weights of a deep network, and then train it using a labeled<br />
training set <math>\{ (x^{(1)}_l, y^{(1)}), \ldots, (x^{(m_l)}_l, y^{(m_l)}) \}</math><br />
using a supervised learning objective, for example by applying gradient descent to try to<br />
drive down the training error. However, this usually did not work well.<br />
There were several reasons for this.<br />
<br />
===Availability of data=== <br />
<br />
With the method described above, one relies only on<br />
labeled data for training. However, labeled data is often scarce, and thus for many<br />
problems it is difficult to get enough examples to fit the parameters of a<br />
complex model. For example, given the high degree of expressive power of deep networks,<br />
training on insufficient data would also result in overfitting. <br />
<br />
===Local optima=== <br />
<br />
Training a shallow network (with 1 hidden layer) using<br />
supervised learning usually resulted in the parameters converging to reasonable values;<br />
but when we are training a deep network, this works much less well. <br />
In particular, training a neural network using supervised learning<br />
involves solving a highly non-convex optimization problem (say, minimizing the<br />
training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a<br />
function of the network parameters <math>\textstyle W</math>). <br />
In a deep network, this problem turns out to be rife with bad local optima, and<br />
training with gradient descent (or methods like conjugate gradient and L-BFGS)<br />
no longer work well. <br />
<br />
===Diffusion of gradients=== <br />
<br />
There is an additional technical reason,<br />
pertaining to the gradients becoming very small, that explains why gradient<br />
descent (and related algorithms like L-BFGS) do not work well on a deep networks<br />
with randomly initialized weights. Specifically, when using backpropagation to<br />
compute the derivatives, the gradients that are propagated backwards (from the<br />
output layer to the earlier layers of the network) rapidly diminish in<br />
magnitude as the depth of the network increases. As a result, the derivative of<br />
the overall cost with respect to the weights in the earlier layers is very<br />
small. Thus, when using gradient descent, the weights of the earlier layers<br />
change slowly, and the earlier layers fail to learn much. This problem<br />
is often called the "diffusion of gradients."<br />
<br />
A closely related problem to the diffusion of gradients is that if the last few<br />
layers in a neural network have a large enough number of neurons, it may be<br />
possible for them to model the labeled data alone without the help of the<br />
earlier layers. Hence, training the entire network at once with all the layers<br />
randomly initialized ends up giving similar performance to training a<br />
shallow network (the last few layers) on corrupted input (the result of<br />
the processing done by the earlier layers). <br />
<br />
<!--<br />
When the last layer is used<br />
for classification, often training a network like this results in low<br />
training error, but high error is low, but training error is high, suggesting that<br />
the last few layers are over-fitting the training data.<br />
!--><br />
<br />
== Greedy layer-wise training ==<br />
<br />
How can we train a deep network? One method that has seen some<br />
success is the '''greedy layer-wise training''' method. We describe this<br />
method in detail in later sections, but briefly, the main idea is to train the<br />
layers of the network one at a time, so that we first train a network with 1 <br />
hidden layer, and only after that is done, train a network with 2 hidden layers,<br />
and so on. At each step, we take the old network with <math>k-1</math> hidden<br />
layers, and add an additional <math>k</math>-th hidden layer (that takes as <br />
input the previous hidden layer <math>k-1</math> that we had just<br />
trained). Training can either be <br />
supervised (say, with classification error as the objective function on each<br />
step), but more frequently it is <br />
unsupervised (as in an autoencoder; details to provided later). <br />
The weights from training the layers individually are then used to initialize the weights <br />
in the final/overall deep network, and only then is the entire architecture "fine-tuned" (i.e.,<br />
trained together to optimize the labeled training set error). <br />
<br />
The success of greedy<br />
layer-wise training has been attributed to a number of factors:<br />
<br />
===Availability of data=== <br />
<br />
While labeled data can be expensive to obtain,<br />
unlabeled data is cheap and plentiful. The promise of self-taught learning is<br />
that by exploiting the massive amount of unlabeled data, we can learn much<br />
better models. By using unlabeled data to learn a good initial value for the<br />
weights in all the layers <math>\textstyle W^{(l)}</math> (except for the final<br />
classification layer that maps to the outputs/predictions), our algorithm is<br />
able to learn and discover patterns from massively more amounts of data than<br />
purely supervised approaches. This often results in much better classifiers <br />
being learned. <br />
<br />
===Better local optima=== <br />
<br />
After having trained the network<br />
on the unlabeled data, the weights are now starting at a better location in<br />
parameter space than if they had been randomly initialized. We can then<br />
further fine-tune the weights starting from this location. Empirically, it<br />
turns out that gradient descent from this location is much more likely to<br />
lead to a good local minimum, because the unlabeled data has already provided<br />
a significant amount of "prior" information about what patterns there<br />
are in the input data. <br />
<br />
<br />
In the next section, we will describe the specific details of how to go about<br />
implementing greedy layer-wise training. <br />
<br />
<br />
{{CNN}}<br />
<br />
<!--<br />
Specifically,<br />
since the weights of the layers have already been initialized to reasonable<br />
values, the final solution tends to be near the good initial solution, forming<br />
a useful "regularization" effect. (more details in Erhan et al., 2010).<br />
!--><br />
<br />
<!--<br />
== References ==<br />
<br />
Erhan et al. (2010). Why Does Unsupervised Pre-training Help Deep Learning?.<br />
AISTATS 2010.<br />
[http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf]<br />
!--></div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Deep_Networks:_OverviewDeep Networks: Overview2011-05-26T11:03:37Z<p>Watsuen: </p>
<hr />
<div>== Overview ==<br />
<br />
In the previous sections, you constructed a 3-layer neural network comprising<br />
an input, hidden and output layer. While fairly effective for MNIST, this<br />
3-layer model is a fairly '''shallow''' network; by this, we mean that the<br />
features (hidden layer activations <math>a^{(2)}</math>) are computed using<br />
only "one layer" of computation (the hidden layer).<br />
<br />
In this section, we begin to discuss '''deep''' neural networks, meaning ones<br />
in which we have multiple hidden layers; this will allow us to compute much <br />
more complex features of the input. Because each hidden layer computes a <br />
non-linear transformation of the previous layer, a deep network can have<br />
significantly greater representational power (i.e., can learn<br />
significantly more complex functions) than a shallow one. <br />
<br />
Note that when training a deep network, it is important to use a ''non-linear''<br />
activation function <math>f(\cdot)</math> in each hidden layer. This is<br />
because multiple layers of linear functions would itself compute only a linear<br />
function of the input (i.e., composing multiple linear functions together<br />
results in just another linear function), and thus be no more expressive than<br />
using just a single layer of hidden units.<br />
<br />
== Advantages of deep networks ==<br />
<br />
Why do we want to use a deep network? The primary advantage is<br />
that it can compactly represent a significantly larger set of fuctions<br />
than shallow networks. Formally, one can show that there are functions<br />
which a <math>k</math>-layer network can represent compactly<br />
(with a number of hidden units that is ''polynomial'' in the number<br />
of inputs), that a <math>(k-1)</math>-layer network cannot represent<br />
unless it has an exponentially large number of hidden units.<br />
<br />
To take a simple example, consider building a boolean circuit/network to<br />
compute the parity (or XOR) of <math>n</math> input bits. Suppose each node in<br />
the network can compute either the logical OR of its inputs (or the OR of the <br />
negation of the inputs), or compute the logical AND. If we have a network with<br />
only one input, one hidden, and one output layer, the parity function would require a number of nodes that<br />
is exponential in the input size <math>n</math>. If however we are allowed a<br />
deeper network, then the network/circuit size can be only polynomial in<br />
<math>n</math>.<br />
<br />
By using a deep network, in the case of images, one can also start to learn part-whole decompositions.<br />
For example, the first layer might learn to group together pixels in an image<br />
in order to detect edges (as seen in the earlier exercises). The second layer might then group together edges to<br />
detect longer contours, or perhaps detect simple "parts of objects." An even deeper layer<br />
might then group together these contours or detect even more complex features.<br />
<br />
Finally, cortical computations (in the brain) also have multiple layers of<br />
processing. For example, visual images are processed in multiple stages by the<br />
brain, by cortical area "V1", followed by cortical area "V2" (a different part<br />
of the brain), and so on. <br />
<br />
<!--<br />
Informally, one way a deep network helps in representing functions compactly is<br />
through ''factorization''. Factorization, as the name suggests, occurs when the<br />
network represents at lower layers functions of the input that are then reused<br />
multiple times at higher layers. To gain some intuition for this, consider an<br />
arithmetic network for computing the values of polynomials, in which alternate<br />
layers implement addition and multiplication. In this network, an intermediate<br />
layer could compute the values of terms which are then used repeatedly in the<br />
next higher layer, the results of which are used repeatedly in the next higher<br />
layer, and so on.<br />
!--><br />
<br />
== Difficulty of training deep architectures ==<br />
<br />
While the theoretical benefits of deep networks in terms of their compactness<br />
and expressive power have been appreciated for many decades, until recently<br />
researchers had little success training deep architectures.<br />
<br />
The main learning algorithm that researchers were using was to randomly initialize<br />
the weights of a deep network, and then train it using a labeled<br />
training set <math>\{ (x^{(1)}_l, y^{(1)}), \ldots, (x^{(m_l)}_l, y^{(m_l)}) \}</math><br />
using a supervised learning objective, for example by applying gradient descent to try to<br />
drive down the training error. However, this usually did not work well.<br />
There were several reasons for this.<br />
<br />
===Availability of data=== <br />
<br />
With the method described above, one relies only on<br />
labeled data for training. However, labeled data is often scarce, and thus for many<br />
problems it is difficult to get enough examples to fit the parameters of a<br />
complex model. For example, given the high degree of expressive power of deep networks,<br />
training on insufficient data would also result in overfitting. <br />
<br />
===Local optima=== <br />
<br />
Training a shallow network (with 1 hidden layer) using<br />
supervised learning usually resulted in the parameters converging to reasonable values;<br />
but when we are training a deep network, this works much less well. <br />
In particular, training a neural network using supervised learning<br />
involves solving a highly non-convex optimization problem (say, minimizing the<br />
training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a<br />
function of the network parameters <math>\textstyle W</math>). <br />
In a deep network, this problem turns out to be rife with bad local optima, and<br />
training with gradient descent (or methods like conjugate gradient and L-BFGS)<br />
no longer work well. <br />
<br />
===Diffusion of gradients=== <br />
<br />
There is an additional technical reason,<br />
pertaining to the gradients becoming very small, that explains why gradient<br />
descent (and related algorithms like L-BFGS) do not work well on a deep networks<br />
with randomly initialized weights. Specifically, when using backpropagation to<br />
compute the derivatives, the gradients that are propagated backwards (from the<br />
output layer to the earlier layers of the network) rapidly diminish in<br />
magnitude as the depth of the network increases. As a result, the derivative of<br />
the overall cost with respect to the weights in the earlier layers is very<br />
small. Thus, when using gradient descent, the weights of the earlier layers<br />
change slowly, and the earlier layers fail to learn much. This problem<br />
is often called the "diffusion of gradients."<br />
<br />
A closely related problem to the diffusion of gradients is that if the last few<br />
layers in a neural network have a large enough number of neurons, it may be<br />
possible for them to model the labeled data alone without the help of the<br />
earlier layers. Hence, training the entire network at once with all the layers<br />
randomly initialized ends up giving similar performance to training a<br />
shallow network (the last few layers) on corrupted input (the result of<br />
the processing done by the earlier layers). <br />
<br />
<!--<br />
When the last layer is used<br />
for classification, often training a network like this results in low<br />
training error, but high error is low, but training error is high, suggesting that<br />
the last few layers are over-fitting the training data.<br />
!--><br />
<br />
== Greedy layer-wise training ==<br />
<br />
How can we train a deep network? One method that has seen some<br />
success is the '''greedy layer-wise training''' method. We describe this<br />
method in detail in later sections, but briefly, the main idea is to train the<br />
layers of the network one at a time, so that we first train a network with 1 <br />
hidden layer, and only after that is done, train a network with 2 hidden layers,<br />
and so on. At each step, we take the old network with <math>k-1</math> hidden<br />
layers, and add an additional <math>k</math>-th hidden layer (that takes as <br />
input the previous hidden layer <math>k-1</math> that we had just<br />
trained). Training can either be <br />
supervised (say, with classification error as the objective function on each<br />
step), but more frequently it is <br />
unsupervised (as in an autoencoder; details to provided later). <br />
The weights from training the layers individually are then used to initialize the weights <br />
in the final/overall deep network, and only then is the entire architecture "fine-tuned" (i.e.,<br />
trained together to optimize the labeled training set error). <br />
<br />
The success of greedy<br />
layer-wise training has been attributed to a number of factors:<br />
<br />
===Availability of data=== <br />
<br />
While labeled data can be expensive to obtain,<br />
unlabeled data is cheap and plentiful. The promise of self-taught learning is<br />
that by exploiting the massive amount of unlabeled data, we can learn much<br />
better models. By using unlabeled data to learn a good initial value for the<br />
weights in all the layers <math>\textstyle W^{(l)}</math> (except for the final<br />
classification layer that maps to the outputs/predictions), our algorithm is<br />
able to learn and discover patterns from massively more amounts of data than<br />
purely supervised approaches. This often results in much better classifiers <br />
being learned. <br />
<br />
===Better local optima=== <br />
<br />
After having trained the network<br />
on the unlabeled data, the weights are now starting at a better location in<br />
parameter space than if they had been randomly initialized. We can then<br />
further fine-tune the weights starting from this location. Empirically, it<br />
turns out that gradient descent from this location is much more likely to<br />
lead to a good local minimum, because the unlabeled data has already provided<br />
a significant amount of "prior" information about what patterns there<br />
are in the input data. <br />
<br />
<br />
In the next section, we will describe the specific details of how to go about<br />
implementing greedy layer-wise training. <br />
<br />
<br />
<br />
<!--<br />
Specifically,<br />
since the weights of the layers have already been initialized to reasonable<br />
values, the final solution tends to be near the good initial solution, forming<br />
a useful "regularization" effect. (more details in Erhan et al., 2010).<br />
!--><br />
<br />
<!--<br />
== References ==<br />
<br />
Erhan et al. (2010). Why Does Unsupervised Pre-training Help Deep Learning?.<br />
AISTATS 2010.<br />
[http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf]<br />
!--><br />
{{CNN}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Deep_Networks:_OverviewDeep Networks: Overview2011-05-26T11:03:29Z<p>Watsuen: </p>
<hr />
<div>== Overview ==<br />
<br />
In the previous sections, you constructed a 3-layer neural network comprising<br />
an input, hidden and output layer. While fairly effective for MNIST, this<br />
3-layer model is a fairly '''shallow''' network; by this, we mean that the<br />
features (hidden layer activations <math>a^{(2)}</math>) are computed using<br />
only "one layer" of computation (the hidden layer).<br />
<br />
In this section, we begin to discuss '''deep''' neural networks, meaning ones<br />
in which we have multiple hidden layers; this will allow us to compute much <br />
more complex features of the input. Because each hidden layer computes a <br />
non-linear transformation of the previous layer, a deep network can have<br />
significantly greater representational power (i.e., can learn<br />
significantly more complex functions) than a shallow one. <br />
<br />
Note that when training a deep network, it is important to use a ''non-linear''<br />
activation function <math>f(\cdot)</math> in each hidden layer. This is<br />
because multiple layers of linear functions would itself compute only a linear<br />
function of the input (i.e., composing multiple linear functions together<br />
results in just another linear function), and thus be no more expressive than<br />
using just a single layer of hidden units.<br />
<br />
== Advantages of deep networks ==<br />
<br />
Why do we want to use a deep network? The primary advantage is<br />
that it can compactly represent a significantly larger set of fuctions<br />
than shallow networks. Formally, one can show that there are functions<br />
which a <math>k</math>-layer network can represent compactly<br />
(with a number of hidden units that is ''polynomial'' in the number<br />
of inputs), that a <math>(k-1)</math>-layer network cannot represent<br />
unless it has an exponentially large number of hidden units.<br />
<br />
To take a simple example, consider building a boolean circuit/network to<br />
compute the parity (or XOR) of <math>n</math> input bits. Suppose each node in<br />
the network can compute either the logical OR of its inputs (or the OR of the <br />
negation of the inputs), or compute the logical AND. If we have a network with<br />
only one input, one hidden, and one output layer, the parity function would require a number of nodes that<br />
is exponential in the input size <math>n</math>. If however we are allowed a<br />
deeper network, then the network/circuit size can be only polynomial in<br />
<math>n</math>.<br />
<br />
By using a deep network, in the case of images, one can also start to learn part-whole decompositions.<br />
For example, the first layer might learn to group together pixels in an image<br />
in order to detect edges (as seen in the earlier exercises). The second layer might then group together edges to<br />
detect longer contours, or perhaps detect simple "parts of objects." An even deeper layer<br />
might then group together these contours or detect even more complex features.<br />
<br />
Finally, cortical computations (in the brain) also have multiple layers of<br />
processing. For example, visual images are processed in multiple stages by the<br />
brain, by cortical area "V1", followed by cortical area "V2" (a different part<br />
of the brain), and so on. <br />
<br />
<!--<br />
Informally, one way a deep network helps in representing functions compactly is<br />
through ''factorization''. Factorization, as the name suggests, occurs when the<br />
network represents at lower layers functions of the input that are then reused<br />
multiple times at higher layers. To gain some intuition for this, consider an<br />
arithmetic network for computing the values of polynomials, in which alternate<br />
layers implement addition and multiplication. In this network, an intermediate<br />
layer could compute the values of terms which are then used repeatedly in the<br />
next higher layer, the results of which are used repeatedly in the next higher<br />
layer, and so on.<br />
!--><br />
<br />
== Difficulty of training deep architectures ==<br />
<br />
While the theoretical benefits of deep networks in terms of their compactness<br />
and expressive power have been appreciated for many decades, until recently<br />
researchers had little success training deep architectures.<br />
<br />
The main learning algorithm that researchers were using was to randomly initialize<br />
the weights of a deep network, and then train it using a labeled<br />
training set <math>\{ (x^{(1)}_l, y^{(1)}), \ldots, (x^{(m_l)}_l, y^{(m_l)}) \}</math><br />
using a supervised learning objective, for example by applying gradient descent to try to<br />
drive down the training error. However, this usually did not work well.<br />
There were several reasons for this.<br />
<br />
===Availability of data=== <br />
<br />
With the method described above, one relies only on<br />
labeled data for training. However, labeled data is often scarce, and thus for many<br />
problems it is difficult to get enough examples to fit the parameters of a<br />
complex model. For example, given the high degree of expressive power of deep networks,<br />
training on insufficient data would also result in overfitting. <br />
<br />
===Local optima=== <br />
<br />
Training a shallow network (with 1 hidden layer) using<br />
supervised learning usually resulted in the parameters converging to reasonable values;<br />
but when we are training a deep network, this works much less well. <br />
In particular, training a neural network using supervised learning<br />
involves solving a highly non-convex optimization problem (say, minimizing the<br />
training error <math>\textstyle \sum_i ||h_W(x^{(i)}) - y^{(i)}||^2</math> as a<br />
function of the network parameters <math>\textstyle W</math>). <br />
In a deep network, this problem turns out to be rife with bad local optima, and<br />
training with gradient descent (or methods like conjugate gradient and L-BFGS)<br />
no longer work well. <br />
<br />
===Diffusion of gradients=== <br />
<br />
There is an additional technical reason,<br />
pertaining to the gradients becoming very small, that explains why gradient<br />
descent (and related algorithms like L-BFGS) do not work well on a deep networks<br />
with randomly initialized weights. Specifically, when using backpropagation to<br />
compute the derivatives, the gradients that are propagated backwards (from the<br />
output layer to the earlier layers of the network) rapidly diminish in<br />
magnitude as the depth of the network increases. As a result, the derivative of<br />
the overall cost with respect to the weights in the earlier layers is very<br />
small. Thus, when using gradient descent, the weights of the earlier layers<br />
change slowly, and the earlier layers fail to learn much. This problem<br />
is often called the "diffusion of gradients."<br />
<br />
A closely related problem to the diffusion of gradients is that if the last few<br />
layers in a neural network have a large enough number of neurons, it may be<br />
possible for them to model the labeled data alone without the help of the<br />
earlier layers. Hence, training the entire network at once with all the layers<br />
randomly initialized ends up giving similar performance to training a<br />
shallow network (the last few layers) on corrupted input (the result of<br />
the processing done by the earlier layers). <br />
<br />
<!--<br />
When the last layer is used<br />
for classification, often training a network like this results in low<br />
training error, but high error is low, but training error is high, suggesting that<br />
the last few layers are over-fitting the training data.<br />
!--><br />
<br />
== Greedy layer-wise training ==<br />
<br />
How can we train a deep network? One method that has seen some<br />
success is the '''greedy layer-wise training''' method. We describe this<br />
method in detail in later sections, but briefly, the main idea is to train the<br />
layers of the network one at a time, so that we first train a network with 1 <br />
hidden layer, and only after that is done, train a network with 2 hidden layers,<br />
and so on. At each step, we take the old network with <math>k-1</math> hidden<br />
layers, and add an additional <math>k</math>-th hidden layer (that takes as <br />
input the previous hidden layer <math>k-1</math> that we had just<br />
trained). Training can either be <br />
supervised (say, with classification error as the objective function on each<br />
step), but more frequently it is <br />
unsupervised (as in an autoencoder; details to provided later). <br />
The weights from training the layers individually are then used to initialize the weights <br />
in the final/overall deep network, and only then is the entire architecture "fine-tuned" (i.e.,<br />
trained together to optimize the labeled training set error). <br />
<br />
The success of greedy<br />
layer-wise training has been attributed to a number of factors:<br />
<br />
===Availability of data=== <br />
<br />
While labeled data can be expensive to obtain,<br />
unlabeled data is cheap and plentiful. The promise of self-taught learning is<br />
that by exploiting the massive amount of unlabeled data, we can learn much<br />
better models. By using unlabeled data to learn a good initial value for the<br />
weights in all the layers <math>\textstyle W^{(l)}</math> (except for the final<br />
classification layer that maps to the outputs/predictions), our algorithm is<br />
able to learn and discover patterns from massively more amounts of data than<br />
purely supervised approaches. This often results in much better classifiers <br />
being learned. <br />
<br />
===Better local optima=== <br />
<br />
After having trained the network<br />
on the unlabeled data, the weights are now starting at a better location in<br />
parameter space than if they had been randomly initialized. We can then<br />
further fine-tune the weights starting from this location. Empirically, it<br />
turns out that gradient descent from this location is much more likely to<br />
lead to a good local minimum, because the unlabeled data has already provided<br />
a significant amount of "prior" information about what patterns there<br />
are in the input data. <br />
<br />
<br />
In the next section, we will describe the specific details of how to go about<br />
implementing greedy layer-wise training. <br />
<br />
<br />
<br />
<!--<br />
Specifically,<br />
since the weights of the layers have already been initialized to reasonable<br />
values, the final solution tends to be near the good initial solution, forming<br />
a useful "regularization" effect. (more details in Erhan et al., 2010).<br />
!--><br />
<br />
<!--<br />
== References ==<br />
<br />
Erhan et al. (2010). Why Does Unsupervised Pre-training Help Deep Learning?.<br />
AISTATS 2010.<br />
[http://jmlr.csail.mit.edu/proceedings/papers/v9/erhan10a/erhan10a.pdf]<br />
!--><br />
<br />
<br />
{{CNN}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Self-Taught_Learning_to_Deep_NetworksSelf-Taught Learning to Deep Networks2011-05-26T11:03:10Z<p>Watsuen: </p>
<hr />
<div>In the previous section, you used an autoencoder to learn features that were then fed as input <br />
to a softmax or logistic regression classifier. In that method, the features were learned using<br />
only unlabeled data. In this section, we describe how you can '''fine-tune''' and further improve <br />
the learned features using labeled data. When you have a large amount of labeled<br />
training data, this can significantly improve your classifier's performance.<br />
<br />
In self-taught learning, we first trained a sparse autoencoder on the unlabeled data. Then, <br />
given a new example <math>\textstyle x</math>, we used the hidden layer to extract <br />
features <math>\textstyle a</math>. This is illustrated in the following diagram: <br />
<br />
[[File:STL_SparseAE_Features.png|300px]]<br />
<br />
We are interested in solving a classification task, where our goal is to<br />
predict labels <math>\textstyle y</math>. We have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),<br />
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> labeled examples.<br />
We showed previously that we can replace the original features <math>\textstyle x^{(i)}</math> with features <math>\textstyle a^{(l)}</math><br />
computed by the sparse autoencoder (the "replacement" representation). This gives us a training set <math>\textstyle \{(a^{(1)},<br />
y^{(1)}), \ldots (a^{(m_l)}, y^{(m_l)}) \}</math>. Finally, we train a logistic<br />
classifier to map from the features <math>\textstyle a^{(i)}</math> to the classification label <math>\textstyle y^{(i)}</math>.<br />
To illustrate this step, similar to [[Neural Networks|our earlier notes]], we can draw our logistic regression unit (shown in orange) as follows:<br />
<br />
::::[[File:STL_Logistic_Classifier.png|380px]]<br />
<br />
Now, consider the overall classifier (i.e., the input-output mapping) that we have learned <br />
using this method. <br />
In particular, let us examine the function that our classifier uses to map from from a new test example <br />
<math>\textstyle x</math> to a new prediction <math>p(y=1|x)</math>. <br />
We can draw a representation of this function by putting together the <br />
two pictures from above. In particular, the final classifier looks like this:<br />
<br />
[[File:STL_CombinedAE.png|500px]]<br />
<br />
The parameters of this model were trained in two stages: The first layer of weights <math>\textstyle W^{(1)}</math><br />
mapping from the input <math>\textstyle x</math> to the hidden unit activations <math>\textstyle a</math> were trained<br />
as part of the sparse autoencoder training process. The second layer<br />
of weights <math>\textstyle W^{(2)}</math> mapping from the activations <math>\textstyle a</math> to the output <math>\textstyle y</math> was<br />
trained using logistic regression (or softmax regression).<br />
<br />
But the form of our overall/final classifier is clearly just a whole big neural network. So,<br />
having trained up an initial set of parameters for our model (training the first layer using an <br />
autoencoder, and the second layer<br />
via logistic/softmax regression), we can further modify all the parameters in our model to try to <br />
further reduce the training error. In particular, we can '''fine-tune''' the parameters, meaning perform <br />
gradient descent (or use L-BFGS) from the current setting of the<br />
parameters to try to reduce the training error on our labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),<br />
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math>. <br />
<br />
When fine-tuning is used, sometimes the original unsupervised feature learning steps <br />
(i.e., training the autoencoder and the logistic classifier) are called '''pre-training.'''<br />
The effect of fine-tuning is that the labeled data can be used to modify the weights <math>W^{(1)}</math> as<br />
well, so that adjustments can be made to the features <math>a</math> extracted by the layer<br />
of hidden units. <br />
<br />
So far, we have described this process assuming that you used the "replacement" representation, where<br />
the training examples seen by the logistic classifier are of the form <math>(a^{(i)}, y^{(i)})</math>,<br />
rather than the "concatenation" representation, where the examples are of the form <math>((x^{(i)}, a^{(i)}), y^{(i)})</math>.<br />
It is also possible to perform fine-tuning too using the "concatenation" representation. (This corresponds<br />
to a neural network where the input units <math>x_i</math> also feed directly to the logistic<br />
classifier in the output layer. You can draw this using a slightly different type of neural network<br />
diagram than the ones we have seen so far; in particular, you would have edges that go directly<br />
from the first layer input nodes to the third layer output node, "skipping over" the hidden layer.) <br />
However, so long as we are using finetuning, usually the "concatenation" representation <br />
has little advantage over the "replacement" representation. Thus, if we are using fine-tuning usually we will do so<br />
with a network built using the replacement representation. (If you are not using fine-tuning however,<br />
then sometimes the concatenation representation can give much better performance.) <br />
<br />
When should we use fine-tuning? It is typically used only if you have a large labeled training <br />
set; in this setting, fine-tuning can significantly improve the performance of your classifier. <br />
However, if you<br />
have a large ''unlabeled'' dataset (for unsupervised feature learning/pre-training) and<br />
only a relatively small labeled training set, then fine-tuning is significantly less likely to<br />
help.<br />
<br />
<br />
{{CNN}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Exercise:Self-Taught_LearningExercise:Self-Taught Learning2011-05-26T11:02:52Z<p>Watsuen: </p>
<hr />
<div>===Overview===<br />
<br />
In this exercise, we will use the self-taught learning paradigm with the sparse autoencoder and softmax classifier to build a classifier for handwritten digits.<br />
<br />
You will be building upon your code from the earlier exercises. First, you will train your sparse autoencoder on an "unlabeled" training dataset of handwritten digits. This produces feature that are penstroke-like. We then extract these learned features from a labeled dataset of handwritten digits. These features will then be used as inputs to the softmax classifier that you wrote in the previous exercise. <br />
<br />
Concretely, for each example in the the labeled training dataset <math>\textstyle x_l</math>, we forward propagate the example to obtain the activation of the hidden units <math>\textstyle a^{(2)}</math>. We now represent this example using <math>\textstyle a^{(2)}</math> (the "replacement" representation), and use this to as the new feature representation with which to train the softmax classifier. <br />
<br />
Finally, we also extract the same features from the test data to obtain predictions.<br />
<br />
In this exercise, our goal is to distinguish between the digits from 0 to 4. We will use the digits 5 to 9 as our <br />
"unlabeled" dataset which which to learn the features; we will then use a labeled dataset with the digits 0 to 4 with<br />
which to train the softmax classifier. <br />
<br />
In the starter code, we have provided a file '''<tt>stlExercise.m</tt>''' that will help walk you through the steps in this exercise.<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://yann.lecun.com/exdb/mnist/ MNIST Dataset]<br />
* [[Using the MNIST Dataset | Support functions for loading MNIST in Matlab ]]<br />
* [http://ufldl.stanford.edu/wiki/resources/stl_exercise.zip Starter Code (stl_exercise.zip)]<br />
<br />
You will also need your code from the following exercises:<br />
* [[Exercise:Sparse Autoencoder]]<br />
* [[Exercise:Vectorization]]<br />
* [[Exercise:Softmax Regression]]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
===Step 1: Generate the input and test data sets===<br />
<br />
Download and decompress <tt>[http://ufldl.stanford.edu/wiki/resources/stl_exercise.zip stl_exercise.zip]</tt>, which contains starter code for this exercise. Additionally, you will need to download the datasets from the MNIST Handwritten Digit Database for this project.<br />
<br />
===Step 2: Train the sparse autoencoder===<br />
<br />
Next, use the unlabeled data (the digits from 5 to 9) to train a sparse autoencoder, using the same <tt>sparseAutoencoderCost.m</tt> function as you had written in the previous exercise. (From the earlier exercise, you should have a working and vectorized implementation of the sparse autoencoder.) For us, the training step took less than 25 minutes on a fast desktop. When training is complete, you should get a visualization of pen strokes like the image shown below: <br />
<br />
[[File:selfTaughtFeatures.png]]<br />
<br />
Informally, the features learned by the sparse autoencoder should correspond to penstrokes.<br />
<br />
===Step 3: Extracting features===<br />
<br />
After the sparse autoencoder is trained, you will use it to extract features from the handwritten digit images. <br />
<br />
Complete <tt>feedForwardAutoencoder.m</tt> to produce a matrix whose columns correspond to activations of the hidden layer for each example, i.e., the vector <math>a^{(2)}</math> corresponding to activation of layer 2. (Recall that we treat the inputs as layer 1).<br />
<br />
After completing this step, calling <tt>feedForwardAutoencoder.m</tt> should convert the raw image data to hidden unit activations <math>a^{(2)}</math>.<br />
<br />
===Step 4: Training and testing the logistic regression model===<br />
<br />
Use your code from the softmax exercise (<tt>softmaxTrain.m</tt>) to train a softmax classifier using the training set features (<tt>trainFeatures</tt>) and labels (<tt>trainLabels</tt>).<br />
<br />
===Step 5: Classifying on the test set===<br />
<br />
Finally, complete the code to make predictions on the test set (<tt>testFeatures</tt>) and see how your learned features perform! If you've done all the steps correctly, you should get an accuracy of about '''98%''' percent. <br />
<br />
As a comparison, when ''raw pixels'' are used (instead of the learned features), we obtained a test accuracy of only around 96% (for the same train and test sets).<br />
<br />
[[Category:Exercises]]<br />
<br />
<br />
{{STL}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Self-Taught_LearningSelf-Taught Learning2011-05-26T11:02:37Z<p>Watsuen: </p>
<hr />
<div>== Overview ==<br />
<br />
Assuming that we have a sufficiently powerful learning algorithm, one of the most reliable <br />
ways to get better performance is to give the algorithm more data. This has led to the <br />
that aphorism that in<br />
machine learning, "sometimes it's not who has the best algorithm that wins; it's <br />
who has the most data." <br />
<br />
One can always try to get more labeled data, but this can be expensive. In<br />
particular, researchers have already gone to extraordinary lengths to use tools<br />
such as AMT (Amazon Mechanical Turk) to get large training sets. While having<br />
large numbers of people hand-label lots of data is probably a step forward<br />
compared to having large numbers of researchers hand-engineer features, it<br />
would be nice to do better. In particular, the promise of '''self-taught learning'''<br />
and '''unsupervised feature learning''' is that if we can get our algorithms to learn<br />
from ''unlabeled'' data, then we can easily obtain and learn from massive<br />
amounts of it. Even though a single unlabeled example is less informative than<br />
a single labeled example, if we can get tons of the former---for example, by downloading<br />
random unlabeled images/audio clips/text documents off the<br />
internet---and if our algorithms can exploit this unlabeled data effectively,<br />
then we might be able to achieve better performance than the massive<br />
hand-engineering and massive hand-labeling approaches.<br />
<br />
In Self-taught learning and Unsupervised feature learning, we will give our<br />
algorithms a large amount of unlabeled data with which to learn a good feature<br />
representation of the input. If we are trying to solve a specific<br />
classification task, then we take this learned feature representation and<br />
whatever (perhaps small amount of) labeled data we have for that classification task, and apply<br />
supervised learning on that labeled data to solve the classification task.<br />
<br />
These ideas probably have the most powerful effects in problems where we have a lot of<br />
unlabeled data, and a smaller amount of labeled data. However,<br />
they typically give good results even if we have only<br />
labeled data (in which case we usually perform the feature learning step using<br />
the labeled data, but ignoring the labels).<br />
<br />
== Learning features ==<br />
<br />
We have already seen how an autoencoder can be used to learn features from<br />
unlabeled data. Concretely, suppose we have an unlabeled<br />
training set <math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math> <br />
with <math>\textstyle m_u</math> unlabeled examples. (The subscript "u" stands for<br />
"unlabeled.") We can then train a sparse autoencoder on this data <br />
(perhaps with appropriate whitening or other pre-processing):<br />
<br />
[[File:STL_SparseAE.png|350px]]<br />
<br />
Having trained the parameters <math>\textstyle W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}</math> of this model,<br />
given any new input <math>\textstyle x</math>, we can now compute the corresponding vector of<br />
activations <math>\textstyle a</math> of the hidden units. As we saw previously, this often gives a<br />
better representation of the input than the original raw input <math>\textstyle x</math>. We can also<br />
visualize the algorithm for computing the features/activations <math>\textstyle a</math> as the following<br />
neural network:<br />
<br />
[[File:STL_SparseAE_Features.png|300px]]<br />
<br />
This is just the sparse autoencoder that we previously had, with with the final<br />
layer removed. <br />
<br />
Now, suppose we have a labeled training set <math>\textstyle \{ (x_l^{(1)}, y^{(1)}),<br />
(x_l^{(2)}, y^{(2)}), \ldots (x_l^{(m_l)}, y^{(m_l)}) \}</math> of <math>\textstyle m_l</math> examples. <br />
(The subscript "l" stands for "labeled.") <br />
We can now find a better representation for the inputs. In particular, rather<br />
than representing the first training example as <math>\textstyle x_l^{(1)}</math>, we can feed<br />
<math>\textstyle x_l^{(1)}</math> as the input to our autoencoder, and obtain the corresponding<br />
vector of activations <math>\textstyle a_l^{(1)}</math>. To represent this example, we can either<br />
just '''replace''' the original feature vector with <math>\textstyle a_l^{(1)}</math>.<br />
Alternatively, we can '''concatenate''' the two feature vectors together,<br />
getting a representation <math>\textstyle (x_l^{(1)}, a_l^{(1)})</math>. <br />
<br />
Thus, our training set now becomes <br />
<math>\textstyle \{ (a_l^{(1)}, y^{(1)}), (a_l^{(2)}, y^{(2)}), \ldots (a_l^{(m_l)}, y^{(m_l)})<br />
\}</math> (if we use the replacement representation, and use <math>\textstyle a_l^{(i)}</math> to represent the <br />
<math>\textstyle i</math>-th training example), or <math>\textstyle \{<br />
((x_l^{(1)}, a_l^{(1)}), y^{(1)}), ((x_l^{(2)}, a_l^{(1)}), y^{(2)}), \ldots, <br />
((x_l^{(m_l)}, a_l^{(1)}), y^{(m_l)}) \}</math> (if we use the concatenated<br />
representation). In practice, the concatenated representation often works<br />
better; but for memory or computation representations, we will sometimes use<br />
the replacement representation as well. <br />
<br />
Finally, we can train a supervised learning algorithm such as an SVM, logistic<br />
regression, etc. to obtain a function that makes predictions on the <math>\textstyle y</math> values. <br />
Given a test example <math>\textstyle x_{\rm test}</math>, we would then follow the same procedure:<br />
For feed it to the autoencoder to get <math>\textstyle a_{\rm test}</math>. Then, feed <br />
either <math>\textstyle a_{\rm test}</math> or <math>\textstyle (x_{\rm test}, a_{\rm test})</math> to the trained classifier to get a prediction.<br />
<br />
== On pre-processing the data == <br />
<br />
During the feature learning stage where we were learning from the unlabeled training set <br />
<math>\textstyle \{ x_u^{(1)}, x_u^{(2)}, \ldots, x_u^{(m_u)}\}</math>, we may have computed<br />
various pre-processing parameters. For example, one may have computed<br />
a mean value of the data and subtracted off this mean to perform mean normalization,<br />
or used PCA to compute a matrix <math>\textstyle U</math> to represent the data as <math>\textstyle U^Tx</math> (or used <br />
PCA <br />
whitening or ZCA whitening). If this is the case, then it is important to<br />
save away these preprocessing parameters, and to use the ''same'' parameters<br />
during the labeled training phase and the test phase, so as to make sure<br />
we are always transforming the data the same way to feed into the autoencoder. <br />
In particular, if we have computed a matrix <math>\textstyle U</math> using the unlabeled data and PCA,<br />
we should keep the ''same'' matrix <math>\textstyle U</math> and use it to preprocess the<br />
labeled examples and the test data. We should '''not''' re-estimate a<br />
different <math>\textstyle U</math> matrix (or data mean for mean normalization, etc.) using the<br />
labeled training set, since that might result in a dramatically different<br />
pre-processing transformation, which would make the input distribution to<br />
the autoencoder very different from what it was actually trained on.<br />
<br />
== On the terminology of unsupervised feature learning == <br />
<br />
There are two common unsupervised feature learning settings, depending on what type of <br />
unlabeled data you have. The more general and powerful setting is the '''self-taught learning'''<br />
setting, which does not assume that your unlabeled data <math>x_u</math> has to<br />
be drawn from the same distribution as your labeled data <math>x_l</math>. The <br />
more restrictive setting where the unlabeled data comes from exactly the same <br />
distribution as the labeled data is sometimes called the '''semi-supervised learning''' <br />
setting. This distinctions is best explained with an example, which we now give. <br />
<br />
Suppose your goal is a computer vision task where you'd like<br />
to distinguish between images of cars and images of motorcycles; so, each labeled<br />
example in your training set is either an image of a car or an image of a motorcycle. <br />
Where can we get lots of unlabeled data? The easiest way would be to obtain some<br />
random collection of images, perhaps downloaded off the internet. We could then <br />
train the autoencoder on this large collection of images, and obtain useful features<br />
from them. Because here the unlabeled data is drawn from a different distribution<br />
than the labeled data (i.e., perhaps some of our unlabeled images may contain<br />
cars/motorcycles, but not every image downloaded is either a car or a motorcycle), we<br />
call this self-taught learning. <br />
<br />
In contrast, if we happen to have lots of unlabeled images lying around<br />
that are all images of ''either'' a car or a motorcycle, but where the data<br />
is just missing its label (so you don't know which ones are cars, and which<br />
ones are motorcycles), then we could use this form of unlabeled data to<br />
learn the features. This setting---where each unlabeled example is drawn from the same<br />
distribution as your labeled examples---is sometimes called the semi-supervised <br />
setting. In practice, we often do not have this sort of unlabeled data (where would you<br />
get a database of images where every image is either a car or a motorcycle, but<br />
just missing its label?), and so in the context of learning features from unlabeled<br />
data, the self-taught learning setting is more broadly applicable.<br />
<br />
<br />
{{STL}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Exercise:Softmax_RegressionExercise:Softmax Regression2011-05-26T11:02:18Z<p>Watsuen: </p>
<hr />
<div>== Softmax regression ==<br />
<br />
In this problem set, you will use [[softmax regression]] to classify MNIST images. The goal of this exercise is to build a softmax classifier that you will be able to reuse in the future exercises and also on other classification problems that you might encounter.<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/softmax_exercise.zip softmax_exercise.zip]</tt>, we have provided some starter code. You should write your code in the places indicated by "YOUR CODE HERE" in the files. <br />
<br />
In the starter code, you will need to modify '''<tt>softmaxCost.m</tt>''' and '''<tt>softmaxPredict.m</tt>''' for this exercise.<br />
<br />
We have also provided '''<tt>softmaxExercise.m</tt>''' that will help walk you through the steps in this exercise.<br />
<br />
=== Dependencies ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://yann.lecun.com/exdb/mnist/ MNIST Dataset]<br />
* [[Using the MNIST Dataset | Support functions for loading MNIST in Matlab ]]<br />
* [http://ufldl.stanford.edu/wiki/resources/softmax_exercise.zip Starter Code (softmax_exercise.zip)]<br />
<br />
You will also need:<br />
* <tt>computeNumericalGradient.m</tt> from [[Exercise:Sparse Autoencoder]]<br />
<br />
''If you have not completed the exercises listed above, we strongly suggest you complete them first.''<br />
<br />
=== Step 0: Initialize constants and parameters ===<br />
<br />
We've provided the code for this step in <tt>softmaxExercise.m</tt>.<br />
<br />
Two constants, <tt>inputSize</tt> and <tt>numClasses</tt>, corresponding to the size of each input vector and the number of class labels have been defined in the starter code. This will allow you to reuse your code on a different data set in a later exercise. We also initialize <tt>lambda</tt>, the weight decay parameter here.<br />
<br />
=== Step 1: Load data ===<br />
<br />
The starter code loads the MNIST images and labels into <tt>inputData</tt> and <tt>labels</tt> respectively. The images are pre-processed to scale the pixel values to the range <math>[0, 1]</math>, and the label 0 is remapped to 10 for convenience of implementation, so that the labels take values in <math>\{1, 2, \ldots, 10\}</math>. You will not need to change any code in this step for this exercise, but note that your code should be general enough to operate on data of arbitrary size belonging to any number of classes.<br />
<br />
=== Step 2: Implement softmaxCost ===<br />
<br />
In <tt>softmaxCost.m</tt>, implement code to compute the softmax cost function <math>J(\theta)</math>. Remember to include the weight decay term in the cost as well. Your code should also compute the appropriate gradients, as well as the predictions for the input data (which will be used in the cross-validation step later). <br />
<br />
It is important to vectorize your code so that it runs quickly. We also provide several implementation tips below:<br />
<br />
{{Quote|<br />
Note: In the provided starter code, <tt>theta</tt> is a matrix where each the ''j<sup>th</sup> row'' is <math>\theta_j^T</math> <br />
}}<br />
<br />
'''Implementation Tip''': Computing the ground truth matrix - In your code, you may need to compute the ground truth matrix <tt>M</tt>, such that <tt>M(r, c)</tt> is 1 if <math>y^{(c)} = r</math> and 0 otherwise. This can be done quickly, without a loop, using the MATLAB functions <tt>sparse</tt> and <tt>full</tt>. Specifically, the command <tt>M = sparse(r, c, v)</tt> creates a sparse matrix such that <tt>M(r(i), c(i)) = v(i)</tt> for all i. That is, the vectors <tt>r</tt> and <tt>c</tt> give the position of the elements whose values we wish to set, and <tt>v</tt> the corresponding values of the elements. Running <tt>full</tt> on a sparse matrix gives a "full" representation of the matrix for use (meaning that Matlab will no longer try to represent it as a sparse matrix in memory). The code for using <tt>sparse</tt> and <tt>full</tt> to compute the ground truth matrix has already been included in softmaxCost.m.<br />
<br />
<br />
'''Implementation Tip:''' Preventing overflows - in softmax regression, you will have to compute the hypothesis<br />
<br />
<math><br />
\begin{align} <br />
h(x^{(i)}) = <br />
\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }<br />
\begin{bmatrix} <br />
e^{ \theta_1^T x^{(i)} } \\<br />
e^{ \theta_2^T x^{(i)} } \\<br />
\vdots \\<br />
e^{ \theta_k^T x^{(i)} } \\<br />
\end{bmatrix}<br />
\end{align}<br />
</math><br />
<br />
When the products <math>\theta_i^T x^{(i)}</math> are large, the exponential function <math>e^{\theta_i^T x^{(i)}}</math> will become very large and possibly overflow. When this happens, you will not be able to compute your hypothesis. However, there is an easy solution - observe that we can multiply the top and bottom of the hypothesis by some constant without changing the output: <br />
<br />
<math><br />
\begin{align} <br />
h(x^{(i)}) &=<br />
<br />
\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }<br />
\begin{bmatrix} <br />
e^{ \theta_1^T x^{(i)} } \\<br />
e^{ \theta_2^T x^{(i)} } \\<br />
\vdots \\<br />
e^{ \theta_k^T x^{(i)} } \\<br />
\end{bmatrix} \\<br />
<br />
&=<br />
<br />
\frac{ e^{-\alpha} }{ e^{-\alpha} \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }<br />
\begin{bmatrix} <br />
e^{ \theta_1^T x^{(i)} } \\<br />
e^{ \theta_2^T x^{(i)} } \\<br />
\vdots \\<br />
e^{ \theta_k^T x^{(i)} } \\<br />
\end{bmatrix} \\<br />
<br />
&=<br />
<br />
\frac{ 1 }{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} - \alpha }} }<br />
\begin{bmatrix} <br />
e^{ \theta_1^T x^{(i)} - \alpha } \\<br />
e^{ \theta_2^T x^{(i)} - \alpha } \\<br />
\vdots \\<br />
e^{ \theta_k^T x^{(i)} - \alpha } \\<br />
\end{bmatrix} \\<br />
<br />
<br />
\end{align}<br />
<br />
</math><br />
<br />
Hence, to prevent overflow, simply subtract some large constant value from each of the <math>\theta_j^T x^{(i)}</math> terms before computing the exponential. In practice, for each example, you can use the maximum of the <math>\theta_j^T x^{(i)}</math> terms as the constant. Assuming you have a matrix <tt>M</tt> containing these terms such that <tt>M(r, c)</tt> is <math>\theta_r^T x^{(c)}</math>, then you can use the following code to accomplish this:<br />
<br />
% M is the matrix as described in the text<br />
M = bsxfun(@minus, M, max(M, [], 1));<br />
<br />
<tt>max(M)</tt> yields a row vector with each element giving the maximum value in that column. <tt>bsxfun</tt> (short for binary singleton expansion function) applies minus along each row of <tt>M</tt>, hence subtracting the maximum of each column from every element in the column. <br />
<br />
'''Implementation Tip: ''' Computing the predictions - you may also find <tt>bsxfun</tt> useful in computing your predictions - if you have a matrix <tt>M</tt> containing the <math>e^{\theta_j^T x^{(i)}}</math> terms, such that <tt>M(r, c)</tt> contains the <math>e^{\theta_r^T x^{(c)}}</math> term, you can use the following code to compute the hypothesis (by dividing all elements in each column by their column sum):<br />
<br />
% M is the matrix as described in the text<br />
M = bsxfun(@rdivide, M, sum(M))<br />
<br />
The operation of <tt>bsxfun</tt> in this case is analogous to the earlier example.<br />
<br />
=== Step 3: Gradient checking ===<br />
<br />
Once you have written the softmax cost function, you should check your gradients numerically. In general, whenever implementing any learning algorithm, you should always check your gradients numerically before proceeding to train the model. The norm of the difference between the numerical gradient and your analytical gradient should be small, on the order of <math>10^{-9}</math>. <br />
<br />
'''Implementation Tip:''' Faster gradient checking - when debugging, you can speed up gradient checking by reducing the number of parameters your model uses. In this case, we have included code for reducing the size of the input data, using the first 8 pixels of the images instead of the full 28x28 images. This code can be used by setting the variable <tt>DEBUG</tt> to true, as described in step 1 of the code.<br />
<br />
=== Step 4: Learning parameters ===<br />
<br />
Now that you've verified that your gradients are correct, you can train your softmax model using the function <tt>softmaxTrain</tt> in <tt>softmaxTrain.m</tt>. <tt>softmaxTrain</tt> which uses the L-BFGS algorithm, in the function <tt>minFunc</tt>. Training the model on the entire MNIST training set of 60000 28x28 images should be rather quick, and take less than 5 minutes for 100 iterations.<br />
<br />
Factoring <tt>softmaxTrain</tt> out as a function means that you will be able to easily reuse it to train softmax models on other data sets in the future by invoking the function with different parameters.<br />
<br />
Use the following parameter when training your softmax classifier:<br />
<br />
lambda = 1e-4<br />
<br />
=== Step 5: Testing ===<br />
<br />
Now that you've trained your model, you will test it against the MNIST test set, comprising 10000 28x28 images. However, to do so, you will first need to complete the function <tt>softmaxPredict</tt> in <tt>softmaxPredict.m</tt>, a function which generates predictions for input data under a trained softmax model. <br />
<br />
Once that is done, you will be able to compute the accuracy (the proportion of correctly classified images) of your model using the code provided. Our implementation achieved an accuracy of '''92.6%'''. If your model's accuracy is significantly less (less than 91%), check your code, ensure that you are using the trained weights, and that you are training your model on the full 60000 training images. Conversely, if your accuracy is too high (99-100%), ensure that you have not accidentally trained your model on the test set as well.<br />
<br />
[[Category:Exercises]]<br />
<br />
<br />
{{Softmax}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Softmax_RegressionSoftmax Regression2011-05-26T11:02:00Z<p>Watsuen: </p>
<hr />
<div>== Introduction ==<br />
<br />
In these notes, we describe the '''Softmax regression''' model. This model generalizes logistic regression to<br />
classification problems where the class label <math>y</math> can take on more than two possible values.<br />
This will be useful for such problems as MNIST digit classification, where the goal is to distinguish between 10 different<br />
numerical digits. Softmax regression is a supervised learning algorithm, but we will later be<br />
using it in conjuction with our deep learning/unsupervised feature learning methods.<br />
<br />
Recall that in logistic regression, we had a training set<br />
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math><br />
of <math>m</math> labeled examples, where the input features are <math>x^{(i)} \in \Re^{n+1}</math>. <br />
(In this set of notes, we will use the notational convention of letting the feature vectors <math>x</math> be<br />
<math>n+1</math> dimensional, with <math>x_0 = 1</math> corresponding to the intercept term.) <br />
With logistic regression, we were in the binary classification setting, so the labels <br />
were <math>y^{(i)} \in \{0,1\}</math>. Our hypothesis took the form:<br />
<br />
<math>\begin{align}<br />
h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)},<br />
\end{align}</math><br />
<br />
and the model parameters <math>\theta</math> were trained to minimize<br />
the cost function<br />
<br />
<math><br />
\begin{align}<br />
J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right]<br />
\end{align}<br />
</math><br />
<br />
In the softmax regression setting, we are interested in multi-class<br />
classification (as opposed to only binary classification), and so the label<br />
<math>y</math> can take on <math>k</math> different values, rather than only<br />
two. Thus, in our training set<br />
<math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math>,<br />
we now have that <math>y^{(i)} \in \{1, 2, \ldots, k\}</math>. (Note that<br />
our convention will be to index the classes starting from 1, rather than from 0.) For example,<br />
in the MNIST digit recognition task, we would have <math>k=10</math> different classes.<br />
<br />
Given a test input <math>x</math>, we want our hypothesis to estimate<br />
the probability that <math>p(y=j | x)</math> for each value of <math>j = 1, \ldots, k</math>.<br />
I.e., we want to estimate the probability of the class label taking<br />
on each of the <math>k</math> different possible values. Thus, our hypothesis<br />
will output a <math>k</math> dimensional vector (whose elements sum to 1) giving<br />
us our <math>k</math> estimated probabilities. Concretely, our hypothesis<br />
<math>h_{\theta}(x)</math> takes the form:<br />
<br />
<math><br />
\begin{align}<br />
h_\theta(x^{(i)}) =<br />
\begin{bmatrix}<br />
p(y^{(i)} = 1 | x^{(i)}; \theta) \\<br />
p(y^{(i)} = 2 | x^{(i)}; \theta) \\<br />
\vdots \\<br />
p(y^{(i)} = k | x^{(i)}; \theta)<br />
\end{bmatrix}<br />
=<br />
\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} }<br />
\begin{bmatrix}<br />
e^{ \theta_1^T x^{(i)} } \\<br />
e^{ \theta_2^T x^{(i)} } \\<br />
\vdots \\<br />
e^{ \theta_k^T x^{(i)} } \\<br />
\end{bmatrix}<br />
\end{align}<br />
</math><br />
<br />
Here <math>\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}</math> are the<br />
parameters of our model. <br />
Notice that<br />
the term <math>\frac{1}{ \sum_{j=1}^{k}{e^{ \theta_j^T x^{(i)} }} } </math><br />
normalizes the distribution, so that it sums to one. <br />
<br />
For convenience, we will also write <br />
<math>\theta</math> to denote all the<br />
parameters of our model. When you implement softmax regression, it is usually<br />
convenient to represent <math>\theta</math> as a <math>k</math>-by-<math>(n+1)</math> matrix obtained by<br />
stacking up <math>\theta_1, \theta_2, \ldots, \theta_k</math> in rows, so that<br />
<br />
<math><br />
\theta = \begin{bmatrix}<br />
\mbox{---} \theta_1^T \mbox{---} \\<br />
\mbox{---} \theta_2^T \mbox{---} \\<br />
\vdots \\<br />
\mbox{---} \theta_k^T \mbox{---} \\<br />
\end{bmatrix}<br />
</math><br />
<br />
<!--<br />
To extend the logistic regression framework which only outputs a single<br />
probability value, we consider a hypothesis that outputs K values (summing to<br />
1) that represent the predicted probability distribution. Formally, let us<br />
consider the classification problem where we have <math>m</math><br />
<math>k</math>-dimensional inputs <math>x^{(1)}, x^{(2)}, \ldots,<br />
x^{(m)}</math> with corresponding class labels <math>y^{(1)}, y^{(2)}, \ldots,<br />
y^{(m)}</math>, where <math>y^{(i)} \in \{1, 2, \ldots, n\}</math>, with<br />
<math>n</math> being the number of classes.<br />
<br />
where we trained the logistic regression weights to optimize the (conditional)<br />
log-likelihood of the dataset using <math> p(y|x) = h_\theta(x) </math>. In<br />
softmax regression, we are interested in multi-class problems where each<br />
example (input image) is assigned to one of <tt>k</tt> labels. One example of a<br />
multi-class classification problem would be classifying digits on the MNIST<br />
dataset where each example has label 1 of 10 possible labels (i.e., where it is<br />
the digit 0, 1, ... or 9).<br />
<br />
If you have seen (in a different course) with the justification of logistic regression as<br />
maximum likelihood estimation, you have have recognized this as<br />
maximizing <math> \sum_i \log p(y^{(i)}|x^{(i)};\theta) = - J(\theta)</math>.<br />
<br />
<br />
''Strictly speaking, we only need <math>n - 1</math> parameters for<br />
<math>n</math> classes, but for convenience, we use <math>n</math> parameters<br />
in our derivation.''<br />
<br />
Now, this hypothesis defines a predicted probability distribution given some<br />
<tt>x</tt>, <math>P(y | x^{(i)}) = h(x^{(i)}) </math>. Thus to train the<br />
model, a natural choice is to maximize the (conditional) log-likelihood of the<br />
data, <math>l(\theta; x, y) = \sum_{i=1}^{m} \ln { P(y^{(i)} | x^{(i)})<br />
}</math>.<br />
<br />
{{Quote| Motivation: One reason for selecting this form of hypotheses comes<br />
from connections to linear discriminant analysis. In particular, if one assumes<br />
a generative model for the data in the form <math>p(x,y) = p(y) \times p(x |<br />
y)</math> and selects for <math>p(x | y)</math> a member of the exponential<br />
family (which includes Gaussians, Poissons, etc.) it is possible to show that<br />
the conditional probability <math>p(y | x)</math> has the same form as our<br />
chosen hypotheses <math>h(x)</math>. For more details, see the<br />
[http://www.stanford.edu/class/cs229/notes/cs229-notes2.pdf CS 229 Lecture 2<br />
Notes]. }}<br />
!--><br />
<br />
== Cost Function ==<br />
<br />
We now describe the cost function that we'll use for softmax regression. In the equation below, <math>1\{\cdot\}</math> is<br />
the '''indicator function,''' so that <math>1\{\hbox{a true statement}\}=1</math>, and <math>1\{\hbox{a false statement}\}=0</math>.<br />
For example, <math>1\{2+2=4\}</math> evaluates to 1; whereas <math>1\{1+1=5\}</math> evaluates to 0. Our cost function will be:<br />
<br />
<math><br />
\begin{align}<br />
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }}\right]<br />
\end{align}<br />
</math><br />
<br />
Notice that this generalizes the logistic regression cost function, which could also have been written:<br />
<br />
<math><br />
\begin{align}<br />
J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\<br />
&= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right]<br />
\end{align}<br />
</math><br />
<br />
The softmax cost function is similar, except that we now sum over the <math>k</math> different possible values<br />
of the class label. Note also that in softmax regression, we have that<br />
<math><br />
p(y^{(i)} = j | x^{(i)} ; \theta) = \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} }<br />
</math>.<br />
<br />
There is no known closed-form way to solve for the minimum of <math>J(\theta)</math>, and thus as usual we'll resort to an iterative<br />
optimization algorithm such as gradient descent or L-BFGS. Taking derivatives, one can show that the gradient is:<br />
<br />
<math><br />
\begin{align}<br />
\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right] }<br />
\end{align}<br />
</math><br />
<br />
<!--<br />
where as usual<br />
<math>p(y^{(i)} = j | x^{(i)} ; \theta) = e^{\theta_j^T x^{(i)}}/(\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} )</math>. !--><br />
<br />
Recall the meaning of the "<math>\nabla_{\theta_j}</math>" notation. In particular, <math>\nabla_{\theta_j} J(\theta)</math><br />
is itself a vector, so that its <math>l</math>-th element is <math>\frac{\partial J(\theta)}{\partial \theta_{jl}}</math><br />
the partial derivative of <math>J(\theta)</math> with respect to the <math>l</math>-th element of <math>\theta_j</math>. <br />
<br />
Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have it<br />
minimize <math>J(\theta)</math>. For example, with the standard implementation of gradient descent, on each iteration<br />
we would perform the update <math>\theta_j := \theta_j - \alpha \nabla_{\theta_j} J(\theta)</math> (for each <math>j=1,\ldots,k</math>).<br />
<br />
When implementing softmax regression, we will typically use a modified version of the cost function described above;<br />
specifically, one that incorporates weight decay. We describe the motivation and details below.<br />
<br />
== Properties of softmax regression parameterization ==<br />
<br />
Softmax regression has an unusual property that it has a "redundant" set of parameters. To explain what this means, <br />
suppose we take each of our parameter vectors <math>\theta_j</math>, and subtract some fixed vector <math>\psi</math><br />
from it, so that every <math>\theta_j</math> is now replaced with <math>\theta_j - \psi</math> <br />
(for every <math>j=1, \ldots, k</math>). Our hypothesis<br />
now estimates the class label probabilities as<br />
<br />
<math><br />
\begin{align}<br />
p(y^{(i)} = j | x^{(i)} ; \theta)<br />
&= \frac{e^{(\theta_j-\psi)^T x^{(i)}}}{\sum_{l=1}^k e^{ (\theta_l-\psi)^T x^{(i)}}} \\<br />
&= \frac{e^{\theta_j^T x^{(i)}} e^{-\psi^Tx^{(i)}}}{\sum_{l=1}^k e^{\theta_l^T x^{(i)}} e^{-\psi^Tx^{(i)}}} \\<br />
&= \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}}}.<br />
\end{align}<br />
</math><br />
<br />
In other words, subtracting <math>\psi</math> from every <math>\theta_j</math><br />
does not affect our hypothesis' predictions at all! This shows that softmax<br />
regression's parameters are "redundant." More formally, we say that our<br />
softmax model is '''overparameterized,''' meaning that for any hypothesis we might<br />
fit to the data, there are multiple parameter settings that give rise to exactly<br />
the same hypothesis function <math>h_\theta</math> mapping from inputs <math>x</math><br />
to the predictions. <br />
<br />
Further, if the cost function <math>J(\theta)</math> is minimized by some<br />
setting of the parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math>,<br />
then it is also minimized by <math>(\theta_1 - \psi, \theta_2 - \psi,\ldots,<br />
\theta_k - \psi)</math> for any value of <math>\psi</math>. Thus, the<br />
minimizer of <math>J(\theta)</math> is not unique. (Interestingly, <br />
<math>J(\theta)</math> is still convex, and thus gradient descent will<br />
not run into a local optima problems. But the Hessian is singular/non-invertible,<br />
which causes a straightforward implementation of Newton's method to run into<br />
numerical problems.) <br />
<br />
Notice also that by setting <math>\psi = \theta_1</math>, one can always<br />
replace <math>\theta_1</math> with <math>\theta_1 - \psi = \vec{0}</math> (the vector of all<br />
0's), without affecting the hypothesis. Thus, one could "eliminate" the vector<br />
of parameters <math>\theta_1</math> (or any other <math>\theta_j</math>, for<br />
any single value of <math>j</math>), without harming the representational power<br />
of our hypothesis. Indeed, rather than optimizing over the <math>k(n+1)</math><br />
parameters <math>(\theta_1, \theta_2,\ldots, \theta_k)</math> (where<br />
<math>\theta_j \in \Re^{n+1}</math>), one could instead set <math>\theta_1 =<br />
\vec{0}</math> and optimize only with respect to the <math>(k-1)(n+1)</math><br />
remaining parameters, and this would work fine. <br />
<br />
In practice, however, it is often cleaner and simpler to implement the version which keeps<br />
all the parameters <math>(\theta_1, \theta_2,\ldots, \theta_n)</math>, without<br />
arbitrarily setting one of them to zero. But we will<br />
make one change to the cost function: Adding weight decay. This will take care of<br />
the numerical problems associated with softmax regression's overparameterized representation.<br />
<br />
== Weight Decay ==<br />
<br />
<!--<br />
Fortunately, this turns out to be a convex optimization problem, and thus algorithms such as<br />
gradient descent and L-BFGS are guaranteed to converge to the global minimum.<br />
!--><br />
<br />
We will modify the cost function by adding a weight decay term <br />
<math>\textstyle \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^{n} \theta_{ij}^2</math><br />
which penalizes large values of the parameters. Our cost function is now<br />
<br />
<math><br />
\begin{align}<br />
J(\theta) = - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{k} 1\left\{y^{(i)} = j\right\} \log \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }} \right]<br />
+ \frac{\lambda}{2} \sum_{i=1}^k \sum_{j=0}^n \theta_{ij}^2<br />
\end{align}<br />
</math><br />
<br />
With this weight decay term (for any <math>\lambda > 0</math>), the cost function<br />
<math>J(\theta)</math> is now strictly convex, and is guaranteed to have a<br />
unique solution. The Hessian is now invertible, and because <math>J(\theta)</math> is <br />
convex, algorithms such as gradient descent, L-BFGS, etc. are guaranteed<br />
to converge to the global minimum.<br />
<br />
To apply an optimization algorithm, we also need the derivative of this<br />
new definition of <math>J(\theta)</math>. One can show that the derivative is:<br />
<math><br />
\begin{align}<br />
\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) ) \right] } + \lambda \theta_j<br />
\end{align}<br />
</math><br />
<br />
By minimizing <math>J(\theta)</math> with respect to <math>\theta</math>, we will have a working implementation of softmax regression.<br />
<br />
<br />
<!--<br />
Expanding the log-likelihood expression, we find that:<br />
<br />
<math><br />
\begin{align}<br />
\ell(\theta) &= \ln L(\theta; x, y) \\<br />
&= \ln \prod_{i=1}^{m}{ P(y^{(i)} | x^{(i)}) } \\<br />
&= \sum_{i=1}^{m}{ \ln \frac{ e^{ \theta^T_{y^{(i)}} x^{(i)} } }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} } } \\<br />
&= \sum_{i=1}^{m}{\left[ \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }}\right]}<br />
\end{align}<br />
</math><br />
<br />
<br />
<math><br />
\begin{align}<br />
\frac{\partial \ell(\theta)}{\partial \theta_k} &= \frac{\partial}{\partial \theta_k} \sum_{i=1}^{m}{\left[ \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }}\right]} \\<br />
<br />
&= \sum_{i=1}^{m}{ \left[ I_{ \{ y^{(i)} = k\} } x^{(i)} - \frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }<br />
\cdot<br />
e^{ \theta_k^T x^{(i)} }<br />
\cdot<br />
x^{(i)} \right]}<br />
\qquad \text{(where } I_{ \{ y^{(i)} = k\} } \text{is 1 when } y^{(i)} = k \text{ and 0 otherwise) } \\<br />
<br />
&= \sum_{i=1}^{m}{ \left[ x^{(i)} ( 1\{ y^{(i)} = k\} - P(y^{(i)} = k | x^{(i)}) ) \right] }<br />
\end{align}<br />
</math><br />
<br />
With this, we can now find a set of parameters that maximizes <math>\ell(\theta)</math>, for instance by using L-BFGS with minFunc.<br />
!--><br />
<br />
== Relationship to Logistic Regression ==<br />
<br />
In the special case where <math>k = 2</math>, one can show that softmax regression reduces to logistic regression.<br />
This shows that softmax regression is a generalization of logistic regression. Concretely, when <math>k=2</math>,<br />
the softmax regression hypothesis outputs<br />
<br />
<math><br />
\begin{align}<br />
h_\theta(x) &=<br />
<br />
\frac{1}{ e^{\theta_1^Tx} + e^{ \theta_2^T x^{(i)} } }<br />
\begin{bmatrix}<br />
e^{ \theta_1^T x } \\<br />
e^{ \theta_2^T x }<br />
\end{bmatrix}<br />
\end{align}<br />
</math><br />
<br />
Taking advantage of the fact that this hypothesis<br />
is overparameterized and setting <math>\psi = \theta_1</math>,<br />
we can subtract <math>\theta_1</math> from each of the two parameters, giving us<br />
<br />
<math><br />
\begin{align}<br />
h(x) &=<br />
<br />
\frac{1}{ e^{\vec{0}^Tx} + e^{ (\theta_2-\theta_1)^T x^{(i)} } }<br />
\begin{bmatrix}<br />
e^{ \vec{0}^T x } \\<br />
e^{ (\theta_2-\theta_1)^T x }<br />
\end{bmatrix} \\<br />
<br />
<br />
&=<br />
\begin{bmatrix}<br />
\frac{1}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\<br />
\frac{e^{ (\theta_2-\theta_1)^T x }}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } }<br />
\end{bmatrix} \\<br />
<br />
&=<br />
\begin{bmatrix}<br />
\frac{1}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\<br />
1 - \frac{1}{ 1 + e^{ (\theta_2-\theta_1)^T x^{(i)} } } \\<br />
\end{bmatrix}<br />
\end{align}<br />
</math><br />
<br />
<br />
Thus, replacing <math>\theta_2-\theta_1</math> with a single parameter vector <math>\theta'</math>, we find<br />
that softmax regression predicts the probability of one of the classes as<br />
<math>\frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>,<br />
and that of the other class as<br />
<math>1 - \frac{1}{ 1 + e^{ (\theta')^T x^{(i)} } }</math>,<br />
same as logistic regression.<br />
<br />
== Softmax Regression vs. k Binary Classifiers ==<br />
<br />
Suppose you are working on a music classification application, and there are<br />
<math>k</math> types of music that you are trying to recognize. Should you use a<br />
softmax classifier, or should you build <math>k</math> separate binary classifiers using<br />
logistic regression?<br />
<br />
This will depend on whether the four classes are ''mutually exclusive.'' For example,<br />
if your four classes are classical, country, rock, and jazz, then assuming each<br />
of your training examples is labeled with exactly one of these four class labels,<br />
you should build a softmax classifier with <math>k=4</math>.<br />
(If there're also some examples that are none of the above four classes,<br />
then you can set <math>k=5</math> in softmax regression, and also have a fifth, "none of the above," class.)<br />
<br />
If however your categories are has_vocals, dance, soundtrack, pop, then the<br />
classes are not mutually exclusive; for example, there can be a piece of pop<br />
music that comes from a soundtrack and in addition has vocals. In this case, it<br />
would be more appropriate to build 4 binary logistic regression classifiers. <br />
This way, for each new musical piece, your algorithm can separately decide whether<br />
it falls into each of the four categories.<br />
<br />
Now, consider a computer vision example, where you're trying to classify images into<br />
three different classes. (i) Suppose that your classes are indoor_scene,<br />
outdoor_urban_scene, and outdoor_wilderness_scene. Would you use sofmax regression<br />
or three logistic regression classifiers? (ii) Now suppose your classes are<br />
indoor_scene, black_and_white_image, and image_has_people. Would you use softmax<br />
regression or multiple logistic regression classifiers?<br />
<br />
In the first case, the classes are mutually exclusive, so a softmax regression<br />
classifier would be appropriate. In the second case, it would be more appropriate to build<br />
three separate logistic regression classifiers.<br />
<br />
<br />
{{Softmax}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Exercise:PCA_and_WhiteningExercise:PCA and Whitening2011-05-26T11:01:35Z<p>Watsuen: </p>
<hr />
<div>== PCA and Whitening on natural images == <br />
<br />
In this exercise, you will implement PCA, PCA whitening and ZCA whitening, and apply them to image patches taken from natural images. <br />
<br />
You will build on the MATLAB starter code which we have provided in <tt>[http://ufldl.stanford.edu/wiki/resources/pca_exercise.zip pca_exercise.zip]</tt>. You need only write code at the places indicated by "YOUR CODE HERE" in the files. The only file you need to modify is <tt>pca_gen.m</tt>.<br />
<br />
=== Step 0: Prepare data ===<br />
==== Step 0a: Load data ====<br />
<br />
The starter code contains code to load a set of natural images and sample 12x12 patches from them. The raw patches will look something like this:<br />
<br />
[[File:raw_images.png|240px|alt=Raw patches|Raw patches]]<br />
<br />
These patches are stored as column vectors <math>x^{(i)} \in \mathbb{R}^{144}</math> in the <math>144 \times 10000</math> matrix <math>x</math>.<br />
<br />
==== Step 0b: Zero mean the data ====<br />
<br />
First, for each image patch, compute the mean pixel value and subtract it from that image, this centering the image around zero. You should compute a different mean value for each image patch.<br />
<br />
=== Step 1: Implement PCA ===<br />
<br />
==== Step 1a: Implement PCA ====<br />
<br />
In this step, you will implement PCA to obtain <math>x_{\rm rot}</math>, the matrix in which the data is "rotated" to the basis comprising the principal components (i.e. the eigenvectors of <math>\Sigma</math>). Note that in this part of the exercise, you should ''not'' whiten the data.<br />
<br />
==== Step 1b: Check covariance ====<br />
<br />
To verify that your implementation of PCA is correct, you should check the covariance matrix for the rotated data <math>x_{\rm rot}</math>. PCA guarantees that the covariance matrix for the rotated data is a diagonal matrix (a matrix with non-zero entries only along the main diagonal). Implement code to compute the covariance matrix and verify this property. One way to do this is to compute the covariance matrix, and visualise it using the MATLAB command <tt>imagesc</tt>. The image should show a coloured diagonal line against a blue background. For this dataset, because of the range of the diagonal entries, the diagonal line may not be apparent, so you might get a figure like the one show below, but this trick of visualizing using <tt>imagesc</tt> will come in handy later in this exercise. <br />
<br />
[[File:pca_covar.png|360px]]<br />
<br />
=== Step 2: Find number of components to retain ===<br />
<br />
Next, choose <math>k</math>, the number of principal components to retain. Pick <math>k</math> to be as small as possible, but so that at least 99% of the variance is retained. In the step after this, you will discard all but the top <math>k</math> principal components, reducing the dimension of the original data to <math>k</math>.<br />
<br />
=== Step 3: PCA with dimension reduction ===<br />
<br />
Now that you have found <math>k</math>, compute <math>\tilde{x}</math>, the reduced-dimension representation of the data. This gives you a representation of each image patch as a <math>k</math> dimensional vector instead of a 144 dimensional vector. If you are training a sparse autoencoder or other algorithm on this reduced-dimensional data, it will run faster than if you were training on the original 144 dimensional data. <br />
<br />
To see the effect of dimension reduction, go back from <math>\tilde{x}</math> to produce the matrix <math>\hat{x}</math>, the dimension-reduced data but expressed in the original 144 dimensional space of image patches. Visualise <math>\hat{x}</math> and compare it to the raw data, <math>x</math>. You will observe that there is little loss due to throwing away the principal components that correspond to dimensions with low variation. For comparison, you may also wish to generate and visualise <math>\hat{x}</math> for when only 90% of the variance is retained. <br />
<br />
<table><br />
<tr><br />
<td>[[File:raw_images.png|240px|alt=Raw images|Raw images]]</td> <br />
<td>[[File:pca_images.png|240px|alt=PCA dimension-reduced images (99% variance)|PCA dimension-reduced images (99% variance)]]</td><br />
<td>[[File:pca_images_90.png|240px|alt=PCA dimension-reduced images (90% variance)|PCA dimension-reduced images (50% variance)]]</td><br />
</tr><br />
<tr><br />
<td>Raw images <br /> &nbsp; </td><br />
<td>PCA dimension-reduced images<br /> (99% variance)</td><br />
<td>PCA dimension-reduced images<br /> (90% variance)</td><br />
</tr><br />
</table><br />
<br />
=== Step 4: PCA with whitening and regularization ===<br />
<br />
==== Step 4a: Implement PCA with whitening and regularization ====<br />
<br />
Now implement PCA with whitening and regularization to produce the matrix <math>x_{PCAWhite}</math>. Use the following parameter value:<br />
<br />
epsilon = 0.1<br />
<br />
==== Step 4b: Check covariance ====<br />
<br />
Similar to using PCA alone, PCA with whitening also results in processed data that has a diagonal covariance matrix. However, unlike PCA alone, whitening additionally ensures that the diagonal entries are equal to 1, i.e. that the covariance matrix is the identity matrix. <br />
<br />
That would be the case if you were doing whitening alone with no regularization. However, in this case you are whitening with regularization, to avoid numerical/etc. problems associated with small eigenvalues. As a result of this, some of the diagonal entries of the covariance of your <math>x_{\rm PCAwhite}</math> will be smaller than 1. <br />
<br />
To verify that your implementation of PCA whitening with and without regularization is correct, you can check these properties. Implement code to compute the covariance matrix and verify this property. (To check the result of PCA without whitening, simply set epsilon to 0, or close to 0, say <tt>1e-10</tt>). As earlier, you can visualise the covariance matrix with <tt>imagesc</tt>. When visualised as an image, for PCA whitening without regularization you should see a red line across the diagonal (corresponding to the one entries) against a blue background (corresponding to the zero entries); for PCA whitening with regularization you should see a red line that slowly turns blue across the diagonal (corresponding to the 1 entries slowly becoming smaller). <br />
<br />
<table><br />
<tr><br />
<td>[[File:pca_whitened_covar.png|360px|alt=Covariance for PCA whitening with regularization|Covariance for PCA whitening with regularization]]</td><br />
<td>[[File:pca_whitened_unregularised_covar.png|360px|alt=Covariance for PCA whitening with regularization|Covariance for PCA whitening without regularization]]</td><br />
</tr><br />
<tr><br />
<td><center>Covariance for PCA whitening with regularization</center></td><br />
<td><center>Covariance for PCA whitening without regularization</center></td><br />
</tr><br />
</table><br />
<br />
=== Step 5: ZCA whitening ===<br />
<br />
Now implement ZCA whitening to produce the matrix <math>x_{ZCAWhite}</math>. Visualize <math>x_{ZCAWhite}</math> and compare it to the raw data, <math>x</math>. You should observe that whitening results in, among other things, enhanced edges. Try repeating this with <tt>epsilon</tt> set to 1, 0.1, and 0.01, and see what you obtain. The example shown below (left image) was obtained with <tt>epsilon</tt> = 0.1. <br />
<br />
<table><br />
<tr><br />
<td><br />
[[File:zca_whitened_images.png|240px|alt=ZCA whitened images|ZCA whitened images]] <br />
</td><td><br />
[[File:raw_images.png|240px|alt=Raw images|Raw images]]<br />
</td><br />
</tr><br />
<tr><br />
<td>ZCA whitened images</td><br />
<td>Raw images</td><br />
</tr><br />
</table><br />
<br />
[[Category:Exercises]]<br />
<br />
<br />
{{PCA}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Exercise:PCA_in_2DExercise:PCA in 2D2011-05-26T11:01:21Z<p>Watsuen: </p>
<hr />
<div>== PCA, PCA whitening and ZCA whitening in 2D == <br />
<br />
In this exercise you will implement PCA, PCA whitening and ZCA whitening, as described in the earlier sections of this tutorial, and generate the images shown in the earlier sections yourself. You will build on the starter code that has been provided at [http://ufldl.stanford.edu/wiki/resources/pca_2d.zip pca_2d.zip]. You need only write code at the places indicated by "YOUR CODE HERE" in the files. The only file you need to modify is <tt>pca_2d.m</tt>. Implementing this exercise will make the next exercise significantly easier to understand and complete.<br />
<br />
=== Step 0: Load data ===<br />
<br />
The starter code contains code to load 45 2D data points. When plotted using the <tt>scatter</tt> function, the results should look like the following:<br />
<br />
[[File:raw_images_2d.png|400px|alt=Raw images|Raw images]]<br />
<br />
=== Step 1: Implement PCA ===<br />
<br />
In this step, you will implement PCA to obtain <math>x_{rot}</math>, the matrix in which the data is "rotated" to the basis comprising <math>\textstyle u_1, \ldots, u_n</math> made up of the principal components. As mentioned in the implementation notes, you should make use of MATLAB's <tt>svd</tt> function here.<br />
<br />
==== Step 1a: Finding the PCA basis ====<br />
<br />
Find <math>\textstyle u_1</math> and <math>\textstyle u_2</math>, and draw two lines in your figure to show the resulting basis on top of the given data points. You may find it useful to use MATLAB's <tt>hold on</tt> and <tt>hold off</tt> functions. (After calling <tt>hold on</tt>, plotting functions such as <tt>plot</tt> will draw the new data on top of the previously existing figure rather than erasing and replacing it; and <tt>hold off</tt> turns this off.) You can use <tt>plot([x1,x2], [y1,y2], '-')</tt> to draw a line between <tt>(x1,y1)</tt> and <tt>(x2,y2)</tt>. Your figure should look like this: <br />
<br />
<br />
[[File:pca_2d_basis.png | 400px]]<br />
<br />
If you are doing this in Matlab, you will probably get a plot that's identical to ours. However, eigenvectors are defined only up to a sign. I.e., instead of returning <math>\textstyle u_1</math> as the first eigenvector, Matlab/Octave could just as easily have returned <math>\textstyle -u_1</math>, and similarly instead of <math>\textstyle u_2</math> Matlab/Octave could have returned <math>\textstyle -u_2</math>. So if you wound up with one or both of the eigenvectors pointing in a direction opposite (180 degrees difference) from what's shown above, that's okay too. <br />
<br />
==== Step 1b: Check xRot ====<br />
<br />
Compute <tt>xRot</tt>, and use the <tt>scatter</tt> function to check that <tt>xRot</tt> looks as it should, which should be something like the following:<br />
<br />
[[File:pca_xrot_2d.png|360px]]<br />
<br />
Because Matlab/Octave could have returned <math>\textstyle -u_1</math> and/or <math>\textstyle -u_2</math> instead of <math>\textstyle u_1</math> and <math>\textstyle u_2</math>, it's also possible that you might have gotten a figure which is "flipped" or "reflected" along the <math>\textstyle x</math>- and/or <math>\textstyle y</math>-axis; a flipped/reflected version of this figure is also a completely correct result.<br />
<br />
=== Step 2: Dimension reduce and replot ===<br />
<br />
In the next step, set <math>k</math>, the number of components to retain, to be 1 (we have already done this for you). Compute the resulting <tt>xHat</tt> and plot the results. You should get the following (this figure should '''not''' be flipped along the <math>\textstyle x</math>- or <math>\textstyle y</math>-axis): <br />
<br />
[[File:pca_xhat_2d.png|400px]]<br />
<br />
=== Step 3: PCA Whitening ===<br />
Implement PCA whitening using the formula from the notes. Plot <tt>xPCAWhite</tt>, and verify that it looks like the following (a figure that is flipped/reflected on either/both axes is also correct): <br />
<br />
[[File:pca_white_2d.png|400px]]<br />
<br />
=== Step 4: ZCA Whitening ===<br />
Implement ZCA whitening and plot the results. The results should look like the following (this should not be flipped/reflected along the <math>\textstyle x</math>- or <math>\textstyle y</math>-axis):<br />
<br />
[[File:zca_white_2d.png|400px]]<br />
<br />
[[Category:Exercises]]<br />
<br />
<br />
{{PCA}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Implementing_PCA/WhiteningImplementing PCA/Whitening2011-05-26T11:01:05Z<p>Watsuen: </p>
<hr />
<div>In this section, we summarize the PCA, PCA whitening and ZCA whitening algorithms,<br />
and also describe how you can implement them using efficient linear algebra libraries.<br />
<br />
First, we need to ensure that the data has (approximately) zero-mean. For natural images, we achieve this (approximately) by subtracting the mean value of each image patch.<br />
<br />
We achieve this by computing the mean for each patch and subtracting it for each patch. In Matlab, we can do this by using<br />
<br />
avg = mean(x, 1); % Compute the mean pixel intensity value separately for each patch. <br />
x = x - repmat(avg, size(x, 1), 1);<br />
<br />
Next, we need to compute <math>\textstyle \Sigma = \frac{1}{m} \sum_{i=1}^m (x^{(i)})(x^{(i)})^T</math>. If you're implementing this in Matlab (or even if you're implementing this in C++, Java, etc., but have access to an efficient linear algebra library), doing it as an explicit sum is inefficient. Instead, we can compute this in one fell swoop as <br />
<br />
sigma = x * x' / size(x, 2);<br />
<br />
(Check the math yourself for correctness.) <br />
Here, we assume that <math>x</math> is a data structure that contains one training example per column (so, <math>x</math> is a <math>\textstyle n</math>-by-<math>\textstyle m</math> matrix). <br />
<br />
Next, PCA computes the eigenvectors of <math>\Sigma</math>. One could do this using the Matlab <tt>eig</tt> function. However, because <math>\Sigma</math> is a symmetric positive semi-definite matrix, it is more numerically reliable to do this using the <tt>svd</tt> function. Concretely, if you implement <br />
<br />
[U,S,V] = svd(sigma);<br />
<br />
then the matrix <math>U</math> will contain the eigenvectors of <math>Sigma</math> (one eigenvector per column, sorted in order from top to bottom eigenvector), and the diagonal entries of the matrix <math>S</math> will contain the corresponding eigenvalues (also sorted in decreasing order). The matrix <math>V</math> will be equal to transpose of <math>U</math>, and can be safely ignored.<br />
<br />
(Note: The <tt>svd</tt> function actually computes the singular vectors and singular values of a matrix, which for the special case of a symmetric positive semi-definite matrix---which is all that we're concerned with here---is equal to its eigenvectors and eigenvalues. A full discussion of singular vectors vs. eigenvectors is beyond the scope of these notes.)<br />
<br />
Finally, you can compute <math>\textstyle x_{\rm rot}</math> and <math>\textstyle \tilde{x}</math> as follows:<br />
<br />
xRot = U' * x; % rotated version of the data. <br />
xTilde = U(:,1:k)' * x; % reduced dimension representation of the data, <br />
% where k is the number of eigenvectors to keep<br />
<br />
This gives your PCA representation of the data in terms of <math>\textstyle \tilde{x} \in \Re^k</math>. <br />
Incidentally, if <math>x</math> is a <math>\textstyle n</math>-by-<math>\textstyle m</math> matrix containing all your training data, this is a vectorized<br />
implementation, and the expressions<br />
above work too for computing <math>x_{\rm rot}</math> and <math>\tilde{x}</math> for your entire training set<br />
all in one go. The resulting <br />
<math>x_{\rm rot}</math> and <math>\tilde{x}</math> will have one column corresponding to each training example. <br />
<br />
To compute the PCA whitened data <math>\textstyle x_{\rm PCAwhite}</math>, use <br />
<br />
xPCAwhite = diag(1./sqrt(diag(S) + epsilon)) * U' * x;<br />
<br />
Since <math>S</math>'s diagonal contains the eigenvalues <math>\textstyle \lambda_i</math>, <br />
this turns out to be a compact way <br />
of computing <math>\textstyle x_{{\rm PCAwhite},i} = \frac{x_{{\rm rot},i} }{\sqrt{\lambda_i}}</math><br />
simultaneously for all <math>\textstyle i</math>. <br />
<br />
Finally, you can also compute the ZCA whitened data <math>\textstyle x_{\rm ZCAwhite}</math> as:<br />
<br />
xZCAwhite = U * diag(1./sqrt(diag(S) + epsilon)) * U' * x;<br />
<br />
<br />
{{PCA}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/WhiteningWhitening2011-05-26T11:00:52Z<p>Watsuen: </p>
<hr />
<div>== Introduction ==<br />
<br />
We have used PCA to reduce the dimension of the data. There is a closely related<br />
preprocessing step called '''whitening''' (or, in some other literatures, '''sphering''')<br />
which is needed for some algorithms. If we are training on images,<br />
the raw input is redundant, since adjacent pixel values<br />
are highly correlated. The goal of whitening is to make the input less redundant; more formally,<br />
our desiderata are that our learning algorithms sees a training input where (i) the features are less<br />
correlated with each other, and (ii) the features all have the same variance.<br />
<br />
== 2D example ==<br />
<br />
We will first describe whitening using our previous 2D example. We will then <br />
describe how this can be combined with smoothing, and finally how to combine<br />
this with PCA. <br />
<br />
How can we make our input features uncorrelated with each other? We had<br />
already done this when computing <math>\textstyle x_{\rm rot}^{(i)} = U^Tx^{(i)}</math>. <br />
Repeating our previous figure, our plot for <math>\textstyle x_{\rm rot}</math> was:<br />
<br />
[[File:PCA-rotated.png | 600px]]<br />
<br />
The covariance matrix of this data is given by:<br />
<br />
<math>\begin{align}<br />
\begin{bmatrix}<br />
7.29 & 0 \\<br />
0 & 0.69<br />
\end{bmatrix}.<br />
\end{align}</math><br />
<br />
(Note: Technically, many of the<br />
statements in this section about the "covariance" will be true only if the data<br />
has zero mean. In the rest of this section, we will take this assumption as<br />
implicit in our statements. However, even if the data's mean isn't exactly zero, <br />
the intuitions we're presenting here still hold true, and so this isn't something<br />
that you should worry about.)<br />
<br />
It is no accident that the diagonal values are <math>\textstyle \lambda_1</math> and <math>\textstyle \lambda_2</math>. <br />
Further, <br />
the off-diagonal entries are zero; thus, <br />
<math>\textstyle x_{{\rm rot},1}</math> and <math>\textstyle x_{{\rm rot},2}</math> are uncorrelated, satisfying one of our desiderata <br />
for whitened data (that the features be less correlated).<br />
<br />
To make each of our input features have unit variance, we can simply rescale<br />
each feature <math>\textstyle x_{{\rm rot},i}</math> by <math>\textstyle 1/\sqrt{\lambda_i}</math>. Concretely, we define<br />
our whitened data <math>\textstyle x_{{\rm PCAwhite}} \in \Re^n</math> as follows: <br />
:<math>\begin{align}<br />
x_{{\rm PCAwhite},i} = \frac{x_{{\rm rot},i} }{\sqrt{\lambda_i}}. <br />
\end{align}</math><br />
Plotting <math>\textstyle x_{{\rm PCAwhite}}</math>, we get:<br />
<br />
[[File:PCA-whitened.png | 600px]]<br />
<br />
This data now has covariance equal to the identity matrix <math>\textstyle I</math>. We say that<br />
<math>\textstyle x_{{\rm PCAwhite}}</math> is our '''PCA whitened''' version of the data: The <br />
different components of <math>\textstyle x_{{\rm PCAwhite}}</math> are uncorrelated and have<br />
unit variance. <br />
<br />
'''Whitening combined with dimensionality reduction.''' <br />
If you want to have data that is whitened and which is lower dimensional than<br />
the original input, you can also optionally keep only the top <math>\textstyle k</math> components of<br />
<math>\textstyle x_{{\rm PCAwhite}}</math>. When we combine PCA whitening with regularization<br />
(described later), the last few components of <math>\textstyle x_{{\rm PCAwhite}}</math> will be<br />
nearly zero anyway, and thus can safely be dropped.<br />
<br />
== ZCA Whitening == <br />
Finally, it turns out that this way of getting the <br />
data to have covariance identity <math>\textstyle I</math> isn't unique. <br />
Concretely, if <br />
<math>\textstyle R</math> is any orthogonal matrix, so that it satisfies <math>\textstyle RR^T = R^TR = I</math> (less formally,<br />
if <math>\textstyle R</math> is a rotation/reflection matrix),<br />
then <math>\textstyle R \,x_{\rm PCAwhite}</math> will also have identity covariance. <br />
In '''ZCA whitening''',<br />
we choose <math>\textstyle R = U</math>. We define <br />
:<math>\begin{align}<br />
x_{\rm ZCAwhite} = U x_{\rm PCAwhite}<br />
\end{align}</math><br />
Plotting <math>\textstyle x_{\rm ZCAwhite}</math>, we get: <br />
<br />
[[File:ZCA-whitened.png | 600px]]<br />
<br />
It can be shown that out of all possible choices for <math>\textstyle R</math>, <br />
this choice of rotation causes <math>\textstyle x_{\rm ZCAwhite}</math> to be as close as possible to the <br />
original input data <math>\textstyle x</math>. <br />
<br />
When using ZCA whitening (unlike PCA whitening), we usually keep all <math>\textstyle n</math> dimensions<br />
of the data, and do not try to reduce its dimension.<br />
<br />
== Regularizaton ==<br />
When implementing PCA whitening or ZCA whitening in practice, sometimes some<br />
of the eigenvalues <math>\textstyle \lambda_i</math> will be numerically close to 0, and thus the scaling<br />
step where we divide by <math>\sqrt{\lambda_i}</math> would involve dividing by a value close to zero; this <br />
may cause the data to blow up (take on large values) or otherwise be numerically unstable. In practice, we <br />
therefore implement this scaling step using <br />
a small amount of regularization, and add a small constant <math>\textstyle \epsilon</math> <br />
to the eigenvalues before taking their square root and inverse:<br />
:<math>\begin{align}<br />
x_{{\rm PCAwhite},i} = \frac{x_{{\rm rot},i} }{\sqrt{\lambda_i + \epsilon}}.<br />
\end{align}</math><br />
When <math>\textstyle x</math> takes values around <math>\textstyle [-1,1]</math>, a value of <math>\textstyle \epsilon \approx 10^{-5}</math><br />
might be typical. <br />
<br />
For the case of images, adding <math>\textstyle \epsilon</math> here also has the effect of slightly smoothing (or low-pass<br />
filtering) the input image. This also has a desirable effect of removing aliasing artifacts<br />
caused by the way pixels are laid out in an image, and can improve the features learned <br />
(details are beyond the scope of these notes). <br />
<br />
ZCA whitening is a form of pre-processing of the data that maps it from <math>\textstyle x</math> to<br />
<math>\textstyle x_{\rm ZCAwhite}</math>. It turns out that this is also a rough model of how the<br />
biological eye (the retina) processes images. Specifically, as your eye<br />
perceives images, most adjacent "pixels" in your eye will perceive very<br />
similar values, since adjacent parts of an image tend to be highly correlated<br />
in intensity. It is thus wasteful for your eye to have to transmit every pixel<br />
separately (via your optic nerve) to your brain. Instead, your retina performs<br />
a decorrelation operation (this is done via retinal neurons that compute a function<br />
called "on center, off surround/off center, on surround") which is similar to that<br />
performed by ZCA. This results in a less redundant representation of the input<br />
image, which is then transmitted to your brain.<br />
<br />
<br />
{{PCA}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/PCAPCA2011-05-26T11:00:36Z<p>Watsuen: </p>
<hr />
<div>== Introduction ==<br />
Principal Components Analysis (PCA) is a dimensionality reduction algorithm<br />
that can be used to significantly speed up your unsupervised feature learning<br />
algorithm. More importantly, understanding PCA will enable us to later<br />
implement '''whitening''', which is an important pre-processing step for many<br />
algorithms. <br />
<br />
Suppose you are training your algorithm on images. Then the input will be<br />
somewhat redundant, because the values of adjacent pixels in an image are<br />
highly correlated. Concretely, suppose we are training on 16x16 grayscale<br />
image patches. Then <math>\textstyle x \in \Re^{256}</math> are 256 dimensional vectors, with one<br />
feature <math>\textstyle x_j</math> corresponding to the intensity of each pixel. Because of the<br />
correlation between adjacent pixels, PCA will allow us to approximate the input with<br />
a much lower dimensional one, while incurring very little error.<br />
<br />
== Example and Mathematical Background ==<br />
<br />
For our running example, we will use a dataset <br />
<math>\textstyle \{x^{(1)}, x^{(2)}, \ldots, x^{(m)}\}</math> with <br />
<math>\textstyle n=2</math> dimensional inputs, so that <br />
<math>\textstyle x^{(i)} \in \Re^2</math>.<br />
Suppose we want to reduce the data <br />
from 2 dimensions to 1. (In practice, we might want to reduce data<br />
from 256 to 50 dimensions, say; but using lower dimensional data in our example<br />
allows us to visualize the algorithms better.) Here is our dataset:<br />
<br />
[[File:PCA-rawdata.png|600px]]<br />
<br />
This data has already been pre-processed so that each of the features <math>\textstyle x_1</math> and <math>\textstyle x_2</math><br />
have about the same mean (zero) and variance. <br />
<br />
For the purpose of illustration, we have also colored each of the points one of<br />
three colors, depending on their <math>\textstyle x_1</math> value; these colors are not used by the<br />
algorithm, and are for illustration only.<br />
<br />
PCA will find a lower-dimensional subspace onto which to project our data. <br />
From visually examining the data, it appears that <math>\textstyle u_1</math> is the principal direction of <br />
variation of the data, and <math>\textstyle u_2</math> the secondary direction of variation:<br />
<br />
[[File:PCA-u1.png | 600px]]<br />
<br />
I.e., the data varies much more in the direction <math>\textstyle u_1</math> than <math>\textstyle u_2</math>. <br />
To more formally find the directions <math>\textstyle u_1</math> and <math>\textstyle u_2</math>, we first compute the matrix <math>\textstyle \Sigma</math><br />
as follows:<br />
:<math>\begin{align}<br />
\Sigma = \frac{1}{m} \sum_{i=1}^m (x^{(i)})(x^{(i)})^T. <br />
\end{align}</math><br />
If <math>\textstyle x</math> has zero mean, then <math>\textstyle \Sigma</math> is exactly the covariance matrix of <math>\textstyle x</math>. (The symbol "<math>\textstyle \Sigma</math>", pronounced "Sigma", is the standard notation for denoting the covariance matrix. Unfortunately it looks just like the summation symbol, as in <math>\sum_{i=1}^n i</math>; but these are two different things.) <br />
<br />
It can then be shown that <math>\textstyle u_1</math>---the principal direction of variation of the data---is <br />
the top (principal) eigenvector of <math>\textstyle \Sigma</math>, and <math>\textstyle u_2</math> is<br />
the second eigenvector.<br />
<br />
Note: If you are interested in seeing a more formal mathematical derivation/justification of this result, see the CS229 (Machine Learning) lecture notes on PCA (link at bottom of this page). You won't need to do so to follow along this course, however. <br />
<br />
You can use standard numerical linear algebra software to find these eigenvectors (see Implementation Notes).<br />
Concretely, let us compute the eigenvectors of <math>\textstyle \Sigma</math>, and stack<br />
the eigenvectors in columns to form the matrix <math>\textstyle U</math>:<br />
:<math>\begin{align}<br />
U = <br />
\begin{bmatrix} <br />
| & | & & | \\<br />
u_1 & u_2 & \cdots & u_n \\<br />
| & | & & | <br />
\end{bmatrix} <br />
\end{align}</math><br />
Here, <math>\textstyle u_1</math> is the principal eigenvector (corresponding to the largest eigenvalue),<br />
<math>\textstyle u_2</math> is the second eigenvector, and so on. <br />
Also, let <math>\textstyle \lambda_1, \lambda_2, \ldots, \lambda_n</math> be the corresponding eigenvalues. <br />
<br />
The vectors <math>\textstyle u_1</math> and <math>\textstyle u_2</math> in our example form a new basis in which we <br />
can represent the data. Concretely, let <math>\textstyle x \in \Re^2</math> be some training example. Then <math>\textstyle u_1^Tx</math><br />
is the length (magnitude) of the projection of <math>\textstyle x</math> onto the vector <math>\textstyle u_1</math>. <br />
<br />
Similarly, <math>\textstyle u_2^Tx</math> is the magnitude of <math>\textstyle x</math> projected onto the vector <math>\textstyle u_2</math>.<br />
<br />
== Rotating the Data ==<br />
<br />
Thus, we can represent <math>\textstyle x</math> in the <math>\textstyle (u_1, u_2)</math>-basis by computing<br />
:<math>\begin{align}<br />
x_{\rm rot} = U^Tx = \begin{bmatrix} u_1^Tx \\ u_2^Tx \end{bmatrix} <br />
\end{align}</math><br />
(The subscript "rot" comes from the observation that this corresponds to<br />
a rotation (and possibly reflection) of the original data.)<br />
Lets take the entire training set, and compute <br />
<math>\textstyle x_{\rm rot}^{(i)} = U^Tx^{(i)}</math> for every <math>\textstyle i</math>. Plotting this transformed data <br />
<math>\textstyle x_{\rm rot}</math>, we get: <br />
<br />
[[File:PCA-rotated.png|600px]]<br />
<br />
This is the training set rotated into the <math>\textstyle u_1</math>,<math>\textstyle u_2</math> basis. In the general<br />
case, <math>\textstyle U^Tx</math> will be the training set rotated into the basis <br />
<math>\textstyle u_1</math>,<math>\textstyle u_2</math>, ...,<math>\textstyle u_n</math>. <br />
<br />
One of the properties of <math>\textstyle U</math> is that it is an "orthogonal" matrix, which means<br />
that it satisfies <math>\textstyle U^TU = UU^T = I</math>. <br />
So if you ever need to go from the rotated vectors <math>\textstyle x_{\rm rot}</math> back to the <br />
original data <math>\textstyle x</math>, you can compute <br />
:<math>\begin{align}<br />
x = U x_{\rm rot} ,<br />
\end{align}</math><br />
because <math>\textstyle U x_{\rm rot} = UU^T x = x</math>.<br />
<br />
== Reducing the Data Dimension ==<br />
<br />
We see that the principal direction of variation of the data is the first<br />
dimension <math>\textstyle x_{{\rm rot},1}</math> of this rotated data. Thus, if we want to<br />
reduce this data to one dimension, we can set <br />
:<math>\begin{align}<br />
\tilde{x}^{(i)} = x_{{\rm rot},1}^{(i)} = u_1^Tx^{(i)} \in \Re.<br />
\end{align}</math><br />
More generally, if <math>\textstyle x \in \Re^n</math> and we want to reduce it to <br />
a <math>\textstyle k</math> dimensional representation <math>\textstyle \tilde{x} \in \Re^k</math> (where <math>\textstyle k < n</math>), we would<br />
take the first <math>\textstyle k</math> components of <math>\textstyle x_{\rm rot}</math>, which correspond to<br />
the top <math>\textstyle k</math> directions of variation. <br />
<br />
Another way of explaining PCA is that <math>\textstyle x_{\rm rot}</math> is an <math>\textstyle n</math> dimensional<br />
vector, where the first few components are likely to <br />
be large (e.g., in our example, we saw that <math>\textstyle x_{{\rm rot},1}^{(i)} = u_1^Tx^{(i)}</math> takes<br />
reasonably large values for most examples <math>\textstyle i</math>), and<br />
the later components are likely to be small (e.g., in our example, <br />
<math>\textstyle x_{{\rm rot},2}^{(i)} = u_2^Tx^{(i)}</math> was more likely to be small). What<br />
PCA does it it <br />
drops the the later (smaller) components of <math>\textstyle x_{\rm rot}</math>, and<br />
just approximates them with 0's. Concretely, our definition of <br />
<math>\textstyle \tilde{x}</math> can also be arrived at by using an approximation to<br />
<math>\textstyle x_{{\rm rot}}</math> where <br />
all but the first<br />
<math>\textstyle k</math> components are zeros. In other words, we have: <br />
:<math>\begin{align}<br />
\tilde{x} = <br />
\begin{bmatrix} <br />
x_{{\rm rot},1} \\<br />
\vdots \\ <br />
x_{{\rm rot},k} \\<br />
0 \\ <br />
\vdots \\ <br />
0 \\ <br />
\end{bmatrix}<br />
\approx <br />
\begin{bmatrix} <br />
x_{{\rm rot},1} \\<br />
\vdots \\ <br />
x_{{\rm rot},k} \\<br />
x_{{\rm rot},k+1} \\<br />
\vdots \\ <br />
x_{{\rm rot},n} <br />
\end{bmatrix}<br />
= x_{\rm rot} <br />
\end{align}</math><br />
In our example, this gives us the following plot of <math>\textstyle \tilde{x}</math> (using <math>\textstyle n=2, k=1</math>):<br />
<br />
[[File:PCA-xtilde.png | 600px]]<br />
<br />
However, since the final <math>\textstyle n-k</math> components of <math>\textstyle \tilde{x}</math> as defined above would<br />
always be zero, there is no need to keep these zeros around, and so we<br />
define <math>\textstyle \tilde{x}</math> as a <math>\textstyle k</math>-dimensional vector with just the first <math>\textstyle k</math> (non-zero) components. <br />
<br />
This also explains why we wanted to express our data in the <math>\textstyle u_1, u_2, \ldots, u_n</math> basis:<br />
Deciding which components to keep becomes just keeping the top <math>\textstyle k</math> components. When we<br />
do this, we also say that we are "retaining the top <math>\textstyle k</math> PCA (or principal) components."<br />
<br />
== Recovering an Approximation of the Data ==<br />
<br />
Now, <math>\textstyle \tilde{x} \in \Re^k</math> is a lower-dimensional, "compressed" representation<br />
of the original <math>\textstyle x \in \Re^n</math>. Given <math>\textstyle \tilde{x}</math>, how can we recover an approximation <math>\textstyle \hat{x}</math> to <br />
the original value of <math>\textstyle x</math>? From an [[#Rotating the Data|earlier section]], we know that <math>\textstyle x = U x_{\rm rot}</math>. Further, <br />
we can think of <math>\textstyle \tilde{x}</math> as an approximation to <math>\textstyle x_{\rm rot}</math>, where we have<br />
set the last <math>\textstyle n-k</math> components to zeros. Thus, given <math>\textstyle \tilde{x} \in \Re^k</math>, we can <br />
pad it out with <math>\textstyle n-k</math> zeros to get our approximation to <math>\textstyle x_{\rm rot} \in \Re^n</math>. Finally, we pre-multiply<br />
by <math>\textstyle U</math> to get our approximation to <math>\textstyle x</math>. Concretely, we get <br />
:<math>\begin{align}<br />
\hat{x} = U \begin{bmatrix} \tilde{x}_1 \\ \vdots \\ \tilde{x}_k \\ 0 \\ \vdots \\ 0 \end{bmatrix} <br />
= \sum_{i=1}^k u_i \tilde{x}_i.<br />
\end{align}</math><br />
The final equality above comes from the definition of <math>\textstyle U</math> [[#Example and Mathematical Background|given earlier]].<br />
(In a practical implementation, we wouldn't actually zero pad <math>\textstyle \tilde{x}</math> and then multiply<br />
by <math>\textstyle U</math>, since that would mean multiplying a lot of things by zeros; instead, we'd just <br />
multiply <math>\textstyle \tilde{x} \in \Re^k</math> with the first <math>\textstyle k</math> columns of <math>\textstyle U</math> as in the final expression above.)<br />
Applying this to our dataset, we get the following plot for <math>\textstyle \hat{x}</math>:<br />
<br />
[[File:PCA-xhat.png | 600px]]<br />
<br />
We are thus using a 1 dimensional approximation to the original dataset. <br />
<br />
If you are training an autoencoder or other unsupervised feature learning algorithm,<br />
the running time of your algorithm will depend on the dimension of the input. If you feed <math>\textstyle \tilde{x} \in \Re^k</math><br />
into your learning algorithm instead of <math>\textstyle x</math>, then you'll be training on a lower-dimensional<br />
input, and thus your algorithm might run significantly faster. For many datasets,<br />
the lower dimensional <math>\textstyle \tilde{x}</math> representation can be an extremely good approximation <br />
to the original, and using PCA this way can significantly speed up your algorithm while<br />
introducing very little approximation error.<br />
<br />
== Number of components to retain ==<br />
<br />
How do we set <math>\textstyle k</math>; i.e., how many PCA components should we retain? In our<br />
simple 2 dimensional example, it seemed natural to retain 1 out of the 2<br />
components, but for higher dimensional data, this decision is less trivial. If <math>\textstyle k</math> is<br />
too large, then we won't be compressing the data much; in the limit of <math>\textstyle k=n</math>,<br />
then we're just using the original data (but rotated into a different basis).<br />
Conversely, if <math>\textstyle k</math> is too small, then we might be using a very bad<br />
approximation to the data. <br />
<br />
To decide how to set <math>\textstyle k</math>, we will usually look at the '''percentage of variance retained''' <br />
for different values of <math>\textstyle k</math>. Concretely, if <math>\textstyle k=n</math>, then we have<br />
an exact approximation to the data, and we say that 100% of the variance is<br />
retained. I.e., all of the variation of the original data is retained. <br />
Conversely, if <math>\textstyle k=0</math>, then we are approximating all the data with the zero vector,<br />
and thus 0% of the variance is retained. <br />
<br />
More generally, let <math>\textstyle \lambda_1, \lambda_2, \ldots, \lambda_n</math> be the eigenvalues <br />
of <math>\textstyle \Sigma</math> (sorted in decreasing order), so that <math>\textstyle \lambda_j</math> is the eigenvalue<br />
corresponding to the eigenvector <math>\textstyle u_j</math>. Then if we retain <math>\textstyle k</math> principal components, <br />
the percentage of variance retained is given by:<br />
:<math>\begin{align}<br />
\frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^n \lambda_j}.<br />
\end{align}</math><br />
In our simple 2D example above, <math>\textstyle \lambda_1 = 7.29</math>, and <math>\textstyle \lambda_2 = 0.69</math>. Thus,<br />
by keeping only <math>\textstyle k=1</math> principal components, we retained <math>\textstyle 7.29/(7.29+0.69) = 0.913</math>,<br />
or 91.3% of the variance.<br />
<br />
A more formal definition of percentage of variance retained is beyond the scope<br />
of these notes. However, it is possible to show that <math>\textstyle \lambda_j =<br />
\sum_{i=1}^m x_{{\rm rot},j}^2</math>. Thus, if <math>\textstyle \lambda_j \approx 0</math>, that shows that<br />
<math>\textstyle x_{{\rm rot},j}</math> is usually near 0 anyway, and we lose relatively little by<br />
approximating it with a constant 0. This also explains why we retain the top principal<br />
components (corresponding to the larger values of <math>\textstyle \lambda_j</math>) instead of the bottom<br />
ones. The top principal components <br />
<math>\textstyle x_{{\rm rot},j}</math> are the ones that're more variable and that take on larger values, <br />
and for which we would incur a greater approximation error if we were to set them to zero. <br />
<br />
In the case of images, one common heuristic is to choose <math>\textstyle k</math> so as to retain 99% of<br />
the variance. In other words, we pick the smallest value of <math>\textstyle k</math> that satisfies <br />
:<math>\begin{align}<br />
\frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^n \lambda_j} \geq 0.99. <br />
\end{align}</math><br />
Depending on the application, if you are willing to incur some <br />
additional error, values in the 90-98% range are also sometimes used. When you<br />
describe to others how you applied PCA, saying that you chose <math>\textstyle k</math> to retain 95% of<br />
the variance will also be a much more easily interpretable description than saying<br />
that you retained 120 (or whatever other number of) components.<br />
<br />
== PCA on Images ==<br />
For PCA to work, usually we want each of the features <math>\textstyle x_1, x_2, \ldots, x_n</math><br />
to have a similar range of values to the others (and to have a mean close to<br />
zero). If you've used PCA on other applications before, you may therefore have<br />
separately pre-processed each feature to have zero mean and unit variance, by<br />
separately estimating the mean and variance of each feature <math>\textstyle x_j</math>. However,<br />
this isn't the pre-processing that we will apply to most types of images. Specifically,<br />
suppose we are training our algorithm on '''natural images''', so that <math>\textstyle x_j</math> is<br />
the value of pixel <math>\textstyle j</math>. By "natural images," we informally mean the type of image that<br />
a typical animal or person might see over their lifetime.<br />
<br />
Note: Usually we use images of outdoor scenes with grass, trees, etc., and cut out small (say 16x16) image patches randomly from these to train the algorithm. But in practice most feature learning algorithms are extremely robust to the exact type of image it is trained on, so most images taken with a normal camera, so long as they aren't excessively blurry or have strange artifacts, should work. <br />
<br />
When training on natural images, it makes little sense to estimate a separate mean and<br />
variance for each pixel, because the statistics in one part<br />
of the image should (theoretically) be the same as any other. <br />
This property of images is called '''stationarity.''' <br />
<br />
In detail, in order for PCA to work well, informally we require that (i) The<br />
features have approximately zero mean, and (ii) The different features have<br />
similar variances to each other. With natural images, (ii) is already<br />
satisfied even without variance normalization, and so we won't perform any <br />
variance normalization. <br />
(If you are training on audio data---say, on<br />
spectrograms---or on text data---say, bag-of-word vectors---we will usually not perform<br />
variance normalization either.) <br />
In fact, PCA is invariant to the scaling of<br />
the data, and will return the same eigenvectors regardless of the scaling of<br />
the input. More formally, if you multiply each feature vector <math>\textstyle x</math> by some<br />
positive number (thus scaling every feature in every training example by the<br />
same number), PCA's output eigenvectors will not change. <br />
<br />
So, we won't use variance normalization. The only normalization we need to<br />
perform then is mean normalization, to ensure that the features have a mean<br />
around zero. Depending on the application, very often we are not interested<br />
in how bright the overall input image is. For example, in object recognition<br />
tasks, the overall brightness of the image doesn't affect what objects<br />
there are in the image. More formally, we are not interested in the<br />
mean intensity value of an image patch; thus, we can subtract out this value,<br />
as a form of mean normalization. <br />
<br />
Concretely, if <math>\textstyle x^{(i)} \in \Re^{n}</math> are the (grayscale) intensity values of<br />
a 16x16 image patch (<math>\textstyle n=256</math>), we might normalize the intensity of each image<br />
<math>\textstyle x^{(i)}</math> as follows: <br />
<br />
<math>\mu^{(i)} := \frac{1}{n} \sum_{j=1}^n x^{(i)}_j</math><br />
<br />
<math>x^{(i)}_j := x^{(i)}_j - \mu^{(i)}</math>, for all <math>\textstyle j</math><br />
<br />
Note that the two steps above are done separately for each image <math>\textstyle x^{(i)}</math>,<br />
and that <math>\textstyle \mu^{(i)}</math> here is the mean intensity of the image <math>\textstyle x^{(i)}</math>. In particular,<br />
this is not the same thing as estimating a mean value separately for each pixel <math>\textstyle x_j</math>.<br />
<br />
If you are training your algorithm on images other than natural images (for example, images of handwritten characters, or images of single isolated objects centered against a white background), other types of normalization might be worth considering, and the best choice may be application dependent. But when training on natural images, using the per-image mean normalization method as given in the equations above would be a reasonable default.<br />
<br />
== References ==<br />
http://cs229.stanford.edu<br />
<br />
<br />
{{PCA}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Exercise:VectorizationExercise:Vectorization2011-05-26T11:00:11Z<p>Watsuen: </p>
<hr />
<div>== Vectorization ==<br />
<br />
In the previous problem set, we implemented a sparse autoencoder for patches taken from natural images. In this problem set, you will vectorize your code to make it run much faster, and further adapt your sparse autoencoder to work on images of handwritten digits. Your network for learning from handwritten digits will be much larger than the one you'd trained on the natural images, and so using the original implementation would have been painfully slow. But with a vectorized implementation of the autoencoder, you will be able to get this to run in a reasonable amount of computation time. <br />
<br />
=== Support Code/Data ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz MNIST Dataset (Training Images)]<br />
* [http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz MNIST Dataset (Training Labels)]<br />
* [[Using the MNIST Dataset | Support functions for loading MNIST in Matlab ]]<br />
<br />
=== Step 1: Vectorize your Sparse Autoencoder Implementation ===<br />
<br />
Using the ideas from [[Vectorization]] and [[Neural Network Vectorization]], vectorize your implementation of <tt>sparseAutoencoderCost.m</tt>. In our implementation, we were able to remove all for-loops with the use of matrix operations and <tt>repmat</tt>. (If you want to play with more advanced vectorization ideas, also type <tt>help bsxfun</tt>. The <tt>bsxfun</tt> function provides an alternative to <tt>repmat</tt> for some of the vectorization steps, but is not necessary for this exercise). A vectorized version of our sparse autoencoder code ran in under one minute on a fast computer (for learning 25 features from 10000 8x8 image patches). <br />
<br />
(Note that you do not need to vectorize the code in the other files.)<br />
<br />
=== Step 2: Learn features for handwritten digits ===<br />
<br />
Now that you have vectorized the code, it is easy to learn larger sets of features on medium sized images. In this part of the exercise, you will use your sparse autoencoder to learn features for handwritten digits from the MNIST dataset. <br />
<br />
The MNIST data is available at [http://yann.lecun.com/exdb/mnist/]. Download the file <tt>train-images-idx3-ubyte.gz</tt> and decompress it. After obtaining the source images, you should use [[Using the MNIST Dataset | helper functions that we provide]] to load the data into Matlab as matrices. While the helper functions that we provide will load both the input examples <math>x</math> and the class labels <math>y</math>, for this assignment, you will only need the input examples <math>x</math> since the sparse autoencoder is an ''unsupervised'' learning algorithm. (In a later assignment, we will use the labels <math>y</math> as well.) <br />
<br />
The following set of parameters worked well for us to learn good features on the MNIST dataset:<br />
<br />
visibleSize = 28*28<br />
hiddenSize = 196<br />
sparsityParam = 0.1<br />
lambda = 3e-3<br />
beta = 3<br />
patches = first 10000 images from the MNIST dataset<br />
<br />
After 400 iterations of updates using minFunc, your autoencoder should have learned features that resemble pen strokes. In other words, this has learned to represent handwritten characters in terms of what pen strokes appear in an image. Our implementation takes around 15-20 minutes on a fast machine. Visualized, the features should look like the following image: <br />
<br />
[[File:mnistVectorizationEx.png|400px]]<br />
<br />
If your parameters are improperly tuned, or if your implementation of the autoencoder is buggy, you may get one of the following images instead:<br />
<br />
<table><br />
<tr><td>[[File:MNIST-false-bad-1.png|240px]]</td><td>[[File:MNIST-false-bad-2.png|240px]]</td></tr><br />
</table><br />
<br />
If your image looks like one of the above images, check your code and parameters again. Learning these features are a prelude to the later exercises, where we shall see how they will be useful for classification.<br />
<br />
<!-- === Natural images ===<br />
<br />
Use the following parameters for the natural images dataset:<br />
<br />
visibleSize = 14*14;<br />
hiddenSize = 196;<br />
sparsityParam = 0.035;<br />
lambda = 0.0003; <br />
beta = 5; <br />
<br />
As with the first problem, the autoencoder should learn edge features. Your code should run in under 10 minutes on a reasonably fast machine. If it takes significantly longer, check your code and ensure that it is vectorized.<br />
<br />
[[Category:Exercises]] --><br />
<br />
<br />
{{Vectorized Implementation}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Exercise:VectorizationExercise:Vectorization2011-05-26T10:57:28Z<p>Watsuen: </p>
<hr />
<div>== Vectorization ==<br />
<br />
In the previous problem set, we implemented a sparse autoencoder for patches taken from natural images. In this problem set, you will vectorize your code to make it run much faster, and further adapt your sparse autoencoder to work on images of handwritten digits. Your network for learning from handwritten digits will be much larger than the one you'd trained on the natural images, and so using the original implementation would have been painfully slow. But with a vectorized implementation of the autoencoder, you will be able to get this to run in a reasonable amount of computation time. <br />
<br />
=== Support Code/Data ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz MNIST Dataset (Training Images)]<br />
* [http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz MNIST Dataset (Training Labels)]<br />
* [[Using the MNIST Dataset | Support functions for loading MNIST in Matlab ]]<br />
<br />
=== Step 1: Vectorize your Sparse Autoencoder Implementation ===<br />
<br />
Using the ideas from [[Vectorization]] and [[Neural Network Vectorization]], vectorize your implementation of <tt>sparseAutoencoderCost.m</tt>. In our implementation, we were able to remove all for-loops with the use of matrix operations and <tt>repmat</tt>. (If you want to play with more advanced vectorization ideas, also type <tt>help bsxfun</tt>. The <tt>bsxfun</tt> function provides an alternative to <tt>repmat</tt> for some of the vectorization steps, but is not necessary for this exercise). A vectorized version of our sparse autoencoder code ran in under one minute on a fast computer (for learning 25 features from 10000 8x8 image patches). <br />
<br />
(Note that you do not need to vectorize the code in the other files.)<br />
<br />
=== Step 2: Learn features for handwritten digits ===<br />
<br />
Now that you have vectorized the code, it is easy to learn larger sets of features on medium sized images. In this part of the exercise, you will use your sparse autoencoder to learn features for handwritten digits from the MNIST dataset. <br />
<br />
The MNIST data is available at [http://yann.lecun.com/exdb/mnist/]. Download the file <tt>train-images-idx3-ubyte.gz</tt> and decompress it. After obtaining the source images, you should use [[Using the MNIST Dataset | helper functions that we provide]] to load the data into Matlab as matrices. While the helper functions that we provide will load both the input examples <math>x</math> and the class labels <math>y</math>, for this assignment, you will only need the input examples <math>x</math> since the sparse autoencoder is an ''unsupervised'' learning algorithm. (In a later assignment, we will use the labels <math>y</math> as well.) <br />
<br />
The following set of parameters worked well for us to learn good features on the MNIST dataset:<br />
<br />
visibleSize = 28*28<br />
hiddenSize = 196<br />
sparsityParam = 0.1<br />
lambda = 3e-3<br />
beta = 3<br />
patches = first 10000 images from the MNIST dataset<br />
<br />
After 400 iterations of updates using minFunc, your autoencoder should have learned features that resemble pen strokes. In other words, this has learned to represent handwritten characters in terms of what pen strokes appear in an image. Our implementation takes around 15-20 minutes on a fast machine. Visualized, the features should look like the following image: <br />
<br />
[[File:mnistVectorizationEx.png|400px]]<br />
<br />
If your parameters are improperly tuned, or if your implementation of the autoencoder is buggy, you may get one of the following images instead:<br />
<br />
<table><br />
<tr><td>[[File:MNIST-false-bad-1.png|240px]]</td><td>[[File:MNIST-false-bad-2.png|240px]]</td></tr><br />
</table><br />
<br />
If your image looks like one of the above images, check your code and parameters again. Learning these features are a prelude to the later exercises, where we shall see how they will be useful for classification.<br />
<br />
<!-- === Natural images ===<br />
<br />
Use the following parameters for the natural images dataset:<br />
<br />
visibleSize = 14*14;<br />
hiddenSize = 196;<br />
sparsityParam = 0.035;<br />
lambda = 0.0003; <br />
beta = 5; <br />
<br />
As with the first problem, the autoencoder should learn edge features. Your code should run in under 10 minutes on a reasonably fast machine. If it takes significantly longer, check your code and ensure that it is vectorized.<br />
<br />
[[Category:Exercises]]<br />
<br />
<br />
{{Vectorized Implementation}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Neural_Network_VectorizationNeural Network Vectorization2011-05-26T10:57:12Z<p>Watsuen: </p>
<hr />
<div>In this section, we derive a vectorized version of our neural network. In our earlier description of [[Neural Networks]], we had already given a partially vectorized implementation, that is quite efficient if we are working with only a single example at a time. We now describe how to implement the algorithm so that it simultaneously processes multiple training examples. Specifically, we will do this for the forward propagation and backpropagation steps, as well as for learning a sparse set of features. <br />
<br />
== Forward propagation ==<br />
<br />
Consider a 3 layer neural network (with one input, one hidden, and one output layer), and suppose <tt>x</tt> is a column vector containing a single training example <math>x^{(i)} \in \Re^{n}</math>. Then the forward propagation step is given by: <br />
:<math>\begin{align}<br />
z^{(2)} &= W^{(1)} x + b^{(1)} \\<br />
a^{(2)} &= f(z^{(2)}) \\<br />
z^{(3)} &= W^{(2)} a^{(2)} + b^{(2)} \\<br />
h_{W,b}(x) &= a^{(3)} = f(z^{(3)})<br />
\end{align}</math><br />
This is a fairly efficient implementation for a single example. If we have <math>m</math> examples, then we would wrap a <tt>for</tt> loop around this. <br />
<br />
Concretely, following the [[Logistic Regression Vectorization Example]], let the Matlab/Octave variable <tt>x</tt> be a matrix containing the training inputs, so that <tt>x(:,i)</tt> is the <math>\textstyle i</math>-th training example. We can then implement forward propagation as: <br />
<syntaxhighlight> <br />
% Unvectorized implementation<br />
for i=1:m, <br />
z2 = W1 * x(:,i) + b1;<br />
a2 = f(z2);<br />
z3 = W2 * a2 + b2;<br />
h(:,i) = f(z3);<br />
end;<br />
</syntaxhighlight><br />
<br />
Can we get rid of the <tt>for</tt> loop? For many algorithms, we will represent intermediate stages of computation via vectors. For example, <tt>z2</tt>, <tt>a2</tt>, and <tt>z3</tt> here are all column vectors that're used to compute the activations of the hidden and output layers. In order to take better advantage of parallelism and efficient matrix operations, we would like to ''have our algorithm operate simultaneously on many training examples''. Let us temporarily ignore <tt>b1</tt> and <tt>b2</tt> (say, set them to zero for now). We can then implement the following:<br />
<syntaxhighlight><br />
% Vectorized implementation (ignoring b1, b2)<br />
z2 = W1 * x;<br />
a2 = f(z2);<br />
z3 = W2 * a2;<br />
h = f(z3)<br />
</syntaxhighlight><br />
<br />
In this implementation, <tt>z2</tt>, <tt>a2</tt>, and <tt>z3</tt> are all matrices, with one column per training example. A common design pattern in vectorizing across training examples is that whereas previously we had a column vector (such as <tt>z2</tt>) per training example, we can often instead try to compute a matrix so that all of these column vectors are stacked together to form a matrix. Concretely, in this example, <tt>a2</tt> becomes a <math>s_2</math> by <math>m</math> matrix (where <math>s_2</math> is the number of units in layer 2 of the network, and <math>m</math> is the number of training examples). And, the <math>i</math>-th column of <tt>a2</tt> contains the activations of the hidden units (layer 2 of the network) when the <math>i</math>-th training example <tt>x(:,i)</tt> is input to the network. <br />
<br />
In the implementation above, we have assumed that the activation function <tt>f(z)</tt> takes as input a matrix <tt>z</tt>, and applies the activation function component-wise to the input. Note that your implementation of <tt>f(z)</tt> should use Matlab/Octave's matrix operations as much as possible, and avoid <tt>for</tt> loops as well. We illustrate this below, assuming that <tt>f(z)</tt> is the sigmoid activation function:<br />
<br />
<syntaxhighlight><br />
% Inefficient, unvectorized implementation of the activation function<br />
function output = unvectorized_f(z)<br />
output = zeros(size(z))<br />
for i=1:size(z,1), <br />
for j=1:size(z,2),<br />
output(i,j) = 1/(1+exp(-z(i,j)));<br />
end; <br />
end;<br />
end<br />
<br />
% Efficient, vectorized implementation of the activation function<br />
function output = vectorized_f(z)<br />
output = 1./(1+exp(-z)); % "./" is Matlab/Octave's element-wise division operator. <br />
end<br />
</syntaxhighlight><br />
<br />
Finally, our vectorized implementation of forward propagation above had ignored <tt>b1</tt> and <tt>b2</tt>. To incorporate those back in, we will use Matlab/Octave's built-in <tt>repmat</tt> function. We have:<br />
<syntaxhighlight><br />
% Vectorized implementation of forward propagation<br />
z2 = W1 * x + repmat(b1,1,m);<br />
a2 = f(z2);<br />
z3 = W2 * a2 + repmat(b2,1,m);<br />
h = f(z3)<br />
</syntaxhighlight><br />
The result of <tt>repmat(b1,1,m)</tt> is a matrix formed by taking the column vector <tt>b1</tt> and stacking <math>m</math> copies of them in columns as follows<br />
::<math><br />
\begin{bmatrix}<br />
| & | & & | \\<br />
{\rm b1} & {\rm b1} & \cdots & {\rm b1} \\<br />
| & | & & | <br />
\end{bmatrix}.<br />
</math><br />
This forms a <math>s_2</math> by <math>m</math> matrix. <br />
Thus, the result of adding this to <tt>W1 * x</tt> is that each column of the matrix gets <tt>b1</tt> added to it, as desired.<br />
See Matlab/Octave's documentation (type "<tt>help repmat</tt>") for more information. As a Matlab/Octave built-in function, <tt>repmat</tt> is very efficient as well, and runs much faster than if you were to implement the same thing yourself using a <tt>for</tt> loop. <br />
<br />
<br />
== Backpropagation ==<br />
<br />
We now describe the main ideas behind vectorizing backpropagation. Before reading this section, we strongly encourage you to carefully step through all the forward propagation code examples above to make sure you fully understand them. In this text, we'll only sketch the details of how to vectorize backpropagation, and leave you to derive the details in the [[Exercise:Vectorization|Vectorization exercise]]. <br />
<br />
We are in a supervised learning setting, so that we have a training set <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math> of <math>m</math> training examples. (For the autoencoder, we simply set <math>y^{(i)} = x^{(i)}</math>, but our derivation here will consider this more general setting.)<br />
<br />
Suppose we have <math>s_3</math> dimensional outputs, so that our target labels are <math>y^{(i)} \in \Re^{s_3}</math>. In our Matlab/Octave datastructure, we will stack these in columns to form a Matlab/Octave variable <tt>y</tt>, so that the <math>i</math>-th column <tt>y(:,i)</tt> is <math>y^{(i)}</math>. <br />
<br />
We now want to compute the gradient terms <br />
<math>\nabla_{W^{(l)}} J(W,b)</math> and <math>\nabla_{b^{(l)}} J(W,b)</math>. Consider the first of<br />
these terms. Following our earlier description of the [[Backpropagation Algorithm]], we had that for a single training example <math>(x,y)</math>, we can compute the derivatives as<br />
::<math><br />
\begin{align}<br />
\delta^{(3)} &= - (y - a^{(3)}) \bullet f'(z^{(3)}), \\<br />
\delta^{(2)} &= ((W^{(2)})^T\delta^{(3)}) \bullet f'(z^{(2)}), \\<br />
\nabla_{W^{(2)}} J(W,b;x,y) &= \delta^{(3)} (a^{(2)})^T, \\<br />
\nabla_{W^{(1)}} J(W,b;x,y) &= \delta^{(2)} (a^{(1)})^T. <br />
\end{align} <br />
</math><br />
Here, <math>\bullet</math> denotes element-wise product. For simplicity, our description here will ignore the derivatives with respect to <math>b^{(l)}</math>, though your implementation of backpropagation will have to compute those derivatives too. <br />
<br />
Suppose we have already implemented the vectorized forward propagation method, so that the matrix-valued <tt>z2</tt>, <tt>a2</tt>, <tt>z3</tt> and <tt>h</tt> are computed as described above. We can then implement an ''unvectorized'' version of backpropagation as follows:<br />
<syntaxhighlight><br />
gradW1 = zeros(size(W1));<br />
gradW2 = zeros(size(W2)); <br />
for i=1:m,<br />
delta3 = -(y(:,i) - h(:,i)) .* fprime(z3(:,i)); <br />
delta2 = W2'*delta3(:,i) .* fprime(z2(:,i));<br />
<br />
gradW2 = gradW2 + delta3*a2(:,i)';<br />
gradW1 = gradW1 + delta2*a1(:,i)'; <br />
end; <br />
</syntaxhighlight> <br />
<br />
This implementation has a <tt>for</tt> loop. We would like to come up with an implementation that simultaneously performs backpropagation on all the examples, and eliminates this <tt>for</tt> loop. <br />
<br />
To do so, we will replace the vectors <tt>delta3</tt> and <tt>delta2</tt> with matrices, where one column of each matrix corresponds to each training example. We will also implement a function <tt>fprime(z)</tt> that takes as input a matrix <tt>z</tt>, and applies <math>f'(\cdot)</math> element-wise. Each of the four lines of Matlab in the <tt>for</tt> loop above can then be vectorized and replaced with a single line of Matlab code (without a surrounding <tt>for</tt> loop). <br />
<br />
In the [[Exercise:Vectorization|Vectorization exercise]], we ask you to derive the vectorized version of this algorithm by yourself. If you are able to do it from this description, we strongly encourage you to do so. Here also are some [[Backpropagation vectorization hints]]; however, we encourage you to try to carry out the vectorization yourself without looking at the hints. <br />
<br />
<br />
== Sparse autoencoder ==<br />
<br />
The [[Autoencoders_and_Sparsity|sparse autoencoder]] neural network has an additional sparsity penalty that constrains neurons' average firing rate to be close to some target activation <math>\rho</math>. When performing backpropagation on a single training example, we had taken into the account the sparsity penalty by computing the following:<br />
<br />
:<math>\begin{align}<br />
\delta^{(2)}_i =<br />
\left( \left( \sum_{j=1}^{s_{2}} W^{(2)}_{ji} \delta^{(3)}_j \right)<br />
+ \beta \left( - \frac{\rho}{\hat\rho_i} + \frac{1-\rho}{1-\hat\rho_i} \right) \right) f'(z^{(2)}_i) .<br />
\end{align}</math><br />
<br />
In the ''unvectorized'' case, this was computed as:<br />
<br />
<syntaxhighlight><br />
% Sparsity Penalty Delta<br />
sparsity_delta = - rho ./ rho_hat + (1 - rho) ./ (1 - rho_hat);<br />
for i=1:m,<br />
...<br />
delta2 = (W2'*delta3(:,i) + beta*sparsity_delta).* fprime(z2(:,i)); <br />
...<br />
end;<br />
</syntaxhighlight> <br />
<br />
The code above still had a <tt>for</tt> loop over the training set, and <tt>delta2</tt> was a column vector. <br />
<br />
In contrast, recall that in the vectorized case, <tt>delta2</tt> is now a matrix with <math>m</math> columns corresponding to the <math>m</math> training examples. Now, notice that the <tt>sparsity_delta</tt> term is the same regardless of what training example we are processing. This suggests that vectorizing the computation above can be done by simply adding the same value to each column when constructing the <tt>delta2</tt> matrix. Thus, to vectorize the above computation, we can simply add <tt>sparsity_delta</tt> (e.g., using <tt>repmat</tt>) to each column of <tt>delta2</tt>.<br />
<br />
<br />
{{Vectorized Implementation}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Logistic_Regression_Vectorization_ExampleLogistic Regression Vectorization Example2011-05-26T10:56:58Z<p>Watsuen: </p>
<hr />
<div>Consider training a logistic regression model using batch gradient ascent.<br />
Suppose our hypothesis is<br />
:<math>\begin{align}<br />
h_\theta(x) = \frac{1}{1+\exp(-\theta^Tx)},<br />
\end{align}</math><br />
where (following the notational convention from the OpenClassroom videos and from CS229) we let <math>\textstyle x_0=1</math>, so that <math>\textstyle x \in \Re^{n+1}</math> <br />
and <math>\textstyle \theta \in \Re^{n+1}</math>, and <math>\textstyle \theta_0</math> is our intercept term. We have a training set<br />
<math>\textstyle \{(x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)})\}</math> of <math>\textstyle m</math> examples, and the batch gradient<br />
ascent update rule is <math>\textstyle \theta := \theta + \alpha \nabla_\theta \ell(\theta)</math>, where <math>\textstyle \ell(\theta)</math><br />
is the log likelihood and <math>\textstyle \nabla_\theta \ell(\theta)</math> is its derivative.<br />
<br />
[Note: Most of the notation below follows that defined in the OpenClassroom videos or in the class <br />
CS229: Machine Learning. For details, see either the [http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning OpenClassroom videos] or Lecture Notes #1 of http://cs229.stanford.edu/ .]<br />
<br />
We thus need to compute the gradient:<br />
:<math>\begin{align}<br />
\nabla_\theta \ell(\theta) = \sum_{i=1}^m \left(y^{(i)} - h_\theta(x^{(i)}) \right) x^{(i)}_j.<br />
\end{align}</math><br />
<br />
Suppose that the Matlab/Octave variable <tt>x</tt> is a matrix containing the training inputs, so that<br />
<tt>x(:,i)</tt> is the <math>\textstyle i</math>-th training example <math>\textstyle x^{(i)}</math>, and <tt>x(j,i)</tt> is <math>\textstyle x^{(i)}_j</math>. <br />
Further, suppose the Matlab/Octave variable <tt>y</tt> is a ''row'' vector of the labels in the<br />
training set, so that the variable <tt>y(i)</tt> is <math>\textstyle y^{(i)} \in \{0,1\}</math>. (Here we differ from the <br />
OpenClassroom/CS229 notation. Specifically, in the matrix-valued <tt>x</tt> we stack the training inputs in columns rather than in rows;<br />
and <tt>y</tt><math>\in \Re^{1\times m}</math> is a row vector rather than a column vector.) <br />
<br />
Here's truly horrible, extremely slow, implementation of the gradient computation:<br />
<syntaxhighlight lang="matlab"><br />
% Implementation 1<br />
grad = zeros(n+1,1);<br />
for i=1:m,<br />
h = sigmoid(theta'*x(:,i));<br />
temp = y(i) - h; <br />
for j=1:n+1,<br />
grad(j) = grad(j) + temp * x(j,i); <br />
end;<br />
end;<br />
</syntaxhighlight><br />
The two nested for-loops makes this very slow. Here's a more typical implementation,<br />
that partially vectorizes the algorithm and gets better performance: <br />
<syntaxhighlight lang="matlab"><br />
% Implementation 2 <br />
grad = zeros(n+1,1);<br />
for i=1:m,<br />
grad = grad + (y(i) - sigmoid(theta'*x(:,i)))* x(:,i);<br />
end; <br />
</syntaxhighlight><br />
<br />
However, it turns out to be possible to even further vectorize this. If we can get rid of the for-loop, we can significantly speed up the implementation. In particular, suppose <tt>b</tt> is a column vector, and <tt>A</tt> is a matrix. Consider the following ways of computing <tt>A * b</tt>: <br />
<syntaxhighlight lang="matlab"><br />
% Slow implementation of matrix-vector multiply<br />
grad = zeros(n+1,1);<br />
for i=1:m,<br />
grad = grad + b(i) * A(:,i); % more commonly written A(:,i)*b(i)<br />
end;<br />
<br />
% Fast implementation of matrix-vector multiply<br />
grad = A*b;<br />
</syntaxhighlight><br />
<br />
We recognize that Implementation 2 of our gradient descent calculation above is using the slow version with a for-loop, with<br />
<tt>b(i)</tt> playing the role of <tt>(y(i) - sigmoid(theta'*x(:,i)))</tt>, and <tt>A</tt> playing the role of <tt>x</tt>. We can derive a fast implementation as follows: <br />
<syntaxhighlight lang="matlab"><br />
% Implementation 3<br />
grad = x * (y- sigmoid(theta'*x))';<br />
</syntaxhighlight><br />
Here, we assume that the Matlab/Octave <tt>sigmoid(z)</tt> takes as input a vector <tt>z</tt>, applies the sigmoid function component-wise to the input, and returns the result. The output of <tt>sigmoid(z)</tt> is therefore itself also a vector, of the same dimension as the input <tt>z</tt> <br />
<br />
When the training set is large, this final implementation takes the greatest advantage of Matlab/Octave's highly optimized numerical linear algebra libraries to carry out the matrix-vector operations, and so this is far more efficient than the earlier implementations. <br />
<br />
Coming up with vectorized implementations isn't always easy, and sometimes requires careful thought. But as you gain familiarity with vectorized operations, you'll find that there are design patterns (i.e., a small number of ways of vectorizing) that apply to many different pieces of code.<br />
<br />
<br />
{{Vectorized Implementation}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/VectorizationVectorization2011-05-26T10:56:47Z<p>Watsuen: </p>
<hr />
<div>When working with learning algorithms, having a faster piece of code often <br />
means that you'll make progress faster on your project. For example, if your<br />
learning algorithm takes 20 minutes to run to completion, that means you can<br />
"try" up to 3 new ideas per hour. But if your code takes 20 hours to<br />
run, that means you can "try" only one idea a day, since that's<br />
how long you have to wait to get feedback from your program. In this latter<br />
case, if you can speed up your code so that it takes only 10 hours to run, <br />
that can literally double your personal productivity! <br />
<br />
'''Vectorization''' refers to a powerful way to speed up your algorithms. <br />
Numerical computing and parallel computing researchers have put decades of work<br />
into making certain numerical operations (such as matrix-matrix multiplication,<br />
matrix-matrix addition, matrix-vector multiplication) fast. The idea of<br />
vectorization is that we would like to express our learning algorithms<br />
in terms of these highly optimized operations. <br />
<br />
For example, if <math>x \in \Re^{n+1}</math> and <math>\textstyle \theta \in \Re^{n+1}</math> are vectors <br />
and you need to compute <math>\textstyle z = \theta^Tx</math>,<br />
you can implement (in Matlab): <br />
<syntaxhighlight lang="matlab"><br />
z = 0;<br />
for i=1:(n+1),<br />
z = z + theta(i) * x(i);<br />
end;<br />
</syntaxhighlight><br />
or you can more simply implement<br />
<syntaxhighlight lang="matlab"><br />
z = theta' * x;<br />
</syntaxhighlight><br />
The second piece of code is not only simpler, but it will also run ''much'' faster. <br />
<br />
More generally, a good rule-of-thumb for coding Matlab/Octave is:<br />
<br />
::'''Whenever possible, avoid using explicit for-loops in your code.'''<br />
<br />
In particular, the first code example used an explicit <tt>for</tt> loop. By <br />
implementing the same functionality without the <tt>for</tt> loop, we sped<br />
it up significantly. A large part<br />
of vectorizing our Matlab/Octave code will focus on getting rid of <tt>for</tt> loops,<br />
since this lets Matlab/Octave extract more parallelism from your code, while<br />
also incurring less computational overhead from the interpreter. <br />
<br />
In terms of a strategy for writing your code, initially you may find that vectorized code is harder to write, read, and/or debug,<br />
and that there may be a tradeoff in ease of programming/debugging vs. running<br />
time. Thus, for your first few programs, you might choose to first implement <br />
your algorithm without too many vectorization tricks, and verify that it is working correctly<br />
(perhaps by running on a small problem). Then only after it is working, you<br />
can vectorize your code one piece at a time, pausing after each piece to verify<br />
that your code is still computing the same result as before. At the end, you'll<br />
then hopefully have a correct, debugged, and vectorized/efficient piece of code.<br />
<br />
After you become familiar with the most common vectorization methods and tricks, <br />
you'll find that it usually isn't much effort to vectorize your code. Doing <br />
so will make your code run much faster and, in some cases, simplify it too.<br />
<br />
<br />
{{Vectorized Implementation}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Template:CNNTemplate:CNN2011-05-26T10:56:11Z<p>Watsuen: Created page with "---- <div style="text-align: center;font-size:x-small;"> From Self-Taught Learning to Deep Networks | Deep Networks: Overview | [..."</p>
<hr />
<div>----<br />
<br />
<div style="text-align: center;font-size:x-small;"><br />
[[Self-Taught Learning to Deep Networks | From Self-Taught Learning to Deep Networks]] | [[Deep Networks: Overview]] | [[Stacked Autoencoders]] | [[Fine-tuning Stacked AEs]] | [[Exercise: Implement deep networks for digit classification]]<br />
</div></div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Template:STLTemplate:STL2011-05-26T10:55:27Z<p>Watsuen: Created page with "---- <div style="text-align: center;font-size:x-small;"> Self-Taught Learning | Exercise:Self-Taught Learning </div>"</p>
<hr />
<div>----<br />
<br />
<div style="text-align: center;font-size:x-small;"><br />
[[Self-Taught Learning]] | [[Exercise:Self-Taught Learning]]<br />
</div></div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Template:SoftmaxTemplate:Softmax2011-05-26T10:54:55Z<p>Watsuen: </p>
<hr />
<div>----<br />
<br />
<div style="text-align: center;font-size:x-small;"><br />
[[Softmax Regression]] | [[Exercise:Softmax Regression]]<br />
</div></div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Template:SoftmaxTemplate:Softmax2011-05-26T10:54:30Z<p>Watsuen: </p>
<hr />
<div>----<br />
<br />
<div style="text-align: center;font-size:x-small;"><br />
[[Self-Taught Learning]] | [[Exercise:Self-Taught Learning]]<br />
</div></div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Template:SoftmaxTemplate:Softmax2011-05-26T10:54:02Z<p>Watsuen: Created page with "---- <div style="text-align: center;font-size:x-small;"> Softmax Regression | Exercise:Softmax Regression </div>"</p>
<hr />
<div>----<br />
<br />
<div style="text-align: center;font-size:x-small;"><br />
[[Softmax Regression]] | [[Exercise:Softmax Regression]]<br />
</div></div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Template:PCATemplate:PCA2011-05-26T10:53:33Z<p>Watsuen: Created page with "---- <div style="text-align: center;font-size:x-small;"> PCA | Whitening | Implementing PCA/Whitening | Exercise:PCA in 2D | Exercise:PCA and Whitening </div>"</p>
<hr />
<div>----<br />
<br />
<div style="text-align: center;font-size:x-small;"><br />
[[PCA]] | [[Whitening]] | [[Implementing PCA/Whitening]] | [[Exercise:PCA in 2D]] | [[Exercise:PCA and Whitening]]<br />
</div></div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Template:Vectorized_ImplementationTemplate:Vectorized Implementation2011-05-26T10:52:40Z<p>Watsuen: Created page with "---- <div style="text-align: center;font-size:x-small;"> Vectorization | Logistic Regression Vectorization Example | Neural Network Vectorization | [[Exercise:Vector..."</p>
<hr />
<div>----<br />
<br />
<div style="text-align: center;font-size:x-small;"><br />
[[Vectorization]] | [[Logistic Regression Vectorization Example]] | [[Neural Network Vectorization]] | [[Exercise:Vectorization]]<br />
</div></div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Exercise:Sparse_AutoencoderExercise:Sparse Autoencoder2011-05-26T10:51:04Z<p>Watsuen: </p>
<hr />
<div>==Sparse autoencoder implementation==<br />
<br />
In this problem set, you will implement the sparse autoencoder<br />
algorithm, and show how it discovers that edges are a good<br />
representation for natural images. (Images provided by<br />
Bruno Olshausen.) The sparse autoencoder algorithm is described in<br />
the lecture notes found on the course website.<br />
<br />
In the file [http://ufldl.stanford.edu/wiki/resources/sparseae_exercise.zip sparseae_exercise.zip], we have provided some starter code in<br />
Matlab. You should write your code at the places indicated<br />
in the files ("<tt>YOUR CODE HERE</tt>"). You have to complete the following files:<br />
<tt>sampleIMAGES.m, sparseAutoencoderCost.m, computeNumericalGradient.m</tt>. <br />
The starter code in <tt>train.m</tt> shows how these functions are used.<br />
<br />
Specifically, in this exercise you will implement a sparse autoencoder, <br />
trained with 8&times;8 image patches using the L-BFGS optimization algorithm.<br />
<br />
'''A note on the software:''' The provided .zip file includes a subdirectory<br />
<tt>minFunc</tt> with 3rd party software implementing L-BFGS, that <br />
is licensed under a Creative Commons, Attribute, Non-Commercial license. <br />
If you need to use this software for commercial purposes, you can <br />
download and use a different function (fminlbfgs) that can serve the same<br />
purpose, but runs ~3x slower for this exercise (and thus is less recommended). <br />
You can read more about this in the [[Fminlbfgs_Details]] page. <br />
<br />
<br />
<br />
===Step 1: Generate training set===<br />
<br />
The first step is to generate a training set. To get a single training <br />
example <math>x</math>, randomly pick one of the 10 images, then randomly sample <br />
an 8&times;8 image patch from the selected image, and convert the image patch (either <br />
in row-major order or column-major order; it doesn't matter) into a 64-dimensional <br />
vector to get a training example <math>x \in \Re^{64}.</math><br />
<br />
Complete the code in <tt>sampleIMAGES.m</tt>. Your code should sample 10000 image <br />
patches and concatenate them into a 64&times;10000 matrix. <br />
<br />
To make sure your implementation is working, run the code in "Step 1" of <tt>train.m</tt>.<br />
This should result in a plot of a random sample of 200 patches from the dataset. <br />
<br />
'''Implementational tip:''' When we run our implemented <tt><br />
sampleImages()</tt>, it takes under 5 seconds. If your implementation<br />
takes over 30 seconds, it may be because you are accidentally making a<br />
copy of an entire 512&times;512 image each time you're picking a random<br />
image. By copying a 512&times;512 image 10000 times, this can make your<br />
implementation much less efficient. While this doesn't slow down your<br />
code significantly for this exercise (because we have only 10000<br />
examples), when we scale to much larger problems later this quarter<br />
with <math>10^6</math> or more examples, this will significantly slow down your<br />
code. Please implement <tt>sampleIMAGES</tt> so that you aren't making a<br />
copy of an entire 512&times;512 image each time you need to cut out an 8x8<br />
image patch.<br />
<br />
===Step 2: Sparse autoencoder objective===<br />
<br />
Implement code to compute the sparse autoencoder cost function <math>J_{\rm sparse}(W,b)</math> <br />
(Section 3 of the lecture notes)<br />
and the corresponding derivatives of <math>J_{\rm sparse}</math> with respect to <br />
the different parameters. Use the sigmoid function for the activation function, <br />
<math>f(z) = \frac{1}{{1+e^{-z}}}</math>. <br />
In particular, complete the code in <tt>sparseAutoencoderCost.m</tt>.<br />
<br />
The sparse autoencoder is parameterized by matrices <br />
<math>W^{(1)} \in \Re^{s_1\times s_2}</math>,<br />
<math>W^{(2)} \in \Re^{s_2\times s_3}</math> <br />
vectors <br />
<math>b^{(1)} \in \Re^{s_2}</math>, <br />
<math>b^{(2)} \in \Re^{s_3}</math>.<br />
However, for subsequent notational convenience, we will "unroll" all of these parameters<br />
into a very long parameter vector <math>\theta</math> with <math>s_1s_2 + s_2s_3 + s_2 + s_3</math> elements. The<br />
code for converting between the <math>(W^{(1)}, W^{(2)}, b^{(1)}, b^{(2)})</math> and the <math>\theta</math> parameterization <br />
is already provided in the starter code.<br />
<br />
'''Implementational tip:''' The objective <math>J_{\rm sparse}(W,b)</math> contains 3 terms, corresponding<br />
to the squared error term, the weight decay term, and the sparsity penalty. You're welcome<br />
to implement this however you want, but for ease of debugging,<br />
you might implement the cost function and derivative computation (backpropagation) only for the <br />
squared error term first (this corresponds to setting <math>\lambda = \beta = 0</math>), and implement <br />
the gradient checking method in the next section to first verify that this code is correct. Then only<br />
after you have verified that the objective and derivative calculations corresponding to the squared error <br />
term are working, add in code to compute the weight decay and sparsity penalty terms and their corresponding derivatives. <br />
<br />
===Step 3: Gradient checking===<br />
<br />
Following Section 2.3 of the lecture notes, implement code for gradient checking. <br />
Specifically, complete the code in <tt>computeNumericalGradient.m</tt>. Please <br />
use <tt>EPSILON</tt> = 10<sup>-4</sup> as described in the lecture notes. <br />
<br />
We've also provided code in <tt>checkNumericalGradient.m</tt> for you to test your code. <br />
This code defines a simple quadratic function <math>h: \Re^2 \mapsto \Re</math> given by <br />
<math>h(x) = x_1^2 + 3x_1 x_2</math>, and evaluates it at the point <math>x = (4, 10)^T</math>. It allows you<br />
to verify that your numerically evaluated gradient is very close to the true (analytically<br />
computed) gradient. <br />
<br />
After using <tt>checkNumericalGradient.m</tt> to make sure your implementation is correct, <br />
next use <tt>computeNumericalGradient.m</tt> to make sure that your <tt>sparseAutoencoderCost.m</tt><br />
is computing derivatives correctly. For details, see Steps 3 in <tt>train.m</tt>. We strongly<br />
encourage you not to proceed to the next step until you've verified that your derivative<br />
computations are correct. <br />
<br />
'''Implementational tip:''' If you are debugging your code, performing gradient checking on smaller models <br />
and smaller training sets (e.g., using only 10 training examples and 1-2 hidden <br />
units) may speed things up.<br />
<br />
===Step 4: Train the sparse autoencoder===<br />
<br />
Now that you have code that computes <br />
<math>J_{\rm sparse}</math> and its derivatives, we're ready to minimize <br />
<math>J_{\rm sparse}</math> with respect to its parameters, and thereby train our<br />
sparse autoencoder.<br />
<br />
We will use the L-BFGS algorithm. This is provided to you in a function called<br />
<tt>minFunc</tt> (code provided by Mark Schmidt) included in the starter code. (For the purpose of this<br />
assignment, you only need to call minFunc with the default parameters. You do<br />
not need to know how L-BFGS works.) We have already provided code in <tt>train.m</tt><br />
(Step 4) to call <tt>minFunc</tt>. The <tt>minFunc</tt> code assumes that the parameters<br />
to be optimized are a long parameter vector; so we will use the "<math>\theta</math>" parameterization<br />
rather than the "<math>(W^{(1)}, W^{(2)}, b^{(1)}, b^{(2)})</math>" parameterization when passing our parameters<br />
to it.<br />
<br />
Train a sparse autoencoder with 64 input units, 25 hidden units, and 64 output units.<br />
In our starter code, we have provided a function for initializing the parameters.<br />
We initialize the biases <math>b^{(l)}_i</math> to zero, and the weights <math>W^{(l)}_{ij}</math><br />
to random numbers drawn uniformly from the interval <br />
<math>\left[-\sqrt{\frac{6}{n_{\rm in}+n_{\rm out}+1}},\sqrt{\frac{6}{n_{\rm in}+n_{\rm out}+1}}\,\right]</math>, where <math>n_{\rm in}</math> is the fan-in<br />
(the number of inputs feeding into a node) and <math>n_{\rm out}</math> is the fan-in (the number of<br />
units that a node feeds into).<br />
<br />
The values we provided for the various parameters (<math>\lambda, \beta, \rho</math>, etc.)<br />
should work, but feel free to play with different settings of the parameters as<br />
well.<br />
<br />
'''Implementational tip:''' Once you have your backpropagation implementation correctly computing the derivatives (as verified using gradient checking in Step 3), when you are now using it with L-BFGS to optimize <math>J_{\rm sparse}(W,b)</math>, make sure you're not doing gradient-checking on every step. Backpropagation can be used to compute the derivatives of <math>J_{\rm sparse}(W,b)</math> fairly efficiently, and if you were additionally computing the gradient numerically on every step, this would slow down your program significantly. <br />
<br />
<br />
===Step 5: Visualization===<br />
<br />
After training the autoencoder, use <tt>display_network.m</tt> to visualize the learned<br />
weights. (See <tt>train.m</tt>, Step 5.) Run "<tt>print -djpeg weights.jpg</tt>" to save<br />
the visualization to a file "<tt>weights.jpg</tt>" (which you will submit together with<br />
your code). <br />
<br />
==Results==<br />
<br />
To successfully complete this assignment, you should demonstrate your sparse<br />
autoencoder algorithm learning a set of edge detectors. For example, this<br />
was the visualization we obtained: <br />
<br />
<br />
[[File:Gabor.jpg]]<br />
<br />
<br />
Our implementation took around 5 minutes to run on a fast computer.<br />
In case you end up needing to try out multiple implementations or <br />
different parameter values, be sure to budget enough time for debugging <br />
and to run the experiments you'll need. <br />
<br />
Also, by way of comparison, here are some visualizations from implementations<br />
that we do not consider successful (either a buggy implementation, or where<br />
the parameters were poorly tuned):<br />
<br />
<br />
[[File:badfilter1.jpg|240 px]] [[File:badfilter2.jpg|240 px]] [[File:badfilter3.jpg|240 px]]<br />
<br />
[[File:badfilter4.jpg|240 px]] [[File:badfilter5.jpg|240 px]] [[File:badfilter6.jpg|240 px]]<br />
<br />
<br />
[[Category:Exercises]]<br />
<br />
<br />
{{Sparse_Autoencoder}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Sparse_Autoencoder_Notation_SummarySparse Autoencoder Notation Summary2011-05-26T10:50:50Z<p>Watsuen: </p>
<hr />
<div>Here is a summary of the symbols used in our derivation of the sparse autoencoder: <br />
<br />
{| class="wikitable"<br />
|-<br />
! Symbol<br />
! Meaning<br />
|-<br />
| <math>\textstyle x</math><br />
| Input features for a training example, <math>\textstyle x \in \Re^{n}</math>.<br />
|-<br />
| <math>\textstyle y</math><br />
| Output/target values. Here, <math>\textstyle y</math> can be vector valued. In the case of an autoencoder, <math>\textstyle y=x</math>.<br />
|-<br />
| <math>\textstyle (x^{(i)}, y^{(i)})</math><br />
| The <math>\textstyle i</math>-th training example<br />
|-<br />
| <math>\textstyle h_{W,b}(x)</math><br />
| Output of our hypothesis on input <math>\textstyle x</math>, using parameters <math>\textstyle W,b</math>. This should be a vector of<br />
the same dimension as the target value <math>\textstyle y</math>.<br />
|-<br />
| <math>\textstyle W^{(l)}_{ij}</math><br />
| The parameter associated with the connection between unit <math>\textstyle j</math> in layer <math>\textstyle l</math>, and<br />
unit <math>\textstyle i</math> in layer <math>\textstyle l+1</math>.<br />
|-<br />
| <math>\textstyle b^{(l)}_{i}</math><br />
| The bias term associated with unit <math>\textstyle i</math> in layer <math>\textstyle l+1</math>. Can also be thought of as the parameter associated with the connection between the bias unit in layer <math>\textstyle l</math> and unit <math>\textstyle i</math> in layer <math>\textstyle l+1</math>.<br />
|-<br />
| <math>\textstyle \theta</math><br />
| Our parameter vector. It is useful to think of this as the result of taking the parameters <math>\textstyle W,b</math> and ``unrolling'' them into a long column vector.<br />
|-<br />
| <math>\textstyle a^{(l)}_i</math><br />
| Activation (output) of unit <math>\textstyle i</math> in layer <math>\textstyle l</math> of the network.<br />
In addition, since layer <math>\textstyle L_1</math> is the input layer, we also have <math>\textstyle a^{(1)}_i = x_i</math>.<br />
|-<br />
| <math>\textstyle f(\cdot)</math><br />
| The activation function. Throughout these notes, we used <math>\textstyle f(z) = \tanh(z)</math>.<br />
|-<br />
| <math>\textstyle z^{(l)}_i</math><br />
| Total weighted sum of inputs to unit <math>\textstyle i</math> in layer <math>\textstyle l</math>. Thus, <math>\textstyle a^{(l)}_i = f(z^{(l)}_i)</math>.<br />
|-<br />
| <math>\textstyle \alpha</math><br />
| Learning rate parameter<br />
|-<br />
| <math>\textstyle s_l</math><br />
| Number of units in layer <math>\textstyle l</math> (not counting the bias unit).<br />
|-<br />
| <math>\textstyle n_l</math><br />
| Number layers in the network. Layer <math>\textstyle L_1</math> is usually the input layer, and layer <math>\textstyle L_{n_l}</math> the output layer.<br />
|-<br />
| <math>\textstyle \lambda</math><br />
| Weight decay parameter.<br />
|-<br />
| <math>\textstyle \hat{x}</math><br />
| For an autoencoder, its output; i.e., its reconstruction of the input <math>\textstyle x</math>. Same meaning as <math>\textstyle h_{W,b}(x)</math>.<br />
|-<br />
| <math>\textstyle \rho</math><br />
| Sparsity parameter, which specifies our desired level of sparsity<br />
|-<br />
| <math>\textstyle \hat\rho_i</math><br />
| The average activation of hidden unit <math>\textstyle i</math> (in the sparse autoencoder).<br />
|-<br />
| <math>\textstyle \beta</math> <br />
| Weight of the sparsity penalty term (in the sparse autoencoder objective).<br />
|}<br />
<br />
<br />
{{Sparse_Autoencoder}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Visualizing_a_Trained_AutoencoderVisualizing a Trained Autoencoder2011-05-26T10:50:39Z<p>Watsuen: </p>
<hr />
<div>Having trained a (sparse) autoencoder, we would now like to visualize the function<br />
learned by the algorithm, to try to understand what it has learned.<br />
Consider the case of training an autoencoder on <math>\textstyle 10 \times 10</math> images, so that <math>\textstyle n = 100</math>.<br />
Each hidden unit <math>\textstyle i</math> computes a function of the input:<br />
:<math>\begin{align}<br />
a^{(2)}_i = f\left(\sum_{j=1}^{100} W^{(1)}_{ij} x_j + b^{(1)}_i \right).<br />
\end{align}</math><br />
<!-- This is the activation function <math>\textstyle g(\cdot)</math> applied to an affine function of the input.!--><br />
We will visualize the function computed by hidden unit <math>\textstyle i</math>---which depends on the<br />
parameters <math>\textstyle W^{(1)}_{ij}</math> (ignoring<br />
the bias term for now)---using a 2D image. In particular, we think of<br />
<math>\textstyle a^{(2)}_i</math> as some non-linear feature of the input <math>\textstyle x</math>.<br />
We ask:<br />
What input image <math>\textstyle x</math> would cause<br />
<math>\textstyle a^{(2)}_i</math> to be maximally activated?<br />
(Less formally, what is the feature that hidden unit <math>\textstyle i</math> is looking for?)<br />
For this question to have a non-trivial answer,<br />
we must impose some constraints on <math>\textstyle x</math>. If we suppose that<br />
the input is<br />
norm constrained by <math>\textstyle ||x||^2 = \sum_{i=1}^{100} x_i^2 \leq 1</math>, then one can<br />
show (try doing this yourself)<br />
that the input which maximally activates hidden unit <math>\textstyle i</math> is given<br />
by setting pixel <math>\textstyle x_j</math> (for all 100 pixels, <math>\textstyle j=1,\ldots, 100</math>) to<br />
:<math>\begin{align}<br />
x_j = \frac{W^{(1)}_{ij}}{\sqrt{\sum_{j=1}^{100} (W^{(1)}_{ij})^2}}.<br />
\end{align}</math><br />
By displaying the image formed by these pixel intensity values, we can begin<br />
to understand what feature hidden unit <math>\textstyle i</math> is looking for.<br />
<br />
If we have an autoencoder with 100 hidden units (say), then we our<br />
visualization will have 100 such images---one per hidden unit. By examining<br />
these 100 images, we can try to understand what the ensemble of hidden units is<br />
learning.<br />
<br />
When we do this for a sparse autoencoder (trained with 100 hidden units on<br />
10x10 pixel inputs<sup>1</sup> we get the following result:<br />
<br />
[[Image:ExampleSparseAutoencoderWeights.png|thumb|400px|center]]<br />
<br />
Each square in the figure above shows the (norm bounded) input image <math>\textstyle x</math> that<br />
maximally actives one of 100 hidden units. We see that the different hidden<br />
units have learned to detect edges at different positions and orientations in<br />
the image.<br />
<br />
These features are, not surprisingly, useful for such tasks as object<br />
recognition and other vision tasks. When applied to other input domains (such<br />
as audio), this algorithm also learns useful representations/features for those<br />
domains too.<br />
<br />
----<br />
<br />
<sup>1</sup> ''The learned features were obtained by training on '''whitened''' natural images. Whitening is a preprocessing step which removes redundancy in the input, by causing adjacent pixels to become less correlated.''<br />
<br />
<br />
{{Sparse_Autoencoder}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Autoencoders_and_SparsityAutoencoders and Sparsity2011-05-26T10:50:23Z<p>Watsuen: </p>
<hr />
<div>So far, we have described the application of neural networks to supervised learning, in which we have labeled<br />
training examples. Now suppose we have only unlabeled training examples set <math>\textstyle \{x^{(1)}, x^{(2)}, x^{(3)}, \ldots\}</math>,<br />
where <math>\textstyle x^{(i)} \in \Re^{n}</math>. An<br />
'''autoencoder''' neural network is an unsupervised learning algorithm that applies backpropagation,<br />
setting the target values to be equal to the inputs. I.e., it uses <math>\textstyle y^{(i)} = x^{(i)}</math>.<br />
<br />
Here is an autoencoder:<br />
<br />
[[Image:Autoencoder636.png|400px|center]]<br />
<br />
The autoencoder tries to learn a function <math>\textstyle h_{W,b}(x) \approx x</math>. In other<br />
words, it is trying to learn an approximation to the identity function, so as<br />
to output <math>\textstyle \hat{x}</math> that is similar to <math>\textstyle x</math>. The identity function seems a<br />
particularly trivial function to be trying to learn; but by placing constraints<br />
on the network, such as by limiting the number of hidden units, we can discover<br />
interesting structure about the data. As a concrete example, suppose the<br />
inputs <math>\textstyle x</math> are the pixel intensity values from a <math>\textstyle 10 \times 10</math> image (100<br />
pixels) so <math>\textstyle n=100</math>, and there are <math>\textstyle s_2=50</math> hidden units in layer <math>\textstyle L_2</math>. Note that<br />
we also have <math>\textstyle y \in \Re^{100}</math>. Since there are only 50 hidden units, the<br />
network is forced to learn a ''compressed'' representation of the input.<br />
I.e., given only the vector of hidden unit activations <math>\textstyle a^{(2)} \in \Re^{50}</math>,<br />
it must try to '''reconstruct''' the 100-pixel input <math>\textstyle x</math>. If the input were completely<br />
random---say, each <math>\textstyle x_i</math> comes from an IID Gaussian independent of the other<br />
features---then this compression task would be very difficult. But if there is<br />
structure in the data, for example, if some of the input features are correlated,<br />
then this algorithm will be able to discover some of those correlations. In fact,<br />
this simple autoencoder often ends up learning a low-dimensional representation very similar<br />
to PCAs.<br />
<br />
Our argument above relied on the number of hidden units <math>\textstyle s_2</math> being small. But<br />
even when the number of hidden units is large (perhaps even greater than the<br />
number of input pixels), we can still discover interesting structure, by<br />
imposing other constraints on the network. In particular, if we impose a<br />
'''sparsity''' constraint on the hidden units, then the autoencoder will still<br />
discover interesting structure in the data, even if the number of hidden units<br />
is large.<br />
<br />
Informally, we will think of a neuron as being "active" (or as "firing") if<br />
its output value is close to 1, or as being "inactive" if its output value is<br />
close to 0. We would like to constrain the neurons to be inactive most of the<br />
time. This discussion assumes a sigmoid activation function. If you are<br />
using a tanh activation function, then we think of a neuron as being inactive<br />
when it outputs values close to -1.<br />
<br />
Recall that <math>\textstyle a^{(2)}_j</math> denotes the activation of hidden unit <math>\textstyle j</math> in the<br />
autoencoder. However, this notation doesn't make explicit what was the input <math>\textstyle x</math><br />
that led to that activation. Thus, we will write <math>\textstyle a^{(2)}_j(x)</math> to denote the activation<br />
of this hidden unit when the network is given a specific input <math>\textstyle x</math>. Further, let<br />
:<math>\begin{align}<br />
\hat\rho_j = \frac{1}{m} \sum_{i=1}^m \left[ a^{(2)}_j(x^{(i)}) \right]<br />
\end{align}</math><br />
be the average activation of hidden unit <math>\textstyle j</math> (averaged over the training set).<br />
We would like to (approximately) enforce the constraint<br />
:<math>\begin{align}<br />
\hat\rho_j = \rho,<br />
\end{align}</math><br />
where <math>\textstyle \rho</math> is a '''sparsity parameter''', typically a small value close to zero<br />
(say <math>\textstyle \rho = 0.05</math>). In other words, we would like the average activation<br />
of each hidden neuron <math>\textstyle j</math> to be close to 0.05 (say). To satisfy this<br />
constraint, the hidden unit's activations must mostly be near 0.<br />
<br />
<br />
To achieve this, we will add an extra penalty term to our optimization objective that<br />
penalizes <math>\textstyle \hat\rho_j</math> deviating significantly from <math>\textstyle \rho</math>. Many choices of the penalty<br />
term will give reasonable results. We will choose the following:<br />
:<math>\begin{align}<br />
\sum_{j=1}^{s_2} \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j}.<br />
\end{align}</math><br />
Here, <math>\textstyle s_2</math> is the number of neurons in the hidden layer, and the index <math>\textstyle j</math> is summing<br />
over the hidden units in our network. If you are<br />
familiar with the concept of KL divergence, this penalty term is based on<br />
it, and can also be written<br />
:<math>\begin{align}<br />
\sum_{j=1}^{s_2} {\rm KL}(\rho || \hat\rho_j),<br />
\end{align}</math><br />
where <math>\textstyle {\rm KL}(\rho || \hat\rho_j)<br />
= \rho \log \frac{\rho}{\hat\rho_j} + (1-\rho) \log \frac{1-\rho}{1-\hat\rho_j}</math><br />
is the Kullback-Leibler (KL) divergence between<br />
a Bernoulli random variable with mean <math>\textstyle \rho</math> and a Bernoulli random variable with mean <math>\textstyle \hat\rho_j</math>.<br />
KL-divergence is a standard function for measuring how different two different<br />
distributions are. (If you've not seen KL-divergence before, don't worry about<br />
it; everything you need to know about it is contained in these notes.)<br />
<br />
This penalty function has the property that <math>\textstyle {\rm KL}(\rho || \hat\rho_j) = 0</math> if <math>\textstyle \hat\rho_j = \rho</math>,<br />
and otherwise it increases monotonically as <math>\textstyle \hat\rho_j</math> diverges from <math>\textstyle \rho</math>. For example, in the<br />
figure below, we have set <math>\textstyle \rho = 0.2</math>, and plotted<br />
<math>\textstyle {\rm KL}(\rho || \hat\rho_j)</math> for a range of values of <math>\textstyle \hat\rho_j</math>:<br />
<br />
[[Image:KLPenaltyExample.png|400px|center]]<br />
<br />
We see that the KL-divergence reaches its minimum of 0 at<br />
<math>\textstyle \hat\rho_j = \rho</math>, and blows up (it actually approaches <math>\textstyle \infty</math>) as <math>\textstyle \hat\rho_j</math><br />
approaches 0 or 1. Thus, minimizing<br />
this penalty term has the effect of causing <math>\textstyle \hat\rho_j</math> to be close to <math>\textstyle \rho</math>.<br />
<br />
Our overall cost function is now<br />
:<math>\begin{align}<br />
J_{\rm sparse}(W,b) = J(W,b) + \beta \sum_{j=1}^{s_2} {\rm KL}(\rho || \hat\rho_j),<br />
\end{align}</math><br />
where <math>\textstyle J(W,b)</math> is as defined previously, and <math>\textstyle \beta</math> controls the weight of<br />
the sparsity penalty term. The term <math>\textstyle \hat\rho_j</math> (implicitly) depends on <math>\textstyle W,b</math> also,<br />
because it is the average activation of hidden unit <math>\textstyle j</math>, and the activation of a hidden<br />
unit depends on the parameters <math>\textstyle W,b</math>.<br />
<br />
To incorporate the KL-divergence term into your derivative calculation, there is a simple-to-implement<br />
trick involving only a small change to your code. Specifically, where previously for<br />
the second layer (<math>\textstyle l=2</math>), during backpropagation you would have computed<br />
:<math>\begin{align}<br />
\delta^{(2)}_i = \left( \sum_{j=1}^{s_{2}} W^{(2)}_{ji} \delta^{(3)}_j \right) f'(z^{(2)}_i),<br />
\end{align}</math><br />
now instead compute<br />
:<math>\begin{align}<br />
\delta^{(2)}_i =<br />
\left( \left( \sum_{j=1}^{s_{2}} W^{(2)}_{ji} \delta^{(3)}_j \right)<br />
+ \beta \left( - \frac{\rho}{\hat\rho_i} + \frac{1-\rho}{1-\hat\rho_i} \right) \right) f'(z^{(2)}_i) .<br />
\end{align}</math><br />
One subtlety is that you'll need to know <math>\textstyle \hat\rho_i</math> to compute this term. Thus, you'll need<br />
to compute a forward pass on all the training examples first to compute the average<br />
activations on the training set, before computing backpropagation on any example. If your<br />
training set is small enough to fit comfortably in computer memory (this will be the case for the programming<br />
assignment), you can compute forward passes on all your examples and keep the resulting activations<br />
in memory and compute the <math>\textstyle \hat\rho_i</math>s. Then you can use your precomputed activations to<br />
perform backpropagation on all your examples. If your data is too large to fit in memory, you<br />
may have to scan through your examples computing a forward pass on each to accumulate (sum up) the<br />
activations and compute <math>\textstyle \hat\rho_i</math> (discarding the result of each forward pass after you<br />
have taken its activations <math>\textstyle a^{(2)}_i</math> into account for computing <math>\textstyle \hat\rho_i</math>). Then after<br />
having computed <math>\textstyle \hat\rho_i</math>, you'd have to redo the forward pass for each example so that you<br />
can do backpropagation on that example. In this latter case, you would end up computing a forward<br />
pass twice on each example in your training set, making it computationally less efficient.<br />
<br />
<br />
The full derivation showing that the algorithm above results in gradient descent is beyond the scope<br />
of these notes. But if you implement the autoencoder using backpropagation modified this way,<br />
you will be performing gradient descent exactly on the objective<br />
<math>\textstyle J_{\rm sparse}(W,b)</math>. Using the derivative checking method, you will be able to verify<br />
this for yourself as well.<br />
<br />
<br />
{{Sparse_Autoencoder}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Neural_NetworksNeural Networks2011-05-26T10:50:10Z<p>Watsuen: </p>
<hr />
<div>Consider a supervised learning problem where we have access to labeled training<br />
examples <math>(x^{(i)}, y^{(i)})</math>. Neural networks give a way of defining a complex,<br />
non-linear form of hypotheses <math>h_{W,b}(x)</math>, with parameters <math>W,b</math> that we can<br />
fit to our data.<br />
<br />
To describe neural networks, we will begin by describing the simplest possible<br />
neural network, one which comprises a single "neuron." We will use the following<br />
diagram to denote a single neuron:<br />
<br />
[[Image:SingleNeuron.png|300px|center]]<br />
<br />
This "neuron" is a computational unit that takes as input <math>x_1, x_2, x_3</math> (and a +1 intercept term), and<br />
outputs <math>\textstyle h_{W,b}(x) = f(W^Tx) = f(\sum_{i=1}^3 W_{i}x_i +b)</math>, where <math>f : \Re \mapsto \Re</math> is<br />
called the '''activation function'''. In these notes, we will choose<br />
<math>f(\cdot)</math> to be the sigmoid function:<br />
<br />
:<math><br />
f(z) = \frac{1}{1+\exp(-z)}.<br />
</math><br />
<br />
Thus, our single<br />
neuron corresponds exactly to the input-output mapping defined by logistic regression.<br />
<br />
Although these notes will use the sigmoid function, it is worth noting that<br />
another common choice for <math>f</math> is the hyperbolic tangent, or tanh, function:<br />
:<math><br />
f(z) = \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}, <br />
</math><br />
Here are plots of the sigmoid and <math>\tanh</math> functions:<br />
<br />
<br />
<br />
<div align=center><br />
[[Image:Sigmoid_Function.png|400px|top|Sigmoid activation function.]]<br />
[[Image:Tanh_Function.png|400px|top|Tanh activation function.]]<br />
</div><br />
<br />
The <math>\tanh(z)</math> function is a rescaled version of the sigmoid, and its output range is<br />
<math>[-1,1]</math> instead of <math>[0,1]</math>.<br />
<br />
Note that unlike some other venues (including the OpenClassroom videos, and parts of CS229), we are not using the convention<br />
here of <math>x_0=1</math>. Instead, the intercept term is handled separately by the parameter <math>b</math>.<br />
<br />
Finally, one identity that'll be useful later: If <math>f(z) = 1/(1+\exp(-z))</math> is the sigmoid<br />
function, then its derivative is given by <math>f'(z) = f(z) (1-f(z))</math>.<br />
(If <math>f</math> is the tanh function, then its derivative is given by<br />
<math>f'(z) = 1- (f(z))^2</math>.) You can derive this yourself using the definition of<br />
the sigmoid (or tanh) function.<br />
<br />
<br />
<br />
== Neural Network model == <br />
<br />
A neural network is put together by hooking together many of our simple<br />
"neurons," so that the output of a neuron can be the input of another. For<br />
example, here is a small neural network:<br />
<br />
[[Image:Network331.png|400px|center]]<br />
<br />
In this figure, we have used circles to also denote the inputs to the network. The circles<br />
labeled "+1" are called '''bias units''', and correspond to the intercept term.<br />
The leftmost layer of the network is called the '''input layer''', and the<br />
rightmost layer the '''output layer''' (which, in this example, has only one<br />
node). The middle layer of nodes is called the '''hidden layer''', because its<br />
values are not observed in the training set. We also say that our example<br />
neural network has 3 '''input units''' (not counting the bias unit), 3 <br />
'''hidden units''', and 1 '''output unit'''.<br />
<br />
We will let <math>n_l</math><br />
denote the number of layers in our network; thus <math>n_l=3</math> in our example. We label layer <math>l</math> as<br />
<math>L_l</math>, so layer <math>L_1</math> is the input layer, and layer <math>L_{n_l}</math> the output layer.<br />
Our neural network has parameters <math>(W,b) = (W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)})</math>, where<br />
we write<br />
<math>W^{(l)}_{ij}</math> to denote the parameter (or weight) associated with the connection<br />
between unit <math>j</math> in layer <math>l</math>, and unit <math>i</math> in layer <math>l+1</math>. (Note the order of the indices.)<br />
Also, <math>b^{(l)}_i</math> is the bias associated with unit <math>i</math> in layer <math>l+1</math>.<br />
Thus, in our example, we have <math>W^{(1)} \in \Re^{3\times 3}</math>, and <math>W^{(2)} \in \Re^{1\times 3}</math>.<br />
Note that bias units don't have inputs or connections going into them, since they always output<br />
the value +1. We also let <math>s_l</math> denote the number of nodes in layer <math>l</math> (not counting the bias unit).<br />
<br />
We will write <math>a^{(l)}_i</math> to denote the '''activation''' (meaning output value) of<br />
unit <math>i</math> in layer <math>l</math>. For <math>l=1</math>, we also use <math>a^{(1)}_i = x_i</math> to denote the <math>i</math>-th input.<br />
Given a fixed setting of<br />
the parameters <math>W,b</math>, our neural<br />
network defines a hypothesis <math>h_{W,b}(x)</math> that outputs a real number. Specifically, the<br />
computation that this neural network represents is given by:<br />
:<math><br />
\begin{align}<br />
a_1^{(2)} &= f(W_{11}^{(1)}x_1 + W_{12}^{(1)} x_2 + W_{13}^{(1)} x_3 + b_1^{(1)}) \\<br />
a_2^{(2)} &= f(W_{21}^{(1)}x_1 + W_{22}^{(1)} x_2 + W_{23}^{(1)} x_3 + b_2^{(1)}) \\<br />
a_3^{(2)} &= f(W_{31}^{(1)}x_1 + W_{32}^{(1)} x_2 + W_{33}^{(1)} x_3 + b_3^{(1)}) \\<br />
h_{W,b}(x) &= a_1^{(3)} = f(W_{11}^{(2)}a_1^{(2)} + W_{12}^{(2)} a_2^{(2)} + W_{13}^{(2)} a_3^{(2)} + b_1^{(2)}) <br />
\end{align}<br />
</math><br />
<br />
In the sequel, we also let <math>z^{(l)}_i</math> denote the total weighted sum of inputs to unit <math>i</math> in layer <math>l</math>,<br />
including the bias term (e.g., <math>\textstyle z_i^{(2)} = \sum_{j=1}^n W^{(1)}_{ij} x_j + b^{(1)}_i</math>), so that<br />
<math>a^{(l)}_i = f(z^{(l)}_i)</math>.<br />
<br />
Note that this easily lends itself to a more compact notation. Specifically, if we extend the<br />
activation function <math>f(\cdot)</math><br />
to apply to vectors in an element-wise fashion (i.e.,<br />
<math>f([z_1, z_2, z_3]) = [f(z_1), f(z_2), f(z_3)]</math>), then we can write<br />
the equations above more<br />
compactly as:<br />
:<math>\begin{align}<br />
z^{(2)} &= W^{(1)} x + b^{(1)} \\<br />
a^{(2)} &= f(z^{(2)}) \\<br />
z^{(3)} &= W^{(2)} a^{(2)} + b^{(2)} \\<br />
h_{W,b}(x) &= a^{(3)} = f(z^{(3)})<br />
\end{align}</math><br />
We call this step '''forward propagation.''' More generally, recalling that we also use <math>a^{(1)} = x</math> to also denote the values from the input layer,<br />
then given layer <math>l</math>'s activations <math>a^{(l)}</math>, we can compute layer <math>l+1</math>'s activations <math>a^{(l+1)}</math> as:<br />
:<math>\begin{align}<br />
z^{(l+1)} &= W^{(l)} a^{(l)} + b^{(l)} \\<br />
a^{(l+1)} &= f(z^{(l+1)})<br />
\end{align}</math><br />
By organizing our parameters in matrices and using matrix-vector operations, we can take<br />
advantage of fast linear algebra routines to quickly perform calculations in our network.<br />
<br />
<br />
We have so far focused on one example neural network, but one can also build neural<br />
networks with other '''architectures''' (meaning patterns of connectivity between neurons), including ones with multiple hidden layers.<br />
The most common choice is a <math>\textstyle n_l</math>-layered network<br />
where layer <math>\textstyle 1</math> is the input layer, layer <math>\textstyle n_l</math> is the output layer, and each<br />
layer <math>\textstyle l</math> is densely connected to layer <math>\textstyle l+1</math>. In this setting, to compute the<br />
output of the network, we can successively compute all the activations in layer<br />
<math>\textstyle L_2</math>, then layer <math>\textstyle L_3</math>, and so on, up to layer <math>\textstyle L_{n_l}</math>, using the equations above that describe the forward propagation step. This is one<br />
example of a '''feedforward''' neural network, since the connectivity graph<br />
does not have any directed loops or cycles.<br />
<br />
<br />
Neural networks can also have multiple output units. For example, here is a network<br />
with two hidden layers layers <math>L_2</math> and <math>L_3</math> and two output units in layer <math>L_4</math>:<br />
<br />
[[Image:Network3322.png|500px|center]]<br />
<br />
To train this network, we would need training examples <math>(x^{(i)}, y^{(i)})</math><br />
where <math>y^{(i)} \in \Re^2</math>. This sort of network is useful if there're multiple<br />
outputs that you're interested in predicting. (For example, in a medical<br />
diagnosis application, the vector <math>x</math> might give the input features of a<br />
patient, and the different outputs <math>y_i</math>'s might indicate presence or absence<br />
of different diseases.)<br />
<br />
<br />
{{Sparse_Autoencoder}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Gradient_checking_and_advanced_optimizationGradient checking and advanced optimization2011-05-26T10:49:49Z<p>Watsuen: </p>
<hr />
<div>Backpropagation is a notoriously difficult algorithm to debug and get right,<br />
especially since many subtly buggy implementations of it&mdash;for example, one<br />
that has an off-by-one error in the indices and that thus only trains some of<br />
the layers of weights, or an implementation that omits the bias term&mdash;will<br />
manage to learn something that can look surprisingly reasonable<br />
(while performing less well than a correct implementation). Thus, even with a<br />
buggy implementation, it may not at all be apparent that anything is amiss.<br />
In this section, we describe a method for numerically checking the derivatives computed<br />
by your code to make sure that your implementation is correct. Carrying out the<br />
derivative checking procedure described here will significantly increase<br />
your confidence in the correctness of your code.<br />
<br />
Suppose we want to minimize <math>\textstyle J(\theta)</math> as a function of <math>\textstyle \theta</math>.<br />
For this example, suppose <math>\textstyle J : \Re \mapsto \Re</math>, so that <math>\textstyle \theta \in \Re</math>.<br />
In this 1-dimensional case, one iteration of gradient descent is given by<br />
:<math>\begin{align}<br />
\theta := \theta - \alpha \frac{d}{d\theta}J(\theta).<br />
\end{align}</math><br />
Suppose also that we have implemented some function <math>\textstyle g(\theta)</math> that purportedly<br />
computes <math>\textstyle \frac{d}{d\theta}J(\theta)</math>, so that we implement gradient descent<br />
using the update <math>\textstyle \theta := \theta - \alpha g(\theta)</math>. How can we check if our implementation of<br />
<math>\textstyle g</math> is correct?<br />
<br />
Recall the mathematical definition of the derivative as<br />
:<math>\begin{align}<br />
\frac{d}{d\theta}J(\theta) = \lim_{\epsilon \rightarrow 0}<br />
\frac{J(\theta+ \epsilon) - J(\theta-\epsilon)}{2 \epsilon}.<br />
\end{align}</math><br />
Thus, at any specific value of <math>\textstyle \theta</math>, we can numerically approximate the derivative<br />
as follows:<br />
:<math>\begin{align}<br />
\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}<br />
\end{align}</math><br />
In practice, we set <math>{\rm EPSILON}</math> to a small constant, say around <math>\textstyle 10^{-4}</math>.<br />
(There's a large range of values of <math>{\rm EPSILON}</math> that should work well, but<br />
we don't set <math>{\rm EPSILON}</math> to be "extremely" small, say <math>\textstyle 10^{-20}</math>,<br />
as that would lead to numerical roundoff errors.)<br />
<br />
Thus, given a function <math>\textstyle g(\theta)</math> that is supposedly computing<br />
<math>\textstyle \frac{d}{d\theta}J(\theta)</math>, we can now numerically verify its correctness<br />
by checking that<br />
:<math>\begin{align}<br />
g(\theta) \approx<br />
\frac{J(\theta+{\rm EPSILON}) - J(\theta-{\rm EPSILON})}{2 \times {\rm EPSILON}}.<br />
\end{align}</math><br />
The degree to which these two values should approximate each other<br />
will depend on the details of <math>\textstyle J</math>. But assuming <math>\textstyle {\rm EPSILON} = 10^{-4}</math>,<br />
you'll usually find that the left- and right-hand sides of the above will agree<br />
to at least 4 significant digits (and often many more).<br />
<br />
Now, consider the case where <math>\textstyle \theta \in \Re^n</math> is a vector rather than a single real<br />
number (so that we have <math>\textstyle n</math> parameters that we want to learn), and <math>\textstyle J: \Re^n \mapsto \Re</math>. In<br />
our neural network example we used "<math>\textstyle J(W,b)</math>," but one can imagine "unrolling"<br />
the parameters <math>\textstyle W,b</math> into a long vector <math>\textstyle \theta</math>. We now generalize our derivative<br />
checking procedure to the case where <math>\textstyle \theta</math> may be a vector.<br />
<br />
<br />
<br />
Suppose we have a function <math>\textstyle g_i(\theta)</math> that purportedly computes<br />
<math>\textstyle \frac{\partial}{\partial \theta_i} J(\theta)</math>; we'd like to check if <math>\textstyle g_i</math><br />
is outputting correct derivative values. Let <math>\textstyle \theta^{(i+)} = \theta +<br />
{\rm EPSILON} \times \vec{e}_i</math>, where<br />
:<math>\begin{align}<br />
\vec{e}_i = \begin{bmatrix}0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0\end{bmatrix}<br />
\end{align}</math><br />
is the <math>\textstyle i</math>-th basis vector (a<br />
vector of the same dimension as <math>\textstyle \theta</math>, with a "1" in the <math>\textstyle i</math>-th position<br />
and "0"s everywhere else). So,<br />
<math>\textstyle \theta^{(i+)}</math> is the same as <math>\textstyle \theta</math>, except its <math>\textstyle i</math>-th element has been incremented<br />
by <math>{\rm EPSILON}</math>. Similarly, let <math>\textstyle \theta^{(i-)} = \theta - {\rm EPSILON} \times \vec{e}_i</math> be the<br />
corresponding vector with the <math>\textstyle i</math>-th element decreased by <math>{\rm EPSILON}</math>.<br />
We can now numerically verify <math>\textstyle g_i(\theta)</math>'s correctness by checking, for each <math>\textstyle i</math>,<br />
that:<br />
:<math>\begin{align}<br />
g_i(\theta) \approx<br />
\frac{J(\theta^{(i+)}) - J(\theta^{(i-)})}{2 \times {\rm EPSILON}}.<br />
\end{align}</math><br />
<br />
<br />
When implementing backpropagation to train a neural network, in a correct implementation<br />
we will have that<br />
:<math>\begin{align}<br />
\nabla_{W^{(l)}} J(W,b) &= \left( \frac{1}{m} \Delta W^{(l)} \right) + \lambda W^{(l)} \\<br />
\nabla_{b^{(l)}} J(W,b) &= \frac{1}{m} \Delta b^{(l)}.<br />
\end{align}</math><br />
This result shows that the final block of psuedo-code in [[Backpropagation Algorithm]] is indeed<br />
implementing gradient descent.<br />
To make sure your implementation of gradient descent is correct, it is<br />
usually very helpful to use the method described above to<br />
numerically compute the derivatives of <math>\textstyle J(W,b)</math>, and thereby verify that<br />
your computations of <math>\textstyle \left(\frac{1}{m}\Delta W^{(l)} \right) + \lambda W</math> and <math>\textstyle \frac{1}{m}\Delta b^{(l)}</math> are<br />
indeed giving the derivatives you want.<br />
<br />
Finally, so far our discussion has centered on using gradient descent to minimize <math>\textstyle J(\theta)</math>. If you have<br />
implemented a function that computes <math>\textstyle J(\theta)</math> and <math>\textstyle \nabla_\theta J(\theta)</math>, it turns out there are more<br />
sophisticated algorithms than gradient descent for trying to minimize <math>\textstyle J(\theta)</math>. For example, one can envision<br />
an algorithm that uses gradient descent, but automatically tunes the learning rate <math>\textstyle \alpha</math> so as to try to<br />
use a step-size that causes <math>\textstyle \theta</math> to approach a local optimum as quickly as possible.<br />
There are other algorithms that are even more<br />
sophisticated than this; for example, there are algorithms that try to find an approximation to the<br />
Hessian matrix, so that it can take more rapid steps towards a local optimum (similar to Newton's method). A full discussion of these<br />
algorithms is beyond the scope of these notes, but one example is<br />
the '''L-BFGS''' algorithm. (Another example is the '''conjugate gradient''' algorithm.) You will use one of<br />
these algorithms in the programming exercise.<br />
The main thing you need to provide to these advanced optimization algorithms is that for any <math>\textstyle \theta</math>, you have to be able<br />
to compute <math>\textstyle J(\theta)</math> and <math>\textstyle \nabla_\theta J(\theta)</math>. These optimization algorithms will then do their own<br />
internal tuning of the learning rate/step-size <math>\textstyle \alpha</math> (and compute its own approximation to the Hessian, etc.)<br />
to automatically search for a value of <math>\textstyle \theta</math> that minimizes <math>\textstyle J(\theta)</math>. Algorithms<br />
such as L-BFGS and conjugate gradient can often be much faster than gradient descent.<br />
<br />
<br />
{{Sparse_Autoencoder}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Backpropagation_AlgorithmBackpropagation Algorithm2011-05-26T10:49:33Z<p>Watsuen: </p>
<hr />
<div>Suppose we have a fixed training set <math>\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}</math> of <math>m</math> training examples. We can train our neural network using batch gradient descent. In detail, for a single training example <math>(x,y)</math>, we define the cost function with respect to that single example to be:<br />
:<math><br />
\begin{align}<br />
J(W,b; x,y) = \frac{1}{2} \left\| h_{W,b}(x) - y \right\|^2.<br />
\end{align}<br />
</math><br />
This is a (one-half) squared-error cost function. Given a training set of <math>m</math> examples, we then define the overall cost function to be: <br />
:<math><br />
\begin{align}<br />
J(W,b)<br />
&= \left[ \frac{1}{m} \sum_{i=1}^m J(W,b;x^{(i)},y^{(i)}) \right]<br />
+ \frac{\lambda}{2} \sum_{l=1}^{n_l-1} \; \sum_{i=1}^{s_l} \; \sum_{j=1}^{s_{l+1}} \left( W^{(l)}_{ji} \right)^2<br />
\\<br />
&= \left[ \frac{1}{m} \sum_{i=1}^m \left( \frac{1}{2} \left\| h_{W,b}(x^{(i)}) - y^{(i)} \right\|^2 \right) \right]<br />
+ \frac{\lambda}{2} \sum_{l=1}^{n_l-1} \; \sum_{i=1}^{s_l} \; \sum_{j=1}^{s_{l+1}} \left( W^{(l)}_{ji} \right)^2<br />
\end{align}<br />
</math><br />
<br />
The first term in the definition of <math>J(W,b)</math> is an average sum-of-squares error term. The second term is a regularization term (also called a '''weight decay''' term) that tends to decrease the magnitude of the weights, and helps prevent overfitting.<br />
<br />
[Note: Usually weight decay is not applied to the bias terms <math>b^{(l)}_i</math>, as reflected in our definition for <math>J(W, b)</math>. Applying weight decay to the bias units usually makes only a small different to the final network, however. If you've taken CS229 (Machine Learning) at Stanford or watched the course's videos on YouTube, you may also recognize this weight decay as essentially a variant of the Bayesian regularization method you saw there, where we placed a Gaussian prior on the parameters and did MAP (instead of maximum likelihood) estimation.]<br />
<br />
The '''weight decay parameter''' <math>\lambda</math> controls the relative importance of the two terms. Note also the slightly overloaded notation: <math>J(W,b;x,y)</math> is the squared error cost with respect to a single example; <math>J(W,b)</math> is the overall cost function, which includes the weight decay term.<br />
<br />
This cost function above is often used both for classification and for regression problems. For classification, we let <math>y = 0</math> or <math>1</math> represent the two class labels (recall that the sigmoid activation function outputs values in <math>[0,1]</math>; if we were using a tanh activation function, we would instead use -1 and +1 to denote the labels). For regression problems, we first scale our outputs to ensure that they lie in the <math>[0,1]</math> range (or if we were using a tanh activation function, then the <math>[-1,1]</math> range).<br />
<br />
Our goal is to minimize <math>J(W,b)</math> as a function of <math>W</math> and <math>b</math>. To train our neural network, we will initialize each parameter <math>W^{(l)}_{ij}</math> and each <math>b^{(l)}_i</math> to a small random value near zero (say according to a <math>{Normal}(0,\epsilon^2)</math> distribution for some small <math>\epsilon</math>, say <math>0.01</math>), and then apply an optimization algorithm such as batch gradient descent. Since <math>J(W, b)</math> is a non-convex function,<br />
gradient descent is susceptible to local optima; however, in practice gradient descent<br />
usually works fairly well. Finally, note that it is important to initialize<br />
the parameters randomly, rather than to all 0's. If all the parameters start off<br />
at identical values, then all the hidden layer units will end up learning the same<br />
function of the input (more formally, <math>W^{(1)}_{ij}</math> will be the same for all values of <math>i</math>, so that <math>a^{(2)}_1 = a^{(2)}_2 = a^{(2)}_3 = \ldots</math> for any input <math>x</math>). The random initialization serves the purpose of '''symmetry breaking'''.<br />
<br />
One iteration of gradient descent updates the parameters <math>W,b</math> as follows:<br />
:<math><br />
\begin{align}<br />
W_{ij}^{(l)} &= W_{ij}^{(l)} - \alpha \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b) \\<br />
b_{i}^{(l)} &= b_{i}^{(l)} - \alpha \frac{\partial}{\partial b_{i}^{(l)}} J(W,b)<br />
\end{align}<br />
</math><br />
where <math>\alpha</math> is the learning rate. The key step is computing the partial derivatives above. We will now describe the '''backpropagation''' algorithm, which gives an<br />
efficient way to compute these partial derivatives.<br />
<br />
We will first describe how backpropagation can be used to compute <math>\textstyle \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b; x, y)</math> and <math>\textstyle \frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x, y)</math>, the partial derivatives of the cost function <math>J(W,b;x,y)</math> defined with respect to a single example <math>(x,y)</math>. Once we can compute these, we see that the derivative of the overall cost function <math>J(W,b)</math> can be computed as:<br />
:<math><br />
\begin{align}<br />
\frac{\partial}{\partial W_{ij}^{(l)}} J(W,b) &=<br />
\left[ \frac{1}{m} \sum_{i=1}^m \frac{\partial}{\partial W_{ij}^{(l)}} J(W,b; x^{(i)}, y^{(i)}) \right] + \lambda W_{ij}^{(l)} \\<br />
\frac{\partial}{\partial b_{i}^{(l)}} J(W,b) &=<br />
\frac{1}{m}\sum_{i=1}^m \frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x^{(i)}, y^{(i)})<br />
\end{align}<br />
</math><br />
The two lines above differ slightly because weight decay is applied to <math>W</math> but not <math>b</math>.<br />
<br />
The intuition behind the backpropagation algorithm is as follows. Given a training example <math>(x,y)</math>, we will first run a "forward pass" to compute all the activations throughout the network, including the output value of the hypothesis <math>h_{W,b}(x)</math>. Then, for each node <math>i</math> in layer <math>l</math>, we would like to compute an "error term" <math>\delta^{(l)}_i</math> that measures how much that node was "responsible" for any errors in our output. For an output node, we can directly measure the difference between the network's activation and the true target value, and use that to define <math>\delta^{(n_l)}_i</math> (where layer <math>n_l</math> is the output layer). How about hidden units? For those, we will compute <math>\delta^{(l)}_i</math> based on a weighted average of the error terms of the nodes that uses <math>a^{(l)}_i</math> as an input. In detail, here is the backpropagation algorithm:<br />
<br />
<ol><br />
<li>Perform a feedforward pass, computing the activations for layers <math>L_2</math>, <math>L_3</math>, and so on up to the output layer <math>L_{n_l}</math>.<br />
<li>For each output unit <math>i</math> in layer <math>n_l</math> (the output layer), set<br />
:<math><br />
\begin{align}<br />
\delta^{(n_l)}_i<br />
= \frac{\partial}{\partial z^{(n_l)}_i} \;\;<br />
\frac{1}{2} \left\|y - h_{W,b}(x)\right\|^2 = - (y_i - a^{(n_l)}_i) \cdot f'(z^{(n_l)}_i)<br />
\end{align}<br />
</math><br />
<li>For <math>l = n_l-1, n_l-2, n_l-3, \ldots, 2</math> <br />
:For each node <math>i</math> in layer <math>l</math>, set<br />
::<math><br />
\delta^{(l)}_i = \left( \sum_{j=1}^{s_{l+1}} W^{(l)}_{ji} \delta^{(l+1)}_j \right) f'(z^{(l)}_i)<br />
</math><br />
<li>Compute the desired partial derivatives, which are given as: <br />
:<math><br />
\begin{align}<br />
\frac{\partial}{\partial W_{ij}^{(l)}} J(W,b; x, y) &= a^{(l)}_j \delta_i^{(l+1)} \\<br />
\frac{\partial}{\partial b_{i}^{(l)}} J(W,b; x, y) &= \delta_i^{(l+1)}.<br />
\end{align}<br />
</math><br />
</ol><br />
<br />
Finally, we can also re-write the algorithm using matrix-vectorial notation. We will use "<math>\textstyle \bullet</math>" to denote the element-wise product operator (denoted "<tt>.*</tt>" in Matlab or Octave, and also called the Hadamard product), so that if <math>\textstyle a = b \bullet c</math>, then <math>\textstyle a_i = b_ic_i</math>. Similar to how we extended the definition of <math>\textstyle f(\cdot)</math> to apply element-wise to vectors, we also do the same for <math>\textstyle f'(\cdot)</math> (so that <math>\textstyle f'([z_1, z_2, z_3]) =<br />
[f'(z_1),<br />
f'(z_2),<br />
f'(z_3)]</math>).<br />
<br />
The algorithm can then be written:<br />
<br />
<ol><br />
<li>Perform a feedforward pass, computing the activations for layers <math>\textstyle L_2</math>, <math>\textstyle L_3</math>, up to the output layer <math>\textstyle L_{n_l}</math>, using the equations defining the forward propagation steps<br />
<li>For the output layer (layer <math>\textstyle n_l</math>), set <br />
:<math>\begin{align}<br />
\delta^{(n_l)}<br />
= - (y - a^{(n_l)}) \bullet f'(z^{(n_l)})<br />
\end{align}</math><br />
<li>For <math>\textstyle l = n_l-1, n_l-2, n_l-3, \ldots, 2</math> <br />
:Set<br />
::<math>\begin{align}<br />
\delta^{(l)} = \left((W^{(l)})^T \delta^{(l+1)}\right) \bullet f'(z^{(l)})<br />
\end{align}</math><br />
<li>Compute the desired partial derivatives: <br />
:<math>\begin{align}<br />
\nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\<br />
\nabla_{b^{(l)}} J(W,b;x,y) &= \delta^{(l+1)}.<br />
\end{align}</math><br />
</ol><br />
<br />
<br />
'''Implementation note:''' In steps 2 and 3 above, we need to compute <math>\textstyle f'(z^{(l)}_i)</math> for each value of <math>\textstyle i</math>. Assuming <math>\textstyle f(z)</math> is the sigmoid activation function, we would already have <math>\textstyle a^{(l)}_i</math> stored away from the forward pass through the network. Thus, using the expression that we worked out earlier for <math>\textstyle f'(z)</math>, <br />
we can compute this as <math>\textstyle f'(z^{(l)}_i) = a^{(l)}_i (1- a^{(l)}_i)</math>. <br />
<br />
Finally, we are ready to describe the full gradient descent algorithm. In the pseudo-code<br />
below, <math>\textstyle \Delta W^{(l)}</math> is a matrix (of the same dimension as <math>\textstyle W^{(l)}</math>), and <math>\textstyle \Delta b^{(l)}</math> is a vector (of the same dimension as <math>\textstyle b^{(l)}</math>). Note that in this notation, <br />
"<math>\textstyle \Delta W^{(l)}</math>" is a matrix, and in particular it isn't "<math>\textstyle \Delta</math> times <math>\textstyle W^{(l)}</math>." We implement one iteration of batch gradient descent as follows:<br />
<br />
<ol><br />
<li>Set <math>\textstyle \Delta W^{(l)} := 0</math>, <math>\textstyle \Delta b^{(l)} := 0</math> (matrix/vector of zeros) for all <math>\textstyle l</math>.<br />
<li>For <math>\textstyle i = 1</math> to <math>\textstyle m</math>, <br />
<ol type="a"><br />
<li>Use backpropagation to compute <math>\textstyle \nabla_{W^{(l)}} J(W,b;x,y)</math> and <br />
<math>\textstyle \nabla_{b^{(l)}} J(W,b;x,y)</math>.<br />
<li>Set <math>\textstyle \Delta W^{(l)} := \Delta W^{(l)} + \nabla_{W^{(l)}} J(W,b;x,y)</math>. <br />
<li>Set <math>\textstyle \Delta b^{(l)} := \Delta b^{(l)} + \nabla_{b^{(l)}} J(W,b;x,y)</math>. <br />
</ol><br />
<br />
<li>Update the parameters:<br />
:<math>\begin{align}<br />
W^{(l)} &= W^{(l)} - \alpha \left[ \left(\frac{1}{m} \Delta W^{(l)} \right) + \lambda W^{(l)}\right] \\<br />
b^{(l)} &= b^{(l)} - \alpha \left[\frac{1}{m} \Delta b^{(l)}\right]<br />
\end{align}</math><br />
</ol><br />
<br />
To train our neural network, we can now repeatedly take steps of gradient descent to reduce our cost function <math>\textstyle J(W,b)</math>.<br />
<br />
<br />
{{Sparse_Autoencoder}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Template:Sparse_AutoencoderTemplate:Sparse Autoencoder2011-05-26T10:48:49Z<p>Watsuen: </p>
<hr />
<div>----<br />
<br />
<div style="text-align: center;font-size:x-small;"><br />
[[Neural Networks]] | [[Backpropagation Algorithm]] | [[Gradient checking and advanced optimization]] | [[Autoencoders and Sparsity]] | [[Visualizing a Trained Autoencoder]] | [[Sparse Autoencoder Notation Summary]] | [[Exercise:Sparse Autoencoder]]<br />
</div></div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Neural_NetworksNeural Networks2011-05-26T10:48:28Z<p>Watsuen: </p>
<hr />
<div>Consider a supervised learning problem where we have access to labeled training<br />
examples <math>(x^{(i)}, y^{(i)})</math>. Neural networks give a way of defining a complex,<br />
non-linear form of hypotheses <math>h_{W,b}(x)</math>, with parameters <math>W,b</math> that we can<br />
fit to our data.<br />
<br />
To describe neural networks, we will begin by describing the simplest possible<br />
neural network, one which comprises a single "neuron." We will use the following<br />
diagram to denote a single neuron:<br />
<br />
[[Image:SingleNeuron.png|300px|center]]<br />
<br />
This "neuron" is a computational unit that takes as input <math>x_1, x_2, x_3</math> (and a +1 intercept term), and<br />
outputs <math>\textstyle h_{W,b}(x) = f(W^Tx) = f(\sum_{i=1}^3 W_{i}x_i +b)</math>, where <math>f : \Re \mapsto \Re</math> is<br />
called the '''activation function'''. In these notes, we will choose<br />
<math>f(\cdot)</math> to be the sigmoid function:<br />
<br />
:<math><br />
f(z) = \frac{1}{1+\exp(-z)}.<br />
</math><br />
<br />
Thus, our single<br />
neuron corresponds exactly to the input-output mapping defined by logistic regression.<br />
<br />
Although these notes will use the sigmoid function, it is worth noting that<br />
another common choice for <math>f</math> is the hyperbolic tangent, or tanh, function:<br />
:<math><br />
f(z) = \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}, <br />
</math><br />
Here are plots of the sigmoid and <math>\tanh</math> functions:<br />
<br />
<br />
<br />
<div align=center><br />
[[Image:Sigmoid_Function.png|400px|top|Sigmoid activation function.]]<br />
[[Image:Tanh_Function.png|400px|top|Tanh activation function.]]<br />
</div><br />
<br />
The <math>\tanh(z)</math> function is a rescaled version of the sigmoid, and its output range is<br />
<math>[-1,1]</math> instead of <math>[0,1]</math>.<br />
<br />
Note that unlike some other venues (including the OpenClassroom videos, and parts of CS229), we are not using the convention<br />
here of <math>x_0=1</math>. Instead, the intercept term is handled separately by the parameter <math>b</math>.<br />
<br />
Finally, one identity that'll be useful later: If <math>f(z) = 1/(1+\exp(-z))</math> is the sigmoid<br />
function, then its derivative is given by <math>f'(z) = f(z) (1-f(z))</math>.<br />
(If <math>f</math> is the tanh function, then its derivative is given by<br />
<math>f'(z) = 1- (f(z))^2</math>.) You can derive this yourself using the definition of<br />
the sigmoid (or tanh) function.<br />
<br />
<br />
<br />
== Neural Network model == <br />
<br />
A neural network is put together by hooking together many of our simple<br />
"neurons," so that the output of a neuron can be the input of another. For<br />
example, here is a small neural network:<br />
<br />
[[Image:Network331.png|400px|center]]<br />
<br />
In this figure, we have used circles to also denote the inputs to the network. The circles<br />
labeled "+1" are called '''bias units''', and correspond to the intercept term.<br />
The leftmost layer of the network is called the '''input layer''', and the<br />
rightmost layer the '''output layer''' (which, in this example, has only one<br />
node). The middle layer of nodes is called the '''hidden layer''', because its<br />
values are not observed in the training set. We also say that our example<br />
neural network has 3 '''input units''' (not counting the bias unit), 3 <br />
'''hidden units''', and 1 '''output unit'''.<br />
<br />
We will let <math>n_l</math><br />
denote the number of layers in our network; thus <math>n_l=3</math> in our example. We label layer <math>l</math> as<br />
<math>L_l</math>, so layer <math>L_1</math> is the input layer, and layer <math>L_{n_l}</math> the output layer.<br />
Our neural network has parameters <math>(W,b) = (W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)})</math>, where<br />
we write<br />
<math>W^{(l)}_{ij}</math> to denote the parameter (or weight) associated with the connection<br />
between unit <math>j</math> in layer <math>l</math>, and unit <math>i</math> in layer <math>l+1</math>. (Note the order of the indices.)<br />
Also, <math>b^{(l)}_i</math> is the bias associated with unit <math>i</math> in layer <math>l+1</math>.<br />
Thus, in our example, we have <math>W^{(1)} \in \Re^{3\times 3}</math>, and <math>W^{(2)} \in \Re^{1\times 3}</math>.<br />
Note that bias units don't have inputs or connections going into them, since they always output<br />
the value +1. We also let <math>s_l</math> denote the number of nodes in layer <math>l</math> (not counting the bias unit).<br />
<br />
We will write <math>a^{(l)}_i</math> to denote the '''activation''' (meaning output value) of<br />
unit <math>i</math> in layer <math>l</math>. For <math>l=1</math>, we also use <math>a^{(1)}_i = x_i</math> to denote the <math>i</math>-th input.<br />
Given a fixed setting of<br />
the parameters <math>W,b</math>, our neural<br />
network defines a hypothesis <math>h_{W,b}(x)</math> that outputs a real number. Specifically, the<br />
computation that this neural network represents is given by:<br />
:<math><br />
\begin{align}<br />
a_1^{(2)} &= f(W_{11}^{(1)}x_1 + W_{12}^{(1)} x_2 + W_{13}^{(1)} x_3 + b_1^{(1)}) \\<br />
a_2^{(2)} &= f(W_{21}^{(1)}x_1 + W_{22}^{(1)} x_2 + W_{23}^{(1)} x_3 + b_2^{(1)}) \\<br />
a_3^{(2)} &= f(W_{31}^{(1)}x_1 + W_{32}^{(1)} x_2 + W_{33}^{(1)} x_3 + b_3^{(1)}) \\<br />
h_{W,b}(x) &= a_1^{(3)} = f(W_{11}^{(2)}a_1^{(2)} + W_{12}^{(2)} a_2^{(2)} + W_{13}^{(2)} a_3^{(2)} + b_1^{(2)}) <br />
\end{align}<br />
</math><br />
<br />
In the sequel, we also let <math>z^{(l)}_i</math> denote the total weighted sum of inputs to unit <math>i</math> in layer <math>l</math>,<br />
including the bias term (e.g., <math>\textstyle z_i^{(2)} = \sum_{j=1}^n W^{(1)}_{ij} x_j + b^{(1)}_i</math>), so that<br />
<math>a^{(l)}_i = f(z^{(l)}_i)</math>.<br />
<br />
Note that this easily lends itself to a more compact notation. Specifically, if we extend the<br />
activation function <math>f(\cdot)</math><br />
to apply to vectors in an element-wise fashion (i.e.,<br />
<math>f([z_1, z_2, z_3]) = [f(z_1), f(z_2), f(z_3)]</math>), then we can write<br />
the equations above more<br />
compactly as:<br />
:<math>\begin{align}<br />
z^{(2)} &= W^{(1)} x + b^{(1)} \\<br />
a^{(2)} &= f(z^{(2)}) \\<br />
z^{(3)} &= W^{(2)} a^{(2)} + b^{(2)} \\<br />
h_{W,b}(x) &= a^{(3)} = f(z^{(3)})<br />
\end{align}</math><br />
We call this step '''forward propagation.''' More generally, recalling that we also use <math>a^{(1)} = x</math> to also denote the values from the input layer,<br />
then given layer <math>l</math>'s activations <math>a^{(l)}</math>, we can compute layer <math>l+1</math>'s activations <math>a^{(l+1)}</math> as:<br />
:<math>\begin{align}<br />
z^{(l+1)} &= W^{(l)} a^{(l)} + b^{(l)} \\<br />
a^{(l+1)} &= f(z^{(l+1)})<br />
\end{align}</math><br />
By organizing our parameters in matrices and using matrix-vector operations, we can take<br />
advantage of fast linear algebra routines to quickly perform calculations in our network.<br />
<br />
<br />
We have so far focused on one example neural network, but one can also build neural<br />
networks with other '''architectures''' (meaning patterns of connectivity between neurons), including ones with multiple hidden layers.<br />
The most common choice is a <math>\textstyle n_l</math>-layered network<br />
where layer <math>\textstyle 1</math> is the input layer, layer <math>\textstyle n_l</math> is the output layer, and each<br />
layer <math>\textstyle l</math> is densely connected to layer <math>\textstyle l+1</math>. In this setting, to compute the<br />
output of the network, we can successively compute all the activations in layer<br />
<math>\textstyle L_2</math>, then layer <math>\textstyle L_3</math>, and so on, up to layer <math>\textstyle L_{n_l}</math>, using the equations above that describe the forward propagation step. This is one<br />
example of a '''feedforward''' neural network, since the connectivity graph<br />
does not have any directed loops or cycles.<br />
<br />
<br />
Neural networks can also have multiple output units. For example, here is a network<br />
with two hidden layers layers <math>L_2</math> and <math>L_3</math> and two output units in layer <math>L_4</math>:<br />
<br />
[[Image:Network3322.png|500px|center]]<br />
<br />
To train this network, we would need training examples <math>(x^{(i)}, y^{(i)})</math><br />
where <math>y^{(i)} \in \Re^2</math>. This sort of network is useful if there're multiple<br />
outputs that you're interested in predicting. (For example, in a medical<br />
diagnosis application, the vector <math>x</math> might give the input features of a<br />
patient, and the different outputs <math>y_i</math>'s might indicate presence or absence<br />
of different diseases.)<br />
<br />
{{Sparse_Autoencoder}}</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Template:Sparse_AutoencoderTemplate:Sparse Autoencoder2011-05-26T10:47:49Z<p>Watsuen: Created page with "<div style="text-align: center;font-size:x-small;"> Neural Networks | Backpropagation Algorithm | Gradient checking and advanced optimization | [[Autoencoders and Spa..."</p>
<hr />
<div><div style="text-align: center;font-size:x-small;"><br />
[[Neural Networks]] | [[Backpropagation Algorithm]] | [[Gradient checking and advanced optimization]] | [[Autoencoders and Sparsity]] | [[Visualizing a Trained Autoencoder]] | [[Sparse Autoencoder Notation Summary]] | [[Exercise:Sparse Autoencoder]]<br />
</div></div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Exercise:Softmax_RegressionExercise:Softmax Regression2011-05-06T06:44:47Z<p>Watsuen: /* Step 2: Implement softmaxCost */</p>
<hr />
<div>== Softmax regression ==<br />
<br />
In this problem set, you will use [[softmax regression]] on pixels to classify MNIST images. However, since you will also be using softmax regression for the [[Self-Taught Learning]] exercise later, your implementation should be a more general implementation that works on any arbitrary input.<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/softmax_exercise.zip softmax_exercise.zip]</tt>, we have provided some starter code. You should write your code in the places indicated by "YOUR CODE HERE" in the files. You will need to modify <tt>softmaxCost.m</tt> and <tt>softmaxPredict.m</tt> for this exercise.<br />
<br />
=== Support Code/Data ===<br />
<br />
The following additional files are required for this exercise:<br />
* [http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz MNIST Dataset (Training Images)]<br />
* [http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz MNIST Dataset (Training Labels)]<br />
* [[Using the MNIST Dataset | Support functions for loading MNIST in Matlab ]]<br />
<br />
=== Step 0: Initialize constants and parameters ===<br />
<br />
Two constants, <tt>inputSize</tt> and <tt>outputSize</tt>, corresponding to the size of each input vector and the number of class labels have been defined in the starter code. This will allow you to reuse your code on a different data set in a later exercise. We also initialize <tt>lambda</tt>, the weight decay parameter here.<br />
<br />
=== Step 1: Load data ===<br />
<br />
The starter code loads the MNIST images and labels into inputData and outputData respectively. The images are pre-processed to scale the pixel values to the range <math>[0, 1]</math>, and the label 0 is remapped to 10 for convenience of implementation. You will not need to change any code in this step for this exercise, but note that your code should be general enough to operate on data of arbitrary size belonging to any number of classes.<br />
<br />
=== Step 2: Implement softmaxCost ===<br />
<br />
In softmaxCost.m, implement code to compute the softmax cost function. Since minFunc minimises this cost, we consider the '''negative''' of the log-likelihood <math>-\ell(\theta)</math>, in order to maximise <math>\ell(\theta)</math>. Remember also to include the weight decay term in the cost as well. Your code should also compute the appropriate gradients, as well as the predictions for the input data (which will be used in the cross-validation step later).<br />
<br />
<br />
'''Implementation Tip''': Computing the ground truth matrix - In your code, you may need to compute the ground truth matrix <tt>M</tt>, such that <tt>M(r, c)</tt> is 1 if <math>y^{(c)} = r</math> and 0 otherwise. This can be done quickly, without a loop, using the MATLAB functions <tt>sparse</tt> and <tt>full</tt>. <tt>sparse(r, c, v)</tt> creates a sparse matrix such that <tt>M(r(i), c(i)) = v(i)</tt> for all i. That is, the vectors <tt>r</tt> and <tt>c</tt> give the position of the elements whose values we wish to set, and <tt>v</tt> the corresponding values of the elements. Running <tt>full</tt> on a sparse matrix gives the full representation of the matrix for use. Note that the code for using <tt>sparse</tt> and <tt>full</tt> to compute the ground truth matrix has already been included in softmaxCost.m.<br />
<br />
<br />
'''Implementation Tip:''' Preventing overflows - in softmax regression, you will have to compute the hypothesis<br />
<br />
<math><br />
\begin{align} <br />
h(x^{(i)}) = <br />
\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }<br />
\begin{bmatrix} <br />
e^{ \theta_1^T x^{(i)} } \\<br />
e^{ \theta_2^T x^{(i)} } \\<br />
\vdots \\<br />
e^{ \theta_n^T x^{(i)} } \\<br />
\end{bmatrix}<br />
\end{align}<br />
</math><br />
<br />
When the products <math>\theta_i^T x^{(i)}</math> are large, the exponential function <math>e^{\theta_i^T x^{(i)}}</math> will become very large and possibly overflow. When this happens, you will not be able to compute your hypothesis. However, there is an easy solution - observe that we can multiply the top and bottom of the hypothesis by some constant without changing the output: <br />
<br />
<math><br />
\begin{align} <br />
h(x^{(i)}) &=<br />
<br />
\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }<br />
\begin{bmatrix} <br />
e^{ \theta_1^T x^{(i)} } \\<br />
e^{ \theta_2^T x^{(i)} } \\<br />
\vdots \\<br />
e^{ \theta_n^T x^{(i)} } \\<br />
\end{bmatrix} \\<br />
<br />
&=<br />
<br />
\frac{ e^{-\alpha} }{ e^{-\alpha} \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }<br />
\begin{bmatrix} <br />
e^{ \theta_1^T x^{(i)} } \\<br />
e^{ \theta_2^T x^{(i)} } \\<br />
\vdots \\<br />
e^{ \theta_n^T x^{(i)} } \\<br />
\end{bmatrix} \\<br />
<br />
&=<br />
<br />
\frac{ 1 }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} - \alpha }} }<br />
\begin{bmatrix} <br />
e^{ \theta_1^T x^{(i)} - \alpha } \\<br />
e^{ \theta_2^T x^{(i)} - \alpha } \\<br />
\vdots \\<br />
e^{ \theta_n^T x^{(i)} - \alpha } \\<br />
\end{bmatrix} \\<br />
<br />
<br />
\end{align}<br />
<br />
</math><br />
<br />
Hence, to prevent overflow, simply subtract some large constant value from each of the <math>\theta_j^T x^{(i)}</math> terms before computing the exponential. In practice, for each example, you can use the maximum of the <math>\theta_j^T x^{(i)}</math> terms as the constant. Assuming you have a matrix <tt>M</tt> containing these terms such that <tt>M(r, c)</tt> is <math>\theta_r^T x^{(c)}</math>, then you can use the following code to accomplish this:<br />
<br />
% M is the matrix as described in the text<br />
M = bsxfun(@minus, M, max(M, [], 1));<br />
<br />
<tt>max(M)</tt> yields a row vector with each element giving the maximum value in that column. <tt>bsxfun</tt> (short for binary singleton expansion function) applies minus along each row of <tt>M</tt>, hence subtracting the maximum of each column from every element in the column. <br />
<br />
'''Implementation Tip: ''' Computing the predictions - you may also find <tt>bsxfun</tt> useful in computing your predictions - if you have a matrix <tt>M</tt> containing the <math>e^{\theta_j^T x^{(i)}}</math> terms, such that <tt>M(r, c)</tt> contains the <math>e^{\theta_r^T x^{(c)}}</math> term, you can use the following code to compute the hypothesis (by diving all elements in each column by their column sum):<br />
<br />
% M is the matrix as described in the text<br />
M = bsxfun(@rdivide, M, sum(M))<br />
<br />
The operation of <tt>bsxfun</tt> in this case is analogous to the earlier example.<br />
<br />
=== Step 3: Gradient checking ===<br />
<br />
Once you have written the softmax cost function, you should check your gradients numerically. In general, whenever implementing any learning algorithm, you should always check your gradients numerically before proceeding to train the model. The norm of the difference between the numerical gradient and your analytical gradient should be small, on the order of <math>10^{-9}</math>. <br />
<br />
'''Implementation Tip:''' Faster gradient checking - when debugging, you can speed up gradient checking by reducing the number of parameters your model uses. In this case, we have included code for reducing the size of the input data, using the first 8 pixels of the images instead of the full 28x28 images. This code can be used by setting the variable <tt>DEBUG</tt> to true, as described in step 1 of the code.<br />
<br />
=== Step 4: Learning parameters ===<br />
<br />
Now that you've verified that your gradients are correct, you can train your softmax model using the function <tt>softmaxTrain</tt> in <tt>softmaxTrain.m</tt>. <tt>softmaxTrain</tt> which uses the L-BFGS algorithm, in the function <tt>minFunc</tt>. Training the model on the entire MNIST training set of 60000 28x28 images should be rather quick, and take less than 5 minutes for 100 iterations.<br />
<br />
Factoring <tt>softmaxTrain</tt> out as a function means that you will be able to easily reuse it to train softmax models on other data sets in the future by invoking the function with different parameters.<br />
<br />
Use the following parameter when training your softmax classifier:<br />
<br />
lambda = 1e-3<br />
<br />
=== Step 5: Testing ===<br />
<br />
Now that you've trained your model, you will test it against the MNIST test set, comprising 10000 28x28 images. However, to do so, you will first need to complete the function <tt>softmaxPredict</tt> in <tt>softmaxPredict.m</tt>, a function which generates predictions for input data under a trained softmax model. <br />
<br />
Once that is done, you will be able to compute the accuracy (the proportion of correctly classified images) of your model using the code provided. Our implementation achieved an accuracy of '''92%'''. If your model's accuracy is significantly less (less than 91%), check your code, ensure that you are using the trained weights, and that you are training your model on the full 60000 training images. Conversely, if your accuracy is too high (99-100%), ensure that you have not accidentally trained your model on the test set as well.<br />
<br />
[[Category:Exercises]]</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Exercise:Softmax_RegressionExercise:Softmax Regression2011-05-06T01:13:53Z<p>Watsuen: /* Step 0: Initialise constants and parameters */</p>
<hr />
<div>== Softmax regression ==<br />
<br />
In this problem set, you will use [[softmax regression]] on pixels to classify MNIST images. However, since you will also be using softmax regression for the [[Self-Taught Learning]] exercise later, your implementation should be a more general implementation that works on any arbitrary input.<br />
<br />
In the file <tt>[http://ufldl.stanford.edu/wiki/resources/softmax_exercise.zip softmax_exercise.zip]</tt>, we have provided some starter code. You should write your code in the places indicated by "YOUR CODE HERE" in the files. You will need to modify <tt>softmaxCost.m</tt> and <tt>softmaxPredict.m</tt> for this exercise.<br />
<br />
=== Step 0: Initialize constants and parameters ===<br />
<br />
Two constants, <tt>inputSize</tt> and <tt>outputSize</tt>, corresponding to the size of each input vector and the number of class labels have been defined in the starter code. This will allow you to reuse your code on a different data set in a later exercise. We also initialize <tt>lambda</tt>, the weight decay parameter here.<br />
<br />
=== Step 1: Load data ===<br />
<br />
The starter code loads the MNIST images and labels into inputData and outputData respectively. The images are pre-processed to scale the pixel values to the range <math>[0, 1]</math>, and the label 0 is remapped to 10 for convenience of implementation. You will not need to change any code in this step for this exercise, but note that your code should be general enough to operate on data of arbitrary size belonging to any number of classes.<br />
<br />
=== Step 2: Implement softmaxCost ===<br />
<br />
In softmaxCost.m, implement code to compute the softmax cost function. Since minFunc minimises this cost, we consider the '''negative''' of the log-likelihood <math>-\ell(\theta)</math>, in order to maximise <math>\ell(\theta)</math>. Remember also to include the weight decay term in the cost as well. Your code should also compute the appropriate gradients, as well as the predictions for the input data (which will be used in the cross-validation step later).<br />
<br />
<br />
'''Implementation Tip''': Computing the ground truth matrix - In your code, you may need to compute the ground truth matrix <tt>M</tt>, such that <tt>M(r, c)</tt> is 1 if <math>y^{(c)} = r</math> and 0 otherwise. This can be done quickly, without a loop, using the MATLAB functions <tt>sparse</tt> and <tt>full</tt>. <tt>sparse(r, c, v)</tt> creates a sparse matrix such that <tt>M(r(i), c(i)) = v(i)</tt> for all i. That is, the vectors <tt>r</tt> and <tt>c</tt> give the position of the elements whose values we wish to set, and <tt>v</tt> the corresponding values of the elements. Running <tt>full</tt> on a sparse matrix gives the full representation of the matrix for use. Note that the code for using <tt>sparse</tt> and <tt>full</tt> to compute the ground truth matrix has already been included in softmaxCost.m.<br />
<br />
<br />
'''Implementation Tip:''' Preventing overflows - in softmax regression, you will have to compute the hypothesis<br />
<br />
<math><br />
\begin{align} <br />
h(x^{(i)}) = <br />
\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }<br />
\begin{bmatrix} <br />
e^{ \theta_1^T x^{(i)} } \\<br />
e^{ \theta_2^T x^{(i)} } \\<br />
\vdots \\<br />
e^{ \theta_n^T x^{(i)} } \\<br />
\end{bmatrix}<br />
\end{align}<br />
</math><br />
<br />
When the products <math>\theta_i^T x^{(i)}</math> are large, the exponential function <math>e^{\theta_i^T x^{(i)}}</math> will become very large and possibly overflow. When this happens, you will not be able to compute your hypothesis. However, there is an easy solution - observe that we can multiply the top and bottom of the hypothesis by some constant without changing the output: <br />
<br />
<math><br />
\begin{align} <br />
h(x^{(i)}) &=<br />
<br />
\frac{1}{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }<br />
\begin{bmatrix} <br />
e^{ \theta_1^T x^{(i)} } \\<br />
e^{ \theta_2^T x^{(i)} } \\<br />
\vdots \\<br />
e^{ \theta_n^T x^{(i)} } \\<br />
\end{bmatrix} \\<br />
<br />
&=<br />
<br />
\frac{ e^{-\alpha} }{ e^{-\alpha} \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }} }<br />
\begin{bmatrix} <br />
e^{ \theta_1^T x^{(i)} } \\<br />
e^{ \theta_2^T x^{(i)} } \\<br />
\vdots \\<br />
e^{ \theta_n^T x^{(i)} } \\<br />
\end{bmatrix} \\<br />
<br />
&=<br />
<br />
\frac{ 1 }{ \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} - \alpha }} }<br />
\begin{bmatrix} <br />
e^{ \theta_1^T x^{(i)} - \alpha } \\<br />
e^{ \theta_2^T x^{(i)} - \alpha } \\<br />
\vdots \\<br />
e^{ \theta_n^T x^{(i)} - \alpha } \\<br />
\end{bmatrix} \\<br />
<br />
<br />
\end{align}<br />
<br />
</math><br />
<br />
Hence, to prevent overflow, simply subtract some large constant value from each of the <math>\theta_j^T x^{(i)}</math> terms before computing the exponential. In practice, for each example, you can use the maximum of the <math>\theta_j^T x^{(i)}</math> terms as the constant. Assuming you have a matrix <tt>M</tt> containing these terms such that <tt>M(r, c)</tt> is <math>\theta_r^T x^{(c)}</math>, then you can use the following code to accomplish this:<br />
<br />
% M is the matrix as described in the text<br />
M = bsxfun(@minus, M, max(M));<br />
<br />
<tt>max(M)</tt> yields a row vector with each element giving the maximum value in that column. <tt>bsxfun</tt> (short for binary singleton expansion function) applies minus along each row of <tt>M</tt>, hence subtracting the maximum of each column from every element in the column. <br />
<br />
'''Implementation Tip: ''' Computing the predictions - you may also find <tt>bsxfun</tt> useful in computing your predictions - if you have a matrix <tt>M</tt> containing the <math>e^{\theta_j^T x^{(i)}}</math> terms, such that <tt>M(r, c)</tt> contains the <math>e^{\theta_r^T x^{(c)}}</math> term, you can use the following code to compute the hypothesis (by diving all elements in each column by their column sum):<br />
<br />
% M is the matrix as described in the text<br />
M = bsxfun(@rdivide, M, sum(M))<br />
<br />
The operation of <tt>bsxfun</tt> in this case is analogous to the earlier example.<br />
<br />
=== Step 3: Gradient checking ===<br />
<br />
Once you have written the softmax cost function, you should check your gradients numerically. In general, whenever implementing any learning algorithm, you should always check your gradients numerically before proceeding to train the model. The norm of the difference between the numerical gradient and your analytical gradient should be small, on the order of <math>10^{-9}</math>. <br />
<br />
'''Implementation Tip:''' Faster gradient checking - when debugging, you can speed up gradient checking by reducing the number of parameters your model uses. In this case, we have included code for reducing the size of the input data, using the first 8 pixels of the images instead of the full 28x28 images. This code can be used by setting the variable <tt>DEBUG</tt> to true, as described in step 1 of the code.<br />
<br />
=== Step 4: Learning parameters ===<br />
<br />
Now that you've verified that your gradients are correct, you can train your softmax model using the function <tt>softmaxTrain</tt> in <tt>softmaxTrain.m</tt>. <tt>softmaxTrain</tt> which uses the L-BFGS algorithm, in the function <tt>minFunc</tt>. Training the model on the entire MNIST training set of 60000 28x28 images should be rather quick, and take less than 5 minutes for 100 iterations.<br />
<br />
Factoring <tt>softmaxTrain</tt> out as a function means that you will be able to easily reuse it to train softmax models on other data sets in the future by invoking the function with different parameters.<br />
<br />
Use the following parameter when training your softmax classifier:<br />
<br />
lambda = 1e-3<br />
<br />
=== Step 5: Testing ===<br />
<br />
Now that you've trained your model, you will test it against the MNIST test set, comprising 10000 28x28 images. However, to do so, you will first need to complete the function <tt>softmaxPredict</tt> in <tt>softmaxPredict.m</tt>, a function which generates predictions for input data under a trained softmax model. <br />
<br />
Once that is done, you will be able to compute the accuracy (the proportion of correctly classified images) of your model using the code provided. Our implementation achieved an accuracy of '''92%'''. If your model's accuracy is significantly less (less than 91%), check your code, ensure that you are using the trained weights, and that you are training your model on the full 60000 training images. Conversely, if your accuracy is too high (99-100%), ensure that you have not accidentally trained your model on the test set as well.<br />
<br />
[[Category:Exercises]]</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Exercise:_Implement_deep_networks_for_digit_classificationExercise: Implement deep networks for digit classification2011-04-28T21:56:34Z<p>Watsuen: /* Overview */</p>
<hr />
<div>===Overview===<br />
<br />
In this exercise, you will use a stacked autoencoder for digit classification. This exercise is very similar to the self-taught learning exercise, in which we trained a digit classifier using a autoencoder layer followed by a softmax layer. The only difference in this exercise is that we will be using two autoencoder layers instead of one.<br />
<br />
The code you have already implemented will allow you to stack various layers and perform layer-wise training. However, to perform fine-tuning, you will need to implement back-propogation as well. We will see that fine-tuning significantly improves the model's performance.<br />
<br />
In the file <tt>stacked_ae_exercise.zip</tt>, we have provided some starter code [http://ufldl.stanford.edu/wiki/resources/sparseae_exercise.zip]. You will need to edit <tt>stackedAECost.m</tt>. You should also read <tt>stackedAETrain.m</tt> and ensure that you understand the steps.<br />
<br />
=== Step 0: Initialize constants and parameters ===<br />
<br />
Open <tt>stackedAETrain.m</tt>. In this step, we set meta-parameters to the same values that were used in previous exercise, which should produce reasonable results. You may to modify the meta-parameters if you wish.<br />
<br />
=== Step 1: Train the data on the first stacked autoencoder ===<br />
<br />
Train the first autoencoder on the training images to obtain its parameters. This step is identical to the corresponding step in the sparse autoencoder and STL assignments, so if you have implemented your <tt>autoencoderCost.m</tt> correctly, this step should run properly without needing any modifications. <br />
<br />
=== Step 2: Train the data on the second stacked autoencoder ===<br />
<br />
Run the training set through the first autoencoder to obtain hidden unit activation, then train this data on the second autoencoder. Since this is just an adapted application of a standard autoencoder, it should run identically with the first.<br />
<br />
Note: This step assumes that you have changed the method signature of sparseAutoencoderCost from<br />
<tt>function [cost, grad] = sparseAutoencoderCost(...)</tt> to <tt>function [cost, grad, activation] = sparseAutoencoderCost(...)</tt> in the [[Exercise:Self-Taught_Learning|previous assignment]].<br />
<br />
=== Step 3: Implement fine-tuning ===<br />
<br />
To implement fine tuning, we need to consider all three layers as a single model. Implement <tt>stackedAECost.m</tt> to return the cost, gradient and predictions of the model. The cost function should be as defined as the log likelihood and a gradient decay term. The gradient should be computed using back-propogation as discussed earlier. The predictions should consist of the activations of the output layer of the softmax model.<br />
<br />
To help you check that your implementation is correct, you can use the <tt>stackedAECheck.m</tt> script. The first part of the script runs the same input on your combined-model function, and on your separate autoencoder and softmax functions, and checks that they return the same cost and predictions. The second part of the script checks that the numerical gradient of the function is the same as your computed analytic gradient. If these two checks pass, you will have implemented fine-tuning correctly.<br />
<br />
'''Note:''' Recall that the cost function is given by:<br />
<br />
<math><br />
\begin{align}<br />
J(\theta) = -\ell(\theta) + w(\theta) \\<br />
w(\theta) = \frac{\lambda}{2} \sum_{i}{ \sum_{j}{ \theta_{ij}^2 } } \\<br />
\ell(\theta) = \theta^T_{y^{(i)}} x^{(i)} - \ln \sum_{j=1}^{n}{e^{ \theta_j^T x^{(i)} }}<br />
\end{align}<br />
</math><br />
<br />
When adding the weight decay term to the cost, only the weights for the topmost (softmax) layer need to be considered. Doing so does not impact the results adversely, but simplifies the implementation significantly.<br />
<br />
=== Step 4: Test the model ===<br />
After completing these steps, running the entire script in stackedAETrain.m will perform layer-wise training of the stacked autoencoder, finetune the model, and measure its performance on the test set. If you've done all the steps correctly, you should get an accuracy of about X percent.</div>Watsuenhttp://ufldl.stanford.edu/wiki/index.php/Fine-tuning_Stacked_AEsFine-tuning Stacked AEs2011-04-22T01:33:58Z<p>Watsuen: /* Recap of the Backpropagation Algorithm */</p>
<hr />
<div>=== Introduction ===<br />
Fine tuning is a strategy that is commonly found in deep learning. As such, it can also be used to greatly improve the performance of a stacked autoencoder. From a high level perspective, fine tuning treats all layers of a stacked autoencoder as a single model, so that in one iteration, we are improving upon all the weights in the stacked autoencoder.<br />
<br />
=== General Strategy ===<br />
Luckily, we already have all the tools necessary to implement fine tuning for stacked autoencoders! In order to compute the gradients for all the layers of the stacked autoencoder in each iteration, we use the [[Backpropagation Algorithm]], as discussed in the sparse autoencoder section. As the backpropagation algorithm can be extended to apply for an arbitrary number of layers, we can actually use this algorithm on a stacked autoencoder of arbitrary depth.<br />
<br />
Note: most stacked autoencoders don't go past five layers.<br />
<br />
=== Recap of the Backpropagation Algorithm ===<br />
For your convenience, the summary of the backpropagation algorithm using element wise notation is below:<br />
<br />
: 1. Perform a feedforward pass, computing the activations for layers <math>\textstyle L_2</math>, <math>\textstyle L_3</math>, up to the output layer <math>\textstyle L_{n_l}</math>, using the equations defining the forward propagation steps.<br />
: 2. For the output layer (layer <math>\textstyle n_l</math>), set <br />
::<math>\begin{align}<br />
\delta^{(n_l)}<br />
= - (\nabla_{a^{n_l}}J) \bullet f'(z^{(n_l)})<br />
\end{align}</math><br />
::(When using softmax regression, the softmax layer has <math>\nabla J = \theta^T(I-P)</math> where <math>I</math> is the input labels and <math>P</math> is the predicted labels.)<br />
: 3. For <math>\textstyle l = n_l-1, n_l-2, n_l-3, \ldots, 2</math> <br />
::Set<br />
:::<math>\begin{align}<br />
\delta^{(l)} = \left((W^{(l)})^T \delta^{(l+1)}\right) \bullet f'(z^{(l)})<br />
\end{align}</math><br />
: 4. Compute the desired partial derivatives: <br />
::<math>\begin{align}<br />
\nabla_{W^{(l)}} J(W,b;x,y) &= \delta^{(l+1)} (a^{(l)})^T, \\<br />
\nabla_{b^{(l)}} J(W,b;x,y) &= \delta^{(l+1)}.<br />
\end{align}</math><br />
<br />
:<math>\begin{align}<br />
J(W,b)<br />
&= \left[ \frac{1}{m} \sum_{i=1}^m J(W,b;x^{(i)},y^{(i)}) \right]<br />
\end{align}</math></div>Watsuen