Stacked Autoencoders

===Overview===

The greedy layerwise approach for pretraining a deep network works by training each layer in turn. In this page, you will find out how autoencoders can be "stacked" in a greedy layerwise fashion for pretraining (initializing) the weights of a deep network.

A stacked autoencoder is a neural network consisting of multiple layers of sparse autoencoders in which the outputs of each layer is wired to the inputs of the successive layer. Formally, consider a stacked autoencoder with n layers. Using notation from the autoencoder section, let <math>W^{(k, 1)}, W^{(k, 2)}, b^{(k, 1)}, b^{(k, 2)}</math> denote the parameters <math>W^{(1)}, W^{(2)}, b^{(1)}, b^{(2)}</math> for kth autoencoder. Then the encoding step for the stacked autoencoder is given by running the encoding step of each layer in forward order:

<math>
\begin{align}
a^{(l)} = f(z^{(l)}) \\
z^{(l + 1)} = W^{(l, 1)}a^{(l)} + b^{(l, l)}
\end{align}
</math>

The decoding step is given by running the decoding stack of each autoencoder in reverse order:

<math>
\begin{align}
a^{(n + l)} = f(z^{(n + l)}) \\
z^{(n + l + 1)} = W^{(n - l, 2)}a^{(n + l)} + b^{(n - l, 2)}
\end{align}
</math>

The information of interest is contained within <math>a^{(n)}</math>, which is the activation of the deepest layer of hidden units. This vector gives us a representation of the input in terms of higher-order features. 

The features from the stacked autoencoder can be used for classification problems by feeding <math>a(n)</math> to a softmax classifier.

===Training===
A good way to obtain good parameters for a stacked autoencoder is to use greedy layer-wise training. To do this, first train the first layer on raw input to obtain parameters W1, W2, b1 and b2. Use the first layer to transform the raw input into a vector consisting of activation of the hidden units, A. Train the second layer on this vector to obtain parameters W1, W2, b1 and b2. Repeat for subsequent layers, using the output of each layer as input for the subsequent layer.

This method trains the parameters of each layer individually while freezing parameters for the remainder of the model. To produce better results, after this phase of training is complete, [[Fine-tuning Stacked AEs | fine-tuning]] using backpropagation can be used to improve the results by tuning the parameters of all layers are changed at the same time. 

<!-- In practice, fine-tuning should be use when the parameters have been brought close to convergence through layer-wise training. Attempting to use fine-tuning with the weights initialized randomly will lead to poor results due to local optima. -->

{{Quote|
If one is only interested in finetuning for the purposes of classification, the common practice is to then discard the "decoding" layers of the stacked autoencoder and link the last hidden layer <math>a^(n)</math> to the softmax classifier. The gradients from the (softmax) classification error will then be backpropagated into the encoding layers.
}}

===Motivation===

A stacked autoencoder inherits all the benefits of any deep network: greater expressive power and greater statistical efficiency. In addition, its purpose can be described in an intuitive sense as follows.

Recall that an autoencoder tends to learn features that form a good representation of its input. The first layer of a stacked autoencoder tends to learn first-order features in the raw input. The second layer of a stacked autoencoder tends to learn second-order features corresponding to patterns in the appearance of first-order features. Higher layers of the stacked autoencoder tend to learn even higher-order features.

For instance, in the context of image input, the first layers usually learns to recognize edges. The second layer usually learns features that arise from combinations of the edges, such as corners. With certain types of network configuration and input modes, the higher layers can learn meaningful combinations of features. For instance, if the input set consists of images of faces, higher layers may learn features corresponding to parts of the face such as eyes, noses or mouths.