Stacked Autoencoders

===Overview===

A stacked autoencoder is a neural network consisting of multiple layers of sparse autoencoders in which the outputs of each layer is wired to the inputs of the successive layer. Formally, consider a stacked autoencoder with n layers. Using notation from the autoencoder section, let <math>W^{(k, 1)}, W^{(k, 2)}, b^{(k, 1)}, b^{(k, 2)}</math> denote the parameters <math>W^{(1)}, W^{(2)}, b^{(1)}, b^{(2)}</math> for kth autoencoder. Then the encoding step for the stacked autoencoder is given by running the encoding step of each layer in forward order:

<math>
\begin{align}
a^{(l)} = f(z^{(l)}) \\
z^{(l + 1)} = W^{(l, 1)}a^{(l)} + b^{(l, l)}
\end{align}
</math>

The decoding step is given by running the decoding stack of each autoencoder in reverse order:

<math>
\begin{align}
a^{(n + l)} = f(z^{(n + l)}) \\
z^{(n + l + 1)} = W^{(n - l, 2)}a^{(n + l)} + b^{(n - l, 2)}
\end{align}
</math>

The information of interest is contained within <math>a^{(n)}</math>, which is the activation of the deepest layer of hidden units. This vector gives us a representation of the input in terms of higher-order features. The stacked autoencoder can be used for classification problems by feeding a(1) to a softmax classifier.

===Training===
A good way to obtain good parameters for a stacked autoencoder is to use greedy layer-wise training. To do this, first train the first layer on raw input to obtain parameters W1, W2, b1 and b2. Use the first layer to transform the raw input into a vector consisting of activation of the hidden units, A. Train the second layer on this vector to obtain parameters W1, W2, b1 and b2. Repeat for subsequent layers, using the output of each layer as input for the subsequent layer.

This method trains the parameters of each layer individually while freezing parameters for the remainder of the model. To produce better results, after this phase of training is complete, fine-tuning can be used to improve the results. In fine-tuning, the parameters of all layers are changed at the same time. The loss function can be back-propagated to each preceding layer, and the 

In practice, fine-tuning should be use when the parameters have been brought close to convergence through layer-wise training. Attempting to use fine-tuning with the weights initialized randomly will lead to poor results due to local optima.

===Motivation===

A stacked autoencoder inherits all the benefits of any deep network: greater expressive power and greater statistical efficiency. In addition, its purpose can be described in an intuitive sense as follows.

Recall that an autoencoder tends to learn features that form a good representation of its input. The first layer of a stacked autoencoder tends to learn first-order features in the raw input. The second layer of a stacked autoencoder tends to learn second-order features corresponding to patterns in the appearance of first-order features. Higher layers of the stacked autoencoder tend to learn even higher-order features.

For instance, in the context of image input, the first layers usually learns to recognize edges. The second layer usually learns features that arise from combinations of the edges, such as corners. With certain types of network configuration and input modes, the higher layers can learn meaningful combinations of features. For instance, if the input set consists of images of faces, higher layers may learn features corresponding to parts of the face such as eyes, noses or mouths.