# Feature extraction using convolution

(Difference between revisions)
 Revision as of 05:10, 22 May 2011 (view source) (→Weight Sharing (Convolution))← Older edit Revision as of 17:55, 27 May 2011 (view source)Ang (Talk | contribs) Newer edit → Line 1: Line 1: == Overview == == Overview == - In the previous exercises, you have worked through problems which involved images that are on a relatively small scale such as hand-written digits and image patches. In this section of the course, we will be exploring methods which allow us to effectively scale up the methods to work with more realistic datasets with large images. + In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which allow us to scale up these methods to work with more realistic datasets that have larger images. == Fully Connected Networks == == Fully Connected Networks == - In sparse autoencoder models, one design choice that we made was to "fully connect" all the hidden units to all the input units. On relatively small images (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it is computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive - you would have to have $10^4$ input units, and assuming you want to learn 100 features, you would have on the order of $10^6$ parameters to learn. The feedforward and backpropagation computations would also be on the order of at least $10^2$ slower, compared to 28x28 images. + In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On relatively small images (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it is computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about $10^4$ input units, and assuming you want to learn 100 features, you would have on the order of $10^6$ parameters to learn. The feedforward and backpropagation computations would also be about $10^2$ times slower, compared to 28x28 images. == Locally Connected Networks == == Locally Connected Networks == - One simple solution to the problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to be able to only connect to a select number of input units. The selection of connections between the hidden and input units can often be determined based on the input modality -- e.g., for images, we will have hidden units that connect to local contiguous regions of pixels. + One simple solution to the problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a select number of input units. The selection of connections between the hidden and input units can often be determined based on the input modality -- e.g., for images, we will have hidden units that connect to local contiguous regions of pixels. - This idea of having locally connected networks also draws inspiration from how the early visual system is wired up. Specifically, neurons in the visual cortex are found to have localized receptive fields (i.e., they only respond to stimuli in a certain location). + This idea of having locally connected networks also draws inspiration from how the early visual system is wired up. Specifically, neurons in the visual cortex are found to have localized receptive fields (i.e., they respond only to stimuli in a certain location). == Weight Sharing (Convolution) == == Weight Sharing (Convolution) == - Natural images have the property of being stationary, that is, the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applicable to other regions -- i.e., we should have the same features at all locations. + Natural images have the property of being '''stationary''', meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applicable to other regions -- i.e., we can have the same features at all locations. - In practice, this is added as an additional constraint known as weight sharing (tying) between the hidden units at different locations. If one chooses to have the same hidden unit replicated at every possible location, this turns out to be equivalent to a convolution of the feature (as a filter) on the image. + - Hence, in practice, it is faster and easier to learn features heuristically and simply extract convolutionally thereafter. + More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image.  Specifically, we can take the learned 8x8 features and + '''convolve''' them with the larger image, thus obtaining a different feature activation value at each location in the image. - The idea is to first learn some features on smaller patches (say 8x8 patches) sampled from the large image, and then to convolve these features with the larger image to get the feature activations at various points in the image. Convolution then corresponds precisely to the intuitive notion of translating the features. To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at $(1, 1), (1, 2), \ldots (89, 89)$, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. These convolved features can later be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification. + To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at $(1, 1), (1, 2), \ldots (89, 89)$, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. + + [[File:Convolution_schematic.gif]] [[File:Convolution_schematic.gif]] - Formally, given some large $r \times c$ images $x_{large}$, we first train a sparse autoencoder on small $a \times b$ patches $x_{small}$ sampled from these images, learning $k$ features $f = \sigma(W^{(1)}x_{small} + b^{(1)})$ (where $\sigma$ is the sigmoid function), given by the weights $W^{(1)}$ and biases $b^{(1)}$ from the visible units to the hidden units. For every $a \times b$ patch $x_s$ in the large image, we compute $f_s = \sigma(W^{(1)}x_s + b^{(1)})$, giving us $f_{convolved}$, a $k \times (r - a + 1) \times (c - b + 1)$ array of convolved features. These convolved features can then be [[#pooling | pooled]] for classification, as described in the next section. + Formally, given some large $r \times c$ images $x_{large}$, we first train a sparse autoencoder on small $a \times b$ patches $x_{small}$ sampled from these images, learning $k$ features $f = \sigma(W^{(1)}x_{small} + b^{(1)})$ (where $\sigma$ is the sigmoid function), given by the weights $W^{(1)}$ and biases $b^{(1)}$ from the visible units to the hidden units. For every $a \times b$ patch $x_s$ in the large image, we compute $f_s = \sigma(W^{(1)}x_s + b^{(1)})$, giving us $f_{convolved}$, a $k \times (r - a + 1) \times (c - b + 1)$ array of convolved features. + + In the next section, we further describe how to "pool" these features together to get even better features for classification.

## Overview

In the previous exercises, you worked through problems which involved images that were relatively low in resolution, such as small image patches and small images of hand-written digits. In this section, we will develop methods which allow us to scale up these methods to work with more realistic datasets that have larger images.

## Fully Connected Networks

In the sparse autoencoder, one design choice that we had made was to "fully connect" all the hidden units to all the input units. On relatively small images (e.g., 8x8 patches for the sparse autoencoder assignment, 28x28 images for the MNIST dataset), it is computationally feasible to learn features on the entire image. However, with larger images (e.g., 96x96 images) learning features that span the entire image (fully connected networks) is very computationally expensive--you would have about 104 input units, and assuming you want to learn 100 features, you would have on the order of 106 parameters to learn. The feedforward and backpropagation computations would also be about 102 times slower, compared to 28x28 images.

## Locally Connected Networks

One simple solution to the problem is to restrict the connections between the hidden units and the input units, allowing each hidden unit to connect to only a select number of input units. The selection of connections between the hidden and input units can often be determined based on the input modality -- e.g., for images, we will have hidden units that connect to local contiguous regions of pixels.

This idea of having locally connected networks also draws inspiration from how the early visual system is wired up. Specifically, neurons in the visual cortex are found to have localized receptive fields (i.e., they respond only to stimuli in a certain location).

## Weight Sharing (Convolution)

Natural images have the property of being stationary, meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applicable to other regions -- i.e., we can have the same features at all locations.

More precisely, having learned features over small (say 8x8) patches sampled randomly from the larger image, we can then apply this learned 8x8 feature detector anywhere in the image. Specifically, we can take the learned 8x8 features and convolve them with the larger image, thus obtaining a different feature activation value at each location in the image.

To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at $(1, 1), (1, 2), \ldots (89, 89)$, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features.

Formally, given some large $r \times c$ images xlarge, we first train a sparse autoencoder on small $a \times b$ patches xsmall sampled from these images, learning k features f = σ(W(1)xsmall + b(1)) (where σ is the sigmoid function), given by the weights W(1) and biases b(1) from the visible units to the hidden units. For every $a \times b$ patch xs in the large image, we compute fs = σ(W(1)xs + b(1)), giving us fconvolved, a $k \times (r - a + 1) \times (c - b + 1)$ array of convolved features.

In the next section, we further describe how to "pool" these features together to get even better features for classification.