Feature extraction using convolution

Revision as of 05:34, 7 May 2011 (view source)

Revision as of 07:07, 7 May 2011 (view source)

Line 6:

Indeed, this intuition leads us to the method of '''feature extraction using convolution''' for large images. The idea is to first learn some features on smaller patches (say 8x8 patches) sampled from the large image, and then to '''convolve''' these features with the larger image to get the feature activations at various points in the image. Convolution corresponds precisely to the intuitive notion of translating the features. To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. These convolved features can then be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification.

+

[[File:Convolution_schematic.gif]]

Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where \sigma is the sigmoid function), given by the weights <math>W^{(1)}M</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. These convolved features can then be [[#pooling | pooled]] for classification, as described below.

Line 14:

Line 16:

Hence, what we are really interested in is the '''translation-invariant''' feature activation - we want to know whether there is an edge, regardless of whether it is at <math>(1, 1), (3, 3)</math> or <math>(5, 5)</math>, though perhaps if it is at <math>(50, 50)</math> we might want to treat it as a separate edge. This suggests that what we should do is to take the maximum (or perhaps mean) activation of the convolved features around a certain small region, hence making our resultant pooled features less sensitive to small translations.

+

[[File:Pooling_schematic.gif]]

Formally, after obtaining our convolved features as earlier, we decide the size of the region, say <math>m \times n</math> to pool our convolved features over. Then, we divide our convolved features into disjoint <math>m \times n</math> regions, and take the maximum (or mean) feature activation over these regions to obtain the pooled convolved features. These pooled features can then be used for classification.

Feature extraction using convolution

From Ufldl

Revision as of 07:07, 7 May 2011

Views

Personal tools

ufldl resources

wiki

Search

Toolbox

@@ Line 6: / Line 6: @@
 Indeed, this intuition leads us to the method of '''feature extraction using convolution''' for large images. The idea is to first learn some features on smaller patches (say 8x8 patches) sampled from the large image, and then to '''convolve''' these features with the larger image to get the feature activations at various points in the image. Convolution corresponds precisely to the intuitive notion of translating the features. To give a concrete example, suppose you have learned features on 8x8 patches sampled from a 96x96 image. To get the convolved features, for every 8x8 region of the 96x96 image, that is, the 8x8 regions starting at <math>(1, 1), (1, 2), \ldots (89, 89)</math>, you would extract the 8x8 patch, and run it through your trained sparse autoencoder to get the feature activations. This would result in a set of 100 89x89 convolved features. These convolved features can then be '''[[#pooling | pooled]]''' together to produce a smaller set of pooled features, which can then be used for classification.
+[[File:Convolution_schematic.gif]]
 Formally, given some large <math>r \times c</math> images <math>x_{large}</math>, we first train a sparse autoencoder on small <math>a \times b</math> patches <math>x_{small}</math> sampled from these images, learning <math>k</math> features <math>f = \sigma(W^{(1)}x_{small} + b^{(1)})</math> (where \sigma is the sigmoid function), given by the weights <math>W^{(1)}M</math> and biases <math>b^{(1)}</math> from the visible units to the hidden units. For every <math>a \times b</math> patch <math>x_s</math> in the large image, we compute <math>f_s = \sigma(W^{(1)}x_s + b^{(1)})</math>, giving us <math>f_{convolved}</math>, a <math>k \times (r - a + 1) \times (c - b + 1)</math> array of convolved features. These convolved features can then be [[#pooling | pooled]] for classification, as described below.
@@ Line 14: / Line 16: @@
 Hence, what we are really interested in is the '''translation-invariant''' feature activation - we want to know whether there is an edge, regardless of whether it is at <math>(1, 1), (3, 3)</math> or <math>(5, 5)</math>, though perhaps if it is at <math>(50, 50)</math> we might want to treat it as a separate edge. This suggests that what we should do is to take the maximum (or perhaps mean) activation of the convolved features around a certain small region, hence making our resultant pooled features less sensitive to small translations.
+[[File:Pooling_schematic.gif]]
 Formally, after obtaining our convolved features as earlier, we decide the size of the region, say <math>m \times n</math> to pool our convolved features over. Then, we divide our convolved features into disjoint <math>m \times n</math> regions, and take the maximum (or mean) feature activation over these regions to obtain the pooled convolved features. These pooled features can then be used for classification.