# Pooling

 Revision as of 23:29, 21 May 2011 (view source)Jngiam (Talk | contribs)← Older edit Revision as of 23:41, 21 May 2011 (view source)Jngiam (Talk | contribs) Newer edit → Line 3: Line 3: After obtaining features using convolution, the next step is to use them in for classification. In theory, one could use all the extracted features with a classifier (e.g., softmax regression) but this can be computationally challenging. Consider for instance images of size 96x96 pixels and 400 features that are 8x8 each and convolved over the entire image; each features after (valid) convolution results in $(96-8+1)*(96-8+1)=7921$ and since we have 400 features, this results in a feature vector of $(98^2) * 400 = 3,168,400$ features per example. Learning a classifier with inputs having 3+ million features can be unwieldy and also prone to over-fitting. After obtaining features using convolution, the next step is to use them in for classification. In theory, one could use all the extracted features with a classifier (e.g., softmax regression) but this can be computationally challenging. Consider for instance images of size 96x96 pixels and 400 features that are 8x8 each and convolved over the entire image; each features after (valid) convolution results in $(96-8+1)*(96-8+1)=7921$ and since we have 400 features, this results in a feature vector of $(98^2) * 400 = 3,168,400$ features per example. Learning a classifier with inputs having 3+ million features can be unwieldy and also prone to over-fitting. + However, thinking about why we decided to obtain convolved features suggests a further step that could improve our feature extraction pipeline. Recall that we decided to obtain convolved features because images have the property that features that are useful in one region will be useful for other regions (stationary). + Then, to describe a large image, one natural approach is to aggregate statistics of these features at various locations: ''pooling'' over regions of the image. For example, one could compute the mean (or max) value of a particular feature over a region of the image. These summary statistics are much lower in dimension (compared to using all extracted features) and can also improve results (less over-fitting). - == Invariances == + The following image shows how pooling is done over 4 non-overlapping regions of the image. + + [[File:Pooling_schematic.gif]] - == Pooling Methods == - Average Pooling + == Pooling for Invariance == - Max Pooling + If one chooses the pooling regions to be contiguous areas in the image and only pools features generated from the same (replicated) hidden units. Then, these pooling units will then be '''translation invariant'''. This means that the same (pooled) feature will be active even when the image undergoes (small) translations. Translation-invariant features are often desirable; in many tasks (e.g., object detection, audio recognition), the label of the example (image) is the same even when the image is translated. For example,  if you were to take an MNIST digit and translate it left or right, you would want your classifier to still accurately classify it as the same digit regardless of its final position. - + == Notes == - Now that you have obtained an array of convolved features, you might try using these features for classification. However, thinking about why we decided to obtain convolved features suggests a further step that could improve our classification performance. Recall that we decided to obtain convolved features because we thought that the features for the large image would simply be the features for smaller patches translated around the large image. This suggests to us that what we might really be interested in are the feature activations independent of some small translations. You can see why this might be so intuitively - if you were to take an MNIST digit and translate it left or right, you would want your classifier to still accurately classify it as the same digit regardless of its final position. + - + - Hence, what we are really interested in is the '''translation-invariant''' feature activation - we want to know whether there is an edge, regardless of whether it is at $(1, 1), (3, 3)$ or $(5, 5)$, though perhaps if it is at $(50, 50)$ we might want to treat it as a separate edge. This suggests that what we should do is to take the maximum (or perhaps mean) activation of the convolved features around a certain small region, hence making our resultant pooled features less sensitive to small translations. + - + - [[File:Pooling_schematic.gif]] + Formally, after obtaining our convolved features as earlier, we decide the size of the region, say $m \times n$ to pool our convolved features over. Then, we divide our convolved features into disjoint $m \times n$ regions, and take the maximum (or mean) feature activation over these regions to obtain the pooled convolved features. These pooled features can then be used for classification. Formally, after obtaining our convolved features as earlier, we decide the size of the region, say $m \times n$ to pool our convolved features over. Then, we divide our convolved features into disjoint $m \times n$ regions, and take the maximum (or mean) feature activation over these regions to obtain the pooled convolved features. These pooled features can then be used for classification.