Description
P.A. Viola, and M.J. Jones, “Robust real-time face detection,” International Journal of Computer Vision, 57(2):137-154, 2004.
T.F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, “Active Shape Models -their training and application,” Computer Vision and Image Understanding, 61(1):389, January 1995.
Both papers are uploaded in the Files section.
Describe either one of the two methods above, intuitively, mathematically, and schematically. You may use figures, diagrams, and examples, along with your text. Use formal, technical language and not informal one.
– The text should be 1000 words or more (excluding title, your name, etc.) (-30% if otherwise)
– Do not copy-paste parts of the papers (-20% if otherwise)
– Do not use the illustrations in the papers (-10% if otherwise)
– Improper language (-10%)
– Your submission should be in pdf format (-10% if otherwise)
– Late submissions: see Syllabus.
– You can collaborate but cannot copy each other. (There will be plagiarism detection)
This assignment has three scopes: (i) learn some details about a method that we will see and use later in the semester; (ii) reading and understanding a technical paper: the intuition, the math, and details beyond the paper itself, that is, the most important references etc.; (iii) writing in formal, technical language.
Unformatted Attachment Preview
International Journal of Computer Vision 57(2), 137–154, 2004
c 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.
Robust Real-Time Face Detection
PAUL VIOLA
Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA
[email protected]
MICHAEL J. JONES
Mitsubishi Electric Research Laboratory, 201 Broadway, Cambridge, MA 02139, USA
[email protected]
Received September 10, 2001; Revised July 10, 2003; Accepted July 11, 2003
Abstract. This paper describes a face detection framework that is capable of processing images extremely rapidly
while achieving high detection rates. There are three key contributions. The first is the introduction of a new
image representation called the “Integral Image” which allows the features used by our detector to be computed
very quickly. The second is a simple and efficient classifier which is built using the AdaBoost learning algorithm (Freund and Schapire, 1995) to select a small number of critical visual features from a very large set of
potential features. The third contribution is a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising face-like
regions. A set of experiments in the domain of face detection is presented. The system yields face detection performance comparable to the best previous systems (Sung and Poggio, 1998; Rowley et al., 1998; Schneiderman and
Kanade, 2000; Roth et al., 2000). Implemented on a conventional desktop, face detection proceeds at 15 frames per
second.
Keywords: face detection, boosting, human sensing
1.
Introduction
This paper brings together new algorithms and insights
to construct a framework for robust and extremely rapid
visual detection. Toward this end we have constructed
a frontal face detection system which achieves detection and false positive rates which are equivalent to
the best published results (Sung and Poggio, 1998;
Rowley et al., 1998; Osuna et al., 1997a; Schneiderman
and Kanade, 2000; Roth et al., 2000). This face detection system is most clearly distinguished from previous approaches in its ability to detect faces extremely
rapidly. Operating on 384 by 288 pixel images, faces
are detected at 15 frames per second on a conventional
700 MHz Intel Pentium III. In other face detection
systems, auxiliary information, such as image differ-
ences in video sequences, or pixel color in color images, have been used to achieve high frame rates. Our
system achieves high frame rates working only with
the information present in a single grey scale image.
These alternative sources of information can also be integrated with our system to achieve even higher frame
rates.
There are three main contributions of our face detection framework. We will introduce each of these ideas
briefly below and then describe them in detail in subsequent sections.
The first contribution of this paper is a new image
representation called an integral image that allows for
very fast feature evaluation. Motivated in part by the
work of Papageorgiou et al. (1998) our detection system does not work directly with image intensities. Like
138
Viola and Jones
these authors we use a set of features which are reminiscent of Haar Basis functions (though we will also
use related filters which are more complex than Haar
filters). In order to compute these features very rapidly
at many scales we introduce the integral image representation for images (the integral image is very similar
to the summed area table used in computer graphics
(Crow, 1984) for texture mapping). The integral image can be computed from an image using a few operations per pixel. Once computed, any one of these
Haar-like features can be computed at any scale or location in constant time.
The second contribution of this paper is a simple
and efficient classifier that is built by selecting a small
number of important features from a huge library of potential features using AdaBoost (Freund and Schapire,
1995). Within any image sub-window the total number of Haar-like features is very large, far larger than
the number of pixels. In order to ensure fast classification, the learning process must exclude a large majority of the available features, and focus on a small
set of critical features. Motivated by the work of Tieu
and Viola (2000) feature selection is achieved using
the AdaBoost learning algorithm by constraining each
weak classifier to depend on only a single feature. As a
result each stage of the boosting process, which selects
a new weak classifier, can be viewed as a feature selection process. AdaBoost provides an effective learning
algorithm and strong bounds on generalization performance (Schapire et al., 1998).
The third major contribution of this paper is a method
for combining successively more complex classifiers
in a cascade structure which dramatically increases the
speed of the detector by focusing attention on promising regions of the image. The notion behind focus
of attention approaches is that it is often possible to
rapidly determine where in an image a face might occur (Tsotsos et al., 1995; Itti et al., 1998; Amit and
Geman, 1999; Fleuret and Geman, 2001). More complex processing is reserved only for these promising
regions. The key measure of such an approach is the
“false negative” rate of the attentional process. It must
be the case that all, or almost all, face instances are
selected by the attentional filter.
We will describe a process for training an extremely
simple and efficient classifier which can be used as a
“supervised” focus of attention operator.1 A face detection attentional operator can be learned which will
filter out over 50% of the image while preserving 99%
of the faces (as evaluated over a large dataset). This
filter is exceedingly efficient; it can be evaluated in 20
simple operations per location/scale (approximately 60
microprocessor instructions).
Those sub-windows which are not rejected by the
initial classifier are processed by a sequence of classifiers, each slightly more complex than the last. If any
classifier rejects the sub-window, no further processing
is performed. The structure of the cascaded detection
process is essentially that of a degenerate decision tree,
and as such is related to the work of Fleuret and Geman
(2001) and Amit and Geman (1999).
The complete face detection cascade has 38 classifiers, which total over 80,000 operations. Nevertheless
the cascade structure results in extremely rapid average
detection times. On a difficult dataset, containing 507
faces and 75 million sub-windows, faces are detected
using an average of 270 microprocessor instructions
per sub-window. In comparison, this system is about
15 times faster than an implementation of the detection
system constructed by Rowley et al. (1998).2
An extremely fast face detector will have broad practical applications. These include user interfaces, image databases, and teleconferencing. This increase in
speed will enable real-time face detection applications
on systems where they were previously infeasible. In
applications where rapid frame-rates are not necessary,
our system will allow for significant additional postprocessing and analysis. In addition our system can be
implemented on a wide range of small low power devices, including hand-helds and embedded processors.
In our lab we have implemented this face detector on a
low power 200 mips Strong Arm processor which lacks
floating point hardware and have achieved detection at
two frames per second.
1.1.
Overview
The remaining sections of the paper will discuss the
implementation of the detector, related theory, and experiments. Section 2 will detail the form of the features
as well as a new scheme for computing them rapidly.
Section 3 will discuss the method in which these features are combined to form a classifier. The machine
learning method used, a application of AdaBoost, also
acts as a feature selection mechanism. While the classifiers that are constructed in this way have good computational and classification performance, they are far too
slow for a real-time classifier. Section 4 will describe a
method for constructing a cascade of classifiers which
Robust Real-Time Face Detection
together yield an extremely reliable and efficient face
detector. Section 5 will describe a number of experimental results, including a detailed description of our
experimental methodology. Finally Section 6 contains
a discussion of this system and its relationship to related systems.
2.
Features
Our face detection procedure classifies images based
on the value of simple features. There are many motivations for using features rather than the pixels directly.
The most common reason is that features can act to encode ad-hoc domain knowledge that is difficult to learn
using a finite quantity of training data. For this system
there is also a second critical motivation for features:
the feature-based system operates much faster than a
pixel-based system.
The simple features used are reminiscent of Haar
basis functions which have been used by Papageorgiou
et al. (1998). More specifically, we use three kinds of
features. The value of a two-rectangle feature is the
difference between the sum of the pixels within two
rectangular regions. The regions have the same size
and shape and are horizontally or vertically adjacent
(see Fig. 1). A three-rectangle feature computes the
sum within two outside rectangles subtracted from the
sum in a center rectangle. Finally a four-rectangle feature computes the difference between diagonal pairs of
rectangles.
Given that the base resolution of the detector is
24 × 24, the exhaustive set of rectangle features is
Figure 1. Example rectangle features shown relative to the enclosing detection window. The sum of the pixels which lie within the
white rectangles are subtracted from the sum of pixels in the grey
rectangles. Two-rectangle features are shown in (A) and (B). Figure
(C) shows a three-rectangle feature, and (D) a four-rectangle feature.
139
quite large, 160,000. Note that unlike the Haar basis,
the set of rectangle features is overcomplete.3
2.1.
Integral Image
Rectangle features can be computed very rapidly using
an intermediate representation for the image which we
call the integral image.4 The integral image at location
x, y contains the sum of the pixels above and to the left
of x, y, inclusive:
ii(x, y) =
i(x , y ),
x ≤x,y ≤y
where ii(x, y) is the integral image and i(x, y) is the
original image (see Fig. 2). Using the following pair of
recurrences:
s(x, y) = s(x, y − 1) + i(x, y)
(1)
ii(x, y) = ii(x − 1, y) + s(x, y)
(2)
(where s(x, y) is the cumulative row sum, s(x, −1) =
0, and ii(−1, y) = 0) the integral image can be computed in one pass over the original image.
Using the integral image any rectangular sum can be
computed in four array references (see Fig. 3). Clearly
the difference between two rectangular sums can be
computed in eight references. Since the two-rectangle
features defined above involve adjacent rectangular
sums they can be computed in six array references,
eight in the case of the three-rectangle features, and
nine for four-rectangle features.
One alternative motivation for the integral image comes from the “boxlets” work of Simard et al.
Figure 2. The value of the integral image at point (x, y) is the sum
of all the pixels above and to the left.
140
Viola and Jones
the rectangle. Evaluation of the second dot product is
accomplished with four array accesses.
2.2.
Figure 3. The sum of the pixels within rectangle D can be computed
with four array references. The value of the integral image at location
1 is the sum of the pixels in rectangle A. The value at location 2 is
A + B, at location 3 is A + C, and at location 4 is A + B + C + D.
The sum within D can be computed as 4 + 1 − (2 + 3).
(1999). The authors point out that in the case of linear
operations (e.g. f · g), any invertible linear operation
can be applied to f or g if its inverse is applied to the
result. For example in the case of convolution, if the
derivative operator is applied both to the image and the
kernel the result must then be double integrated:
f ∗g=
( f ∗ g ).
The authors go on to show that convolution can be
significantly accelerated if the derivatives of f and g
are sparse (or can be made so). A similar insight is that
an invertible linear operation can be applied to f if its
inverse is applied to g:
(f )∗
g = f ∗ g.
Viewed in this framework computation of the rectangle sum can be expressed as a dot product, i ·r , where
i is the image and r is the box car image (with value
1 within the rectangle of interest and 0 outside). This
operation can be rewritten
i ·r =
i · r .
The integral image is in fact the double integral of the
image (first along rows and then along columns). The
second derivative of the rectangle (first in row and then
in column) yields four delta functions at the corners of
Feature Discussion
Rectangle features are somewhat primitive when
compared with alternatives such as steerable filters
(Freeman and Adelson, 1991; Greenspan et al., 1994).
Steerable filters, and their relatives, are excellent for the
detailed analysis of boundaries, image compression,
and texture analysis. While rectangle features are also
sensitive to the presence of edges, bars, and other simple image structure, they are quite coarse. Unlike steerable filters, the only orientations available are vertical,
horizontal and diagonal. Since orthogonality is not central to this feature set, we choose to generate a very
large and varied set of rectangle features. Typically the
representation is about 400 times overcomplete. This
overcomplete set provides features of arbitrary aspect
ratio and of finely sampled location. Empirically it appears as though the set of rectangle features provide
a rich image representation which supports effective
learning. The extreme computational efficiency of rectangle features provides ample compensation for their
limitations.
In order to appreciate the computational advantage
of the integral image technique, consider a more conventional approach in which a pyramid of images is
computed. Like most face detection systems, our detector scans the input at many scales; starting at the
base scale in which faces are detected at a size of
24 × 24 pixels, a 384 by 288 pixel image is scanned
at 12 scales each a factor of 1.25 larger than the last.
The conventional approach is to compute a pyramid of
12 images, each 1.25 times smaller than the previous
image. A fixed scale detector is then scanned across
each of these images. Computation of the pyramid,
while straightforward, requires significant time. Implemented efficiently on conventional hardware (using bilinear interpolation to scale each level of the pyramid) it
takes around .05 seconds to compute a 12 level pyramid
of this size (on an Intel PIII 700 MHz processor).5
In contrast we have defined a meaningful set of rectangle features, which have the property that a single
feature can be evaluated at any scale and location in a
few operations. We will show in Section 4 that effective face detectors can be constructed with as few as two
rectangle features. Given the computational efficiency
of these features, the face detection process can be completed for an entire image at every scale at 15 frames per
Robust Real-Time Face Detection
second, about the same time required to evaluate the 12
level image pyramid alone. Any procedure which requires a pyramid of this type will necessarily run slower
than our detector.
3.
Learning Classification Functions
Given a feature set and a training set of positive and
negative images, any number of machine learning approaches could be used to learn a classification function. Sung and Poggio use a mixture of Gaussian model
(Sung and Poggio, 1998). Rowley et al. (1998) use a
small set of simple image features and a neural network. Osuna et al. (1997b) used a support vector machine. More recently Roth et al. (2000) have proposed
a new and unusual image representation and have used
the Winnow learning procedure.
Recall that there are 160,000 rectangle features associated with each image sub-window, a number far
larger than the number of pixels. Even though each
feature can be computed very efficiently, computing
the complete set is prohibitively expensive. Our hypothesis, which is borne out by experiment, is that a
very small number of these features can be combined
to form an effective classifier. The main challenge is to
find these features.
In our system a variant of AdaBoost is used both
to select the features and to train the classifier (Freund
and Schapire, 1995). In its original form, the AdaBoost
learning algorithm is used to boost the classification
performance of a simple learning algorithm (e.g., it
might be used to boost the performance of a simple perceptron). It does this by combining a collection of weak
classification functions to form a stronger classifier. In
the language of boosting the simple learning algorithm
is called a weak learner. So, for example the perceptron learning algorithm searches over the set of possible
perceptrons and returns the perceptron with the lowest
classification error. The learner is called weak because
we do not expect even the best classification function to
classify the training data well (i.e. for a given problem
the best perceptron may only classify the training data
correctly 51% of the time). In order for the weak learner
to be boosted, it is called upon to solve a sequence of
learning problems. After the first round of learning, the
examples are re-weighted in order to emphasize those
which were incorrectly classified by the previous weak
classifier. The final strong classifier takes the form of a
perceptron, a weighted combination of weak classifiers
followed by a threshold.6
141
The formal guarantees provided by the AdaBoost
learning procedure are quite strong. Freund and
Schapire proved that the training error of the strong
classifier approaches zero exponentially in the number
of rounds. More importantly a number of results
were later proved about generalization performance
(Schapire et al., 1997). The key insight is that generalization performance is related to the margin of the
examples, and that AdaBoost achieves large margins
rapidly.
The conventional AdaBoost procedure can be easily interpreted as a greedy feature selection process.
Consider the general problem of boosting, in which a
large set of classification functions are combined using
a weighted majority vote. The challenge is to associate
a large weight with each good classification function
and a smaller weight with poor functions. AdaBoost is
an aggressive mechanism for selecting a small set of
good classification functions which nevertheless have
significant variety. Drawing an analogy between weak
classifiers and features, AdaBoost is an effective procedure for searching out a small number of good “features” which nevertheless have significant variety.
One practical method for completing this analogy is
to restrict the weak learner to the set of classification
functions each of which depend on a single feature.
In support of this goal, the weak learning algorithm is
designed to select the single rectangle feature which
best separates the positive and negative examples (this
is similar to the approach of Tieu and Viola (2000) in
the domain of image database retrieval). For each feature, the weak learner determines the optimal threshold
classification function, such that the minimum number of examples are misclassified. A weak classifier
(h(x, f, p, θ )) thus consists of a feature ( f ), a threshold (θ ) and a polarity ( p) indicating the direction of the
inequality:
h(x, f, p, θ) =
1
0
if p f (x) < pθ
otherwise
Here x is a 24 × 24 pixel sub-window of an image.
In practice no single feature can perform the classification task with low error. Features which are selected
early in the process yield error rates between 0.1 and
0.3. Features selected in later rounds, as the task becomes more difficult, yield error rates between 0.4 and
0.5. Table 1 shows the learning algorithm.
The weak classifiers that we use (thresholded single
features) can be viewed as single node decision trees.
142
Viola and Jones
Table 1. The boosting algorithm for learning a query online.
T hypotheses are constructed each using a single feature. The
final hypothesis is a weighted linear combination of the T hypotheses where the weights are inversely proportional to the
training errors.
• Given example images (x1 , y1 ), . . . , (xn , yn ) where
yi = 0, 1 for negative and positive examples respectively.
1
, 2l1 for yi = 0, 1 respectively,
• Initialize weights w1,i = 2m
where m and l are the number of negatives and positives
respectively.
• For t = 1, . . . , T :
w
1. Normalize the weights, wt,i ← n t,iw
j=1
t, j
2. Select the best weak classifier with respect to the
weighted error
t = min f, p,θ
wi | h(xi , f, p, θ ) − yi | .
i
See Section 3.1 for a discussion of an efficient
implementation.
3. Define h t (x) = h(x, f t , pt , θt ) where f t , pt , and θt
are the minimizers of t .
4. Update the weights:
1−ei
wt+1,i = wt,i βt
where ei = 0 if example xi is classified correctly, ei = 1
t
.
otherwise, and βt = 1−
t
• The final strong classifier is:
C(x) =
1
T
αt h t (x) ≥
t=1
0
T
1
αt
2 t=1
otherwise
where αt = log β1t
Such structures have been called decision stumps in
the machine learning literature. The original work of
Freund and Schapire (1995) also experimented with
boosting decision stumps.
3.1.
Learning Discussion
The algorithm described in Table 1 is used to select
key weak classifiers from the set of possible weak
classifiers. While the AdaBoost process is quite efficient, the set of weak classifier is extraordinarily large.
Since there is one weak classifier for each distinct feature/threshold combination, there are effectively KN
weak classifiers, where K is the number of features
and N is the number of examples. In order to appreciate the dependency on N , suppose that the examples
are sorted by a given feature value. With respect to the
training process any two thresholds that lie between the
same pair of sorted examples is equivalent. Therefore
the total number of distinct thresholds is N . Given a
task with N = 20000 and K = 160000 there are 3.2
billion distinct binary weak classifiers.
The wrapper method can also be used to learn a perceptron which utilizes M weak classifiers (John et al.,
1994) The wrapper method also proceeds incrementally by adding one weak classifier to the perceptron in
each round. The weak classifier added is the one which
when added to the current set yields a perceptron with
lowest error. Each round takes at least O(NKN) (or 60
Trillion operations); the time to enumerate all binary
features and evaluate each example using that feature.
This neglects the time to learn the perceptron weights.
Even so, the final work to learn a 200 feature classifier would be something like O(MNKN) which is 1016
operations.
The key advantage of AdaBoost as a feature selection mechanism, over competitors such as the wrapper
method, is the speed of learning. Using AdaBoost a
200 feature classifier can be learned in O(MNK) or
about 1011 operations. One key advantage is that in
each round the entire dependence on previously selected features is efficiently and compactly encoded
using the example weights. These weights can then be
used to evaluate a given weak classifier in constant time.
The weak classifier selection algorithm proceeds as
follows. For each feature, the examples are sorted based
on feature value. The AdaBoost optimal threshold for
that feature can then be computed in a single pass over
this sorted list. For each element in the sorted list, four
sums are maintained and evaluated: the total sum of
positive example weights T + , the total sum of negative
example weights T − , the sum of positive weights below
the current example S + and the sum of negative weights
below the current example S − . The error for a threshold
which splits the range between the current and previous
example in the sorted list is:
e = min S + + (T − − S − ), S − + (T + − S + ,
or the minimum of the error of labeling all examples
below the current example negative and labeling the examples above positive versus the error of the converse.
These sums are easily updated as the search proceeds.
Many general feature selection procedures have been
proposed (see chapter 8 of Webb (1999) for a review).
Our final application demanded a very aggressive process which would discard the vast majority of features.
For a similar recognition problem Papageorgiou et al.
(1998) proposed a scheme for feature selection based
Robust Real-Time Face Detection
on feature variance. They demonstrated good results selecting 37 features out of a total 1734 features. While
this is a significant reduction, the number of features
evaluated for every image sub-window is still reasonably large.
Roth et al. (2000) propose a feature selection process
based on the Winnow exponential perceptron learning
rule. These authors use a very large and unusual feature
set, where each pixel is mapped into a binary vector of d
dimensions (when a particular pixel takes on the value
x, in the range [0, d − 1], the x-th dimension is set to
1 and the other dimensions to 0). The binary vectors
for each pixel are concatenated to form a single binary
vector with nd dimensions (n is the number of pixels).
The classification rule is a perceptron, which assigns
one weight to each dimension of the input vector. The
Winnow learning process converges to a solution where
many of these weights are zero. Nevertheless a very
large number of features are retained (perhaps a few
hundred or thousand).
3.2.
Learning Results
While details on the training and performance of the
final system are presented in Section 5, several simple results merit discussion. Initial experiments demon-
Figure 4.
143
strated that a classifier constructed from 200 features
would yield reasonable results (see Fig. 4). Given a
detection rate of 95% the classifier yielded a false positive rate of 1 in 14084 on a testing dataset. This is
promising, but for a face detector to be practical for
real applications, the false positive rate must be closer
to 1 in 1,000,000.
For the task of face detection, the initial rectangle
features selected by AdaBoost are meaningful and easily interpreted. The first feature selected seems to focus
on the property that the region of the eyes is often darker
than the region of the nose and cheeks (see Fig. 5). This
feature is relatively large in comparison with the detection sub-window, and should be somewhat insensitive
to size and location of the face. The second feature selected relies on the property that the eyes are darker
than the bridge of the nose.
In summary the 200-feature classifier provides initial evidence that a boosted classifier constructed from
rectangle features is an effective technique for face detection. In terms of detection, these results are compelling but not sufficient for many real-world tasks. In
terms of computation, this classifier is very fast, requiring 0.7 seconds to scan an 384 by 288 pixel image. Unfortunately, the most straightforward technique for improving detection performance, adding
Receiver operating characteristic (ROC) curve for the 200 feature classifier.
144
Viola and Jones
number of sub-windows that need further processing
with very few operations:
1. Evaluate the rectangle features (requires between 6
and 9 array references per feature).
2. Compute the weak classifier for each feature (requires one threshold operation per feature).
3. Combine the weak classifiers (requires one multiply
per feature, an addition, and finally a threshold).
Figure 5. The first and second features selected by AdaBoost. The
two features are shown in the top row and then overlayed on a typical training face in the bottom row. The first feature measures the
difference in intensity between the region of the eyes and a region
across the upper cheeks. The feature capitalizes on the observation
that the eye region is often darker than the cheeks. The second feature
compares the intensities in the eye regions to the intensity across the
bridge of the nose.
features to the classifier, directly increases computation
time.
4.
The Attentional Cascade
This section describes an algorithm for constructing a
cascade of classifiers which achieves increased detection performance while radically reducing computation
time. The key insight is that smaller, and therefore more
efficient, boosted classifiers can be constructed which
reject many of the negative sub-windows while detecting almost all positive instances. Simpler classifiers are
used to reject the majority of sub-windows before more
complex classifiers are called upon to achieve low false
positive rates.
Stages in the cascade are constructed by training
classifiers using AdaBoost. Starting with a two-feature
strong classifier, an effective face filter can be obtained
by adjusting the strong classifier threshold to minimize
false negatives. The initial AdaBoost threshold,
1 T
t=1 αt , is designed to yield a low error rate on the
2
training data. A lower threshold yields higher detection rates and higher false positive rates. Based on performance measured using a validation training set, the
two-feature classifier can be adjusted to detect 100% of
the faces with a false positive rate of 50%. See Fig. 5 for
a description of the two features used in this classifier.
The detection performance of the two-feature classifier is far from acceptable as a face detection system.
Nevertheless the classifier can significantly reduce the
A two feature classifier amounts to about 60 microprocessor instructions. It seems hard to imagine
that any simpler filter could achieve higher rejection
rates. By comparison, scanning a simple image template would require at least 20 times as many operations
per sub-window.
The overall form of the detection process is that of
a degenerate decision tree, what we call a “cascade”
(Quinlan, 1986) (see Fig. 6). A positive result from
the first classifier triggers the evaluation of a second
classifier which has also been adjusted to achieve very
high detection rates. A positive result from the second
classifier triggers a third classifier, and so on. A negative
outcome at any point leads to the immediate rejection
of the sub-window.
The structure of the cascade reflects the fact that
within any single image an overwhelming majority of
sub-windows are negative. As such, the cascade attempts to reject as many negatives as possible at the
earliest stage possible. While a positive instance will
Figure 6. Schematic depiction of a the detection cascade. A series
of classifiers are applied to every sub-window. The initial classifier
eliminates a large number of negative examples with very little processing. Subsequent layers eliminate additional negatives but require
additional computation. After several stages of processing the number of sub-windows have been reduced radically. Further processing
can take any form such as additional stages of the cascade (as in our
detection system) or an alternative detection system.
Robust Real-Time Face Detection
trigger the evaluation of every classifier in the cascade,
this is an exceedingly rare event.
Much like a decision tree, subsequent classifiers are
trained using those examples which pass through all
the previous stages. As a result, the second classifier
faces a more difficult task than the first. The examples
which make it through the first stage are “harder” than
typical examples. The more difficult examples faced
by deeper classifiers push the entire receiver operating characteristic (ROC) curve downward. At a given
detection rate, deeper classifiers have correspondingly
higher false positive rates.
4.1.
Training a Cascade of Classifiers
The number of features evaluated when scanning
real images is necessarily a probabilistic process. Any
given sub-window will progress down through the cascade, one classifier at a time, until it is decided that
the window is negative or, in rare circumstances, the
window succeeds in each test and is labelled positive.
The expected behavior of this process is determined
by the distribution of image windows in a typical test
set. The key measure of each classifier is its “positive
rate”, the proportion of windows which are labelled as
potentially containing a face. The expected number of
features which are evaluated is:
N = n0 +
K
i=1
The cascade design process is driven from a set of detec