scispace - formally typeset
Open AccessJournal ArticleDOI

Robust Object Tracking with Online Multiple Instance Learning

Reads0
Chats0
TLDR
It is shown that using Multiple Instance Learning (MIL) instead of traditional supervised learning avoids these problems and can therefore lead to a more robust tracker with fewer parameter tweaks.
Abstract
In this paper, we address the problem of tracking an object in a video given its location in the first frame and no other information. Recently, a class of tracking techniques called “tracking by detection” has been shown to give promising results at real-time speeds. These methods train a discriminative classifier in an online manner to separate the object from the background. This classifier bootstraps itself by using the current tracker state to extract positive and negative examples from the current frame. Slight inaccuracies in the tracker can therefore lead to incorrectly labeled training examples, which degrade the classifier and can cause drift. In this paper, we show that using Multiple Instance Learning (MIL) instead of traditional supervised learning avoids these problems and can therefore lead to a more robust tracker with fewer parameter tweaks. We propose a novel online MIL algorithm for object tracking that achieves superior results with real-time performance. We present thorough experimental results (both qualitative and quantitative) on a number of challenging video clips.

read more

Content maybe subject to copyright    Report

Robust Object Tracking with
Online Multiple Instance Learning
Boris Babenko, Student Member, IEEE, Ming-Hsuan Yang, Senior Member, IEEE, and
Serge Belongie, Member, IEEE
Abstract—In this paper, we address the problem of tracking an object in a video given its location in the first frame and no other
information. Recently, a class of tracking techniques called “tracking by detection” has been shown to give promising results at
real-time speeds. These methods train a discriminative classifier in an online manner to separate the object from the background. This
classifier bootstraps itself by using the current tracker state to extract positive and negative examples from the current frame. Slight
inaccuracies in the tracker can therefore lead to incorrectly labeled training examples, which degrade the classifier and can cause drift.
In this paper, we show that using Multiple Instance Learning (MIL) instead of traditional supervised learning avoids these problems and
can therefore lead to a more robust tracker with fewer parameter tweaks. We propose a novel online MIL algorithm for object tracking
that achieves superior results with real-time performance. We present thorough experimental results (both qualitative and quantitative)
on a number of challenging video clips.
Index Terms—Visual Tracking, multiple instance learning, online boosting.
Ç
1INTRODUCTION
O
BJECT tracking is a well-studied problem in computer
vision and has many practical applications. The
problem and its difficulty depend on several factors, such
as the amount of prior knowledge about the target object and
the number and type of parameters being tracked (e.g.,
location, scale, detailed contour). Although there has been
some success with building trackers for specific object classes
(e.g., faces [1], humans [2], mice [3], rigid objects [4]), tracking
generic objects has remained challenging because an object
can drastically change appearance when deforming, rotating
out of plane, or when the illumination of the scene changes.
A typical tracking system consists of three components:
1) an appearance model, which can evaluate the likelihood
that the object of interest is at some particular location, 2) a
motion model, which relates the locations of the object over
time, and 3) a search strategy for finding the most likely
location in the current frame. The contributions of this paper
deal with the first of these three components; we refer the
reader to [5] for a thorough review of the other components.
Although many tracking methods employ static appearance
models that are either defined manually or trained using
only the first frame [2], [4], [6], [7], [8], [9], these methods are
often unable to cope with significant appearance changes.
Such challenges are particularly difficult when there is
limited a priori knowledge about the object of interest. In
this scenario, it has been shown that an adaptive appearance
model, which evolves during the tracking process as the
appearance of the object changes, is t he key to good
performance [10], [11], [12]. Training adaptive appearance
models, however, is itself a difficult task, with many
questions yet to be answered. Such models often involve
many parameters that must be tuned to get good perfor-
mance (e.g., “forgetting factors” that control how fast the
appearance model can change), and can suffer from drift
problems when an object undergoes partial occlusion.
In this paper, we focus on the problem of tracking an
arbitrary object with no prior knowledge other than its
location in the first video frame (sometimes referred to as
“model-free” tracking). Our goal is to develop a more
robust way of updating an adaptive appearance model; we
would like our system to be able to handle partial
occlusions without significant drift and for it to work well
with minimal parameter tuning. To do this, we turn to a
discriminative learning paradigm called Multiple Instance
Learning (MIL) [13] that can handle ambiguities in the
training data. This technique has found recent success in
other computer vision areas, such as object detection [14],
[15] and object recognition [16], [17], [18].
We will focus on the problem of tracking the location and
scale of a single object, using a rectangular bounding box to
approximate these parameters. It is plausible that the ideas
presented here can be applied to other types of tracking
problems like tracking multiple objects (e.g., [19]), tracking
contours (e.g., [20], [21]), or tracking deformable objects
(e.g., [22]), but this is outside the scope of our work.
The remainder of this paper is organized as follows: In
Section 2, we review the current state of the art in adaptive
appearance models; in Section 3, we introduce our tracking
algorithm; in Section 4, we present qualitative and
quantitative results of our tracker on a number of challen-
ging video clips. We conclude in Section 5.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 8, AUGUST 2011 1619
. B. Babenko and S. Belongie are with the Department of Computer Science
and Engineering, University of California, San Diego, 9500 Gilman Dr.,
La Jolla, CA 92093-0404. E-mail: {bbabenko, sjb}@cs.ucsd.edu.
. M.-H. Yang is with the Department of Computer Science, University of
California, Merced, CA 95344. E-mail: mhyang@ucmerced.edu.
Manuscript received 16 Feb. 2010; revised 1 July 2010; accepted 5 Aug. 2010;
published online 13 Dec. 2010.
Recommended for acceptance by H. Bischof.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number
TPAMI-2010-02-0100.
Digital Object Identifier no. 10.1109/TPAMI.2010.226.
0162-8828/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society

2ADAPTIVE APPEARANCE MODELS
An important choice in the design of appearance models is
whether to model only the object [12], [23] or both the object
and the background [24], [25], [26], [27], [28], [29], [30]. Many
of the latter approaches have shown that training a model to
separate the object from the background via a discriminative
classifier can often achieve superior results. These methods
are closely related to object detection—an area that has seen
great progress in the last decade—and are referred to as
“tracking-by-detection” or “tracking by repeated recogni-
tion” [31]. In particular, the recent advances in face detection
[32] have inspired some successful real-time tracking
algorithms [25], [26].
A major challenge that is often not discussed in the
literature is how to choose positive and negative examples
when updat ing the adaptive appear ance model. Most
comm only this is done by taking the current tracker
location as one positive example and sampling the
neighborhood around the tracker location for negatives. If
the tracker location is not precise, however, the appearance
model ends up getting updated with a suboptimal positive
example. Over time this can degrade the model and can
cause drift. On the other hand, if multiple positive examples
are used (taken from a small neighborhood around the
current tracker location), the model can become confused
and its discriminative power can suffer (cf. Figs. 1a, 1b).
Alternatively, Grabner et al. [33] recently proposed a semi-
supervised approach where labeled examples come from
the first frame only and subsequent training examples are
left unlabeled. This method is particularly well suited for
scenarios where the object l eaves the field of view
completely, but it throws away a lot of useful information
by not taking advantage of the problem domain (e.g., it is
safe to assume small interframe motion).
Object detection faces issues similar to those described
above in that it is difficult for a human labeler to be
consistent with respect to how the positive examples are
cropped. In fact, Viola et al. [14] argue that object detection
has inherent ambiguities that cause difficulty for traditional
supervised learning methods. For this reason they suggest
the use of a MIL [13] approach for object detection. We give
a more formal definition of MIL in Section 3.2, but the basic
idea of this learning paradigm is that during training,
examples are presented in sets (often called “bags”) and
labels are provided for the bags rather than individual
instances. If a bag is labeled positive, it is assumed to
contain at least one positive instance; otherwise the bag is
negative. For example, in the context of object detection, a
positive bag could contain a few possible bounding boxes
around each labeled object (e.g., a human labeler clicks on
the center of the object and the algorithm crops several
rectangles around that point). Therefore, the ambiguity is
passed on to the learning algorithm, which now has to
figure out which instance in each positive bag is the most
“correct.” Although one could argue that this learning
problem is more difficult in the sense that less information
is provided to the learner, in some ways it is actually easier
because the learner is allowed some flexibility in finding a
decision boundary. Viola et al. present convincing results
showing that a face detector trained with weaker labeling
(just the center of the face) and a MIL algorithm outper-
forms a state of the art supervised algorithm trained with
explicit bounding boxes.
In this paper, we make an analogous argument to that of
Viola et al. [14] and propose to use a MIL-based appearance
model for object tracking (cf. Fig. 1c). In fact, in the object
tracking domain there is even more ambiguity than in object
detection because the tracker has no human input and has to
bootstrap itself. Therefore, we expect the benefits of a MIL
approach to be even more significant than in the object
detection problem. In order to incorporate MIL into a
tracker, an online MIL algorithm is required. The algorithm
we propose is based on boosting and is related to the
MILBoost algorithm [14] as well as the Online AdaBoost
algorithm [34] (to our knowledge this is the first online MIL
algorithm in the literature). We present empirical results on
challenging video sequences which show that using an
online MIL-based appearance model can lead to more robust
and stable tracking than existing methods in the literature.
3TRACKING WITH ONLINE MIL
In this section, we introduce our tracking algorithm, MIL-
Track, which uses a MIL-based appearance model. We begin
with an overview of our tracking system which includes a
description of the motion model we use. Next, we review the
MIL problem and briefly describe the MILBoost algorithm
[14]. We then review online boosting [25], [34] and present a
novel boosting-based algorithm for online MIL. Finally, we
review various implementation details.
3.1 System Overview and Motion Model
The basic flow of the tracking system we implemented in
this work is i llustrated in Fig. 2 and summarized in
Algorithm 1. Our image representation consists of a set of
Haar-like features that are computed for each image patch
[32], [35]; this is discussed in more detail in Section 3.6. The
appearance model is composed of a discriminative classifier
1620 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 8, AUGUST 2011
Fig. 1. Updating a discriminative appearance model: (a) Using a
single positive image patch to update a traditional discriminative
classifier. The positive image patch chosen does not capture the object
perfectly. (b) Using several positive image patche s to update a
traditional discriminative classifier. This can make it difficult for the
classifier to learn a tight decision boundary. (c) Using one positive bag
consisting of several image patches to update a MIL classifier. See
Section 4 for empirical results of these three strategies.

which is able to return pðy ¼ 1jxÞ (we will use pðyjxÞ as
shorthand), where x is an image patch (or the representation
of an image patch in feature space) and y is a binary variable
indicating the presence of the object of interest in that image
patch. At every time step t, our tracker maintains the object
location
t
. Let ðxÞ denote the location of image patch x (for
now, let’s assume this consists of only the ðx; yÞ coordinates
of the patch center and that scale is fixed; below, we
consider tracking scale as well). For each new frame, we
crop out a set of image patches X
s
¼fx : kðxÞ
t1
k <sg
that are within some search radius s of the current tracker
location, and compute pðyjxÞ for all x 2 X
s
. We then use a
greedy strategy to update the tracker location:
t
¼ arg max
x2X
s
pðyjxÞ

: ð1Þ
In other words, we do not maintain a distribution of the
target’s location at every frame, and our motion model is such
that the location of the tracker at time t is equally likely to
appear within a radius s of the tracker location at time ðt 1Þ:
pð
t
j
t1
Þ/
1if
t
t1
<s
0 otherwise:
ð2Þ
This could be extended with something more sophisticated,
such as a particle filter, as is done in [12], [29], [36];
however, we again emphasize that our focus is on the
appearance model.
Algorithm 1. MILtrack
Input: Video frame number k
1: Crop out a set of image patches, X
s
¼fx : kðxÞ
t1
k <sg and compute feature vectors.
2: Use MIL classifier to estimate pðy ¼ 1jxÞ for x 2 X
s
.
3: Update tracker location
t
¼ ðargmax
x2X
s
pðyjxÞÞ.
4: Crop out two sets of image patches X
r
¼fx : kðxÞ
t
k <rg and X
r;
¼fx : r<kðxÞ
t
k <g.
5: Update MIL appearance model with one positive bag
X
r
and jX
r;
j negative bags, each containing a single
image patch from the set X
r;
.
Once the tracker location is updated, we proceed to
update the appearance model. We crop out a set of patches
X
r
¼fx : kðxÞ
t
k <rg, where r<s is a scalar radius
(measured in pixels), and label this bag positive (recall that
in MIL we train the algorithm with labeled bags). In
contrast, if a standard learning algorithm were used, there
would be two options: Set r ¼ 1 and use this as a single
positive instance or set r>1 and label all these instances
positive. For negatives we crop out patches from an annular
region X
r;
¼fx : r<kðxÞ
t
k <g, where r is same as
before and is another scalar. Since this generates a
potentially large set, we then take a random subset of these
image patches and label them negative. We place each
negative example into its own negative bag, though placing
them all into one negative bag yields the same result.
Incorporating scale tracking into this system is straight-
forward. First, we define an extra parameter to be the scale
space step size. When searching for the location of the object
in a new frame, we crop out image patches from the image
at the current scale
s
t
, as well as one scale step larger and
smaller,
s
t
; once we find the location with the maximum
response, we update the current state (both position and
scale) accordingly. When updating the appearance model,
we have the option of cropping training image patches only
from the current scale or from the neighboring scales as
well; in our current implementation we do the former.
It is important to note that tracking in scale-space is a
double-edged sword. In some ways the problem becomes
more difficult because the parameter space becomes larger,
and consequently there is more room for error. However,
tracking this additional parameter may mean that the image
patches wecrop out are better aligned, making it easier for our
classifier to learn the correct appearance. In our experiments,
we have noticed both behaviors—sometimes adding scale
tracking helps and other times it hurts performance.
Details on how all of the above parameters were set are
in Section 4, although we use the same parameters
throughout all of the experiments. We continue with a
more detailed review of MIL.
3.2 Multiple Instance Learning
Traditional discriminative learning algorithms for training a
binary classifier that estimates pðyjxÞ require a training data
set of the form x
1
;y
1
Þ; ...; ðx
n
;y
n
Þg, where x
i
is an instance
(in our case a feature vector computed for an image patch)
and y
i
2f0; 1g is a binary label. In Multiple Instance Learning,
training data has the form X
1
;y
1
Þ; ...; ðX
n
;y
n
Þg, where a
bag X
i
¼fx
i1
; ...;x
im
g and y
i
is a bag label. The bag labels
are defined as:
y
i
¼ max
j
ðy
ij
Þ; ð3Þ
where y
ij
are the instance labels, which are not known
during training. In other words, a bag is considered positive
if it contains at least one positive instance. Numerous
algorithms have been proposed for solving the MIL
problem [13], [14], [16]. The algorithm that is most closely
related to our work is the MILBoost algorithm proposed by
Viola et al. in [14]. MILBoost uses the gradient boosting
framework [37] to train a boosting classifier that maximizes
the log likelihood of bags:
X
i
ðlog pðy
i
jX
i
ÞÞ: ð4Þ
Notice that the likelihood is defined over bags and not
instances because insta nce labels are unknown during
BABENKO ET AL .: ROBUST OBJECT TRACKING WITH ONLINE MULTIPLE INSTANCE LEARNING 1621
Fig. 2. Tracking by detection with a greedy motion model: An
illustration of how most tracking by detection systems work.

training, and yet the goal is to train an instance classifier
that estimates pðyjxÞ. We therefore need to express pðy
i
jX
i
Þ,
the probability of a bag being positive, in terms of its
instances. In [14], the Noisy-OR (NOR) model is adopted for
doing this:
pðy
i
jX
i
Þ¼1
Y
j
ð1 pðy
i
jx
ij
ÞÞ; ð5Þ
although other models could be swapped in (e.g., [38]). The
equation above has the desired property that if one of the
instances in a bag has a high probability, the bag probability
will be high as well. As mentioned in [14], with this
formulation, the likelihood is the same whether we put all of
the negative instances in one bag or if we put each in its own
bag. Intuitively this makes sense because no matter how we
arrange things, we know that every instance in a negative
bag is negative. We refer the reader to [14] for further details
on MILBoost. Finally, we note that MILBoost is a batch
algorithm (meaning it needs the entire training data set at
once) and cannot be trained in an online manner as we need
in our tracking application. Nevertheless, we adopt the loss
function in (4) and the bag probability model in (5) when we
develop our online MIL algorithm in Section 3.4.
3.3 Online Boosting
Our algorithm for online MIL is based on the boosting
framework [39] and is related to the work on Online
AdaBoost [34] and its adaptation in [25]. The goal of
boosting is to combine many weak classifiers hðxÞ (usually
decision stumps) into an additive strong classifier:
HðxÞ¼
X
K
k¼1
k
h
k
ðxÞ; ð6Þ
where
k
are scalar weights. There have been many boosting
algorithms proposed to learn this model in batch mode [39],
[40]; typically this is done in a greedy manner where the
weak classifiers are trained sequentially. After each weak
classifier is trained, the training examples are reweighted
such that examples that were previously misclassified
receive more weight. If each weak classifier is a decision
stump, then it chooses one feature that has the most
discriminative power for the entire weighted training set.
In this case, boosting can be viewed as performing feature
selection, choosing a total of K features which is generally
much smaller than the size of the entire feature pool. This
has proven particularly useful in computer vision because it
creates classifiers that are efficient at run time [32].
In [34], Oza develops an online variant of the popular
AdaBoost algorithm [39], which minimizes the exponential
loss function. This variant requires that all h can be trained
in an online manner. The basic flow of Oza’s algorithm is as
follows: For an incoming example x, each h
k
is updated
sequentially and the weight of example x is adjusted after
each update. Since the formulas for the example weights
and classifier weights in AdaBoost depend only on the error
of the weak classifiers, Oza proposes keeping a running
average of the error of each h
k
, which allows the algorithm
to estimate both the example weight and the classifier
weights in an online manner.
In Oza’s framework, if every h is restricted to be a
decision stump, the algorithm has no way of choosing the
most discriminative feature because the entire training set is
never available at one time. Therefore, the features for each
h
k
must be picked a priori. This is a potential problem for
computer vision applications since they often rely on the
feature selection property of boosting. Grabner et al. [25]
proposed an extension of Oza’s algorithm which performs
feature selection by maintaining a pool of M>Kcandidate
weak stump classifiers h. When a new example is passed in,
all of the candidate weak classifiers are updated in parallel.
Then, the algorithm sequentially chooses K weak classifiers
from this pool by keeping running averages of errors for
each, as in [34], and updates the weights of h accordingly.
We employ a similar feature selection technique in our
Online MIL algorithm, although the criteria for choosing
weak classifiers is different.
3.4 Online Multiple Instance Boosting
The algorithms in [34] and [25] rely on the special properties
of the exponential loss function of AdaBoost, and therefore
cannot be readily adapted to the MIL problem. We now
present our novel online boosting algorithm for MIL. As in
[40], we take a statistical view of boosting, where the
algorithm is trying to optimize a specific objective
function J. In this view, the weak classifiers are chosen
sequentially to optimize the following criteria:
ðh
k
;
k
Þ¼arg max
h2H;
JðH
k1
þ hÞ; ð7Þ
where H
k1
is the strong classifier made up of the first ðk 1Þ
weak classifiers and H is the set of all possible weak
classifiers. In batch boosting algorithms, the objective
function J is computed over the entire training data set.
In our case, for the current video frame we are given a
training data set X
1
;y
1
Þ; ðX
2
;y
2
Þ ...g, where X
i
¼fx
i1
;
x
i2
...g. We would like to update our classifier to maximize
log likelihood of this data (4). We model the instance
probability as
pðyjxÞ¼
HðxÞ
; ð8Þ
where ðxÞ¼
1
1þe
x
is the sigmoid function; the bag prob-
abilities pðyjXÞ are modeled using the NOR model in (5). To
simplify the problem, we absorb the scalar weights
t
into the
weak classifiers by allowing them to return real values
rather than binary.
At all times our algorithm maintains a pool of M>K
candidate weak stump classifiers h. To update the classifier,
we first update all weak classifiers in parallel, similar to [25].
Note that although instances are in bags, the weak classifiers
in a MIL algorithm are instance classifiers and therefore
require instance labels y
ij
. Since these are unavailable, we
pass in the bag label y
i
for all instances x
ij
to the weak
training procedure. We then choose K weak classifiers h
from the candidate pool sequentially by maximizing the log
likelihood of bags:
h
k
¼ arg max
h2fh
1
;...;h
M
g
H
k1
þ hÞ: ð9Þ
See Algorithm 2 for the pseudocode of Online MILBoost
and Fig. 3 for an illustration of tracking with this algorithm.
1622 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 8, AUGUST 2011

Algorithm 2. Online MILBoost (OMB)
Input: Data set fX
i
;y
i
g
N
i¼1
, where X
i
¼fx
i1
;x
i2
; ...g;y
i
2
f0; 1g
1: Update all M weak classifiers in the pool with data
fx
ij
;y
i
g
2: Initialize H
ij
¼ 0 for all i; j
3: for k ¼ 1 to K do
4: for m ¼ 1 to M do
5: p
m
ij
¼ ðH
ij
þ h
m
ðx
ij
ÞÞ
6: p
m
i
¼ 1
Q
j
ð1 p
m
ij
Þ
7: L
m
¼
P
i
ðy
i
logðp
m
i
Þþð1 y
i
Þ logð1 p
m
i
ÞÞ
8: end for
9: m
¼ argmax
m
L
m
10: h
k
ðxÞ h
m
ðxÞ
11: H
ij
¼ H
ij
þ h
k
ðxÞ
12: end for
Output: Classifier HðxÞ¼
P
k
h
k
ðxÞ,wherepðyjxÞ¼
ðHðxÞÞ
3.5 Discussion
There are a couple of important issues to point out about
this algorithm. First, we acknowledge the fact that training
the weak classifiers with positive labels for all instances in
the positive bags i s suboptimal because some of the
instances in the positive bags may actually not be “correct.”
The algorithm makes up for this when it is choosing the
weak classifiers h based on the bag likelihood loss function.
We have also experimented using online GradientBoost [41]
to compute weights (via the gradient of the loss function)
for all instances, but found this to make little difference in
accuracy while making the system slower. Second, if we
compare (7) and (9), we see that the latter has a much more
restricted choice of weak classifiers. This approximation
does not seem to degrade the performance of the classifier
in practice, as noted in [42]. Finally, we note that the
likelihood being optimized in (9) is computed only on the
current examples. Thus, it has the potential of overfitting to
current examples and not retaining information about
previously seen data. This is averted by using online weak
classifiers that do retain information about previously seen
data, which balances out the overall algorithm between
fitting the current data and retaining history.
3.6 Implementation Details
3.6.1 Weak Classifiers
Recall that we require weak classifiers h that can be updated
online. In our system, each weak classifier h
k
is composed of
a Haar-like feature f
k
and four parameters ð
1
;
1
;
0
;
0
Þ that
are estimated online. The classifiers return the log odds ratio:
h
k
ðxÞ¼log
p
t
y ¼ 1jf
k
ðxÞ
p
t
y ¼ 0jf
k
ðxÞ
"#
; ð10Þ
where p
t
ðf
t
ðxÞjy ¼ 1ÞNð
1
;
1
Þ and similarly for y ¼ 0.
We let pðy ¼ 1Þ¼pðy ¼ 0Þ and use Bayes rule to compute the
above equation. When the weak classifier receives new data
x
1
;y
1
Þ; ...; ðx
n
;y
n
Þg, we use the following update rules:
1

1
þð1 Þ
1
n
X
ijy
i
¼1
f
k
ðx
i
Þ;
1

1
þð1 Þ
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
n
X
ijy
i
¼1
ðf
k
ðx
i
Þ
1
Þ
2
s
;
where 0 <<1 is a learning rate parameter. The update
rules for
0
and
0
are similarly defined.
3.6.2 Image Features
We represent each image patch as a vector of Haar-like
features [32], which are randomly generated, similarly to [35].
BABENKO ET AL .: ROBUST OBJECT TRACKING WITH ONLINE MULTIPLE INSTANCE LEARNING 1623
Fig. 3. An illustration of how using MIL for tracking can deal with occlusions. Frame 1: Consider a simple case where the classifier is allowed to only
pick one feature from the pool. The first frame is labeled. One positive patch and several negative patches (not shown) are extracted and the
classifiers are initialized. Both OAB and MIL result in identical classifiers—both choose feature #1 because it responds well with the mouth of the
face (feature #3 would have performed well also, but suppose #1 is slightly better). Frame 2: In the second frame there is some occlusion. In
particular, the mouth is occluded, and the classifier trained in the previous step does not perform well. Thus, the most probable image patch is no
longer centered on the object. OAB uses just this patch to update; MIL uses this patch along with its neighbors. Note that MIL includes the correct
image patch in the positive bag. Frame 3: When updating, the classifiers try to pick the feature that best discriminates the current example as well
the ones previously seen. OAB has trouble with this because the current and previous positive examples are too different. It chooses a bad feature.
MIL is able to pick the feature that discriminates the eyes of the face because one of the examples in the positive bag was correctly cropped (even
though the mouth was occluded). MIL is therefore able to successfully classify future frames. Note that if we assign positive labels to all of the image
patches in the MIL bag and use these to train OAB, it would still have trouble picking a good feature.

Figures
Citations
More filters
Journal ArticleDOI

High-Speed Tracking with Kernelized Correlation Filters

TL;DR: A new kernelized correlation filter is derived, that unlike other kernel algorithms has the exact same complexity as its linear counterpart, which is called dual correlation filter (DCF), which outperform top-ranking trackers such as Struck or TLD on a 50 videos benchmark, despite being implemented in a few lines of code.
Proceedings ArticleDOI

Online Object Tracking: A Benchmark

TL;DR: Large scale experiments are carried out with various evaluation criteria to identify effective approaches for robust tracking and provide potential future research directions in this field.
Journal ArticleDOI

Object Tracking Benchmark

TL;DR: An extensive evaluation of the state-of-the-art online object-tracking algorithms with various evaluation criteria is carried out to identify effective approaches for robust tracking and provide potential future research directions in this field.
Book ChapterDOI

Exploiting the circulant structure of tracking-by-detection with kernels

TL;DR: Using the well-established theory of Circulant matrices, this work provides a link to Fourier analysis that opens up the possibility of extremely fast learning and detection with the Fast Fourier Transform, which can be done in the dual space of kernel machines as fast as with linear classifiers.
References
More filters
Proceedings ArticleDOI

Rapid object detection using a boosted cascade of simple features

TL;DR: A machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates and the introduction of a new image representation called the "integral image" which allows the features used by the detector to be computed very quickly.
Journal ArticleDOI

Greedy function approximation: A gradient boosting machine.

TL;DR: A general gradient descent boosting paradigm is developed for additive expansions based on any fitting criterion, and specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification.
Journal ArticleDOI

The Pascal Visual Object Classes (VOC) Challenge

TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Journal ArticleDOI

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

TL;DR: The model studied can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting, and it is shown that the multiplicative weight-update Littlestone?Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems.
Journal ArticleDOI

Additive Logistic Regression : A Statistical View of Boosting

TL;DR: This work shows that this seemingly mysterious phenomenon of boosting can be understood in terms of well-known statistical principles, namely additive modeling and maximum likelihood, and develops more direct approximations and shows that they exhibit nearly identical results to boosting.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What are the contributions in "Robust object tracking with online multiple instance learning" ?

In this paper, the authors address the problem of tracking an object in a video given its location in the first frame and no other information. In this paper, the authors show that using Multiple Instance Learning ( MIL ) instead of traditional supervised learning avoids these problems and can therefore lead to a more robust tracker with fewer parameter tweaks. The authors propose a novel online MIL algorithm for object tracking that achieves superior results with real-time performance. The authors present thorough experimental results ( both qualitative and quantitative ) on a number of challenging video clips. Recently, a class of tracking techniques called “ tracking by detection ” has been shown to give promising results at real-time speeds. 

One interesting avenue for future work would be to combine these ideas with the ones presented in this paper. 

A typical tracking system consists of three components: 1) an appearance model, which can evaluate the likelihood that the object of interest is at some particular location, 2) a motion model, which relates the locations of the object over time, and 3) a search strategy for finding the most likely location in the current frame. 

Since the formulas for the example weights and classifier weights in AdaBoost depend only on the error of the weak classifiers, Oza proposes keeping a running average of the error of each hk, which allows the algorithm to estimate both the example weight and the classifier weights in an online manner. 

In particular, if an object is completely occluded for a long period of time or if the object leaves the scene completely, any tracker with an adaptive appearance model will inevitably start learning from incorrect examples and lose track of the object. 

The authors argued that using Multiple Instance Learning to train the appearance classifier results in more robust tracking and presented an online boosting algorithm for MIL. 

the algorithm sequentially chooses K weak classifiersfrom this pool by keeping running averages of errors foreach, as in [34], and updates the weights of h accordingly. 

There have been many boosting algorithms proposed to learn this model in batch mode [39], [40]; typically this is done in a greedy manner where the weak classifiers are trained sequentially. 

The authors model the instanceprobability aspðyjxÞ ¼ HðxÞ ; ð8Þwhere ðxÞ ¼ 11þe x is the sigmoid function; the bag probabilities pðyjXÞ are modeled using the NOR model in (5). 

The authors acknowledge the fact that their implementation of the OAB tracker achieves worse performance than is reported in [25]; this could be because the authors are using simpler features or because their parameterswere not tuned per video sequence. 

A major challenge that is often not discussed in the literature is how to choose positive and negative examples when updating the adaptive appearance model. 

See text for details.r ¼ 1, generating only one positive example per frame (we call this OAB(1)); in the second variation the authors set r ¼ 4 as the authors do in MILTrack (although in this case, each of the 45 imagepatches is labeled positive); the authors call this OAB(45). 

When updating the appearance model, the authors have the option of cropping training image patches only from the current scale or from the neighboring scales as well; in their current implementation the authors do the former.