What are the contributions in "Robust object tracking with online multiple instance learning" ?

In this paper, the authors address the problem of tracking an object in a video given its location in the first frame and no other information. In this paper, the authors show that using Multiple Instance Learning ( MIL ) instead of traditional supervised learning avoids these problems and can therefore lead to a more robust tracker with fewer parameter tweaks. The authors propose a novel online MIL algorithm for object tracking that achieves superior results with real-time performance. The authors present thorough experimental results ( both qualitative and quantitative ) on a number of challenging video clips. Recently, a class of tracking techniques called “ tracking by detection ” has been shown to give promising results at real-time speeds.

What are the future works in "Robust object tracking with online multiple instance learning" ?

One interesting avenue for future work would be to combine these ideas with the ones presented in this paper.

What is the problem with adaptive appearance trackers?

In particular, if an object is completely occluded for a long period of time or if the object leaves the scene completely, any tracker with an adaptive appearance model will inevitably start learning from incorrect examples and lose track of the object.

What is the way to train the appearance classifier?

The authors argued that using Multiple Instance Learning to train the appearance classifier results in more robust tracking and presented an online boosting algorithm for MIL.

How do the authors model the instanceprobability of the weak classifiers?

The authors model the instanceprobability aspðyjxÞ ¼ HðxÞ ; ð8Þwhere ðxÞ ¼ 11þe x is the sigmoid function; the bag probabilities pðyjXÞ are modeled using the NOR model in (5).

Why does the OAB tracker perform better than the other two?

The authors acknowledge the fact that their implementation of the OAB tracker achieves worse performance than is reported in [25]; this could be because the authors are using simpler features or because their parameterswere not tuned per video sequence.

How many positive images are generated per frame?

See text for details.r ¼ 1, generating only one positive example per frame (we call this OAB(1)); in the second variation the authors set r ¼ 4 as the authors do in MILTrack (although in this case, each of the 45 imagepatches is labeled positive); the authors call this OAB(45).

(Open Access) Robust Object Tracking with Online Multiple Instance Learning (2011) | Boris Babenko

Q: What is the way to estimate the classifier weights?

Since the formulas for the example weights and classifier weights in AdaBoost depend only on the error of the weak classifiers, Oza proposes keeping a running average of the error of each hk, which allows the algorithm to estimate both the example weight and the classifier weights in an online manner.

Q: How does the algorithm choose the weak classifiers?

the algorithm sequentially chooses K weak classifiersfrom this pool by keeping running averages of errors foreach, as in [34], and updates the weights of h accordingly.

Q: How many boosting algorithms have been proposed to learn the MILBoost model?

There have been many boosting algorithms proposed to learn this model in batch mode [39], [40]; typically this is done in a greedy manner where the weak classifiers are trained sequentially.

Robust Object Tracking with

Online Multiple Instance Learning

Boris Babenko, Student Member, IEEE, Ming-Hsuan Yang, Senior Member, IEEE, and

Serge Belongie, Member, IEEE

Abstract—In this paper, we address the problem of tracking an object in a video given its location in the first frame and no other

information. Recently, a class of tracking techniques called “tracking by detection” has been shown to give promising results at

real-time speeds. These methods train a discriminative classifier in an online manner to separate the object from the background. This

classifier bootstraps itself by using the current tracker state to extract positive and negative examples from the current frame. Slight

inaccuracies in the tracker can therefore lead to incorrectly labeled training examples, which degrade the classifier and can cause drift.

In this paper, we show that using Multiple Instance Learning (MIL) instead of traditional supervised learning avoids these problems and

can therefore lead to a more robust tracker with fewer parameter tweaks. We propose a novel online MIL algorithm for object tracking

that achieves superior results with real-time performance. We present thorough experimental results (both qualitative and quantitative)

on a number of challenging video clips.

Index Terms—Visual Tracking, multiple instance learning, online boosting.

1INTRODUCTION

BJECT tracking is a well-studied problem in computer

vision and has many practical applications. The

problem and its difficulty depend on several factors, such

as the amount of prior knowledge about the target object and

the number and type of parameters being tracked (e.g.,

location, scale, detailed contour). Although there has been

some success with building trackers for specific object classes

(e.g., faces [1], humans [2], mice [3], rigid objects [4]), tracking

generic objects has remained challenging because an object

can drastically change appearance when deforming, rotating

out of plane, or when the illumination of the scene changes.

A typical tracking system consists of three components:

1) an appearance model, which can evaluate the likelihood

that the object of interest is at some particular location, 2) a

motion model, which relates the locations of the object over

time, and 3) a search strategy for finding the most likely

location in the current frame. The contributions of this paper

deal with the first of these three components; we refer the

reader to [5] for a thorough review of the other components.

Although many tracking methods employ static appearance

models that are either defined manually or trained using

only the first frame [2], [4], [6], [7], [8], [9], these methods are

often unable to cope with significant appearance changes.

Such challenges are particularly difficult when there is

limited a priori knowledge about the object of interest. In

this scenario, it has been shown that an adaptive appearance

model, which evolves during the tracking process as the

appearance of the object changes, is t he key to good

performance [10], [11], [12]. Training adaptive appearance

models, however, is itself a difficult task, with many

questions yet to be answered. Such models often involve

many parameters that must be tuned to get good perfor-

mance (e.g., “forgetting factors” that control how fast the

appearance model can change), and can suffer from drift

problems when an object undergoes partial occlusion.

In this paper, we focus on the problem of tracking an

arbitrary object with no prior knowledge other than its

location in the first video frame (sometimes referred to as

“model-free” tracking). Our goal is to develop a more

robust way of updating an adaptive appearance model; we

would like our system to be able to handle partial

occlusions without significant drift and for it to work well

with minimal parameter tuning. To do this, we turn to a

discriminative learning paradigm called Multiple Instance

Learning (MIL) [13] that can handle ambiguities in the

training data. This technique has found recent success in

other computer vision areas, such as object detection [14],

[15] and object recognition [16], [17], [18].

We will focus on the problem of tracking the location and

scale of a single object, using a rectangular bounding box to

approximate these parameters. It is plausible that the ideas

presented here can be applied to other types of tracking

problems like tracking multiple objects (e.g., [19]), tracking

contours (e.g., [20], [21]), or tracking deformable objects

(e.g., [22]), but this is outside the scope of our work.

The remainder of this paper is organized as follows: In

Section 2, we review the current state of the art in adaptive

appearance models; in Section 3, we introduce our tracking

algorithm; in Section 4, we present qualitative and

quantitative results of our tracker on a number of challen-

ging video clips. We conclude in Section 5.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 8, AUGUST 2011 1619

. B. Babenko and S. Belongie are with the Department of Computer Science

and Engineering, University of California, San Diego, 9500 Gilman Dr.,

La Jolla, CA 92093-0404. E-mail: {bbabenko, sjb}@cs.ucsd.edu.

. M.-H. Yang is with the Department of Computer Science, University of

California, Merced, CA 95344. E-mail: mhyang@ucmerced.edu.

Manuscript received 16 Feb. 2010; revised 1 July 2010; accepted 5 Aug. 2010;

published online 13 Dec. 2010.

Recommended for acceptance by H. Bischof.

For information on obtaining reprints of this article, please send e-mail to:

tpami@computer.org, and reference IEEECS Log Number

TPAMI-2010-02-0100.

Digital Object Identifier no. 10.1109/TPAMI.2010.226.

0162-8828/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society

2ADAPTIVE APPEARANCE MODELS

An important choice in the design of appearance models is

whether to model only the object [12], [23] or both the object

and the background [24], [25], [26], [27], [28], [29], [30]. Many

of the latter approaches have shown that training a model to

separate the object from the background via a discriminative

classifier can often achieve superior results. These methods

are closely related to object detection—an area that has seen

great progress in the last decade—and are referred to as

“tracking-by-detection” or “tracking by repeated recogni-

tion” [31]. In particular, the recent advances in face detection

[32] have inspired some successful real-time tracking

algorithms [25], [26].

A major challenge that is often not discussed in the

literature is how to choose positive and negative examples

when updat ing the adaptive appear ance model. Most

comm only this is done by taking the current tracker

location as one positive example and sampling the

neighborhood around the tracker location for negatives. If

the tracker location is not precise, however, the appearance

model ends up getting updated with a suboptimal positive

example. Over time this can degrade the model and can

cause drift. On the other hand, if multiple positive examples

are used (taken from a small neighborhood around the

current tracker location), the model can become confused

and its discriminative power can suffer (cf. Figs. 1a, 1b).

Alternatively, Grabner et al. [33] recently proposed a semi-

supervised approach where labeled examples come from

the first frame only and subsequent training examples are

left unlabeled. This method is particularly well suited for

scenarios where the object l eaves the field of view

completely, but it throws away a lot of useful information

by not taking advantage of the problem domain (e.g., it is

safe to assume small interframe motion).

Object detection faces issues similar to those described

above in that it is difficult for a human labeler to be

consistent with respect to how the positive examples are

cropped. In fact, Viola et al. [14] argue that object detection

has inherent ambiguities that cause difficulty for traditional

supervised learning methods. For this reason they suggest

the use of a MIL [13] approach for object detection. We give

a more formal definition of MIL in Section 3.2, but the basic

idea of this learning paradigm is that during training,

examples are presented in sets (often called “bags”) and

labels are provided for the bags rather than individual

instances. If a bag is labeled positive, it is assumed to

contain at least one positive instance; otherwise the bag is

negative. For example, in the context of object detection, a

positive bag could contain a few possible bounding boxes

around each labeled object (e.g., a human labeler clicks on

the center of the object and the algorithm crops several

rectangles around that point). Therefore, the ambiguity is

passed on to the learning algorithm, which now has to

figure out which instance in each positive bag is the most

“correct.” Although one could argue that this learning

problem is more difficult in the sense that less information

is provided to the learner, in some ways it is actually easier

because the learner is allowed some flexibility in finding a

decision boundary. Viola et al. present convincing results

showing that a face detector trained with weaker labeling

(just the center of the face) and a MIL algorithm outper-

forms a state of the art supervised algorithm trained with

explicit bounding boxes.

In this paper, we make an analogous argument to that of

Viola et al. [14] and propose to use a MIL-based appearance

model for object tracking (cf. Fig. 1c). In fact, in the object

tracking domain there is even more ambiguity than in object

detection because the tracker has no human input and has to

bootstrap itself. Therefore, we expect the benefits of a MIL

approach to be even more significant than in the object

detection problem. In order to incorporate MIL into a

tracker, an online MIL algorithm is required. The algorithm

we propose is based on boosting and is related to the

MILBoost algorithm [14] as well as the Online AdaBoost

algorithm [34] (to our knowledge this is the first online MIL

algorithm in the literature). We present empirical results on

challenging video sequences which show that using an

online MIL-based appearance model can lead to more robust

and stable tracking than existing methods in the literature.

3TRACKING WITH ONLINE MIL

In this section, we introduce our tracking algorithm, MIL-

Track, which uses a MIL-based appearance model. We begin

with an overview of our tracking system which includes a

description of the motion model we use. Next, we review the

MIL problem and briefly describe the MILBoost algorithm

[14]. We then review online boosting [25], [34] and present a

novel boosting-based algorithm for online MIL. Finally, we

review various implementation details.

3.1 System Overview and Motion Model

The basic flow of the tracking system we implemented in

this work is i llustrated in Fig. 2 and summarized in

Algorithm 1. Our image representation consists of a set of

Haar-like features that are computed for each image patch

[32], [35]; this is discussed in more detail in Section 3.6. The

appearance model is composed of a discriminative classifier

1620 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 8, AUGUST 2011

Fig. 1. Updating a discriminative appearance model: (a) Using a

single positive image patch to update a traditional discriminative

classifier. The positive image patch chosen does not capture the object

perfectly. (b) Using several positive image patche s to update a

traditional discriminative classifier. This can make it difficult for the

classifier to learn a tight decision boundary. (c) Using one positive bag

consisting of several image patches to update a MIL classifier. See

Section 4 for empirical results of these three strategies.

which is able to return pðy ¼ 1jxÞ (we will use pðyjxÞ as

shorthand), where x is an image patch (or the representation

of an image patch in feature space) and y is a binary variable

indicating the presence of the object of interest in that image

patch. At every time step t, our tracker maintains the object

location ‘



. Let ‘ðxÞ denote the location of image patch x (for

now, let’s assume this consists of only the ðx; yÞ coordinates

of the patch center and that scale is fixed; below, we

consider tracking scale as well). For each new frame, we

crop out a set of image patches X

¼fx : k‘ðxÞ‘



t1

k <sg

that are within some search radius s of the current tracker

location, and compute pðyjxÞ for all x 2 X

. We then use a

greedy strategy to update the tracker location:

‘



¼ ‘ arg max

x2X

pðyjxÞ



: ð1Þ

In other words, we do not maintain a distribution of the

target’s location at every frame, and our motion model is such

that the location of the tracker at time t is equally likely to

appear within a radius s of the tracker location at time ðt  1Þ:

pð‘



j‘



t1

Þ/

1if



‘



 ‘



t1



0 otherwise:



ð2Þ

This could be extended with something more sophisticated,

such as a particle filter, as is done in [12], [29], [36];

however, we again emphasize that our focus is on the

appearance model.

Algorithm 1. MILtrack

Input: Video frame number k

1: Crop out a set of image patches, X

¼fx : k‘ðxÞ

‘



t1

k <sg and compute feature vectors.

2: Use MIL classifier to estimate pðy ¼ 1jxÞ for x 2 X

3: Update tracker location ‘



¼ ‘ðargmax

x2X

pðyjxÞÞ.

4: Crop out two sets of image patches X

¼fx : k‘ðxÞ

‘



k <rg and X

r;

¼fx : r<k‘ðxÞ‘



k <g.

5: Update MIL appearance model with one positive bag

and jX

r;

j negative bags, each containing a single

image patch from the set X

r;

Once the tracker location is updated, we proceed to

update the appearance model. We crop out a set of patches

¼fx : k‘ðxÞ‘



k <rg, where r<s is a scalar radius

(measured in pixels), and label this bag positive (recall that

in MIL we train the algorithm with labeled bags). In

contrast, if a standard learning algorithm were used, there

would be two options: Set r ¼ 1 and use this as a single

positive instance or set r>1 and label all these instances

positive. For negatives we crop out patches from an annular

region X

r;

¼fx : r<k‘ðxÞ‘



k <g, where r is same as

before and  is another scalar. Since this generates a

potentially large set, we then take a random subset of these

image patches and label them negative. We place each

negative example into its own negative bag, though placing

them all into one negative bag yields the same result.

Incorporating scale tracking into this system is straight-

forward. First, we define an extra parameter  to be the scale

space step size. When searching for the location of the object

in a new frame, we crop out image patches from the image

at the current scale ‘

, as well as one scale step larger and

smaller, ‘

 ; once we find the location with the maximum

response, we update the current state (both position and

scale) accordingly. When updating the appearance model,

we have the option of cropping training image patches only

from the current scale or from the neighboring scales as

well; in our current implementation we do the former.

It is important to note that tracking in scale-space is a

double-edged sword. In some ways the problem becomes

more difficult because the parameter space becomes larger,

and consequently there is more room for error. However,

tracking this additional parameter may mean that the image

patches wecrop out are better aligned, making it easier for our

classifier to learn the correct appearance. In our experiments,

we have noticed both behaviors—sometimes adding scale

tracking helps and other times it hurts performance.

Details on how all of the above parameters were set are

in Section 4, although we use the same parameters

throughout all of the experiments. We continue with a

more detailed review of MIL.

3.2 Multiple Instance Learning

Traditional discriminative learning algorithms for training a

binary classifier that estimates pðyjxÞ require a training data

set of the form fðx

Þ; ...; ðx

Þg, where x

is an instance

(in our case a feature vector computed for an image patch)

and y

2f0; 1g is a binary label. In Multiple Instance Learning,

training data has the form fðX

Þ; ...; ðX

Þg, where a

bag X

¼fx

; ...;x

g and y

is a bag label. The bag labels

are defined as:

¼ max

ðy

Þ; ð3Þ

where y

are the instance labels, which are not known

during training. In other words, a bag is considered positive

if it contains at least one positive instance. Numerous

algorithms have been proposed for solving the MIL

problem [13], [14], [16]. The algorithm that is most closely

related to our work is the MILBoost algorithm proposed by

Viola et al. in [14]. MILBoost uses the gradient boosting

framework [37] to train a boosting classifier that maximizes

the log likelihood of bags:

L¼

ðlog pðy

ÞÞ: ð4Þ

Notice that the likelihood is defined over bags and not

instances because insta nce labels are unknown during

BABENKO ET AL .: ROBUST OBJECT TRACKING WITH ONLINE MULTIPLE INSTANCE LEARNING 1621

Fig. 2. Tracking by detection with a greedy motion model: An

illustration of how most tracking by detection systems work.

training, and yet the goal is to train an instance classifier

that estimates pðyjxÞ. We therefore need to express pðy

Þ,

the probability of a bag being positive, in terms of its

instances. In [14], the Noisy-OR (NOR) model is adopted for

doing this:

pðy

Þ¼1 

ð1  pðy

ÞÞ; ð5Þ

although other models could be swapped in (e.g., [38]). The

equation above has the desired property that if one of the

instances in a bag has a high probability, the bag probability

will be high as well. As mentioned in [14], with this

formulation, the likelihood is the same whether we put all of

the negative instances in one bag or if we put each in its own

bag. Intuitively this makes sense because no matter how we

arrange things, we know that every instance in a negative

bag is negative. We refer the reader to [14] for further details

on MILBoost. Finally, we note that MILBoost is a batch

algorithm (meaning it needs the entire training data set at

once) and cannot be trained in an online manner as we need

in our tracking application. Nevertheless, we adopt the loss

function in (4) and the bag probability model in (5) when we

develop our online MIL algorithm in Section 3.4.

3.3 Online Boosting

Our algorithm for online MIL is based on the boosting

framework [39] and is related to the work on Online

AdaBoost [34] and its adaptation in [25]. The goal of

boosting is to combine many weak classifiers hðxÞ (usually

decision stumps) into an additive strong classifier:

HðxÞ¼

k¼1



ðxÞ; ð6Þ

where 

are scalar weights. There have been many boosting

algorithms proposed to learn this model in batch mode [39],

[40]; typically this is done in a greedy manner where the

weak classifiers are trained sequentially. After each weak

classifier is trained, the training examples are reweighted

such that examples that were previously misclassified

receive more weight. If each weak classifier is a decision

stump, then it chooses one feature that has the most

discriminative power for the entire weighted training set.

In this case, boosting can be viewed as performing feature

selection, choosing a total of K features which is generally

much smaller than the size of the entire feature pool. This

has proven particularly useful in computer vision because it

creates classifiers that are efficient at run time [32].

In [34], Oza develops an online variant of the popular

AdaBoost algorithm [39], which minimizes the exponential

loss function. This variant requires that all h can be trained

in an online manner. The basic flow of Oza’s algorithm is as

follows: For an incoming example x, each h

is updated

sequentially and the weight of example x is adjusted after

each update. Since the formulas for the example weights

and classifier weights in AdaBoost depend only on the error

of the weak classifiers, Oza proposes keeping a running

average of the error of each h

, which allows the algorithm

to estimate both the example weight and the classifier

weights in an online manner.

In Oza’s framework, if every h is restricted to be a

decision stump, the algorithm has no way of choosing the

most discriminative feature because the entire training set is

never available at one time. Therefore, the features for each

must be picked a priori. This is a potential problem for

computer vision applications since they often rely on the

feature selection property of boosting. Grabner et al. [25]

proposed an extension of Oza’s algorithm which performs

feature selection by maintaining a pool of M>Kcandidate

weak stump classifiers h. When a new example is passed in,

all of the candidate weak classifiers are updated in parallel.

Then, the algorithm sequentially chooses K weak classifiers

from this pool by keeping running averages of errors for

each, as in [34], and updates the weights of h accordingly.

We employ a similar feature selection technique in our

Online MIL algorithm, although the criteria for choosing

weak classifiers is different.

3.4 Online Multiple Instance Boosting

The algorithms in [34] and [25] rely on the special properties

of the exponential loss function of AdaBoost, and therefore

cannot be readily adapted to the MIL problem. We now

present our novel online boosting algorithm for MIL. As in

[40], we take a statistical view of boosting, where the

algorithm is trying to optimize a specific objective

function J. In this view, the weak classifiers are chosen

sequentially to optimize the following criteria:

ðh

;

Þ¼arg max

h2H;

JðH

k1

þ hÞ; ð7Þ

where H

k1

is the strong classifier made up of the first ðk  1Þ

weak classifiers and H is the set of all possible weak

classifiers. In batch boosting algorithms, the objective

function J is computed over the entire training data set.

In our case, for the current video frame we are given a

training data set fðX

Þ; ðX

Þ ...g, where X

¼fx

;

...g. We would like to update our classifier to maximize

log likelihood of this data (4). We model the instance

probability as

pðyjxÞ¼



HðxÞ



; ð8Þ

where ðxÞ¼

1þe

x

is the sigmoid function; the bag prob-

abilities pðyjXÞ are modeled using the NOR model in (5). To

simplify the problem, we absorb the scalar weights 

into the

weak classifiers by allowing them to return real values

rather than binary.

At all times our algorithm maintains a pool of M>K

candidate weak stump classifiers h. To update the classifier,

we first update all weak classifiers in parallel, similar to [25].

Note that although instances are in bags, the weak classifiers

in a MIL algorithm are instance classifiers and therefore

require instance labels y

. Since these are unavailable, we

pass in the bag label y

for all instances x

to the weak

training procedure. We then choose K weak classifiers h

from the candidate pool sequentially by maximizing the log

likelihood of bags:

¼ arg max

h2fh

;...;h

LðH

k1

þ hÞ: ð9Þ

See Algorithm 2 for the pseudocode of Online MILBoost

and Fig. 3 for an illustration of tracking with this algorithm.

1622 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 8, AUGUST 2011

Algorithm 2. Online MILBoost (OMB)

Input: Data set fX

i¼1

, where X

¼fx

; ...g;y

f0; 1g

1: Update all M weak classifiers in the pool with data

2: Initialize H

¼ 0 for all i; j

3: for k ¼ 1 to K do

4: for m ¼ 1 to M do

5: p

¼ ðH

þ h

ðx

ÞÞ

6: p

¼ 1 

ð1  p

7: L

ðy

logðp

Þþð1  y

Þ logð1  p

ÞÞ

8: end for

9: m



¼ argmax

10: h

ðxÞ h



ðxÞ

11: H

¼ H

þ h

ðxÞ

12: end for

Output: Classifier HðxÞ¼

ðxÞ,wherepðyjxÞ¼

ðHðxÞÞ

3.5 Discussion

There are a couple of important issues to point out about

this algorithm. First, we acknowledge the fact that training

the weak classifiers with positive labels for all instances in

the positive bags i s suboptimal because some of the

instances in the positive bags may actually not be “correct.”

The algorithm makes up for this when it is choosing the

weak classifiers h based on the bag likelihood loss function.

We have also experimented using online GradientBoost [41]

to compute weights (via the gradient of the loss function)

for all instances, but found this to make little difference in

accuracy while making the system slower. Second, if we

compare (7) and (9), we see that the latter has a much more

restricted choice of weak classifiers. This approximation

does not seem to degrade the performance of the classifier

in practice, as noted in [42]. Finally, we note that the

likelihood being optimized in (9) is computed only on the

current examples. Thus, it has the potential of overfitting to

current examples and not retaining information about

previously seen data. This is averted by using online weak

classifiers that do retain information about previously seen

data, which balances out the overall algorithm between

fitting the current data and retaining history.

3.6 Implementation Details

3.6.1 Weak Classifiers

Recall that we require weak classifiers h that can be updated

online. In our system, each weak classifier h

is composed of

a Haar-like feature f

and four parameters ð

;

;

;

Þ that

are estimated online. The classifiers return the log odds ratio:

ðxÞ¼log



y ¼ 1jf

ðxÞ





y ¼ 0jf

ðxÞ



; ð10Þ

where p

ðf

ðxÞjy ¼ 1ÞNð

;

Þ and similarly for y ¼ 0.

We let pðy ¼ 1Þ¼pðy ¼ 0Þ and use Bayes rule to compute the

above equation. When the weak classifier receives new data

fðx

Þ; ...; ðx

Þg, we use the following update rules:





þð1  Þ

ijy

¼1

ðx

Þ;





þð1  Þ

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

ijy

¼1

ðf

ðx

Þ

;

where 0 <<1 is a learning rate parameter. The update

rules for 

and 

are similarly defined.

3.6.2 Image Features

We represent each image patch as a vector of Haar-like

features [32], which are randomly generated, similarly to [35].

BABENKO ET AL .: ROBUST OBJECT TRACKING WITH ONLINE MULTIPLE INSTANCE LEARNING 1623

Fig. 3. An illustration of how using MIL for tracking can deal with occlusions. Frame 1: Consider a simple case where the classifier is allowed to only

pick one feature from the pool. The first frame is labeled. One positive patch and several negative patches (not shown) are extracted and the

classifiers are initialized. Both OAB and MIL result in identical classifiers—both choose feature #1 because it responds well with the mouth of the

face (feature #3 would have performed well also, but suppose #1 is slightly better). Frame 2: In the second frame there is some occlusion. In

particular, the mouth is occluded, and the classifier trained in the previous step does not perform well. Thus, the most probable image patch is no

longer centered on the object. OAB uses just this patch to update; MIL uses this patch along with its neighbors. Note that MIL includes the correct

image patch in the positive bag. Frame 3: When updating, the classifiers try to pick the feature that best discriminates the current example as well

the ones previously seen. OAB has trouble with this because the current and previous positive examples are too different. It chooses a bad feature.

MIL is able to pick the feature that discriminates the eyes of the face because one of the examples in the positive bag was correctly cropped (even

though the mouth was occluded). MIL is therefore able to successfully classify future frames. Note that if we assign positive labels to all of the image

patches in the MIL bag and use these to train OAB, it would still have trouble picking a good feature.

Robust Object Tracking with Online Multiple Instance Learning

Figures

Citations

High-Speed Tracking with Kernelized Correlation Filters

Online Object Tracking: A Benchmark

The PASCAL Visual Object Classes Challenge

Object Tracking Benchmark

Exploiting the circulant structure of tracking-by-detection with kernels

References

Rapid object detection using a boosted cascade of simple features

Greedy function approximation: A gradient boosting machine.

The Pascal Visual Object Classes (VOC) Challenge

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Additive Logistic Regression : A Statistical View of Boosting

Related Papers (5)

Incremental Learning for Robust Visual Tracking

Online Object Tracking: A Benchmark

Tracking-Learning-Detection

Struck: Structured output tracking with kernels

High-Speed Tracking with Kernelized Correlation Filters

Frequently Asked Questions (13)

Q1. What are the contributions in "Robust object tracking with online multiple instance learning" ?

Q2. What are the future works in "Robust object tracking with online multiple instance learning" ?

Q3. What is the main component of a typical tracking system?

Q4. What is the way to estimate the classifier weights?

Q5. What is the problem with adaptive appearance trackers?

Q6. What is the way to train the appearance classifier?

Q7. How does the algorithm choose the weak classifiers?

Q8. How many boosting algorithms have been proposed to learn the MILBoost model?

Q9. How do the authors model the instanceprobability of the weak classifiers?

Q10. Why does the OAB tracker perform better than the other two?

Q11. What is the main challenge of the adaptive appearance model?

Q12. How many positive images are generated per frame?

Q13. What is the way to crop out the image patches?