What contributions have the authors mentioned in the paper "Object-adaptive lstm network for visual tracking" ?

In this paper, the authors propose a novel object-adaptive LSTM network, which can effectively exploit sequence dependencies and dynamically adapt to the temporal object variations via constructing an intrinsic model for object appearance and motion. To the best of their knowledge, this is the first work to apply an LSTM network for classification in visual object tracking. Experimental results on OTB and TC-128 benchmarks show that the proposed method achieves state-ofthe-art performance, which exhibits great potentials of recurrent structures for visual object tracking.

How can LSTM capture long-range temporal dependencies?

By taking advantage of its intrinsic recurrent structure, their network can capture long-range temporal dependencies and dynamically adapt to the object variations via memorizing the changes of object appearance and motion during online tracking.

How many convolutional layers are used in the proposed LSTM network?

In the proposed LSTM network, for the purpose of supplying the network with rich target appearance information, the authors adopt five convolutional layers pre-trained on the ILSVRC15 [15] dataset to extract high-level target features.

How is the LSTM recurrent structure learned online?

with a pre-trained convolutional feature extractor, the LSTM recurrence is learned online to sufficiently utilize sequence-specific information.

What is the popular benchmark for OA-LSTM?

OTB (Object Tracking benchmark) [6] is a popular tracking benchmark consisting of 100 fully annotated videos with substantial variations.

What is the way to get a confidence map?

Based on the similarity learning function, the authors can firstly obtain a confidence map, which corresponds to all the translated sub-regions in the search region.

What other trackers use deep networks to learn the object feature representations?

Among the competing trackers, SiamFC, GOTURN, CNT and CFNet also employ deep networks to learn the object feature representations and achieve good performance.

What is the way to track objects?

Experimental results on public tracking benchmarks have shown that their method achieves state-ofthe-art performance with a satisfactory speed, demonstrating the successful application of recurrent structure to visual object tracking.

(Open Access) Object-Adaptive LSTM Network for Visual Tracking (2018) | Du Yihan

Q: How do the authors optimize the efficiency of their tracking method?

In order to optimize the efficiency of their tracking method, the authors propose to leverage the fast matching-based method to preevaluate the densely sampled proposals in the search region and the high-quality proposals are selected to feed to the subsequent classification network.

Q: How do the authors fit in with the task?

To fit in with the tracking task, the authors use the LSTM network to evaluate the proposals and identify the optimal target state in each frame.

Object-Adaptive LSTM Network for Visual Tracking

Du, Y., Yan, Y., Chen, S., Hua, Y., & Wang, H. (2018). Object-Adaptive LSTM Network for Visual Tracking. In

International Conference on Pattern Recognition: Proceedings

(International Conference on Pattern Recognition

(ICPR): Proceedings). https://doi.org/10.1109/ICPR.2018.8545096

Published in:

International Conference on Pattern Recognition: Proceedings

Document Version:

Peer reviewed version

Queen's University Belfast - Research Portal:

Link to publication record in Queen's University Belfast Research Portal

Publisher rights

the publisher.

General rights

Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other

copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated

with these rights.

Take down policy

The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to

ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the

Research Portal that you believe breaches copyright or violates any law, please contact openaccess@qub.ac.uk.

Download date:09. Aug. 2022

Object-Adaptive LSTM Network

for Visual Tracking

Yihan Du

, Yan Yan

1∗

, Si Chen

, Yang Hua

, Hanzi Wang

School of Information Science and Engineering, Xiamen University, China

School of Computer and Information Engineering, Xiamen University of Technology, China

School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, UK

Email: yihandu@stu.xmu.edu.cn, {yanyan, hanzi.wang}@xmu.edu.cn, chensi@xmut.edu.cn, y.hua@qub.ac.uk

Abstract—Convolutional Neural Networks (CNNs) have shown

outstanding performance in visual object tracking. However, most

of classiﬁcation-based tracking methods using CNNs are time-

consuming due to expensive computation of complex online ﬁne-

tuning and massive feature extractions. Besides, these methods

suffer from the problem of over-ﬁtting since the training and

testing stages of CNN models are based on the videos from the

same domain. Recently, matching-based tracking methods (such

as Siamese networks) have shown remarkable speed superiority,

while they cannot well address target appearance variations

and complex scenes for inherent lack of online adaptability

and background information. In this paper, we propose a novel

object-adaptive LSTM network, which can effectively exploit

sequence dependencies and dynamically adapt to the temporal

object variations via constructing an intrinsic model for object

appearance and motion. In addition, we develop an efﬁcient

strategy for proposal selection, where the densely sampled pro-

posals are ﬁrstly pre-evaluated using the fast matching-based

method and then the well-selected high-quality proposals are fed

to the sequence-speciﬁc learning LSTM network. This strategy

enables our method to adaptively track an arbitrary object

and operate faster than conventional CNN-based classiﬁcation

tracking methods. To the best of our knowledge, this is the

ﬁrst work to apply an LSTM network for classiﬁcation in

visual object tracking. Experimental results on OTB and TC-128

benchmarks show that the proposed method achieves state-of-

the-art performance, which exhibits great potentials of recurrent

structures for visual object tracking.

I. INTRODUCTION

Visual object tracking is one of the most important and

challenging problems in computer vision, which has a variety

of applications including video surveillance, driverless cars,

etc. Only given the annotation in the ﬁrst frame of a video,

tracking algorithms operate to locate the object in the sub-

sequent frames, probably confronting various appearance and

motion changes caused by deformation, rotation, background

clutter and so forth.

In recent years, Convolutional Neural Networks (CNNs)

have been widely used in visual object tracking [1, 2, 3, 4].

These CNN-based tracking methods can be roughly divided

into two categories: classiﬁcation-based tracking methods and

matching-based tracking methods. Classiﬁcation-based track-

ing methods [1, 2, 5] treat tracking as a binary classiﬁcation

problem, where the classiﬁers are trained to distinguish the

object from the background. These methods achieve excellent

∗

Corresponding author.

performance at the cost of high computational complexity due

to massive proposal evaluation and sophisticated online ﬁne-

tuning. Besides, some high-accuracy trackers (e.g., MDNet [1]

and SANet [5]) use videos from the same domain or two

intersecting datasets (e.g., OTB [6] and VOT [7]) to train and

test their models, which leads to the problem of over-ﬁtting.

Matching-based tracking methods [3, 4] match the candidate

patches with the target template and do not involve any up-

dating procedures. Thus, they can operate at real-time speeds

[3, 4]. However, the lack of online adaptability and background

information causes these methods prone to drift or failure,

especially when the target appearance signiﬁcantly changes in

some challenging scenes.

Most of existing CNN-based trackers [1, 2] perform object

detection in each frame independently and thus they ignore the

temporal dependencies among successive frames in a video

sequence. Recently, Recurrent Neural Networks (RNNs) have

drawn extensive attention in computer vision [8, 9] owing

to their powerful capability of modeling sequential data and

capturing time dependencies. However, few RNN-based object

tracking works have demonstrated state-of-the-art performance

on the canonical benchmarks. We presume that this is mainly

due to the absence of an effective recurrent network well suited

for the tracking task.

Motivated by the above analysis, in this paper we pro-

pose a novel object-adaptive LSTM network (OA-LSTM)

for visual object tracking. Different from conventional deep

models, we advocate the LSTM (Long Short-Term Memory)

recurrence to characterize long-range sequential information,

while maintaining an intrinsic object representation model

(via memorizing target appearance variations and ignoring

confusing distractors). Due to its intrinsic recurrent structure,

our network is able to dynamically update the internal state

during forward passes. Moreover, to lighten the computa-

tional burden and enable our network to track an arbitrary

object, we adopt an efﬁcient strategy for proposal selection.

Speciﬁcally, the densely sampled proposals are ﬁrstly pre-

evaluated using the fast matching-based method and then

the well-selected high-quality proposals are classiﬁed based

on the online learning LSTM network. Possessing the above

properties, our method can effectively track an arbitrary object

with adaptability to target appearance variations and sequence-

speciﬁc circumstances. Fig. 1 illustrates the pipeline of the

Frame

t+1

template

Selected Proposals

Convolutional

Layers

LSTM

Layers

Tracking Result

t+1

Selected Proposals

t+1

Similarity Learning

Function

Similarity Learning

Function

Fig. 1. Pipeline of the proposed method for visual object tracking. The arrows from the LSTM layers of frame

to those of

frame

t+1

denote the forward propagation in time of memory information.

proposed method for visual object tracking.

In this paper, we make an important step towards the

promising application of recurrent structure for visual object

tracking. Our main contributions are summarized as follows:

• We propose a novel object-adaptive LSTM network (OA-

LSTM), which not only effectively exploits long-range

temporal dependencies in a video sequence, but also

dynamically adapts to the object appearance variations

via constructing an intrinsic model for object appearance

and motion during online tracking. To the best of our

knowledge, this is the ﬁrst work to apply an LSTM

network for classiﬁcation in visual object tracking with

both state-of-the-art performance and a moderate speed

at a practical level.

• We present an efﬁcient strategy for proposal selection.

Speciﬁcally, the densely sampled proposals are ﬁrstly

pre-evaluated using the fast matching-based method and

then the well-selected high-quality proposals are provided

to the sequence-speciﬁc learning LSTM network. This

strategy enables our method to track an arbitrary object

and operate faster than existing CNN-based classiﬁcation

tracking methods.

• Our method achieves state-of-the-art performance on

public tracking benchmarks (OTB [6] and TC-128 [10]),

at speeds that exceed existing representative CNN-based

classiﬁcation tracking methods, which demonstrates that

the proposed recurrent structure is well suited for visual

object tracking.

II. RELATED WORK

Visual object tracking has been actively studied for decades.

In the following, we mainly discuss three types of deep

learning based tracking methods related to ours.

CNN-based tracking-by-detection: CNN-based tracking-

by-detection methods train convolutional classiﬁcation net-

works to distinguish the object from the background. MD-

Net [1], the winner of the VOT2015 challenge [7], ﬁrstly

trains the CNN with respect to each video ofﬂine and then

learns a per-object classiﬁer online. Although this method has

achieved outstanding accuracy, massive feature extractions and

sophisticated online ﬁne-tuning techniques limit its efﬁciency.

To deal with this problem, we leverage the fast matching-

based method to select high-quality proposals so that the

heavy computational burden for irrelevant proposal generation

and evaluation is avoided. Besides, we adopt simple online

updating techniques for computational efﬁciency (since the L-

STM network can dynamically update the recurrent parameters

during forward passes). Furthermore, different from MDNet

[1] which performs training and testing on two intersecting

datasets (i.e., VOT [7] and OTB [6]), we learn an object-

adaptive LSTM network online to track the object from

arbitrary video domains.

Siamese network based tracking: Siamese network based

methods [3, 4] learn generic matching models ofﬂine based

on image pairs. For example, SiamFC [3] adopts a novel

fully-convolutional Siamese network and it can operate in real-

time. However, it neither utilizes background information nor

captures target appearance variations during online tracking.

DSiam [11] uses transformation learning to provide online

adaptation ability for Siamese network, exhibiting state-of-

the-art performance. In contrast to the above Siamese network

based methods, we propose an object-adaptive LSTM network

for classiﬁcation, which effectively utilizes background infor-

mation and highly adapts to the changes of target appearance.

Recurrent network based tracking: Milan et al. [12]

present an RNN-based method for multi-target tracking, where

the RNN is used to predict target motion and determine target

initiation and termination. Gan et al. [8] and Kahou et al.

[9] apply attention-based recurrent neural networks to object

tracking. However, these methods have only been shown to

work on simple datasets (such as MNIST digits) instead of

natural videos. SANet [5] fuses CNN and RNN feature maps

to model the object structure, but the heavy computational

burden constrains its speed (< 1 fps). Gordon et al. [13]

propose real-time recurrent regression networks (Re3) to learn

the changes of target appearance and motion ofﬂine. Despite

its speed superiority, Re3 is prone to drift in complex scenes

for lack of online adaptability. In contrast, we propose to adopt

an object-adaptive LSTM network, which exploits sequence-

speciﬁc information more sufﬁciently and possesses better

object adaptability.

III. LSTM NETWORK FOR OBJECT-ADAPTIVE MODELING

RNNs have shown the powerfulness in handling sequential

data owing to the capability of storing the representations of

recent inputs using feedback connections. As an alternative

RNN, the LSTM recurrence can not only properly capture

long-range temporal dependencies, but also ignore distracting

information. Therefore, we take advantage of LSTM for visual

tracking in this paper.

During online tracking, new target positions need to be

located successively when new frames come one after another.

To ﬁt in with the tracking task, we use the LSTM network to

evaluate the proposals and identify the optimal target state in

each frame. More speciﬁcally, we calculate the forward pass

of the LSTM network with Eqs. (1) to (9). The superscript

t represents the frame index. The subscripts ι, ν and ω

respectively refer to the input gate, forget gate and output

gate of the LSTM memory block. The subscript c refers to the

memory cells. U, V , and W are the weight matrices for the

input, recurrent, and peephole connections, respectively. b is

the bias vector. φ and σ are the non-linear activation functions.

 is point-wise multiplication. x

and h

t−1

are the input and

previous output vectors, respectively. The output vector h

used to classify the proposal. The cell state c holds memory

information of the object appearance and motion. Both h

and

are produced by the forward pass in the t

frame and fed

to the following forward pass in the next frame, which allows

the object information to propagate forward in time.

Input Gates

= U

+ V

t−1

+ W

t−1

+ b

(1)

= σ(z

) (2)

Forget Gates

= U

+ V

t−1

+ W

t−1

+ b

(3)

= σ(z

) (4)

Cells

= U

+ V

t−1

+ b

(5)

= f

 c

t−1

+ i

 φ(z

) (6)

Output Gates

= U

+ V

t−1

+ W

+ b

(7)

= σ(z

) (8)

Cell Outputs

= o

 φ(c

) (9)

In order to effectively utilize sequence-speciﬁc information

and obtain per-object adaptability, we train the LSTM network

online using training data from the video itself. In this paper,

instead of training the network with a sequence of inputs,

we use the previous network state and the samples from the

current frame to train this network online. In this manner,

the loss directly comes from the current classiﬁcation results,

allowing the parameters to be updated according to the hidden

state and training data. Based on this training scheme, the

recurrent network quickly converges since the loss does not

need to propagate through noisy intermediate timesteps.

Speciﬁcally, in the ﬁrst frame, we initialize the network

state with the original target appearance. Then, we use this

initialized network state and training samples from the ﬁrst

frame to train the LSTM network. In the subsequent frames,

we update the network using the previous network state and

training samples from the current frames to obtain better

online adaptability. The backward pass during training can

be calculated with Eqs. (10) to (16). L is the softmax cross-

entropy loss function used for training.  and δ represent the

derivatives deﬁned as Eq. (10). k denotes the subsequent layer

of the cell outputs. The subscript j refers to the gates of the

LSTM memory block, i.e., j ∈ {ι, ν, ω}.



def

∂L

∂h



def

∂L

∂c

def

∂L

∂z

(10)

Cell Outputs



(11)

Output Gates

= σ

) φ(c

) 

(12)

Cells



= W

+ o

) 

(13)

= i

) 

(14)

Forget Gates

= σ

) c

t−1



(15)

Input Gates

= σ

) φ(z

) 

(16)

IV. PROPOSED TRACKING ALGORITHM

A. Overview

Our tracking pipeline is depicted in Fig. 1. A search region

centered at the previously estimated target position in the

current frame is ﬁrstly taken as the input. Then, a similarity

learning model is utilized to select high-quality proposals by

comparing the target template with the search region, which

signiﬁcantly reduces redundant computation for irrelevant pro-

posals (see Section IV-B). Next, the features of these high-

quality proposals extracted from the convolutional layers are

fed to the online learning LSTM network (see Section III),

together with the previously estimated target state. Finally, the

LSTM network is able to identify the optimal target state from

these proposals according to the memorized target appearance

and learned sequence-speciﬁc information (see Section IV-C).

Note that the internal state of the LSTM network can be simul-

taneously updated when a forward pass is performed during

online tracking so that sophisticated ﬁne-tuning techniques

used in conventional CNN-based tracking methods [1, 5] are

unnecessary. Therefore, we can perform simple and effective

online updating to enhance the adaptability of our network.

B. Fast Proposal Selection

In conventional tracking frameworks, massive proposals are

generated via dense sampling [14]. However, the majority of

these proposals are typically irrelevant and trivial, leading to

expensive and redundant computation for proposal evaluation.

In order to optimize the efﬁciency of our tracking method, we

propose to leverage the fast matching-based method to pre-

evaluate the densely sampled proposals in the search region

and the high-quality proposals are selected to feed to the

subsequent classiﬁcation network.

Recently, SiamFC [3] reaches a good trade-off between

speed and accuracy, and it also offers an explicit conﬁdence

map for our proposal selection purpose. Therefore, inspired by

the success of [3], we employ a similarity learning function

(as in [3]) to compare the target template from the ﬁrst frame

with the search region centered at the previously estimated

target position in each frame. Based on the similarity learning

function, we can ﬁrstly obtain a conﬁdence map, which cor-

responds to all the translated sub-regions in the search region.

Then, the sub-regions with high conﬁdence scores are selected

and translated from the ﬁnal conﬁdence map to the original

frame. Finally, we feed the subsequent LSTM network with

these well-selected proposals for high efﬁciency, which signif-

icantly improves the speed of classical tracking-by-detection

framework. Note that the proposal selection strategy is mainly

used to ﬁlter out irrelevant proposals for algorithm acceleration

purpose, while the LSTM network is designed to effectively

determine the tracked object from all the selected proposals.

Both components are tightly coupled to boost the performance

of tracking in both speed and accuracy, even in challenging

scenes.

C. Tracking with OA-LSTM

In the proposed LSTM network, for the purpose of sup-

plying the network with rich target appearance information,

we adopt ﬁve convolutional layers pre-trained on the ILSVR-

C15 [15] dataset to extract high-level target features. These

convolutional layers are kept ﬁxed during the online learning

stage since they have learned the capability of generic feature

extraction. The fully-connected feature vectors can be directly

addressed by the subsequent LSTM layers.

Initially, given the annotation of the target in the ﬁrst

frame, we feed the target features to the LSTM network

to obtain the original network state which stores the initial

target appearance. Then we use training data sampled from

the ﬁrst frame, together with the original network state, to

perform the online training procedure, as discussed in Sec-

tion III. In the subsequent frames, the well-selected proposals

via the matching-based method are evaluated by the LSTM

network according to the past target appearance, which is

modeled into the LSTM memory cells. The network outputs

correspond to the positive scores and negative scores of the

proposals, denoting their probabilities belonging to the target

Algorithm 1 The proposed tracking algorithm

Input: Initial target position x

, similarity learning function

F, threshold θ

Output: Estimated target position ˆx

1: Initialize the LSTM network using x

;

2: Sample training data S

from the 1

frame;

3: Train the LSTM network using S

;

4: repeat

5: Apply the similarity learning function F to obtain a

conﬁdence map M;

6: Select N high-score proposals {x

}

i=1

from M;

7: Evaluate {x

}

i=1

with the previous LSTM state ˆc

t−1

to obtain their positive scores {p(x

)}

i=1

;

8: Find the tracked result by ˆx

= arg max

p(x

);

9: Set the optimal LSTM state ˆc

corresponding to ˆx

;

10: if p(ˆx

) > θ then

11: Sample training data S

with hard negative mining;

12: Update the LSTM network using S

;

13: end if

14: until end of sequence

class and background class, respectively. The proposal with

the highest positive score is selected as the tracked result

and its appearance is stored by the network, which supplies

the latest target appearance information for the next proposal

evaluation. During forward passes, the recurrent parameters

can be dynamically updated as the target appearance changes.

By taking advantage of its intrinsic recurrent structure, the

LSTM network is able to capture the temporal dependencies

of the video sequence and memorize the changes of target

appearance and motion during online tracking.

In addition, in order to obtain great adaptability to complex

scenes, we perform online updating using an efﬁcient hard

negative mining approach. More speciﬁcally, we directly use

the conﬁdence map from the proposal selection step to select

hard negative examples. Therefore, unlike some sophisticated

trackers [1, 5] that require numerous iterations to identify

the hard negative examples, our hard negative mining ap-

proach does not require extra computational cost for example

evaluation. As the online updating proceeds, the network

efﬁciently learns the variations of both the target appearance

and background with hard negative examples, thus becoming

more discriminative and robust. Algorithm 1 summarizes the

main steps of the proposed tracking algorithm.

V. EXPERIMENTS

We evaluate the proposed tracking method on two chal-

lenging object tracking benchmarks, i.e., OTB [6] and TC-128

[10]. The proposed method is implemented in Python using

TensorFlow [16] and runs at an average speed of 11.5 fps with

2.00GHz Intel Xeon E5-2660 and an NVIDIA GTX TITAN

X GPU. During implementation, we set the hidden units of

both LSTM layers to 2048 and select high-quality proposals

from the dense sub-regions at a percentage of 2%. We perform

the pre-training process for the convolutional layers on the

Object-Adaptive LSTM Network for Visual Tracking

Figures

Citations

Object-adaptive LSTM network for real-time visual tracking with adversarial data augmentation

Improving Object Tracking by Added Noise and Channel Attention.

Two motion models for improving video object tracking performance

Object-Adaptive LSTM Network for Real-time Visual Tracking with Adversarial Data Augmentation

Multi-object Tracking Based on Nearest Optimal Template Library

References

ImageNet Large Scale Visual Recognition Challenge

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

High-Speed Tracking with Kernelized Correlation Filters

Online Object Tracking: A Benchmark

Object Tracking Benchmark

Related Papers (5)

Convolutional neural networks based scale-adaptive kernelized correlation filter for robust visual object tracking

Recurrent Filter Learning for Visual Tracking

Gate connected convolutional neural network for object tracking

CNNTracker: Online discriminative object tracking via deep convolutional neural network

Antidecay LSTM for Siamese Tracking With Adversarial Learning

Frequently Asked Questions (14)

Q1. What contributions have the authors mentioned in the paper "Object-adaptive lstm network for visual tracking" ?

Q2. How do the authors optimize the efficiency of their tracking method?

Q3. How can LSTM capture long-range temporal dependencies?

Q4. How many convolutional layers are used in the proposed LSTM network?

Q5. How do the authors fit in with the task?

Q6. How is the LSTM recurrent structure learned online?

Q7. What is the popular benchmark for OA-LSTM?

Q8. What is the way to get a confidence map?

Q9. What other trackers use deep networks to learn the object feature representations?

Q10. What is the LSTM network's ability to learn the variations of the target appearance and?

Q11. Why are these convolutional layers kept fixed during the online learning stage?

Q12. What is the proposed method for tracking?

Q13. What is the way to track objects?

Q14. What are the main limitations of the method?