scispace - formally typeset
Open AccessProceedings ArticleDOI

Object-Adaptive LSTM Network for Visual Tracking

Reads0
Chats0
TLDR
This paper proposes a novel object-adaptive LSTM network, which can effectively exploit sequence dependencies and dynamically adapt to the temporal object variations via constructing an intrinsic model for object appearance and motion and develops an efficient strategy for proposal selection.
Abstract
Convolutional Neural Networks (CNNs) have shown outstanding performance in visual object tracking. However, most of classification-based tracking methods using CNNs are time-consuming due to expensive computation of complex online fine-tuning and massive feature extractions. Besides, these methods suffer from the problem of over-fitting since the training and testing stages of CNN models are based on the videos from the same domain. Recently, matching-based tracking methods (such as Siamese networks) have shown remarkable speed superiority, while they cannot well address target appearance variations and complex scenes for inherent lack of online adaptability and background information. In this paper, we propose a novel object-adaptive LSTM network, which can effectively exploit sequence dependencies and dynamically adapt to the temporal object variations via constructing an intrinsic model for object appearance and motion. In addition, we develop an efficient strategy for proposal selection, where the densely sampled proposals are firstly pre-evaluated using the fast matching-based method and then the well-selected high-quality proposals are fed to the sequence-specific learning LSTM network. This strategy enables our method to adaptively track an arbitrary object and operate faster than conventional CNN-based classification tracking methods. To the best of our knowledge, this is the first work to apply an LSTM network for classification in visual object tracking. Experimental results on OTB and TC-128 benchmarks show that the proposed method achieves state-of-the-art performance, which exhibits great potentials of recurrent structures for visual object tracking.

read more

Content maybe subject to copyright    Report

Object-Adaptive LSTM Network for Visual Tracking
Du, Y., Yan, Y., Chen, S., Hua, Y., & Wang, H. (2018). Object-Adaptive LSTM Network for Visual Tracking. In
International Conference on Pattern Recognition: Proceedings
(International Conference on Pattern Recognition
(ICPR): Proceedings). https://doi.org/10.1109/ICPR.2018.8545096
Published in:
International Conference on Pattern Recognition: Proceedings
Document Version:
Peer reviewed version
Queen's University Belfast - Research Portal:
Link to publication record in Queen's University Belfast Research Portal
Publisher rights
© 2018 IEEE. This work is made available online in accordance with the publisher’s policies. Please refer to any applicable terms of use of
the publisher.
General rights
Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other
copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated
with these rights.
Take down policy
The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to
ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the
Research Portal that you believe breaches copyright or violates any law, please contact openaccess@qub.ac.uk.
Download date:09. Aug. 2022

Object-Adaptive LSTM Network
for Visual Tracking
Yihan Du
1
, Yan Yan
1
, Si Chen
2
, Yang Hua
3
, Hanzi Wang
1
1
School of Information Science and Engineering, Xiamen University, China
2
School of Computer and Information Engineering, Xiamen University of Technology, China
3
School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, UK
Email: yihandu@stu.xmu.edu.cn, {yanyan, hanzi.wang}@xmu.edu.cn, chensi@xmut.edu.cn, y.hua@qub.ac.uk
Abstract—Convolutional Neural Networks (CNNs) have shown
outstanding performance in visual object tracking. However, most
of classification-based tracking methods using CNNs are time-
consuming due to expensive computation of complex online fine-
tuning and massive feature extractions. Besides, these methods
suffer from the problem of over-fitting since the training and
testing stages of CNN models are based on the videos from the
same domain. Recently, matching-based tracking methods (such
as Siamese networks) have shown remarkable speed superiority,
while they cannot well address target appearance variations
and complex scenes for inherent lack of online adaptability
and background information. In this paper, we propose a novel
object-adaptive LSTM network, which can effectively exploit
sequence dependencies and dynamically adapt to the temporal
object variations via constructing an intrinsic model for object
appearance and motion. In addition, we develop an efficient
strategy for proposal selection, where the densely sampled pro-
posals are firstly pre-evaluated using the fast matching-based
method and then the well-selected high-quality proposals are fed
to the sequence-specific learning LSTM network. This strategy
enables our method to adaptively track an arbitrary object
and operate faster than conventional CNN-based classification
tracking methods. To the best of our knowledge, this is the
first work to apply an LSTM network for classification in
visual object tracking. Experimental results on OTB and TC-128
benchmarks show that the proposed method achieves state-of-
the-art performance, which exhibits great potentials of recurrent
structures for visual object tracking.
I. INTRODUCTION
Visual object tracking is one of the most important and
challenging problems in computer vision, which has a variety
of applications including video surveillance, driverless cars,
etc. Only given the annotation in the first frame of a video,
tracking algorithms operate to locate the object in the sub-
sequent frames, probably confronting various appearance and
motion changes caused by deformation, rotation, background
clutter and so forth.
In recent years, Convolutional Neural Networks (CNNs)
have been widely used in visual object tracking [1, 2, 3, 4].
These CNN-based tracking methods can be roughly divided
into two categories: classification-based tracking methods and
matching-based tracking methods. Classification-based track-
ing methods [1, 2, 5] treat tracking as a binary classification
problem, where the classifiers are trained to distinguish the
object from the background. These methods achieve excellent
Corresponding author.
performance at the cost of high computational complexity due
to massive proposal evaluation and sophisticated online fine-
tuning. Besides, some high-accuracy trackers (e.g., MDNet [1]
and SANet [5]) use videos from the same domain or two
intersecting datasets (e.g., OTB [6] and VOT [7]) to train and
test their models, which leads to the problem of over-fitting.
Matching-based tracking methods [3, 4] match the candidate
patches with the target template and do not involve any up-
dating procedures. Thus, they can operate at real-time speeds
[3, 4]. However, the lack of online adaptability and background
information causes these methods prone to drift or failure,
especially when the target appearance significantly changes in
some challenging scenes.
Most of existing CNN-based trackers [1, 2] perform object
detection in each frame independently and thus they ignore the
temporal dependencies among successive frames in a video
sequence. Recently, Recurrent Neural Networks (RNNs) have
drawn extensive attention in computer vision [8, 9] owing
to their powerful capability of modeling sequential data and
capturing time dependencies. However, few RNN-based object
tracking works have demonstrated state-of-the-art performance
on the canonical benchmarks. We presume that this is mainly
due to the absence of an effective recurrent network well suited
for the tracking task.
Motivated by the above analysis, in this paper we pro-
pose a novel object-adaptive LSTM network (OA-LSTM)
for visual object tracking. Different from conventional deep
models, we advocate the LSTM (Long Short-Term Memory)
recurrence to characterize long-range sequential information,
while maintaining an intrinsic object representation model
(via memorizing target appearance variations and ignoring
confusing distractors). Due to its intrinsic recurrent structure,
our network is able to dynamically update the internal state
during forward passes. Moreover, to lighten the computa-
tional burden and enable our network to track an arbitrary
object, we adopt an efficient strategy for proposal selection.
Specifically, the densely sampled proposals are firstly pre-
evaluated using the fast matching-based method and then
the well-selected high-quality proposals are classified based
on the online learning LSTM network. Possessing the above
properties, our method can effectively track an arbitrary object
with adaptability to target appearance variations and sequence-
specific circumstances. Fig. 1 illustrates the pipeline of the

Frame
t
Frame
t+1
template
Selected Proposals
t
Convolutional
Layers
LSTM
Layers
Tracking Result
t
Tracking Result
t+1
Selected Proposals
t+1
Similarity Learning
Function
Similarity Learning
Function
Fig. 1. Pipeline of the proposed method for visual object tracking. The arrows from the LSTM layers of frame
t
to those of
frame
t+1
denote the forward propagation in time of memory information.
proposed method for visual object tracking.
In this paper, we make an important step towards the
promising application of recurrent structure for visual object
tracking. Our main contributions are summarized as follows:
We propose a novel object-adaptive LSTM network (OA-
LSTM), which not only effectively exploits long-range
temporal dependencies in a video sequence, but also
dynamically adapts to the object appearance variations
via constructing an intrinsic model for object appearance
and motion during online tracking. To the best of our
knowledge, this is the first work to apply an LSTM
network for classification in visual object tracking with
both state-of-the-art performance and a moderate speed
at a practical level.
We present an efficient strategy for proposal selection.
Specifically, the densely sampled proposals are firstly
pre-evaluated using the fast matching-based method and
then the well-selected high-quality proposals are provided
to the sequence-specific learning LSTM network. This
strategy enables our method to track an arbitrary object
and operate faster than existing CNN-based classification
tracking methods.
Our method achieves state-of-the-art performance on
public tracking benchmarks (OTB [6] and TC-128 [10]),
at speeds that exceed existing representative CNN-based
classification tracking methods, which demonstrates that
the proposed recurrent structure is well suited for visual
object tracking.
II. RELATED WORK
Visual object tracking has been actively studied for decades.
In the following, we mainly discuss three types of deep
learning based tracking methods related to ours.
CNN-based tracking-by-detection: CNN-based tracking-
by-detection methods train convolutional classification net-
works to distinguish the object from the background. MD-
Net [1], the winner of the VOT2015 challenge [7], firstly
trains the CNN with respect to each video offline and then
learns a per-object classifier online. Although this method has
achieved outstanding accuracy, massive feature extractions and
sophisticated online fine-tuning techniques limit its efficiency.
To deal with this problem, we leverage the fast matching-
based method to select high-quality proposals so that the
heavy computational burden for irrelevant proposal generation
and evaluation is avoided. Besides, we adopt simple online
updating techniques for computational efficiency (since the L-
STM network can dynamically update the recurrent parameters
during forward passes). Furthermore, different from MDNet
[1] which performs training and testing on two intersecting
datasets (i.e., VOT [7] and OTB [6]), we learn an object-
adaptive LSTM network online to track the object from
arbitrary video domains.
Siamese network based tracking: Siamese network based
methods [3, 4] learn generic matching models offline based
on image pairs. For example, SiamFC [3] adopts a novel
fully-convolutional Siamese network and it can operate in real-
time. However, it neither utilizes background information nor
captures target appearance variations during online tracking.
DSiam [11] uses transformation learning to provide online
adaptation ability for Siamese network, exhibiting state-of-
the-art performance. In contrast to the above Siamese network
based methods, we propose an object-adaptive LSTM network
for classification, which effectively utilizes background infor-
mation and highly adapts to the changes of target appearance.
Recurrent network based tracking: Milan et al. [12]
present an RNN-based method for multi-target tracking, where
the RNN is used to predict target motion and determine target
initiation and termination. Gan et al. [8] and Kahou et al.
[9] apply attention-based recurrent neural networks to object
tracking. However, these methods have only been shown to
work on simple datasets (such as MNIST digits) instead of
natural videos. SANet [5] fuses CNN and RNN feature maps
to model the object structure, but the heavy computational
burden constrains its speed (< 1 fps). Gordon et al. [13]
propose real-time recurrent regression networks (Re3) to learn
the changes of target appearance and motion offline. Despite
its speed superiority, Re3 is prone to drift in complex scenes
for lack of online adaptability. In contrast, we propose to adopt
an object-adaptive LSTM network, which exploits sequence-
specific information more sufficiently and possesses better
object adaptability.

III. LSTM NETWORK FOR OBJECT-ADAPTIVE MODELING
RNNs have shown the powerfulness in handling sequential
data owing to the capability of storing the representations of
recent inputs using feedback connections. As an alternative
RNN, the LSTM recurrence can not only properly capture
long-range temporal dependencies, but also ignore distracting
information. Therefore, we take advantage of LSTM for visual
tracking in this paper.
During online tracking, new target positions need to be
located successively when new frames come one after another.
To fit in with the tracking task, we use the LSTM network to
evaluate the proposals and identify the optimal target state in
each frame. More specifically, we calculate the forward pass
of the LSTM network with Eqs. (1) to (9). The superscript
t represents the frame index. The subscripts ι, ν and ω
respectively refer to the input gate, forget gate and output
gate of the LSTM memory block. The subscript c refers to the
memory cells. U, V , and W are the weight matrices for the
input, recurrent, and peephole connections, respectively. b is
the bias vector. φ and σ are the non-linear activation functions.
is point-wise multiplication. x
t
and h
t1
are the input and
previous output vectors, respectively. The output vector h
t
is
used to classify the proposal. The cell state c holds memory
information of the object appearance and motion. Both h
t
and
c
t
are produced by the forward pass in the t
th
frame and fed
to the following forward pass in the next frame, which allows
the object information to propagate forward in time.
Input Gates
z
t
ι
= U
ι
x
t
+ V
ι
h
t1
+ W
ι
c
t1
+ b
ι
(1)
i
t
= σ(z
t
ι
) (2)
Forget Gates
z
t
ν
= U
ν
x
t
+ V
ν
h
t1
+ W
ν
c
t1
+ b
ν
(3)
f
t
= σ(z
t
ν
) (4)
Cells
z
t
c
= U
c
x
t
+ V
c
h
t1
+ b
c
(5)
c
t
= f
t
c
t1
+ i
t
φ(z
t
c
) (6)
Output Gates
z
t
ω
= U
ω
x
t
+ V
ω
h
t1
+ W
ω
c
t
+ b
ω
(7)
o
t
= σ(z
t
ω
) (8)
Cell Outputs
h
t
= o
t
φ(c
t
) (9)
In order to effectively utilize sequence-specific information
and obtain per-object adaptability, we train the LSTM network
online using training data from the video itself. In this paper,
instead of training the network with a sequence of inputs,
we use the previous network state and the samples from the
current frame to train this network online. In this manner,
the loss directly comes from the current classification results,
allowing the parameters to be updated according to the hidden
state and training data. Based on this training scheme, the
recurrent network quickly converges since the loss does not
need to propagate through noisy intermediate timesteps.
Specifically, in the first frame, we initialize the network
state with the original target appearance. Then, we use this
initialized network state and training samples from the first
frame to train the LSTM network. In the subsequent frames,
we update the network using the previous network state and
training samples from the current frames to obtain better
online adaptability. The backward pass during training can
be calculated with Eqs. (10) to (16). L is the softmax cross-
entropy loss function used for training. and δ represent the
derivatives defined as Eq. (10). k denotes the subsequent layer
of the cell outputs. The subscript j refers to the gates of the
LSTM memory block, i.e., j {ι, ν, ω}.
t
h
def
=
L
h
t
t
c
def
=
L
c
t
δ
t
j
def
=
L
z
t
j
(10)
Cell Outputs
t
h
=
X
k
w
hk
δ
t
k
(11)
Output Gates
δ
t
ω
= σ
0
(z
t
ω
) φ(c
t
)
t
h
(12)
Cells
t
c
= W
ω
δ
t
ω
+ o
t
φ
0
(c
t
)
t
h
(13)
δ
t
c
= i
t
φ
0
(z
t
c
)
t
c
(14)
Forget Gates
δ
t
ν
= σ
0
(z
t
ν
) c
t1
t
c
(15)
Input Gates
δ
t
ι
= σ
0
(z
t
ι
) φ(z
t
ι
)
t
c
(16)
IV. PROPOSED TRACKING ALGORITHM
A. Overview
Our tracking pipeline is depicted in Fig. 1. A search region
centered at the previously estimated target position in the
current frame is firstly taken as the input. Then, a similarity
learning model is utilized to select high-quality proposals by
comparing the target template with the search region, which
significantly reduces redundant computation for irrelevant pro-
posals (see Section IV-B). Next, the features of these high-
quality proposals extracted from the convolutional layers are
fed to the online learning LSTM network (see Section III),
together with the previously estimated target state. Finally, the
LSTM network is able to identify the optimal target state from
these proposals according to the memorized target appearance
and learned sequence-specific information (see Section IV-C).
Note that the internal state of the LSTM network can be simul-
taneously updated when a forward pass is performed during
online tracking so that sophisticated fine-tuning techniques
used in conventional CNN-based tracking methods [1, 5] are

unnecessary. Therefore, we can perform simple and effective
online updating to enhance the adaptability of our network.
B. Fast Proposal Selection
In conventional tracking frameworks, massive proposals are
generated via dense sampling [14]. However, the majority of
these proposals are typically irrelevant and trivial, leading to
expensive and redundant computation for proposal evaluation.
In order to optimize the efficiency of our tracking method, we
propose to leverage the fast matching-based method to pre-
evaluate the densely sampled proposals in the search region
and the high-quality proposals are selected to feed to the
subsequent classification network.
Recently, SiamFC [3] reaches a good trade-off between
speed and accuracy, and it also offers an explicit confidence
map for our proposal selection purpose. Therefore, inspired by
the success of [3], we employ a similarity learning function
(as in [3]) to compare the target template from the first frame
with the search region centered at the previously estimated
target position in each frame. Based on the similarity learning
function, we can firstly obtain a confidence map, which cor-
responds to all the translated sub-regions in the search region.
Then, the sub-regions with high confidence scores are selected
and translated from the final confidence map to the original
frame. Finally, we feed the subsequent LSTM network with
these well-selected proposals for high efficiency, which signif-
icantly improves the speed of classical tracking-by-detection
framework. Note that the proposal selection strategy is mainly
used to filter out irrelevant proposals for algorithm acceleration
purpose, while the LSTM network is designed to effectively
determine the tracked object from all the selected proposals.
Both components are tightly coupled to boost the performance
of tracking in both speed and accuracy, even in challenging
scenes.
C. Tracking with OA-LSTM
In the proposed LSTM network, for the purpose of sup-
plying the network with rich target appearance information,
we adopt five convolutional layers pre-trained on the ILSVR-
C15 [15] dataset to extract high-level target features. These
convolutional layers are kept fixed during the online learning
stage since they have learned the capability of generic feature
extraction. The fully-connected feature vectors can be directly
addressed by the subsequent LSTM layers.
Initially, given the annotation of the target in the first
frame, we feed the target features to the LSTM network
to obtain the original network state which stores the initial
target appearance. Then we use training data sampled from
the first frame, together with the original network state, to
perform the online training procedure, as discussed in Sec-
tion III. In the subsequent frames, the well-selected proposals
via the matching-based method are evaluated by the LSTM
network according to the past target appearance, which is
modeled into the LSTM memory cells. The network outputs
correspond to the positive scores and negative scores of the
proposals, denoting their probabilities belonging to the target
Algorithm 1 The proposed tracking algorithm
Input: Initial target position x
1
, similarity learning function
F, threshold θ
Output: Estimated target position ˆx
t
1: Initialize the LSTM network using x
1
;
2: Sample training data S
1
from the 1
st
frame;
3: Train the LSTM network using S
1
;
4: repeat
5: Apply the similarity learning function F to obtain a
confidence map M;
6: Select N high-score proposals {x
i
t
}
N
i=1
from M;
7: Evaluate {x
i
t
}
N
i=1
with the previous LSTM state ˆc
t1
to obtain their positive scores {p(x
i
t
)}
N
i=1
;
8: Find the tracked result by ˆx
t
= arg max
x
i
t
p(x
i
t
);
9: Set the optimal LSTM state ˆc
t
corresponding to ˆx
t
;
10: if p(ˆx
t
) > θ then
11: Sample training data S
t
with hard negative mining;
12: Update the LSTM network using S
t
;
13: end if
14: until end of sequence
class and background class, respectively. The proposal with
the highest positive score is selected as the tracked result
and its appearance is stored by the network, which supplies
the latest target appearance information for the next proposal
evaluation. During forward passes, the recurrent parameters
can be dynamically updated as the target appearance changes.
By taking advantage of its intrinsic recurrent structure, the
LSTM network is able to capture the temporal dependencies
of the video sequence and memorize the changes of target
appearance and motion during online tracking.
In addition, in order to obtain great adaptability to complex
scenes, we perform online updating using an efficient hard
negative mining approach. More specifically, we directly use
the confidence map from the proposal selection step to select
hard negative examples. Therefore, unlike some sophisticated
trackers [1, 5] that require numerous iterations to identify
the hard negative examples, our hard negative mining ap-
proach does not require extra computational cost for example
evaluation. As the online updating proceeds, the network
efficiently learns the variations of both the target appearance
and background with hard negative examples, thus becoming
more discriminative and robust. Algorithm 1 summarizes the
main steps of the proposed tracking algorithm.
V. EXPERIMENTS
We evaluate the proposed tracking method on two chal-
lenging object tracking benchmarks, i.e., OTB [6] and TC-128
[10]. The proposed method is implemented in Python using
TensorFlow [16] and runs at an average speed of 11.5 fps with
2.00GHz Intel Xeon E5-2660 and an NVIDIA GTX TITAN
X GPU. During implementation, we set the hidden units of
both LSTM layers to 2048 and select high-quality proposals
from the dense sub-regions at a percentage of 2%. We perform
the pre-training process for the convolutional layers on the

Citations
More filters
Journal ArticleDOI

Object-adaptive LSTM network for real-time visual tracking with adversarial data augmentation

TL;DR: Zhang et al. as discussed by the authors proposed an object-adaptive LSTM network to effectively capture the video sequential dependencies and adaptively learn the object appearance variations for real-time visual tracking.
Journal ArticleDOI

Improving Object Tracking by Added Noise and Channel Attention.

TL;DR: This paper proposes an Input-Regularized Channel Attentional Siamese (IRCA-Siam) tracker which exhibits improved generalization compared to the current state-of-the-art trackers and proposes feature fusion from noisy and clean input channels which improves the target localization.
Journal ArticleDOI

Two motion models for improving video object tracking performance

TL;DR: It is observed that both models of the prior information lead to performance enhancement of all three trackers, validates the hypothesis that when training videos are available, prior information embodied in the motion models can improve the tracking performance.
Posted Content

Object-Adaptive LSTM Network for Real-time Visual Tracking with Adversarial Data Augmentation

TL;DR: This paper proposes a novel real-time visual tracking method, which adopts an object-adaptive LSTM network to effectively capture the video sequential dependencies and adaptively learn the object appearance variations and presents a fast proposal selection strategy, which utilizes the matching-based tracking method to pre-estimate dense proposals and selects high-quality ones to feed to the L STM network for classification.
Book ChapterDOI

Multi-object Tracking Based on Nearest Optimal Template Library

TL;DR: Zhang et al. as discussed by the authors proposed a novel Nearest Optimal Template Library (NOTL) associated with two tailor-made methods based on the NOTL, which provides reliable appearance information of the object.
References
More filters
Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Journal ArticleDOI

High-Speed Tracking with Kernelized Correlation Filters

TL;DR: A new kernelized correlation filter is derived, that unlike other kernel algorithms has the exact same complexity as its linear counterpart, which is called dual correlation filter (DCF), which outperform top-ranking trackers such as Struck or TLD on a 50 videos benchmark, despite being implemented in a few lines of code.
Proceedings ArticleDOI

Online Object Tracking: A Benchmark

TL;DR: Large scale experiments are carried out with various evaluation criteria to identify effective approaches for robust tracking and provide potential future research directions in this field.
Journal ArticleDOI

Object Tracking Benchmark

TL;DR: An extensive evaluation of the state-of-the-art online object-tracking algorithms with various evaluation criteria is carried out to identify effective approaches for robust tracking and provide potential future research directions in this field.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What contributions have the authors mentioned in the paper "Object-adaptive lstm network for visual tracking" ?

In this paper, the authors propose a novel object-adaptive LSTM network, which can effectively exploit sequence dependencies and dynamically adapt to the temporal object variations via constructing an intrinsic model for object appearance and motion. To the best of their knowledge, this is the first work to apply an LSTM network for classification in visual object tracking. Experimental results on OTB and TC-128 benchmarks show that the proposed method achieves state-ofthe-art performance, which exhibits great potentials of recurrent structures for visual object tracking. 

In order to optimize the efficiency of their tracking method, the authors propose to leverage the fast matching-based method to preevaluate the densely sampled proposals in the search region and the high-quality proposals are selected to feed to the subsequent classification network. 

By taking advantage of its intrinsic recurrent structure, their network can capture long-range temporal dependencies and dynamically adapt to the object variations via memorizing the changes of object appearance and motion during online tracking. 

In the proposed LSTM network, for the purpose of supplying the network with rich target appearance information, the authors adopt five convolutional layers pre-trained on the ILSVRC15 [15] dataset to extract high-level target features. 

To fit in with the tracking task, the authors use the LSTM network to evaluate the proposals and identify the optimal target state in each frame. 

with a pre-trained convolutional feature extractor, the LSTM recurrence is learned online to sufficiently utilize sequence-specific information. 

OTB (Object Tracking benchmark) [6] is a popular tracking benchmark consisting of 100 fully annotated videos with substantial variations. 

Based on the similarity learning function, the authors can firstly obtain a confidence map, which corresponds to all the translated sub-regions in the search region. 

Among the competing trackers, SiamFC, GOTURN, CNT and CFNet also employ deep networks to learn the object feature representations and achieve good performance. 

By taking advantage of its intrinsic recurrent structure, the LSTM network is able to capture the temporal dependencies of the video sequence and memorize the changes of target appearance and motion during online tracking. 

These convolutional layers are kept fixed during the online learning stage since they have learned the capability of generic feature extraction. 

The proposed method is implemented in Python using TensorFlow [16] and runs at an average speed of 11.5 fps with 2.00GHz Intel Xeon E5-2660 and an NVIDIA GTX TITAN X GPU. 

Experimental results on public tracking benchmarks have shown that their method achieves state-ofthe-art performance with a satisfactory speed, demonstrating the successful application of recurrent structure to visual object tracking. 

Although this method has achieved outstanding accuracy, massive feature extractions and sophisticated online fine-tuning techniques limit its efficiency.