scispace - formally typeset
Open AccessProceedings ArticleDOI

Object Detection in Videos with Tubelet Proposal Networks

TLDR
A framework for object detection in videos is proposed, which consists of a novel tubelet proposal network to efficiently generate spatiotemporal proposals, and a Long Short-term Memory network that incorporates temporal information from tubelet proposals for achieving high object detection accuracy in videos.
Abstract
Object detection in videos has drawn increasing attention recently with the introduction of the large-scale ImageNet VID dataset. Different from object detection in static images, temporal information in videos is vital for object detection. To fully utilize temporal information, state-of-the-art methods [15, 14] are based on spatiotemporal tubelets, which are essentially sequences of associated bounding boxes across time. However, the existing methods have major limitations in generating tubelets in terms of quality and efficiency. Motion-based [14] methods are able to obtain dense tubelets efficiently, but the lengths are generally only several frames, which is not optimal for incorporating long-term temporal information. Appearance-based [15] methods, usually involving generic object tracking, could generate long tubelets, but are usually computationally expensive. In this work, we propose a framework for object detection in videos, which consists of a novel tubelet proposal network to efficiently generate spatiotemporal proposals, and a Long Short-term Memory (LSTM) network that incorporates temporal information from tubelet proposals for achieving high object detection accuracy in videos. Experiments on the large-scale ImageNet VID dataset demonstrate the effectiveness of the proposed framework for object detection in videos.

read more

Content maybe subject to copyright    Report

Object Detection in Videos with Tubelet Proposal Networks
Kai Kang
1,2
Hongsheng Li
2,⋆
Tong Xiao
1,2
Wanli Ouyang
2,5
Junjie Yan
3
Xihui Liu
4
Xiaogang Wang
2,⋆
1
Shenzhen Key Lab of Comp. Vis. & Pat. Rec., Shenzhen Institutes of Advanced Technology, CAS, China
2
The Chinese University of Hong Kong
3
SenseTime Group Limited
4
Tsinghua University
5
The University of Sydney
{kkang,hsli,xiaotong,wlouyang,xgwang}@ee.cuhk.edu.hk
yanjunjie@sensetime.com xh-liu13@mails.tsinghua.edu.cn
Abstract
Object detection in videos has drawn increasing atten-
tion recently with the introduction of the large-scale Im-
ageNet VID dataset. Different from object detection in
static images, temporal information in videos is vital for
object detection. To fully utilize temporal information,
state-of-the-art methods [
15, 14] are based on spatiotem-
poral tubelets, which are essentially sequences of associ-
ated bounding boxes across time. However, the existing
methods have major limitations in generating tubelets in
terms of quality and efficiency. Motion-based [
14] meth-
ods are able to obtain dense tubelets efficiently, but the
lengths are generally only several frames, which is not
optimal for incorporating long-term temporal information.
Appearance-based [
15] methods, usually involving generic
object tracking, could generate long tubelets, but are usu-
ally computationally expensive. In this work, we propose
a framework for object detection in videos, which consists
of a novel tubelet proposal network to efficiently generate
spatiotemporal proposals, and a Long Short-term Mem-
ory (LSTM) network that incorporates temporal informa-
tion from tubelet proposals for achieving high object de-
tection accuracy in videos. Experiments on the large-scale
ImageNet VID dataset demonstrate the effectiveness of the
proposed framework for object detection in videos.
1. Introduction
The performance of object detection has been signifi-
cantly improved recently with the emergence of deep neu-
ral networks. Novel neural network structures, such as
GoogLeNet [
29], VGG [27] and ResNet [8], were pro-
posed to improve the learning capability on large-scale
computer vision datasets for various computer vision tasks,
such as object detection [5, 24, 23, 21], semantic segmen-
tation [
20, 2, 16], tracking [31, 1, 33], scene understanding
[
25, 26, 19], person search [18, 32], etc. State-of-the-art
Corresponding authors
(a)
(b)
(c)
(d)
Figure 1. Proposals methods for video object detection. (a) orig-
inal frames. (b) static proposals have no temporal association,
which is hard to incorporate temporal information for proposal
classification. (c) bounding box regression methods would focus
on the dominant object, lose proposal diversity and may also cause
recall drop since all proposals tend to aggregate on the dominant
objects. (d) the ideal proposals should have temporal association
and have the same motion patterns with the objects while keeping
their diversity.
object detection frameworks for static images are based on
these networks and consist of three main stages [
6]. Bound-
ing box proposals are first generated from the input image
based on how likely each location contains an object of in-
terest. The appearance features are then extracted from each
box proposal to classify them as one of the object classes.
Such bounding boxes and their associated class scores are
refined by post-processing techniques (e.g., Non-Maximal
Suppression) to obtain the final detection results. Multiple
frameworks, such as Fast R-CNN [
5] and Faster R-CNN
[24], followed this research direction and eventually for-
mulated the object detection problem as training end-to-end
deep neural networks.
Although great success has been achieved in detecting
objects on static images, video object detection remains
a challenging problem. Several factors contribute to the
difficulty of this problem, which include the drastic ap-
pearance and scale changes of the same object over time,
object-to-object occlusions, motion blur, and the mismatch
1
727

between the static-image data and video data. The new
task of detecting objects in videos (VID) introduced by the
ImageNet challenge in 2015 provides a large-scale video
dataset, which requires labeling every object of 30 classes
in each frame of the videos. Driven by this new dataset,
multiple systems [
7, 14, 15] were proposed to extend static-
image object detectors for videos.
Similar to the bounding box proposals in the static ob-
ject detection, the counterpart in videos are called tubelets,
which are essentially sequences of bounding boxes propos-
als. State-of-the-art algorithms for video object detection
utilize the tubelets to some extend to incorporate temporal
information for obtaining detection results. However, the
tubelet generation is usually based on the frame-by-frame
detection results, which is extremely time consuming. For
instance, the tracking algorithm used by [
14, 15] needs 0.5
second to process each detection box in each frame, which
prevents the systems to generate enough tubelet proposals
for classification in an allowable amount of time, since the
video usually contains hundreds of frames with hundreds
of detection boxes on each frame. Motion-based methods,
such as optical-flow-guided propagation [
14], can generate
dense tubelets efficiently, but the lengths are usually lim-
ited to only several frames (e.g., 7 frames in [14]) because
of their inconsistent performance for long-term tracking.
The ideal tubelets for video object detection should be long
enough to incorporate temporal information while diverse
enough to ensure high recall rates (Figure
1).
To mitigate the problems, we propose a framework for
object detection in videos. It consists of a Tubelet Proposal
Network (TPN) that simultaneously obtains hundreds of di-
verse tubelets starting from static proposals, and a Long
Short-Term Memory (LSTM) sub-network for estimating
object confidences based on temporal information from the
tubelets. Our TPN can efficiently generate tubelet propos-
als via feature map pooling. Given a static box proposal at
a starting frame, we pool features from the same box loca-
tions across multiple frames to train an efficient multi-frame
regression neural network as the TPN. It is able to learn
complex motion patterns of the foreground objects to gen-
erate robust tubelet proposals. Hundreds of proposals in a
video can be tracked simultaneously. Such tubelet proposals
are shown to be of better quality than the ones obtained on
each frame independently, which demonstrates the impor-
tance of temporal information in videos. The visual features
extracted from the tubelet boxes are automatically aligned
into feature sequences and are suitable for learning temporal
features with the following LSTM network, which is able
to capture long-term temporal dependency for accurate pro-
posal classification.
The contribution of this paper is that we propose a new
deep learning framework that combines tubelet proposal
generation and temporal classification with visual-temporal
features. An efficient tubelet proposal generation algo-
rithm is developed to generate tubelet proposals that cap-
ture spatiotemporal locations of objects in videos. A tempo-
ral LSTM model is adopted for classifying tubelet propos-
als with both visual features and temporal features. Such
high-level temporal features are generally ignored by exist-
ing detection systems but are crucial for object detection in
videos.
2. Related work
Object detection in static images. State-of-the-art ob-
ject detection systems are all based on deep CNNs. Girshick
et al. [
6] proposed the R-CNN to decompose the object de-
tection problem into multiple stages including region pro-
posal generation, CNN finetuning, and region classification.
To accelerate the training process of R-CNN, Fast R-CNN
[
5] was proposed to avoid time-consumingly feeding each
image patch from bounding box proposals into CNN to ob-
tain feature representations. Features of multiple bounding
boxes within the same image are warped from the same fea-
ture map efficiently via ROI pooling operations. To accel-
erate the generation of candidate bounding box proposals,
Faster R-CNN integrates a Region Proposal Network into
the Fast R-CNN framework, and is able to generate box pro-
posals directly with neural networks.
Object detection in videos. Since the introduction of
the VID task by the ImageNet challenge, there have been
multiple object detection systems for detecting objects in
videos. These methods focused on post-processing class
scores by static-image detectors to enforce temporal consis-
tency of the scores. Han et al. [
7] associated initial detec-
tion results into sequences. Weaker class scores along the
sequences within the same video were boosted to improve
the initial frame-by-frame detection results. Kang et al. [15]
generated new tubelet proposals by applying tracking algo-
rithms to static-image bounding box proposals. The class
scores along the tubelet were first evaluated by the static-
image object detector and then re-scored by a 1D CNN
model. The same group [
14] also tried a different strategy
for tubelet classification and re-scoring. In addition, initial
detection boxes were propagated to nearby frames accord-
ing to optical flows between frames, and the class scores
not belonging to the top classes were suppressed to enforce
temporal consistency of class scores.
Object localization in videos. There have been works
and datasets [
3, 13, 22] on object localization in videos.
However, they have a simplified problem setting, where
each video is assumed to contain only one known or un-
known class and requires annotating only one of the objects
in each frame.
3. Tubelet proposal networks
Existing methods on object detection in videos gener-
ates tubelet proposals utilizing either generic single-object
tracker starting at a few key frames [
15] or data associa-
tion methods (i.e. tracking-by-detection methods) on per-
frame object detection results [
7]. These methods either
are too computationally expensive to generate enough dense
728

t
y
x
Tubelet Proposal Network
Spatial Anchors
Movement Prediction
Encoder
LSTM
Decoder
LSTM
Class
Label
Tubelet
Features
Tubelet Proposal Network Encoder-decoder LSTM
Tubelet
CNN
Classification
CNN
t
y
x
Figure 2. The proposed object detection system, which consists
of two main parts. The first is a tubelet proposal network to effi-
ciently generating tubelet proposals. The tubelet proposal network
extracts multi-frame features within the spatial anchors, predicts
the object motion patterns relative to the spatial anchors and gen-
erates tubelet proposals. The gray box indicates the video clip
and different colors indicate proposal process of different spatial
anchors. The second part is an encoder-decoder CNN-LSTM net-
work to extract tubelet features and classify each proposal boxes
into different classes. The tubelet features are first fed into the en-
coder LSTM by a forward pass to capture the appearance features
of the entire sequence. Then the states are copied to the decoder
LSTM for a backward pass with the tubelet features. The encoder-
decoder LSTM processes the entire clip before outputting class
probabilities for each frame.
tublets, or are likely to drift and result in tracking fail-
ures. Even for an 100-fps single-object tracker, it might take
about 56 GPU days to generate tubelets with 300 bounding
boxes per frame for the large-scale ImageNet VID dataset.
We propose a Tubelet Proposal Network (TPN) which is
able to generate tubelet proposals efficiently for videos. As
shown in Figure
2, the Tubelet Proposal Network consists
of two main components, the first sub-network extracts vi-
sual features across time based on static region proposals at
a single frame. Our key observation is that, since the re-
ceptive fields (RF) of CNNs are generally large enough, we
can perform feature map pooling simply at the same bound-
ing box locations across time to extract the visual features
of moving objects. Based on the pooled visual features,
the second component is a regression layer for estimating
bounding boxes’ temporal displacements to generate tubelet
proposals.
3.1. Preliminaries on ROI-pooling for regression
There are existing works that utilize feature map pool-
ing for object detection. The Fast R-CNN framework [
5]
utilizes ROI-pooling on visual feature maps for object clas-
sification and bounding box regression. The input image is
fed into a CNN and forward propagated to generate visual
feature maps. Given different object proposals, their visual
features are directly ROI-pooled from the feature maps ac-
cording to the box coordinates. In this way, CNN only needs
to forward propagate once for each input image and saves
much computational time. Let b
i
t
= (x
i
t
, y
i
t
, w
i
t
, h
i
t
) denote
the ith static box proposal at time t, where x, y, w and h
represent the two coordinates of the box center, width and
height of the box proposal. The ROI-pooling obtains visual
features r
i
t
R
f
at box b
i
t
.
The ROI-pooled features r
i
t
for each object bounding
box proposal can be used for object classification, and, more
interestingly, for bounding box regression, which indicates
that the visual features obtained by feature map pooling
contain necessary information describing objects’ locations.
Inspired by this technique, we propose to extract multi-
frame visual features via ROI-pooling, and use such fea-
tures for generating tubelet proposals via regression.
3.2. Static object proposals as spatial anchors
Static object proposals are class-free bounding boxes in-
dicating the possible locations of objects, which could be
efficiently obtained by different proposal methods such as
SelectiveSearch [
30], Edge Boxes [34] and Region Proposal
Networks [
24]. For object detection in videos, however,
we need both spatial and temporal locations of the objects,
which are crucial to incorporate temporal information for
accurate object proposal classification.
For general objects in videos, movements are usually
complex and difficult to predict. The static object propos-
als usually have high recall rates (e.g. >90%) at individual
frames, which is important because it is the upper bound
of object detection performance. Therefore, it is natural to
use static proposals as starting anchors for estimating their
movements at following frames to generate tubelet propos-
als. If their movements can be robustly estimated, high ob-
ject recall rate at the following times can be maintained.
Let b
i
1
denote a static proposal of interest at time t =
1. Particularly, to generate a tubelet proposal starting at
b
i
1
, visual features within the w-frame temporal window
from frame 1 to w are pooled at the same location b
i
1
as
r
i
1
, r
i
2
, · · · , r
i
w
in order to generate the tubelet proposal. We
call b
i
1
a “spatial anchor”. The pooled regression features
encode visual appearances of the objects. Recovering cor-
respondences between the visual features (r
i
1
, r
i
2
, · · · , r
i
w
)
leads to accurate tubelet proposals, which is modeled by a
regression layer detailed in the next subsection.
The reason why we are able to pool multi-frame features
from the same spatial location for tubelet proposals is that
CNN feature maps at higher layers usually have large re-
ceptive fields. Even if visual features are pooled from a
small bounding box, its visual context is far greater than
the bounding box itself. Pooling at the same box locations
across time is therefore capable of capturing large possible
movements of objects. In Figure
2, we illustrate the “spa-
tial anchors” for tubelet proposal generation. The features
in the same locations are aligned to predict the movement
of the object.
We use a GoogLeNet with Batch Normalization (BN)
model [
12] for the TPN. In our settings, the ROI-pooling
layer is connected to “inception
4d” of the BN model,
which has a receptive field of 363 pixels. Therefore, the
729

network is able to handle up to 363-pixel movement when
ROI-pooling the same box locations across time, which is
more than enough to capture short-term object movements.
Each static proposal is regarded as an anchor point for fea-
ture extraction within a temporal window w.
3.3. Supervisions for tubelet proposal generation
Our goal is to generate tubelet proposals that have high
object recall rates at each frame and can accurately track ob-
jects. Based on the pooled visual features r
i
1
, r
i
2
, · · · , r
i
w
at
box locations b
i
t
, we train a regression network R(·) that ef-
fectively estimates the relative movements w.r.t. the spatial
anchors,
m
i
1
, m
i
2
, · · · , m
i
w
= R(r
i
1
, r
i
2
, · · · , r
i
w
), (1)
where the relative movements m
i
t
=(∆x
i
t
, y
i
t
, w
i
t
, h
i
t
)
are calculated as
x
i
t
= (x
i
t
x
i
1
)/w
i
1
, y
i
t
= (y
i
t
y
i
1
)/h
i
1
, (2)
w
i
t
= log(w
i
t
/w
i
1
), h
i
t
= log(h
i
t
/h
i
1
).
Once we obtain such relative movements, the actual box lo-
cations of the tubelet could be easily inferred. We adopt a
fully-connected layer that takes the concatenated visual fea-
tures [r
i
1
, r
i
2
, · · · , r
i
w
]
T
as the input, and outputs 4w move-
ment values of a tubelet proposal by
[m
i
1
, · · · , m
i
w
]
T
= W
w
[r
i
1
, · · · , r
i
w
]
T
+ b
w
, (3)
where W
w
R
fw×4w
and b
w
R
4w
are the learnable
parameters of the layer.
The remaining problem is how to design proper super-
visions for learning the relative movements. Our key as-
sumption is that the tubelet proposals should have consis-
tent movement patterns with the ground-truth objects. How-
ever, given static object proposals as the starting boxes for
tubelet generation, they usually do not have a perfect 100%
Intersection-over-Union (IoU) ratio with the ground truth
object boxes. Therefore, we require static box proposals
that are close to ground truth boxes to follow the movement
patterns of the ground truth boxes. More specifically, if a
static object proposal b
i
t
has a greater-than-0.5 IoU value
with a ground truth box
ˆ
b
i
t
, and the IoU value is greater than
those of other ground truth boxes, our regression layer tries
to generate tubelet boxes following the same movement pat-
terns ˆm
i
t
of the ground truth
ˆ
b
i
t
as much as possible. The
relative movement targets ˆm
i
t
= (ˆx
i
t
, ˆy
i
t
, ˆw
i
t
,
ˆ
h
i
t
) can be de-
fined w.r.t. the ground truth boxes at time 1,
ˆ
b
1
t
, in the simi-
lar way as Eq. (
2). It is trivial to see that ˆm
i
1
= (0, 0, 0, 0).
Therefore, we only need to predict ˆm
i
2
to ˆm
i
w
. Note that by
learning relative movements w.r.t to the spatial anchors at
the first frame, we can avoid cumulative errors in conven-
tional tracking algorithms to some extend.
The movement targets are normalized by their mean
m
t
and standard deviation σ
t
as the regression objectives,
˜m
i
t
= ( ˆm
i
t
m
t
)
t
, for t = 1, . . . , w. (4)
To generate N tubelets that follow movement patterns of
their associated ground truth boxes, we minimize the fol-
lowing object function w.r.t. all x
i
t
, y
i
t
, w
i
t
, h
i
t
,
L({
˜
M}, {M }) =
1
N
N
X
i=1
w
X
t=1
X
k∈{x,y,w,h}
d(∆k
i
t
), (5)
where {
˜
M} and {M } are the sets of all normalized move-
ment targets and network outputs, and
d(x) =
(
0.5x
2
if | x | < 1,
|x| 0.5 otherwise.
(6)
is the smoothed L
1
loss for robust box regression in [
5].
The network outputs ˙m
i
t
are mapped back to the real rel-
ative movements m
i
t
by
m
i
t
= ( ˙m
i
t
+
m
t
) σ
t
. (7)
By our definition, if a static object proposal covers some
area the object, it should cover the same portion of object in
the later frames (see Figure
1 (d) for examples).
3.4. Initialization for multi-frame regression layer
The size of the temporal window is also a key factor in
the TPN. The simplest model is a 2-frame model. For a
given frame, the features within the spatial anchors on cur-
rent frame and the next frames are extracted and concate-
nated, [r
i
1
, r
i
2
]
T
, to estimate the movements of b
i
1
on the
next frames. However, since the 2-frame model only uti-
lizes minimal temporal information within a very short tem-
poral window, the generated tubelets may be non-smooth
and easy to drift. Increasing the temporal window utilizes
more temporal information so as to estimate more complex
movement patterns.
Given the temporal window size w, the dimension of the
extracted features are f w, where f is the dimension of vi-
sual features in a single frame within the spatial anchors
(e.g., 1024-dimensional “inception
5b” features from the
BN model in our settings). Therefore, the parameter size
of the regress layer is of R
fw×4w
and grows quadratically
with the temporal window size w.
If the temporal window size is large, randomly initializ-
ing such a large matrix has difficulty in learning a good re-
gression layer. We propose a “block” initialization method
to use the learned features from 2-frame model to initialize
the multi-frame models.
In Figure
3, we show how to use a pre-trained 2-frame
model’s regression layer to initialize that of a 5-frame
model. Since the target ˆm
i
1
in Equation (2) is always
(0, 0, 0, 0) we only need to estimate movements for the later
frames. The parameter matrix W
2
is of size R
2f×4
since the
input features are concatenations of two frames and the bias
term b
2
is of size R
4
. For the 5-frame regression layer, the
parameter matrix W
5
is of size R
5f×(4×4)
and the bias term
730

A
B
A A A A
B
B
B
B
4
2f
16
5f
W
2
b
2
W
5
b
5
Figure 3. Illustration of the “block” initialization method. The 2-
frame model’s regression layer has weights W
2
and bias b
2
, the
W
2
consists of two sub-matrices A and B corresponding to the
features of the first and second frames. Then a 5-frame model’s
regression layer can be initialized with the sub-matrices as shown
in the figure. The bias term b
5
is a simple repetition of b
2
.
b
5
is of R
(4×4)
. Essentially, we utilize visual features from
frame 1 & 2 to estimate movements in frame 2, frame 1 & 3
for frame 3, and so on. The matrix W
2
is therefore divided
into two sub-matrices A R
f×4
and B R
f×4
to fill the
corresponding entries of matrix W
5
. The bias term b
5
is a
repetition of b
2
for 4 times.
In our experiments, we first train a 2-frame model with
random initialization and then use the 2-frame model to ini-
tialize the multi-frame regression layer.
4. Overall detection framework with tubelet
generation and tubelet classification
Based on the Tubelet Proposal Networks, we propose a
framework that is efficient for object detection in videos.
Compared with state-of-the-art single object tracker, It only
takes our TPN 9 GPU days to generate dense tubelet pro-
posals on the ImageNet VID dataset. It is also capable of
utilizing useful temporal information from tubelet propos-
als to increase detection accuracy. As shown in Figure
2,
the framework consists of two networks, the first one is the
TPN for generating candidate object tubelets, and the sec-
ond network is a CNN-LSTM classification network that
classifies each bounding box on the tubelets into different
object categories.
4.1. Efficient tubelet proposal generation
The TPN is able to estimate movements of each static
object proposal within a temporal window w. For object
detection in videos in large-scale datasets, we need to not
only efficiently generate tubelets for hundreds of spatial an-
chors in parallel, but also generate tubelets with sufficient
lengths to incorporate enough temporal information.
To generate tubelets with length of l, (see illustration
in Figure 4 (a)), we utilize static object proposals on the
first frame as spatial anchors, and then iteratively apply
TPN with temporal window w until the tubelets cover all
l frames. The last estimated locations of the previous itera-
Iter 1
Iter 2
Iter 3
Iter 4
Iter 5
l
(a)
(b)
Figure 4. Efficiently generating tubelet proposals. (a) the TPN
generates the tubelet proposal of temporal window w and uses the
last-frame output of the proposal as static anchors for the next iter-
ation. This process iterates until the whole track length is covered.
(b) multiple static anchors in a frame are fed to the Fast R-CNN
network with a single forward pass for simultaneously generating
multiple tubelet proposals. Different colors indicate different spa-
tial anchors
tion are used as spatial anchors for the next iteration. This
process can iterate to generate tubelet proposals of arbitrary
lengths.
For N static object proposals in the same starting frame,
the bottom CNN only needs to conduct an one-time forward
propagation to obtain the visual feature maps, and thus en-
ables efficient generation of hundreds of tubelet proposals
(see Figure
4 (b)).
Compared to previous methods that adopt generic sin-
gle object trackers, our proposed methods is dramatically
faster for generating a large number of tubelets. The track-
ing method used in [
15] has reported 0.5 fps running speed
for a single object. For a typical frame with 300 spatial an-
chors, it takes 150s for each frame. Our method has an av-
erage speed of 0.488s for each frame, which is about 300×
faster. Even compared to the recent 100 fps single object
tracker in [9], our method is about 6.14× faster.
4.2. Encoder-decoder LSTM (ED-LSTM) for tem-
poral classification
After generating the length-l tubelet proposal, visual fea-
tures u
1
t
, · · · , u
i
t
, · · · , u
i
l
can be pooled from tubelet box
locations for object classification with temporal informa-
tion. Existing methods [
15, 7, 14] mainly use temporal in-
formation in post processing, either propagating detections
to neighboring frames or temporally smoothing detection
scores. The temporal consistency of detection results is im-
portant, but to capture the complex appearance changes in
the tubelets, we need to learn discriminative spatiotemporal
features at the tubelet box locations.
As shown in Figure
2, the proposed classification sub-
network contains a CNN that processes input images to ob-
tain feature maps. Classification features ROI-pooled from
each tubelet proposal across time are then fed into a one-
layer Long Short-Term Memory (LSTM) network [
11] for
731

Citations
More filters
Journal ArticleDOI

Object Detection With Deep Learning: A Review

TL;DR: In this article, a review of deep learning-based object detection frameworks is provided, focusing on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further.
Journal ArticleDOI

A Survey of Deep Learning-Based Object Detection

TL;DR: This survey provides a comprehensive overview of a variety of object detection methods in a systematic manner, covering the one-stage and two-stage detectors, and lists the traditional and new applications.
Book ChapterDOI

Tracking Objects as Points

TL;DR: CenterTrack as mentioned in this paper applies a detection model to a pair of images and detections from the prior frame, given this minimal input, localizes objects and predicts their associations with the previous frame.
Journal ArticleDOI

FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking

TL;DR: A simple approach which consists of two homogeneous branches to predict pixel-wise objectness scores and re-ID features allows \emph{FairMOT} to obtain high levels of detection and tracking accuracy and outperform previous state-of-the-arts by a large margin on several public datasets.
Proceedings ArticleDOI

Tracking Without Bells and Whistles

TL;DR: Tracktor as discussed by the authors exploits the bounding box regression of an object detector to predict the position of the object in the next frame, thereby converting a detector into a Tracktor and provides a new state-of-the-art on three multi-object tracking benchmarks by extending it with a straightforward re-identification and camera motion compensation.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).