Showing papers on "Video tracking published in 2020"

PDF

Open Access

Journal Article•DOI•

FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking

[...]

Yifu Zhang¹, Chunyu Wang², Xinggang Wang¹, Wenjun Zeng², Wenyu Liu¹ - Show less +1 more•Institutions (2)

Huazhong University of Science and Technology¹, Microsoft²

04 Apr 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: A simple approach which consists of two homogeneous branches to predict pixel-wise objectness scores and re-ID features allows \emph{FairMOT} to obtain high levels of detection and tracking accuracy and outperform previous state-of-the-arts by a large margin on several public datasets.

...read moreread less

Abstract: There has been remarkable progress on object detection and re-identification (re-ID) in recent years which are the key components of multi-object tracking. However, little attention has been focused on jointly accomplishing the two tasks in a single network. Our study shows that the previous attempts ended up with degraded accuracy mainly because the re-ID task is not fairly learned which causes many identity switches. The unfairness lies in two-fold: (1) they treat re-ID as a secondary task whose accuracy heavily depends on the primary detection task. So training is largely biased to the detection task but ignores the re-ID task; (2) they use ROI-Align to extract re-ID features which is directly borrowed from object detection. However, this introduces a lot of ambiguity in characterizing objects because many sampling points may belong to disturbing instances or background. To solve the problems, we present a simple approach \emph{FairMOT} which consists of two homogeneous branches to predict pixel-wise objectness scores and re-ID features. The achieved fairness between the tasks allows \emph{FairMOT} to obtain high levels of detection and tracking accuracy and outperform previous state-of-the-arts by a large margin on several public datasets. The source code and pre-trained models are released at this https URL.

...read moreread less

507 citations

Journal Article•DOI•

Deep learning in video multi-object tracking: A survey

[...]

Gioele Ciaparrone¹, Gioele Ciaparrone², Francisco Luque Sánchez², Siham Tabik², Luigi Troiano³, Roberto Tagliaferri¹, Francisco Herrera² - Show less +3 more•Institutions (3)

University of Salerno¹, University of Granada², University of Sannio³

14 Mar 2020-Neurocomputing

TL;DR: A comprehensive survey on works that employ Deep Learning models to solve the task of MOT on single-camera videos, identifying a number of similarities among the top-performing methods and presenting some possible future research directions.

...read moreread less

448 citations

Proceedings Article•DOI•

Siam R-CNN: Visual Tracking by Re-Detection

[...]

Paul Voigtlaender¹, Jonathon Luiten¹, Philip H. S. Torr², Bastian Leibe¹•Institutions (2)

RWTH Aachen University¹, University of Oxford²

14 Jun 2020

TL;DR: This work presents Siam R-CNN, a Siamese re-detection architecture which unleashes the full power of two-stage object detection approaches for visual object tracking, and combines this with a novel tracklet-based dynamic programming algorithm to model the full history of both the object to be tracked and potential distractor objects.

...read moreread less

Abstract: We present Siam R-CNN, a Siamese re-detection architecture which unleashes the full power of two-stage object detection approaches for visual object tracking. We combine this with a novel tracklet-based dynamic programming algorithm, which takes advantage of re-detections of both the first-frame template and previous-frame predictions, to model the full history of both the object to be tracked and potential distractor objects. This enables our approach to make better tracking decisions, as well as to re-detect tracked objects after long occlusion. Finally, we propose a novel hard example mining strategy to improve Siam R-CNN's robustness to similar looking objects. Siam R-CNN achieves the current best performance on ten tracking benchmarks, with especially strong results for long-term tracking. We make our code and models available at www.vision.rwth-aachen.de/page/siamrcnn.

...read moreread less

418 citations

Posted Content•

Center-based 3D Object Detection and Tracking.

[...]

Tianwei Yin¹, Xingyi Zhou¹, Philipp Krähenbühl¹•Institutions (1)

University of Texas at Austin¹

19 Jun 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: The framework, CenterPoint, first detects centers of objects using a keypoint detector and regresses to other attributes, including 3D size, 3D orientation, and velocity, and refines these estimates using additional point features on the object.

...read moreread less

Abstract: Three-dimensional objects are commonly represented as 3D boxes in a point-cloud. This representation mimics the well-studied image-based 2D bounding-box detection but comes with additional challenges. Objects in a 3D world do not follow any particular orientation, and box-based detectors have difficulties enumerating all orientations or fitting an axis-aligned bounding box to rotated objects. In this paper, we instead propose to represent, detect, and track 3D objects as points. Our framework, CenterPoint, first detects centers of objects using a keypoint detector and regresses to other attributes, including 3D size, 3D orientation, and velocity. In a second stage, it refines these estimates using additional point features on the object. In CenterPoint, 3D object tracking simplifies to greedy closest-point matching. The resulting detection and tracking algorithm is simple, efficient, and effective. CenterPoint achieved state-of-the-art performance on the nuScenes benchmark for both 3D detection and tracking, with 65.5 NDS and 63.8 AMOTA for a single model. On the Waymo Open Dataset, CenterPoint outperforms all previous single model method by a large margin and ranks first among all Lidar-only submissions. The code and pretrained models are available at this https URL.

...read moreread less

397 citations

Journal Article•DOI•

UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking

[...]

Longyin Wen, Dawei Du¹, Zhaowei Cai², Zhen Lei, Ming-Ching Chang¹, Honggang Qi³, Jongwoo Lim⁴, Ming-Hsuan Yang⁵, Siwei Lyu¹ - Show less +5 more•Institutions (5)

University at Albany, SUNY¹, University of California, San Diego², Chinese Academy of Sciences³, Hanyang University⁴, University of California, Merced⁵

01 Apr 2020-Computer Vision and Image Understanding

TL;DR: This work performs a comprehensive quantitative study on the effects of object detection accuracy to the overall MOT performance, using the new large-scale University at Albany DETection and tRACking (UA-DETRAC) benchmark dataset.

...read moreread less

332 citations

Posted Content•

MOT20: A benchmark for multi object tracking in crowded scenes.

[...]

Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, Ian Reid, Stefan Roth, Konrad Schindler, Laura Leal-Taixé - Show less +5 more

19 Mar 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: The MOT20benchmark, consisting of 8 new sequences depicting very crowded challenging scenes, is presented, and gives to chance to evaluate state-of-the-art methods for multiple object tracking when handling extremely crowded scenarios.

...read moreread less

Abstract: Standardized benchmarks are crucial for the majority of computer vision applications. Although leaderboards and ranking tables should not be over-claimed, benchmarks often provide the most objective measure of performance and are therefore important guides for research. The benchmark for Multiple Object Tracking, MOTChallenge, was launched with the goal to establish a standardized evaluation of multiple object tracking methods. The challenge focuses on multiple people tracking, since pedestrians are well studied in the tracking community, and precise tracking and detection has high practical relevance. Since the first release, MOT15, MOT16, and MOT17 have tremendously contributed to the community by introducing a clean dataset and precise framework to benchmark multi-object trackers. In this paper, we present our MOT20benchmark, consisting of 8 new sequences depicting very crowded challenging scenes. The benchmark was presented first at the 4thBMTT MOT Challenge Workshop at the Computer Vision and Pattern Recognition Conference (CVPR) 2019, and gives to chance to evaluate state-of-the-art methods for multiple object tracking when handling extremely crowded scenarios.

...read moreread less

292 citations

Journal Article•DOI•

LCR-Net++: Multi-Person 2D and 3D Pose Detection in Natural Images

[...]

Grégory Rogez¹, Philippe Weinzaepfel, Cordelia Schmid¹•Institutions (1)

University of Grenoble¹

01 May 2020-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Li et al. as discussed by the authors proposed an end-to-end architecture for joint 2D and 3D human pose estimation in natural images, which consists of a pose proposal generator that suggests candidate poses at different locations in the image, a classifier that scores the different pose proposals, and a regressor that refines pose proposals both in 2D, and the final pose estimation is obtained by integrating over neighboring pose hypotheses.

...read moreread less

Abstract: We propose an end-to-end architecture for joint 2D and 3D human pose estimation in natural images. Key to our approach is the generation and scoring of a number of pose proposals per image, which allows us to predict 2D and 3D poses of multiple people simultaneously. Hence, our approach does not require an approximate localization of the humans for initialization. Our Localization-Classification-Regression architecture, named LCR-Net, contains 3 main components: 1) the pose proposal generator that suggests candidate poses at different locations in the image; 2) a classifier that scores the different pose proposals; and 3) a regressor that refines pose proposals both in 2D and 3D. All three stages share the convolutional feature layers and are trained jointly. The final pose estimation is obtained by integrating over neighboring pose hypotheses, which is shown to improve over a standard non maximum suppression algorithm. Our method recovers full-body 2D and 3D poses, hallucinating plausible body parts when the persons are partially occluded or truncated by the image boundary. Our approach significantly outperforms the state of the art in 3D pose estimation on Human3.6M, a controlled environment. Moreover, it shows promising results on real images for both single and multi-person subsets of the MPII 2D pose benchmark and demonstrates satisfying 3D pose results even for multi-person images.

...read moreread less

273 citations

Proceedings Article•DOI•

Deformable Siamese Attention Networks for Visual Object Tracking

[...]

Yuechen Yu, Yilei Xiong, Weilin Huang, Matthew R. Scott

14 Jun 2020

TL;DR: This paper proposes SiamAttn, a new Siamese attention mechanism that computes deformable self-attention and cross-att attention, capable of aggregating rich contextual interdependencies between the target template and the search image, for more accurate tracking.

...read moreread less

Abstract: Siamese-based trackers have achieved excellent performance on visual object tracking. However, the target template is not updated online, and the features of target template and search image are computed independently in a Siamese architecture. In this paper, we propose Deformable Siamese Attention Networks, referred to as SiamAttn, by introducing a new Siamese attention mechanism that computes deformable self-attention and cross-attention. The self-attention learns strong context information via spatial attention, and selectively emphasizes interdependent channel-wise features with channel attention. The crossattention is capable of aggregating rich contextual interdependencies between the target template and the search image, providing an implicit manner to adaptively update the target template. In addition, we design a region refinement module that computes depth-wise cross correlations between the attentional features for more accurate tracking. We conduct experiments on six benchmarks, where our method achieves new state-of-the-art results, outperforming recent strong baseline, SiamRPN++, by 0.464 to 0.537 and 0.415 to 0.470 EAO on VOT 2016 and 2018.

...read moreread less

270 citations

Book Chapter•DOI•

Towards Real-Time Multi-Object Tracking

[...]

Zhongdao Wang¹, Liang Zheng², Yixuan Liu¹, Yali Li¹, Shengjin Wang¹ - Show less +1 more•Institutions (2)

Tsinghua University¹, Australian National University²

23 Aug 2020

TL;DR: Zhang et al. as discussed by the authors proposed a real-time MOT system that allows target detection and appearance embedding to be learned in a shared model, such that the model can simultaneously output detections and the corresponding embeddings.

...read moreread less

Abstract: Modern multiple object tracking (MOT) systems usually follow the tracking-by-detection paradigm. It has 1) a detection model for target localization and 2) an appearance embedding model for data association. Having the two models separately executed might lead to efficiency problems, as the running time is simply a sum of the two steps without investigating potential structures that can be shared between them. Existing research efforts on real-time MOT usually focus on the association step, so they are essentially real-time association methods but not real-time MOT system. In this paper, we propose an MOT system that allows target detection and appearance embedding to be learned in a shared model. Specifically, we incorporate the appearance embedding model into a single-shot detector, such that the model can simultaneously output detections and the corresponding embeddings. We further propose a simple and fast association method that works in conjunction with the joint model. In both components the computation cost is significantly reduced compared with former MOT systems, resulting in a neat and fast baseline for future follow-ups on real-time MOT algorithm design. To our knowledge, this work reports the first (near) real-time MOT system, with a running speed of 22 to 40 FPS depending on the input resolution. Meanwhile, its tracking accuracy is comparable to the state-of-the-art trackers embodying separate detection and embedding (SDE) learning (\(64.4\%\) MOTA v.s. \(66.1\%\) MOTA on MOT-16 challenge). Code and models are available at https://github.com/Zhongdao/Towards-Realtime-MOT.

...read moreread less

260 citations

Proceedings Article•DOI•

Learning a Neural Solver for Multiple Object Tracking

[...]

Guillem Braso¹, Laura Leal-Taixé¹•Institutions (1)

Technische Universität München¹

14 Jun 2020

TL;DR: This work exploits the classical network flow formulation of MOT to define a fully differentiable framework based on Message Passing Networks (MPNs) and shows that learning in MOT does not need to be restricted to feature extraction, but it can also be applied to the data association step.

...read moreread less

Abstract: Graphs offer a natural way to formulate Multiple Object Tracking (MOT) within the tracking-by-detection paradigm. However, they also introduce a major challenge for learning methods, as defining a model that can operate on such structured domain is not trivial. As a consequence, most learning-based work has been devoted to learning better features for MOT and then using these with well-established optimization frameworks. In this work, we exploit the classical network flow formulation of MOT to define a fully differentiable framework based on Message Passing Networks (MPNs). By operating directly on the graph domain, our method can reason globally over an entire set of detections and predict final solutions. Hence, we show that learning in MOT does not need to be restricted to feature extraction, but it can also be applied to the data association step. We show a significant improvement in both MOTA and IDF1 on three publicly available benchmarks. Our code is available at https://bit.ly/motsolv.

...read moreread less

250 citations

Journal Article•DOI•

HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking

[...]

Jonathon Luiten¹, Aljosa Osep¹, Patrick Dendorfer², Philip H. S. Torr², Andreas Geiger², Laura Leal-Taixé³, Bastian Leibe⁴ - Show less +3 more•Institutions (4)

RWTH Aachen University¹, Technische Universität München², University of Oxford³, Max Planck Society⁴

16 Sep 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work presents a novel MOT evaluation metric, higher order tracking accuracy (HOTA), which explicitly balances the effect of performing accurate detection, association and localization into a single unified metric for comparing trackers.

...read moreread less

Abstract: Multi-Object Tracking (MOT) has been notoriously difficult to evaluate. Previous metrics overemphasize the importance of either detection or association. To address this, we present a novel MOT evaluation metric, HOTA (Higher Order Tracking Accuracy), which explicitly balances the effect of performing accurate detection, association and localization into a single unified metric for comparing trackers. HOTA decomposes into a family of sub-metrics which are able to evaluate each of five basic error types separately, which enables clear analysis of tracking performance. We evaluate the effectiveness of HOTA on the MOTChallenge benchmark, and show that it is able to capture important aspects of MOT performance not previously taken into account by established metrics. Furthermore, we show HOTA scores better align with human visual evaluation of tracking performance.

...read moreread less

Journal Article•DOI•

Motion Segmentation & Multiple Object Tracking by Correlation Co-Clustering

[...]

Margret Keuper¹, Siyu Tang², Bjoern Andres², Thomas Brox³, Bernt Schiele² - Show less +1 more•Institutions (3)

University of Mannheim¹, Max Planck Society², University of Freiburg³

01 Jan 2020-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This work states this joint problem as a co-clustering problem that is principled and tractable by existing algorithms, and demonstrates the effectiveness of this approach by combining bottom-up motion segmentation by grouping of point trajectories with high-level multiple object tracking by clustering of bounding boxes.

...read moreread less

Abstract: Models for computer vision are commonly defined either w.r.t. low-level concepts such as pixels that are to be grouped, or w.r.t. high-level concepts such as semantic objects that are to be detected and tracked. Combining bottom-up grouping with top-down detection and tracking, although highly desirable, is a challenging problem. We state this joint problem as a co-clustering problem that is principled and tractable by existing algorithms. We demonstrate the effectiveness of this approach by combining bottom-up motion segmentation by grouping of point trajectories with high-level multiple object tracking by clustering of bounding boxes. We show that solving the joint problem is beneficial at the low-level, in terms of the FBMS59 motion segmentation benchmark, and at the high-level, in terms of the Multiple Object Tracking benchmarks MOT15, MOT16, and the MOT17 challenge, and is state-of-the-art in some metrics.

...read moreread less

Proceedings Article•DOI•

D3S – A Discriminative Single Shot Segmentation Tracker

[...]

Alan Lukezic¹, Jiri Matas², Matej Kristan¹•Institutions (2)

University of Ljubljana¹, Czech Technical University in Prague²

14 Jun 2020

TL;DR: Without per-dataset finetuning and trained only for segmentation as the primary output, D3S outperforms all trackers on VOT2016, VOT2018 and GOT-10k benchmarks and performs close to the state-of-the-artTrackers on the TrackingNet.

...read moreread less

Abstract: Template-based discriminative trackers are currently the dominant tracking paradigm due to their robustness, but are restricted to bounding box tracking and a limited range of transformation models, which reduces their localization accuracy. We propose a discriminative single-shot segmentation tracker - D3S, which narrows the gap between visual object tracking and video object segmentation. A single-shot network applies two target models with complementary geometric properties, one invariant to a broad range of transformations, including non-rigid deformations, the other assuming a rigid object to simultaneously achieve high robustness and online target segmentation. Without per-dataset finetuning and trained only for segmentation as the primary output, D3S outperforms all trackers on VOT2016, VOT2018 and GOT-10k benchmarks and performs close to the state-of-the-art trackers on the TrackingNet. D3S outperforms the leading segmentation tracker SiamMask on video segmentation benchmark and performs on par with top video object segmentation algorithms, while running an order of magnitude faster, close to real-time.

...read moreread less

Journal Article•DOI•

Overview and methods of correlation filter algorithms in object tracking

[...]

Shuai Liu¹, Dongye Liu², Gautam Srivastava³, Gautam Srivastava⁴, Dawid Połap⁵, Marcin Woźniak⁵ - Show less +2 more•Institutions (5)

Hunan Normal University¹, Inner Mongolia University², Brandon University³, China Medical University (Taiwan)⁴, Silesian University of Technology⁵

09 Jun 2020-Complex & Intelligent Systems

TL;DR: This article focuses on the correlation filter-based object tracking algorithms, and all kinds of methods are summarized to present tracking results in various vision problems, and a visual tracking method based on reliability is observed.

...read moreread less

Abstract: An important area of computer vision is real-time object tracking, which is now widely used in intelligent transportation and smart industry technologies. Although the correlation filter object tracking methods have a good real-time tracking effect, it still faces many challenges such as scale variation, occlusion, and boundary effects. Many scholars have continuously improved existing methods for better efficiency and tracking performance in some aspects. To provide a comprehensive understanding of the background, key technologies and algorithms of single object tracking, this article focuses on the correlation filter-based object tracking algorithms. Specifically, the background and current advancement of the object tracking methodologies, as well as the presentation of the main datasets are introduced. All kinds of methods are summarized to present tracking results in various vision problems, and a visual tracking method based on reliability is observed.

...read moreread less

Book Chapter•DOI•

The Eighth Visual Object Tracking VOT2020 Challenge Results

[...]

Matej Kristan¹, Ales Leonardis², Jiří Matas³, Michael Felsberg⁴, Roman Pflugfelder⁵, Roman Pflugfelder⁶, Joni-Kristian Kamarainen, Martin Danelljan⁷, Luka Čehovin Zajc¹, Alan Lukežič¹, Ondrej Drbohlav³, Linbo He⁴, Yushan Zhang⁸, Yushan Zhang⁴, Song Yan, Jinyu Yang², Gustavo Fernandez⁶, Alexander G. Hauptmann⁹, Alireza Memarmoghadam¹⁰, Alvaro Garcia-Martin¹¹, Andreas Robinson⁴, Anton Varfolomieiev¹², Awet Haileslassie Gebrehiwot¹¹, Bedirhan Uzun¹³, Bin Yan¹⁴, Bing Li¹⁵, Chen Qian, Chi-Yi Tsai¹⁶, Christian Micheloni¹⁷, Dong Wang¹⁴, Fei Wang, Fei Xie¹⁸, Felix Järemo Lawin⁴, Fredrik K. Gustafsson¹⁹, Gian Luca Foresti¹⁷, Goutam Bhat⁷, Guangqi Chen, Haibin Ling²⁰, Haitao Zhang, Hakan Cevikalp¹³, Haojie Zhao¹⁴, Haoran Bai²¹, Hari Chandana Kuchibhotla²², Hasan Saribas, Heng Fan²⁰, Hossein Ghanei-Yakhdan²³, Houqiang Li²⁴, Houwen Peng²⁵, Huchuan Lu¹⁴, Hui Li²⁶, Javad Khaghani²⁷, Jesús Bescós¹¹, Jianhua Li¹⁴, Jianlong Fu²⁵, Jiaqian Yu²⁸, Jingtao Xu²⁸, Josef Kittler²⁹, Jun Yin, Junhyun Lee³⁰, Kaicheng Yu³¹, Kaiwen Liu¹⁵, Kang Yang³², Kenan Dai¹⁴, Li Cheng²⁷, Li Zhang³³, Lijun Wang¹⁴, Linyuan Wang, Luc Van Gool⁷, Luca Bertinetto, Matteo Dunnhofer¹⁷, Miao Cheng, Mohana Murali Dasari²², Ning Wang³², Pengyu Zhang¹⁴, Philip H. S. Torr³³, Qiang Wang, Radu Timofte⁷, Rama Krishna Sai Subrahmanyam Gorthi²², Seokeon Choi³⁴, Seyed Mojtaba Marvasti-Zadeh²⁷, Shaochuan Zhao²⁶, Shohreh Kasaei³⁵, Shoumeng Qiu¹⁵, Shuhao Chen¹⁴, Thomas B. Schön¹⁹, Tianyang Xu²⁹, Wei Lu, Weiming Hu¹⁵, Wengang Zhou²⁴, Xi Qiu, Xiao Ke³⁶, Xiaojun Wu²⁶, Xiaolin Zhang¹⁵, Xiaoyun Yang, Xue-Feng Zhu²⁶, Yingjie Jiang²⁶, Yingming Wang¹⁴, Yiwei Chen²⁸, Yu Ye³⁶, Yuezhou Li³⁶, Yuncon Yao¹⁸, Yunsung Lee³⁰, Yuzhang Gu¹⁵, Zezhou Wang¹⁴, Zhangyong Tang²⁶, Zhen-Hua Feng²⁹, Zhijun Mai³⁷, Zhipeng Zhang¹⁵, Zhirong Wu²⁵, Ziang Ma - Show less +106 more•Institutions (37)

University of Ljubljana¹, University of Birmingham², Czech Technical University in Prague³, Linköping University⁴, Vienna University of Technology⁵, Austrian Institute of Technology⁶, ETH Zurich⁷, Beijing Institute of Technology⁸, Carnegie Mellon University⁹, University of Isfahan¹⁰, Autonomous University of Madrid¹¹, National Technical University¹², Eskişehir Osmangazi University¹³, Dalian University of Technology¹⁴, Chinese Academy of Sciences¹⁵, Tamkang University¹⁶, University of Udine¹⁷, Southeast University¹⁸, Uppsala University¹⁹, Stony Brook University²⁰, Sichuan University²¹, Indian Institutes of Technology²², Yazd University²³, University of Science and Technology of China²⁴, Microsoft²⁵, Jiangnan University²⁶, University of Alberta²⁷, Samsung²⁸, University of Surrey²⁹, Korea University³⁰, Renmin University of China³¹, Nanjing University of Information Science and Technology³², University of Oxford³³, KAIST³⁴, Sharif University of Technology³⁵, Fuzhou University³⁶, University of Electronic Science and Technology of China³⁷

23 Aug 2020

TL;DR: A significant novelty is introduction of a new VOT short-term tracking evaluation methodology, and introduction of segmentation ground truth in the VOT-ST2020 challenge – bounding boxes will no longer be used in theVDT challenges.

...read moreread less

Abstract: The Visual Object Tracking challenge VOT2020 is the eighth annual tracker benchmarking activity organized by the VOT initiative. Results of 58 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The VOT2020 challenge was composed of five sub-challenges focusing on different tracking domains: (i) VOT-ST2020 challenge focused on short-term tracking in RGB, (ii) VOT-RT2020 challenge focused on “real-time” short-term tracking in RGB, (iii) VOT-LT2020 focused on long-term tracking namely coping with target disappearance and reappearance, (iv) VOT-RGBT2020 challenge focused on short-term tracking in RGB and thermal imagery and (v) VOT-RGBD2020 challenge focused on long-term tracking in RGB and depth imagery. Only the VOT-ST2020 datasets were refreshed. A significant novelty is introduction of a new VOT short-term tracking evaluation methodology, and introduction of segmentation ground truth in the VOT-ST2020 challenge – bounding boxes will no longer be used in the VOT-ST challenges. A new VOT Python toolkit that implements all these novelites was introduced. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The dataset, the evaluation kit and the results are publicly available at the challenge website (http://votchallenge.net).

...read moreread less

Journal Article•DOI•

Spatial and semantic convolutional features for robust visual object tracking

[...]

Jianming Zhang¹, Xiaokang Jin¹, Juan Sun¹, Jin Wang¹, Arun Kumar Sangaiah² - Show less +1 more•Institutions (2)

Changsha University of Science and Technology¹, VIT University²

01 Jun 2020-Multimedia Tools and Applications

TL;DR: A novel model updating strategy is presented, and peak sidelobe ratio (PSR) and skewness are exploited to measure the comprehensive fluctuation of response map for efficient tracking performance.

...read moreread less

Abstract: Robust and accurate visual tracking is a challenging problem in computer vision. In this paper, we exploit spatial and semantic convolutional features extracted from convolutional neural networks in continuous object tracking. The spatial features retain higher resolution for precise localization and semantic features capture more semantic information and less fine-grained spatial details. Therefore, we localize the target by fusing these different features, which improves the tracking accuracy. Besides, we construct the multi-scale pyramid correlation filter of the target and extract its spatial features. This filter determines the scale level effectively and tackles target scale estimation. Finally, we further present a novel model updating strategy, and exploit peak sidelobe ratio (PSR) and skewness to measure the comprehensive fluctuation of response map for efficient tracking performance. Each contribution above is validated on 50 image sequences of tracking benchmark OTB-2013. The experimental comparison shows that our algorithm performs favorably against 12 state-of-the-art trackers.

...read moreread less

Proceedings Article•DOI•

GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking With 2D-3D Multi-Feature Learning

[...]

Xinshuo Weng¹, Yongxin Wang¹, Yunze Man¹, Kris M. Kitani¹•Institutions (1)

Carnegie Mellon University¹

14 Jun 2020

TL;DR: This work proposes two techniques to improve the discriminative feature learning for MOT by introducing a novel feature interaction mechanism by introducing the Graph Neural Network and proposes a novel joint feature extractor to learn appearance and motion features from 2D and 3D space simultaneously.

...read moreread less

Abstract: 3D Multi-object tracking (MOT) is crucial to autonomous systems. Recent work uses a standard tracking-by-detection pipeline, where feature extraction is first performed independently for each object in order to compute an affinity matrix. Then the affinity matrix is passed to the Hungarian algorithm for data association. A key process of this standard pipeline is to learn discriminative features for different objects in order to reduce confusion during data association. In this work, we propose two techniques to improve the discriminative feature learning for MOT: (1) instead of obtaining features for each object independently, we propose a novel feature interaction mechanism by introducing the Graph Neural Network. As a result, the feature of one object is informed of the features of other objects so that the object feature can lean towards the object with similar feature (i.e., object probably with a same ID) and deviate from objects with dissimilar features (i.e., object probably with different IDs), leading to a more discriminative feature for each object; (2) instead of obtaining the feature from either 2D or 3D space in prior work, we propose a novel joint feature extractor to learn appearance and motion features from 2D and 3D space simultaneously. As features from different modalities often have complementary information, the joint feature can be more discriminate than feature from each individual modality. To ensure that the joint feature extractor does not heavily rely on one modality, we also propose an ensemble training paradigm. Through extensive evaluation, our proposed method achieves state-of-the-art performance on KITTI and nuScenes 3D MOT benchmarks. Our code will be made available at https://github.com/xinshuoweng/GNN3DMOT

...read moreread less

Journal Article•DOI•

Visual Object Tracking by Hierarchical Attention Siamese Network

[...]

Jianbing Shen¹, Xin Tang¹, Xingping Dong¹, Ling Shao•Institutions (1)

Beijing Institute of Technology¹

01 Jul 2020-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: A novel tracking method is presented by introducing the attention mechanism into the Siamese network to increase its matching discrimination and a new way to fuse multiscale response maps from each layer to obtain a more accurate position estimation of the object is proposed.

...read moreread less

Abstract: Visual tracking addresses the problem of localizing an arbitrary target in video according to the annotated bounding box. In this article, we present a novel tracking method by introducing the attention mechanism into the Siamese network to increase its matching discrimination. We propose a new way to compute attention weights to improve matching performance by a sub-Siamese network [Attention Net (A-Net)], which locates attentive parts for solving the searching problem. In addition, features in higher layers can preserve more semantic information while features in lower layers preserve more location information. Thus, in order to solve the tracking failure cases by the higher layer features, we fully utilize location and semantic information by multilevel features and propose a new way to fuse multiscale response maps from each layer to obtain a more accurate position estimation of the object. We further propose a hierarchical attention Siamese network by combining the attention weights and multilayer integration for tracking. Our method is implemented with a pretrained network which can outperform most well-trained Siamese trackers even without any fine-tuning and online updating. The comparison results with the state-of-the-art methods on popular tracking benchmarks show that our method achieves better performance. Our source code and results will be available at https://github.com/shenjianbing/HASN .

...read moreread less

Posted Content•

[...]

Jiangmiao Pang¹, Linlu Qiu², Xia Li³, Haofeng Chen⁴, Qi Li¹, Trevor Darrell⁵, Fisher Yu³ - Show less +3 more•Institutions (5)

Zhejiang University¹, Georgia Institute of Technology², ETH Zurich³, Stanford University⁴, University of California, Berkeley⁵

11 Jun 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: Quasi-Dense Similarity Learning is presented, which densely samples hundreds of region proposals on a pair of images for contrastive learning and which outperforms all existing methods on MOT, BDD100K, Waymo, and TAO tracking benchmarks.

...read moreread less

Abstract: Similarity learning has been recognized as a crucial step for object tracking. However, existing multiple object tracking methods only use sparse ground truth matching as the training objective, while ignoring the majority of the informative regions on the images. In this paper, we present Quasi-Dense Similarity Learning, which densely samples hundreds of region proposals on a pair of images for contrastive learning. We can directly combine this similarity learning with existing detection methods to build Quasi-Dense Tracking (QDTrack) without turning to displacement regression or motion priors. We also find that the resulting distinctive feature space admits a simple nearest neighbor search at the inference time. Despite its simplicity, QDTrack outperforms all existing methods on MOT, BDD100K, Waymo, and TAO tracking benchmarks. It achieves 68.7 MOTA at 20.3 FPS on MOT17 without using external training data. Compared to methods with similar detectors, it boosts almost 10 points of MOTA and significantly decreases the number of ID switches on BDD100K and Waymo datasets. The code is available at this https URL

...read moreread less

Posted Content•

Vision Meets Drones: Past, Present and Future

[...]

Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Qinghua Hu, Haibin Ling - Show less +2 more

16 Jan 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: The VisDrone dataset, which is captured over various urban/suburban areas of 14 different cities across China from North to South, is described, being the largest such dataset ever published, and enables extensive evaluation and investigation of visual analysis algorithms on the drone platform.

...read moreread less

Abstract: Drones, or general UAVs, equipped with cameras have been fast deployed with a wide range of applications, including agriculture, aerial photography, and surveillance. Consequently, automatic understanding of visual data collected from drones becomes highly demanding, bringing computer vision and drones more and more closely. To promote and track the evelopments of object detection and tracking algorithms, we have organized two challenge workshops in conjunction with ECCV 2018, and ICCV 2019, attracting more than 100 teams around the world. We provide a large-scale drone captured dataset, VisDrone, which includes four tracks, i.e., (1) image object detection, (2) video object detection, (3) single object tracking, and (4) multi-object tracking. In this paper, we first presents a thorough review of object detection and tracking datasets and benchmarks, and discuss the challenges of collecting large-scale drone-based object detection and tracking datasets with fully manual annotations. After that, we describe our VisDrone dataset, which is captured over various urban/suburban areas of 14 different cities across China from North to South. Being the largest such dataset ever published, VisDrone enables extensive evaluation and investigation of visual analysis algorithms on the drone platform. We provide a detailed analysis of the current state of the field of large-scale object detection and tracking on drones, and conclude the challenge as well as propose future directions. We expect the benchmark largely boost the research and development in video analysis on drone platforms. All the datasets and experimental results can be downloaded from the website: this https URL.

...read moreread less

Journal Article•DOI•

Material Based Object Tracking in Hyperspectral Videos

[...]

Fengchao Xiong¹, Jun Zhou², Yuntao Qian¹•Institutions (2)

Zhejiang University¹, Griffith University²

15 Jan 2020-IEEE Transactions on Image Processing

TL;DR: In this article, the spectral-spatial histogram of multidimensional gradients and fractional abundances of constituted material components are embedded into correlation filters, yielding material based tracking.

...read moreread less

Abstract: Traditional color images only depict color intensities in red, green and blue channels, often making object trackers fail in challenging scenarios, e.g., background clutter and rapid changes of target appearance. Alternatively, material information of targets contained in large amount of bands of hyperspectral images (HSI) is more robust to these difficult conditions. In this paper, we conduct a comprehensive study on how material information can be utilized to boost object tracking from three aspects: dataset, material feature representation and material based tracking. In terms of dataset, we construct a dataset of fully-annotated videos, which contain both hyperspectral and color sequences of the same scene. Material information is represented by spectral-spatial histogram of multidimensional gradients, which describes the 3D local spectral-spatial structure in an HSI, and fractional abundances of constituted material components which encode the underlying material distribution. These two types of features are embedded into correlation filters, yielding material based tracking. Experimental results on the collected dataset show the potentials and advantages of material based object tracking.

...read moreread less

Journal Article•DOI•

A Solution for Large-Scale Multi-Object Tracking

[...]

Michael Beard¹, Ba-Tuong Vo¹, Ba-Ngu Vo¹•Institutions (1)

Curtin University¹

13 Apr 2020-IEEE Transactions on Signal Processing

TL;DR: A large-scale multi-object tracker based on the generalised labeled multi-Bernoulli (GLMB) filter is proposed and a new method of applying the optimal sub-pattern assignment (OSPA) metric to determine a meaningful distance between two sets of tracks is introduced.

...read moreread less

Abstract: A large-scale multi-object tracker based on the generalised labeled multi-Bernoulli (GLMB) filter is proposed. The algorithm is capable of tracking a very large, unknown and time-varying number of objects simultaneously, in the presence of a high number of false alarms, as well as missed detections and measurement origin uncertainty due to closely spaced objects. The algorithm is demonstrated on a simulated tracking scenario, where the peak number objects appearing simultaneously exceeds one million. Additionally, we introduce a new method of applying the optimal sub-pattern assignment (OSPA) metric to determine a meaningful distance between two sets of tracks. We also develop an efficient strategy for its exact computation in large-scale scenarios to evaluate the performance of the proposed tracker.

...read moreread less

Posted Content•

Graph Attention Tracking

[...]

Dongyan Guo¹, Yanyan Shao¹, Ying Cui¹, Zhenhua Wang¹, Liyan Zhang², Chunhua Shen³ - Show less +2 more•Institutions (3)

Zhejiang University of Technology¹, Nanjing University of Aeronautics and Astronautics², Monash University³

23 Nov 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: A simple target-aware Siamese graph attention network for general object tracking that establishes part-to-part correspondence between the target and the search region with a complete bipartite graph, and applies the graph attention mechanism to propagate target information from the template feature to the search feature.

...read moreread less

Abstract: Siamese network based trackers formulate the visual tracking task as a similarity matching problem. Almost all popular Siamese trackers realize the similarity learning via convolutional feature cross-correlation between a target branch and a search branch. However, since the size of target feature region needs to be pre-fixed, these cross-correlation base methods suffer from either reserving much adverse background information or missing a great deal of foreground information. Moreover, the global matching between the target and search region also largely neglects the target structure and part-level information. In this paper, to solve the above issues, we propose a simple target-aware Siamese graph attention network for general object tracking. We propose to establish part-to-part correspondence between the target and the search region with a complete bipartite graph, and apply the graph attention mechanism to propagate target information from the template feature to the search feature. Further, instead of using the pre-fixed region cropping for template-feature-area selection, we investigate a target-aware area selection mechanism to fit the size and aspect ratio variations of different objects. Experiments on challenging benchmarks including GOT-10k, UAV123, OTB-100 and LaSOT demonstrate that the proposed SiamGAT outperforms many state-of-the-art trackers and achieves leading performance. Code is available at: this https URL

...read moreread less

Journal Article•DOI•

Reliability of response region: A novel mechanism in visual tracking by edge computing for IIoT environments

[...]

Shuai Liu¹, Chunli Guo², Fadi Al-Turjman³, Khan Muhammad⁴, Victor Hugo C. de Albuquerque⁵ - Show less +1 more•Institutions (5)

Hunan Normal University¹, China Mobile², Near East University³, Sejong University⁴, University of Fortaleza⁵

01 Apr 2020-Mechanical Systems and Signal Processing

TL;DR: Results showed that tracking performance of the proposed method has been increased, especially effected greatly on fast-moving, background clutter and motion blur, and the method is validated to play an important role in real industrial applications with edge computing, which is more suitable for IIoT environments and automotive industry.

...read moreread less

Book Chapter•DOI•

Know Your Surroundings: Exploiting Scene Information for Object Tracking

[...]

Goutam Bhat¹, Martin Danelljan¹, Luc Van Gool¹, Radu Timofte¹•Institutions (1)

ETH Zurich¹

23 Aug 2020

TL;DR: In this paper, the presence and locations of other objects in the surrounding scene can be propagated through the sequence and used to explicitly avoid distractor objects and eliminate target candidate regions.

...read moreread less

Abstract: Current state-of-the-art trackers rely only on a target appearance model in order to localize the object in each frame. Such approaches are however prone to fail in case of e.g. fast appearance changes or presence of distractor objects, where a target appearance model alone is insufficient for robust tracking. Having the knowledge about the presence and locations of other objects in the surrounding scene can be highly beneficial in such cases. This scene information can be propagated through the sequence and used to, for instance, explicitly avoid distractor objects and eliminate target candidate regions.

...read moreread less

Posted Content•

Rethinking the competition between detection and ReID in Multi-Object Tracking

[...]

Chao Liang, Zhipeng Zhang, Yi Lu, Xue Zhou, Bing Li, Xiyong Ye¹, Jianxiao Zou - Show less +3 more•Institutions (1)

University of Electronic Science and Technology of China¹

23 Oct 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: A novel reciprocal network (REN) with a self-relation and cross-relation design so that to impel each branch to better learn task-dependent representations learning is proposed to alleviate the deleterious tasks competition and improve the cooperation between detection and ReID.

...read moreread less

Abstract: Due to balanced accuracy and speed, joint learning detection and ReID-based one-shot models have drawn great attention in multi-object tracking(MOT). However, the differences between the above two tasks in the one-shot tracking paradigm are unconsciously overlooked, leading to inferior performance than the two-stage methods. In this paper, we dissect the reasoning process of the aforementioned two tasks. Our analysis reveals that the competition of them inevitably hurts the learning of task-dependent representations, which further impedes the tracking performance. To remedy this issue, we propose a novel cross-correlation network that can effectively impel the separate branches to learn task-dependent representations. Furthermore, we introduce a scale-aware attention network that learns discriminative embeddings to improve the ReID capability. We integrate the delicately designed networks into a one-shot online MOT system, dubbed CSTrack. Without bells and whistles, our model achieves new state-of-the-art performances on MOT16 and MOT17. Our code is released at this https URL.

...read moreread less

Proceedings Article•DOI•

P2B: Point-to-Box Network for 3D Object Tracking in Point Clouds

[...]

Qi Haozhe¹, Chen Feng¹, Zhiguo Cao¹, Zhao Feng¹, Yang Xiao¹ - Show less +1 more•Institutions (1)

Huazhong University of Science and Technology¹

14 Jun 2020

TL;DR: This work first localize potential target centers in 3D search area embedded with target information, then point-driven 3D target proposal and verification are executed jointly, so the time-consuming 3D exhaustive search can be avoided.

...read moreread less

Abstract: Towards 3D object tracking in point clouds, a novel point-to-box network termed P2B is proposed in an end-to-end learning manner. Our main idea is to first localize potential target centers in 3D search area embedded with target information. Then point-driven 3D target proposal and verification are executed jointly. In this way, the time-consuming 3D exhaustive search can be avoided. Specifically, we first sample seeds from the point clouds in template and search area respectively. Then, we execute permutation-invariant feature augmentation to embed target clues from template into search area seeds and represent them with target-specific features. Consequently, the augmented search area seeds regress the potential target centers via Hough voting. The centers are further strengthened with seed-wise targetness scores. Finally, each center clusters its neighbors to leverage the ensemble power for joint 3D target proposal and verification. We apply PointNet++ as our backbone and experiments on KITTI tracking dataset demonstrate P2B’s superiority (~10%’s improvement over state-of-the-art). Note that P2B can run with 40FPS on a single NVIDIA 1080Ti GPU. Our code and model are available at https://github.com/HaozheQi/P2B.

...read moreread less

Posted Content•

Know Your Surroundings: Exploiting Scene Information for Object Tracking

[...]

Goutam Bhat¹, Martin Danelljan¹, Luc Van Gool¹, Radu Timofte¹•Institutions (1)

ETH Zurich¹

24 Mar 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes a novel tracking architecture which can utilize scene information as dense localized state vectors, which can encode, for example, if the local region is target, background, or distractor and combined with the appearance model output to localize the target.

...read moreread less

Abstract: Current state-of-the-art trackers only rely on a target appearance model in order to localize the object in each frame. Such approaches are however prone to fail in case of e.g. fast appearance changes or presence of distractor objects, where a target appearance model alone is insufficient for robust tracking. Having the knowledge about the presence and locations of other objects in the surrounding scene can be highly beneficial in such cases. This scene information can be propagated through the sequence and used to, for instance, explicitly avoid distractor objects and eliminate target candidate regions. In this work, we propose a novel tracking architecture which can utilize scene information for tracking. Our tracker represents such information as dense localized state vectors, which can encode, for example, if the local region is target, background, or distractor. These state vectors are propagated through the sequence and combined with the appearance model output to localize the target. Our network is learned to effectively utilize the scene information by directly maximizing tracking performance on video segments. The proposed approach sets a new state-of-the-art on 3 tracking benchmarks, achieving an AO score of 63.6% on the recent GOT-10k dataset.

...read moreread less

Posted Content•

Probabilistic 3D Multi-Object Tracking for Autonomous Driving.

[...]

Hsu-Kuang Chiu, Antonio Prioletti, Jie Li, Jeannette Bohg

16 Jan 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper presents the on-line tracking method, which made the first place in the NuScenes Tracking Challenge, and outperforms the AB3DMOT baseline method by a large margin in the Average Multi-Object Tracking Accuracy (AMOTA) metric.

...read moreread less

Abstract: 3D multi-object tracking is a key module in autonomous driving applications that provides a reliable dynamic representation of the world to the planning module. In this paper, we present our on-line tracking method, which made the first place in the NuScenes Tracking Challenge, held at the AI Driving Olympics Workshop at NeurIPS 2019. Our method estimates the object states by adopting a Kalman Filter. We initialize the state covariance as well as the process and observation noise covariance with statistics from the training set. We also use the stochastic information from the Kalman Filter in the data association step by measuring the Mahalanobis distance between the predicted object states and current object detections. Our experimental results on the NuScenes validation and test set show that our method outperforms the AB3DMOT baseline method by a large margin in the Average Multi-Object Tracking Accuracy (AMOTA) metric.

...read moreread less

Journal Article•DOI•

Object Tracking in Satellite Videos by Improved Correlation Filters With Motion Estimations

[...]

Shiyu Xuan¹, Shengyang Li¹, Mingfei Han¹, Xue Wan¹, Gui-Song Xia² - Show less +1 more•Institutions (2)

Chinese Academy of Sciences¹, Wuhan University²

01 Feb 2020-IEEE Transactions on Geoscience and Remote Sensing

TL;DR: This article addresses the problem of fast object tracking in satellite videos, by developing a novel tracking algorithm based on correlation filters embedded with motion estimations, based on the kernelized correlation filter (KCF).

...read moreread less

Abstract: As a new method of Earth observation, video satellite is capable of monitoring specific events on the Earth’s surface continuously by providing high-temporal resolution remote sensing images. The video observations enable a variety of new satellite applications such as object tracking and road traffic monitoring. In this article, we address the problem of fast object tracking in satellite videos, by developing a novel tracking algorithm based on correlation filters embedded with motion estimations. Based on the kernelized correlation filter (KCF), the proposed algorithm provides the following improvements: 1) proposing a novel motion estimation (ME) algorithm by combining the Kalman filter and motion trajectory averaging and mitigating the boundary effects of KCF by using this ME algorithm and 2) solving the problem of tracking failure when a moving object is partially or completely occluded. The experimental results demonstrate that our algorithm can track the moving object in satellite videos with 95% accuracy.

...read moreread less

Collapse