scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Enhancing Detection Model for Multiple Hypothesis Tracking

01 Jul 2017-pp 2143-2152
TL;DR: A novel enhancing detection model that included detection-scene analysis and detection-detection analysis was incorporated, which improved the ability to deal with close object hypotheses in crowded scenarios and achieved competitive results with current state-of-the-art trackers.
Abstract: Tracking-by-detection has become a popular tracking paradigm in recent years. Due to the fact that detections within this framework are regarded as points in the tracking process, it brings data association ambiguities, especially in crowded scenarios. To cope with this issue, we extended the multiple hypothesis tracking approach by incorporating a novel enhancing detection model that included detection-scene analysis and detection-detection analysis; the former models the scene by using dense confidential detections and handles false trajectories, while the latter estimates the correlations between individual detections and improves the ability to deal with close object hypotheses in crowded scenarios. Our approach was tested on the MOT16 benchmark and achieved competitive results with current state-of-the-art trackers.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: A comprehensive survey on works that employ Deep Learning models to solve the task of MOT on single-camera videos, identifying a number of similarities among the top-performing methods and presenting some possible future research directions.

448 citations


Cites methods from "Enhancing Detection Model for Multi..."

  • ...Other examples of the use of CNNs for feature extraction can be found in [88], where a custom CNN was used to extract appearance features in a Multiple Hypothesis Tracking framework, in [89], whose tracker employed a pretrained region-based CNN [90], or in [91], where a CNN extracted visual features from fish heads, later combined with motion prediction from a Kalman Filter....

    [...]

Book ChapterDOI
08 Sep 2018
TL;DR: This paper introduces a cost-sensitive tracking loss based on the state-of-the-art visual tracker which encourages the model to focus on hard negative distractors during online learning and proposes Dual Matching Attention Networks (DMAN) with both spatial and temporal attention mechanisms.
Abstract: In this paper, we propose an online Multi-Object Tracking (MOT) approach which integrates the merits of single object tracking and data association methods in a unified framework to handle noisy detections and frequent interactions between targets. Specifically, for applying single object tracking in MOT, we introduce a cost-sensitive tracking loss based on the state-of-the-art visual tracker, which encourages the model to focus on hard negative distractors during online learning. For data association, we propose Dual Matching Attention Networks (DMAN) with both spatial and temporal attention mechanisms. The spatial attention module generates dual attention maps which enable the network to focus on the matching patterns of the input image pair, while the temporal attention module adaptively allocates different levels of attention to different samples in the tracklet to suppress noisy observations. Experimental results on the MOT benchmark datasets show that the proposed algorithm performs favorably against both online and offline trackers in terms of identity-preserving metrics.

370 citations

Book ChapterDOI
23 Aug 2020
TL;DR: The two major novelties: chained structure and paired attentive regression, make CTracker simple, fast and effective, setting new MOTA records on MOT16 and MOT17 challenge datasets (67.6 and 66.6, respectively), without relying on any extra training data.
Abstract: Existing Multiple-Object Tracking (MOT) methods either follow the tracking-by-detection paradigm to conduct object detection, feature extraction and data association separately, or have two of the three subtasks integrated to form a partially end-to-end solution. Going beyond these sub-optimal frameworks, we propose a simple online model named Chained-Tracker (CTracker), which naturally integrates all the three subtasks into an end-to-end solution (the first as far as we know). It chains paired bounding boxes regression results estimated from overlapping nodes, of which each node covers two adjacent frames. The paired regression is made attentive by object-attention (brought by a detection module) and identity-attention (ensured by an ID verification module). The two major novelties: chained structure and paired attentive regression, make CTracker simple, fast and effective, setting new MOTA records on MOT16 and MOT17 challenge datasets (67.6 and 66.6, respectively), without relying on any extra training data. The source code of CTracker can be found at: github.com/pjl1995/CTracker.

239 citations


Cites methods from "Enhancing Detection Model for Multi..."

  • ...on-based MOT Methods Yu et. al [3] proposed the POI algorithm, which conducted a high-performance detector based on Faster R-CNN [4] by adding several extra pedestrian detection datasets. Chen et. al [5] incorporated anenhanceddetection model by simultaneously modeling the detection-scene relation and detection-detection relation, called EDMT. Furthermore, Henschel et. al [6] added a head detection m...

    [...]

  • ...d MOTA"IDF1"MOTP" MT" ML# FP# FN# IDS#Hz" Oine MHT-bLSTM [29] 42.1 47.8 75.9 14.9% 44.4% 11637 93172 753 1.8 Quad-CNN [30] 44.1 38.3 76.4 14.6% 44.9% 6388 94775 745 1.8 EDMT [5] 45.3 47.9 75.9 17.0% 39.9% 11122 87890 639 1.8 LMP [31] 48.8 51.3 79.0 18.2% 40.1% 6654 86245 481 0.5 Online CDA-DDAL [32] 43.9 45.1 74.7 10.7% 44.4% 6450 95175 676 - STAM [12] 46.0 50.0 74.9 14.6% 4...

    [...]

  • ...ults on MOT17 test dataset. Public Detection Process Method MOTA"IDF1"MOTP" MT" ML# FP# FN# IDS#Hz" Oine MHT-bLSTM [29] 47.5 51.9 77.5 18.2% 41.7% 25981 268042 2069 1.8 EDMT [5] 50.0 51.3 77.3 21.6% 36.3% 32279 247297 2264 1.8 JCC [38] 51.2 54.5 75.9 20.9% 37.0% 25937 247822 1802 - FWT [6] 51.3 47.6 77.0 21.4% 35.2% 24101 247921 2648 - Online DMAN [13] 48.2 55.7 75.9 19.3% 3...

    [...]

Journal ArticleDOI
TL;DR: The proposed Deep Affinity Network (DAN) learns compact, yet comprehensive features of pre-detected objects at several levels of abstraction, and performs exhaustive pairing permutations of those features in any two frames to infer object affinities.
Abstract: Multiple Object Tracking (MOT) plays an important role in solving many fundamental problems in video analysis and computer vision. Most MOT methods employ two steps: Object Detection and Data Association. The first step detects objects of interest in every frame of a video, and the second establishes correspondence between the detected objects in different frames to obtain their tracks. Object detection has made tremendous progress in the last few years due to deep learning. However, data association for tracking still relies on hand crafted constraints such as appearance, motion, spatial proximity, grouping etc. to compute affinities between the objects in different frames. In this paper, we harness the power of deep learning for data association in tracking by jointly modeling object appearances and their affinities between different frames in an end-to-end fashion. The proposed Deep Affinity Network (DAN) learns compact, yet comprehensive features of pre-detected objects at several levels of abstraction, and performs exhaustive pairing permutations of those features in any two frames to infer object affinities. DAN also accounts for multiple objects appearing and disappearing between video frames. We exploit the resulting efficient affinity computations to associate objects in the current frame deep into the previous frames for reliable on-line tracking. Our technique is evaluated on popular multiple object tracking challenges MOT15, MOT17 and UA-DETRAC. Comprehensive benchmarking under twelve evaluation metrics demonstrates that our approach is among the best performing techniques on the leader board for these challenges. The open source implementation of our work is available at https://github.com/shijieS/SST.git .

239 citations


Cites methods from "Enhancing Detection Model for Multi..."

  • ...[56] proposed a multiple hypothesis tracking method by accounting for scene detections and detectiondetection correlations between video frames....

    [...]

Book ChapterDOI
08 Sep 2018
TL;DR: A novel recurrent network model, the Bilinear LSTM, is proposed in order to improve the learning of long-term appearance models via a recurrent network based on intuitions drawn from recursive least squares.
Abstract: In recent deep online and near-online multi-object tracking approaches, a difficulty has been to incorporate long-term appearance models to efficiently score object tracks under severe occlusion and multiple missing detections. In this paper, we propose a novel recurrent network model, the Bilinear LSTM, in order to improve the learning of long-term appearance models via a recurrent network. Based on intuitions drawn from recursive least squares, Bilinear LSTM stores building blocks of a linear predictor in its memory, which is then coupled with the input in a multiplicative manner, instead of the additive coupling in conventional LSTM approaches. Such coupling resembles an online learned classifier/regressor at each time step, which we have found to improve performances in using LSTM for appearance modeling. We also propose novel data augmentation approaches to efficiently train recurrent models that score object tracks on both appearance and motion. We train an LSTM that can score object tracks based on both appearance and motion and utilize it in a multiple hypothesis tracking framework. In experiments, we show that with our novel LSTM model, we achieved state-of-the-art performance on near-online multiple object tracking on the MOT 2016 and MOT 2017 benchmarks.

234 citations

References
More filters
Journal ArticleDOI
TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.
Abstract: We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL data sets. Our system relies on new methods for discriminative training with partially labeled data. We combine a margin-sensitive approach for data-mining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of MI--SVM in terms of latent variables. A latent SVM is semiconvex, and the training problem becomes convex once latent information is specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive examples and optimizing the latent SVM objective function.

10,501 citations

Proceedings ArticleDOI
07 Dec 2015
TL;DR: A minor contribution, inspired by recent advances in large-scale image search, an unsupervised Bag-of-Words descriptor is proposed that yields competitive accuracy on VIPeR, CUHK03, and Market-1501 datasets, and is scalable on the large- scale 500k dataset.
Abstract: This paper contributes a new high quality dataset for person re-identification, named "Market-1501". Generally, current datasets: 1) are limited in scale, 2) consist of hand-drawn bboxes, which are unavailable under realistic settings, 3) have only one ground truth and one query image for each identity (close environment). To tackle these problems, the proposed Market-1501 dataset is featured in three aspects. First, it contains over 32,000 annotated bboxes, plus a distractor set of over 500K images, making it the largest person re-id dataset to date. Second, images in Market-1501 dataset are produced using the Deformable Part Model (DPM) as pedestrian detector. Third, our dataset is collected in an open system, where each identity has multiple images under each camera. As a minor contribution, inspired by recent advances in large-scale image search, this paper proposes an unsupervised Bag-of-Words descriptor. We view person re-identification as a special task of image search. In experiment, we show that the proposed descriptor yields competitive accuracy on VIPeR, CUHK03, and Market-1501 datasets, and is scalable on the large-scale 500k dataset.

3,564 citations


"Enhancing Detection Model for Multi..." refers methods in this paper

  • ...For training the convolution neural network, we use data from CUHK03[19], Market-1501[32], CUHK01[18], VIPeR[26] and i-LIDS[26]....

    [...]

Proceedings ArticleDOI
23 Jun 2014
TL;DR: A novel filter pairing neural network (FPNN) to jointly handle misalignment, photometric and geometric transforms, occlusions and background clutter is proposed and significantly outperforms state-of-the-art methods on this dataset.
Abstract: Person re-identification is to match pedestrian images from disjoint camera views detected by pedestrian detectors. Challenges are presented in the form of complex variations of lightings, poses, viewpoints, blurring effects, image resolutions, camera settings, occlusions and background clutter across camera views. In addition, misalignment introduced by the pedestrian detector will affect most existing person re-identification methods that use manually cropped pedestrian images and assume perfect detection. In this paper, we propose a novel filter pairing neural network (FPNN) to jointly handle misalignment, photometric and geometric transforms, occlusions and background clutter. All the key components are jointly optimized to maximize the strength of each component when cooperating with others. In contrast to existing works that use handcrafted features, our method automatically learns features optimal for the re-identification task from data. The learned filter pairs encode photometric transforms. Its deep architecture makes it possible to model a mixture of complex photometric and geometric transforms. We build the largest benchmark re-id dataset with 13, 164 images of 1, 360 pedestrians. Unlike existing datasets, which only provide manually cropped pedestrian images, our dataset provides automatically detected bounding boxes for evaluation close to practical applications. Our neural network significantly outperforms state-of-the-art methods on this dataset.

2,417 citations


"Enhancing Detection Model for Multi..." refers methods in this paper

  • ...For training the convolution neural network, we use data from CUHK03[19], Market-1501[32], CUHK01[18], VIPeR[26] and i-LIDS[26]....

    [...]

Proceedings ArticleDOI
07 Dec 2015
TL;DR: This paper adaptively learn correlation filters on each convolutional layer to encode the target appearance and hierarchically infer the maximum response of each layer to locate targets.
Abstract: Visual object tracking is challenging as target objects often undergo significant appearance changes caused by deformation, abrupt motion, background clutter and occlusion. In this paper, we exploit features extracted from deep convolutional neural networks trained on object recognition datasets to improve tracking accuracy and robustness. The outputs of the last convolutional layers encode the semantic information of targets and such representations are robust to significant appearance variations. However, their spatial resolution is too coarse to precisely localize targets. In contrast, earlier convolutional layers provide more precise localization but are less invariant to appearance changes. We interpret the hierarchies of convolutional layers as a nonlinear counterpart of an image pyramid representation and exploit these multiple levels of abstraction for visual tracking. Specifically, we adaptively learn correlation filters on each convolutional layer to encode the target appearance. We hierarchically infer the maximum response of each layer to locate targets. Extensive experimental results on a largescale benchmark dataset show that the proposed algorithm performs favorably against state-of-the-art methods.

1,812 citations


Additional excerpts

  • ...[20] exploited a novel Figure 2....

    [...]

Proceedings ArticleDOI
23 Jun 2014
TL;DR: The contribution of color in a tracking-by-detection framework is investigated and an adaptive low-dimensional variant of color attributes is proposed, suggesting that color attributes provides superior performance for visual tracking.
Abstract: Visual tracking is a challenging problem in computer vision. Most state-of-the-art visual trackers either rely on luminance information or use simple color representations for image description. Contrary to visual tracking, for object recognition and detection, sophisticated color features when combined with luminance have shown to provide excellent performance. Due to the complexity of the tracking problem, the desired color feature should be computationally efficient, and possess a certain amount of photometric invariance while maintaining high discriminative power. This paper investigates the contribution of color in a tracking-by-detection framework. Our results suggest that color attributes provides superior performance for visual tracking. We further propose an adaptive low-dimensional variant of color attributes. Both quantitative and attribute-based evaluations are performed on 41 challenging benchmark color sequences. The proposed approach improves the baseline intensity-based tracker by 24 % in median distance precision. Furthermore, we show that our approach outperforms state-of-the-art tracking methods while running at more than 100 frames per second.

1,499 citations


"Enhancing Detection Model for Multi..." refers background in this paper

  • ...[8] introduced a real-time tracking framework based on adaptive color channels....

    [...]