scispace - formally typeset
Search or ask a question
Book ChapterDOI

WAEF: Weighted Aggregation with Enhancement Filter for Visual Object Tracking

TL;DR: This paper proposes a different approach to regress in the temporal domain, based on weighted aggregation of distinctive visual features and feature prioritization with entropy estimation in a recursive fashion, and provides a statistics based ensembler approach for integrating the conventionally driven spatial regression results and the proposed temporal regression results to accomplish better tracking.
Abstract: In the recent years, convolutional neural networks (CNN) have been extensively employed in various complex computer vision tasks including visual object tracking. In this paper, we study the efficacy of temporal regression with Tikhonov regularization in generic object tracking. Among other major aspects, we propose a different approach to regress in the temporal domain, based on weighted aggregation of distinctive visual features and feature prioritization with entropy estimation in a recursive fashion. We provide a statistics based ensembler approach for integrating the conventionally driven spatial regression results (such as from ECO), and the proposed temporal regression results to accomplish better tracking. Further, we exploit the obligatory dependency of deep architectures on provided visual information, and present an image enhancement filter that helps to boost the performance on popular benchmarks. Our extensive experimentation shows that the proposed weighted aggregation with enhancement filter (WAEF) tracker outperforms the baseline (ECO) in almost all the challenging categories on OTB50 dataset with a cumulative gain of 14.8%. As per the VOT2016 evaluation, the proposed framework offers substantial improvement of 19.04% in occlusion, 27.66% in illumination change, 33.33% in empty, 10% in size change, and 5.28% in average expected overlap.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This survey aims to systematically investigate the current DL-based visual tracking methods, benchmark datasets, and evaluation metrics, and extensively evaluates and analyzes the leading visualtracking methods.
Abstract: Visual target tracking is one of the most sought-after yet challenging research topics in computer vision. Given the ill-posed nature of the problem and its popularity in a broad range of real-world scenarios, a number of large-scale benchmark datasets have been established, on which considerable methods have been developed and demonstrated with significant progress in recent years - predominantly by recent deep learning (DL)-based methods. This survey aims to systematically investigate the current DL-based visual tracking methods, benchmark datasets, and evaluation metrics. It also extensively evaluates and analyzes the leading visual tracking methods. First, the fundamental characteristics, primary motivations, and contributions of DL-based methods are summarized from nine key aspects of: network architecture, network exploitation, network training for visual tracking, network objective, network output, exploitation of correlation filter advantages, aerial-view tracking, long-term tracking, and online tracking. Second, popular visual tracking benchmarks and their respective properties are compared, and their evaluation metrics are summarized. Third, the state-of-the-art DL-based methods are comprehensively examined on a set of well-established benchmarks of OTB2013, OTB2015, VOT2018, LaSOT, UAV123, UAVDT, and VisDrone2019. Finally, by conducting critical analyses of these state-of-the-art trackers quantitatively and qualitatively, their pros and cons under various common scenarios are investigated. It may serve as a gentle use guide for practitioners to weigh when and under what conditions to choose which method(s). It also facilitates a discussion on ongoing issues and sheds light on promising research directions.

197 citations


Cites methods from "WAEF: Weighted Aggregation with Enh..."

  • ...4GHz CPU, GPU Matlab MatConvNet 13 CM UPDT [109] VGG-M/ GoogLeNet/ ResNet-50 N/A Still images ImageNet HOG, CN, DAF N/A Matlab MatConvNet N/A CM WAEF [119] VGG-M Conv1, Conv5 Still images ImageNet HOG, CN, DAF Intel Xeon(R) 3....

    [...]

  • ...To achieve the goal of learning generic representations for target modeling and constructing a more robust target models, the main contributions of methods are classified into: i) offline training of CNNs on large-scale datasets for visual tracking [63], [68], [80], [89], [97], [100], [101], [104], [112], [116], [135], [137], [142], [144], [153], [165], [168], [169], [173], ii) designing specific deep convolutional networks instead of employing pre-trained models [63], [68], [70], [72], [73], [75], [76], [80], [82], [89], [97], [100], [101], [104], [105], [108], [112], [116], [127], [135], [137], [141], [142], [144], [146], [150], [153], [165], [167]–[169], [171], [173], iii) constructing multiple target models to capture variety of target appearances [75], [116], [127], [129], [130], [143], [146], [172], iv) incorporating spatial and temporal information to improve model generalization [79], [82], [106], [119], [122], [137], [151], [153], v) fusion of different deep features to exploit complementary spatial and semantic information [64], [101], [108], [109], [135], vi) learning different target models such as relative model [104] or part-based models [116], [127], [146] to handle partial occlusion and deformation, and vii) utilizing two-stream network [127] to prevent from overfitting and learn rotation information....

    [...]

  • ..., feature approximation via bilinear interpolation) or oblique random forest [99] for better data capturing, iv) corrective domain adaption method [165], v) lightweight structure [72], [73], [167], vi) efficient optimization processes [98], [155], vii) exploiting advantages of correlation filters [59]–[61], [64], [69], [74], [77]–[80], [83], [85], [86], [92], [94]–[96], [98], [100], [106], [108], [109], [115], [119], [122], [126], [127], [129]– [131], [135], [140], [141], [143], [144], [149]–[151], [155], [159], [165], [167], [171], [172], [174] for efficient computations, viii) particle sampling strategy [96], and ix) utilizing attentional mechanism [100]....

    [...]

  • ...These methods include the HCFT [59], DeepSRDCF [60], FCNT [61], CNNSVM [62], DPST [63], CCOT [64], GOTURN [65], SiamFC [66], SINT [67], MDNet [68], HDT [69], STCT [70], RPNT [71], DeepTrack [72], CNT [73], CF-CNN [74], TCNN [75], RDLT [76], PTAV [77], [78], CREST [79], UCT/UCTLite [80], DSiam/DSiamM [81], TSN [82], WECO [83], RFL [84], IBCCF [85], DTO [86]], SRT [87], R-FCSN [88], GNET [89], LST [90], VRCPF [91], DCPF [92], CFNet [93], ECO [94], DeepCSRDCF [95], MCPF [96], BranchOut [97], DeepLMCF [98], Obli-RaFT [99], ACFN [100], SANet [101], DCFNet/DCFNet2 [102], DET [103], DRN [104], DNT [105], STSGS [106], TripletLoss [107], DSLT [108], UPDT [109], ACT [110], DaSiamRPN [111], RT-MDNet [112], StructSiam [113], MMLT [114], CPT [115], STP [116], Siam-MCF [117], Siam-BM [118], WAEF [119], TRACA [120], VITAL [121], DeepSTRCF [122], SiamRPN [123], SA-Siam [124], FlowTrack [125], DRT [126], LSART [127], RASNet [128], MCCT [129], DCPF2 [130], VDSR-SRT [131], FCSFN [132], FRPN2TSiam [133], FMFT [134], IMLCF [135], TGGAN [136], DAT [137], DCTN [138], FPRNet [139], HCFTs [140], adaDDCF [141], YCNN [142], DeepHPFT [143], CFCF [144], CFSRL [145], P2T [146], DCDCF [147], FICFNet [148], LCTdeep [149], HSTC [150], DeepFWDCF [151], CF-FCSiam [152], MGNet [153], ORHF [154], ASRCF [155], ATOM [156], CRPN [157], GCT [158], RPCF [159], SPM [160], SiamDW [56], SiamMask [57], SiamRPN++ [55], TADT [161], UDT [162], DiMP [163], ADT [164], CODA [165], DRRL [166], SMART [167], MRCNN [168], MM [169], MTHCF [170], AEPCF [171], IMM-DFT [172], TAAT [173], DeepTACF [174], MAM [175], ADNet [176], [177], C2FT [178], DRL-IS [179], DRLT [180], EAST [181], HP [182], P-Track [183], RDT [184], and SINT++ [58]....

    [...]

  • ...These methods include the HCFT [59], DeepSRDCF [60], FCNT [61], CNNSVM [62], DPST [63], CCOT [64], GOTURN [65], SiamFC [66], SINT [67], MDNet [68], HDT [69], STCT [70], RPNT [71], DeepTrack [72], CNT [73], CF-CNN [74], TCNN [75], RDLT [76], PTAV [77], [78], CREST [79], UCT/UCTLite [80], DSiam/DSiamM [81], TSN [82], WECO [83], RFL [84], IBCCF [85], DTO [86]], SRT [87], R-FCSN [88], GNET [89], LST [90], VRCPF [91], DCPF [92], CFNet [93], ECO [94], DeepCSRDCF [95], MCPF [96], BranchOut [97], DeepLMCF [98], Obli-RaFT [99], ACFN [100], SANet [101], DCFNet/DCFNet2 [102], DET [103], DRN [104], DNT [105], STSGS [106], TripletLoss [107], DSLT [108], UPDT [109], ACT [110], DaSiamRPN [111], RT-MDNet [112], StructSiam [113], MMLT [114], CPT [115], STP [116], Siam-MCF [117], Siam-BM [118], WAEF [119], TRACA [120], VITAL [121],...

    [...]

Journal ArticleDOI
TL;DR: In this paper , the authors systematically investigate the current DL-based visual tracking methods, benchmark datasets, and evaluation metrics, and extensively evaluate and analyzes the leading visual tracking algorithms.
Abstract: Visual target tracking is one of the most sought-after yet challenging research topics in computer vision. Given the ill-posed nature of the problem and its popularity in a broad range of real-world scenarios, a number of large-scale benchmark datasets have been established, on which considerable methods have been developed and demonstrated with significant progress in recent years -- predominantly by recent deep learning (DL)-based methods. This survey aims to systematically investigate the current DL-based visual tracking methods, benchmark datasets, and evaluation metrics. It also extensively evaluates and analyzes the leading visual tracking methods. First, the fundamental characteristics, primary motivations, and contributions of DL-based methods are summarized from nine key aspects of: network architecture, network exploitation, network training for visual tracking, network objective, network output, exploitation of correlation filter advantages, aerial-view tracking, long-term tracking, and online tracking. Second, popular visual tracking benchmarks and their respective properties are compared, and their evaluation metrics are summarized. Third, the state-of-the-art DL-based methods are comprehensively examined on a set of well-established benchmarks of OTB2013, OTB2015, VOT2018, LaSOT, UAV123, UAVDT, and VisDrone2019. Finally, by conducting critical analyses of these state-of-the-art trackers quantitatively and qualitatively, their pros and cons under various common scenarios are investigated. It may serve as a gentle use guide for practitioners to weigh when and under what conditions to choose which method(s). It also facilitates a discussion on ongoing issues and sheds light on promising research directions.

70 citations

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed adaptive discriminative correlation filters (DCF) based on the methods that can exploit CNN models with different topologies to improve the accuracy and robustness of visual trackers regarding video characteristics.
Abstract: Due to the automatic feature extraction procedure via multi-layer nonlinear transformations, the deep learning-based visual trackers have recently achieved a great success in challenging scenarios for visual tracking purposes. Although many of those trackers utilize the feature maps from pre-trained convolutional neural networks (CNNs), the effects of selecting different models and exploiting various combinations of their feature maps are still not compared completely. To the best of our knowledge, all those methods use a fixed number of convolutional feature maps without considering the scene attributes (e.g., occlusion, deformation, and fast motion) that might occur during tracking. As a pre-requisition, this paper proposes adaptive discriminative correlation filters (DCF) based on the methods that can exploit CNN models with different topologies. First, the paper provides a comprehensive analysis of four commonly used CNN models to determine the best feature maps of each model. Second, with the aid of analysis results as attribute dictionaries, an adaptive exploitation of deep features is proposed to improve the accuracy and robustness of visual trackers regarding video characteristics. Third, the generalization of proposed method is validated on various tracking datasets as well as CNN models with similar architectures. Finally, extensive experimental results demonstrate the effectiveness of proposed adaptive method compared with the state-of-the-art visual tracking methods.

5 citations


Cites methods from "WAEF: Weighted Aggregation with Enh..."

  • ...Also, weighted aggregation with enhancement filter tracker (WAEF) [46] employs temporal Tikhonov regularization to provide better features and suppress unrelated frames....

    [...]

Journal ArticleDOI
01 Jul 2021
TL;DR: In this paper, a separate correlation filter is learned to estimate the accurate target scale by finding the scale's candidate that maximizes the output response of the correlation filter mentioned above, and a minimum rate of similarity for the online model update is defined to avoid training with failure detections.
Abstract: Despite the considerable advances that are emerged in correlation filter-based tracking, in fact, they may achieve excellent performance in robustness, speed, and accuracy; they still fail when dealing with large-scale alteration and show the inability to handle long-term tracking in complex scenarios, where the object undergoes partial occlusion, out-of-view, and deformation. In this paper, we propose a robust approach to address two important problems: the first one is scale estimation in kernelized correlation filter (KCF), and the second one is how to update the model in the process of tracking. We aim in this work to overcome the scale fixed size limitation of kernelized correlation filter-based tracking algorithms and improve the mechanism of model online training. Our approach learns a separate correlation filter to estimate the accurate target scale by finding the scale's candidate that maximizes the output response of the correlation filter mentioned above. Besides, we define a minimum rate of similarity for the online model update to avoid training with failure detections. Our approach is evaluated in terms of precision and accuracy, on a commonly used tracking benchmark with 100 challenging videos; the experimental results show that our proposed tracker outperforms the KCF algorithm and shows promising performance compared to state-of-the-art tracking methods.

1 citations

Posted Content
TL;DR: In this article, the adaptive discriminative correlation filters (DCF) is proposed to improve the robustness of visual trackers regarding video characteristics by considering the scene attributes (e.g., occlusion, deformation, and fast motion).
Abstract: Due to the automatic feature extraction procedure via multi-layer nonlinear transformations, the deep learning-based visual trackers have recently achieved great success in challenging scenarios for visual tracking purposes. Although many of those trackers utilize the feature maps from pre-trained convolutional neural networks (CNNs), the effects of selecting different models and exploiting various combinations of their feature maps are still not compared completely. To the best of our knowledge, all those methods use a fixed number of convolutional feature maps without considering the scene attributes (e.g., occlusion, deformation, and fast motion) that might occur during tracking. As a pre-requisition, this paper proposes adaptive discriminative correlation filters (DCF) based on the methods that can exploit CNN models with different topologies. First, the paper provides a comprehensive analysis of four commonly used CNN models to determine the best feature maps of each model. Second, with the aid of analysis results as attribute dictionaries, adaptive exploitation of deep features is proposed to improve the accuracy and robustness of visual trackers regarding video characteristics. Third, the generalization of the proposed method is validated on various tracking datasets as well as CNN models with similar architectures. Finally, extensive experimental results demonstrate the effectiveness of the proposed adaptive method compared with state-of-the-art visual tracking methods.

1 citations

References
More filters
Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations

Journal ArticleDOI
TL;DR: Recent work in the area of unsupervised feature learning and deep learning is reviewed, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks.
Abstract: The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.

11,201 citations

Journal ArticleDOI
TL;DR: A new kernelized correlation filter is derived, that unlike other kernel algorithms has the exact same complexity as its linear counterpart, which is called dual correlation filter (DCF), which outperform top-ranking trackers such as Struck or TLD on a 50 videos benchmark, despite being implemented in a few lines of code.
Abstract: The core component of most modern trackers is a discriminative classifier, tasked with distinguishing between the target and the surrounding environment. To cope with natural image changes, this classifier is typically trained with translated and scaled sample patches. Such sets of samples are riddled with redundancies—any overlapping pixels are constrained to be the same. Based on this simple observation, we propose an analytic model for datasets of thousands of translated patches. By showing that the resulting data matrix is circulant, we can diagonalize it with the discrete Fourier transform, reducing both storage and computation by several orders of magnitude. Interestingly, for linear regression our formulation is equivalent to a correlation filter, used by some of the fastest competitive trackers. For kernel regression, however, we derive a new kernelized correlation filter (KCF), that unlike other kernel algorithms has the exact same complexity as its linear counterpart. Building on it, we also propose a fast multi-channel extension of linear correlation filters, via a linear kernel, which we call dual correlation filter (DCF). Both KCF and DCF outperform top-ranking trackers such as Struck or TLD on a 50 videos benchmark, despite running at hundreds of frames-per-second, and being implemented in a few lines of code (Algorithm 1). To encourage further developments, our tracking framework was made open-source.

4,994 citations

Proceedings ArticleDOI
23 Jun 2013
TL;DR: Large scale experiments are carried out with various evaluation criteria to identify effective approaches for robust tracking and provide potential future research directions in this field.
Abstract: Object tracking is one of the most important components in numerous applications of computer vision. While much progress has been made in recent years with efforts on sharing code and datasets, it is of great importance to develop a library and benchmark to gauge the state of the art. After briefly reviewing recent advances of online object tracking, we carry out large scale experiments with various evaluation criteria to understand how these algorithms perform. The test image sequences are annotated with different attributes for performance evaluation and analysis. By analyzing quantitative results, we identify effective approaches for robust tracking and provide potential future research directions in this field.

3,828 citations

Proceedings ArticleDOI
01 Jan 2014
TL;DR: This paper presents a novel approach to robust scale estimation that can handle large scale variations in complex image sequences and shows promising results in terms of accuracy and efficiency.
Abstract: Robust scale estimation is a challenging problem in visual object tracking. Most existing methods fail to handle large scale variations in complex image sequences. This paper presents a novel appro ...

2,038 citations