Computationally efficient deep tracker: Guided MDNet
02 Mar 2017-pp 1-6
TL;DR: The main objective of the paper is to recommend an essential improvement to the existing Multi-Domain Convolutional Neural Network tracker (MDNet) which is used to track unknown object in a video-stream.
Abstract: The main objective of the paper is to recommend an essential improvement to the existing Multi-Domain Convolutional Neural Network tracker (MDNet) which is used to track unknown object in a video-stream. MDNet is able to handle major basic tracking challenges like fast motion, background clutter, out of view, scale variations etc. through offline training and online tracking. We pre-train the Convolutional Neural Network (CNN) offline using many videos with ground truth to obtain a target representation in the network. In online tracking the MDNet uses large number of random sample of windows around the previous target for estimating the target in the current frame which make its tracking computationally complex while testing or obtaining the track. The major contribution of the paper is to give guided samples to the MDNet rather than random samples so that the computation and time required by the CNN while tracking could be greatly reduced. Evaluation of the proposed algorithm is done using the videos from the ALOV300++ dataset and the VOT dataset and the results are compared with the state of art trackers.
Citations
More filters
TL;DR: In this paper, the YOLOv3 pretraining model is used for ship detection, recognition, and counting in the context of intelligent maritime surveillance, timely ocean rescue, and computer-aided decision-making.
Abstract: Automatic ship detection, recognition, and counting are crucial for intelligent maritime surveillance, timely ocean rescue, and computer-aided decision-making. YOLOv3 pretraining model is used for ...
7 citations
03 Jul 2019
TL;DR: A novel face recognition method for population search and criminal pursuit in smart cities and a cloud server architecture for face recognition in smart city environments are proposed.
Abstract: Face recognition technology can be applied to many aspects in smart city, and the combination of face recognition and deep learning can bring new applications to the public security. The use of deep learning machine vision technology and video-based image retrieval technology can quickly and easily solve the current problem of quickly finding the missing children and arresting criminal suspects. The main purpose of this paper is to propose a novel face recognition method for population search and criminal pursuit in smart cities. In large and medium-sized security, the face pictures of the most similar face images can be accurately searched in tens of millions of photos. The storage requires a powerful information processing center for a variety of information storage and processing. To fundamentally support the safe operation of a large system, cloud-based network architecture is considered and a smart city cloud computing data center is built. In addition, this paper proposed a cloud server architecture for face recognition in smart city environments.
1 citations
01 Jan 2018
TL;DR: Visual tracking is a computer vision problem where the task is to follow a target through a video sequence to solve the problem of tracking blindfolded people in the dark.
Abstract: Visual tracking is a computer vision problem where the task is to follow a targetthrough a video sequence. Tracking has many important real-world applications in several fields such as autonomous v ...
Cites methods from "Computationally efficient deep trac..."
...Approaches such as MDnet [37] and SiamFC [2] train their networks to output the location of the target....
[...]
References
More filters
TL;DR: A new kernelized correlation filter is derived, that unlike other kernel algorithms has the exact same complexity as its linear counterpart, which is called dual correlation filter (DCF), which outperform top-ranking trackers such as Struck or TLD on a 50 videos benchmark, despite being implemented in a few lines of code.
Abstract: The core component of most modern trackers is a discriminative classifier, tasked with distinguishing between the target and the surrounding environment. To cope with natural image changes, this classifier is typically trained with translated and scaled sample patches. Such sets of samples are riddled with redundancies—any overlapping pixels are constrained to be the same. Based on this simple observation, we propose an analytic model for datasets of thousands of translated patches. By showing that the resulting data matrix is circulant, we can diagonalize it with the discrete Fourier transform, reducing both storage and computation by several orders of magnitude. Interestingly, for linear regression our formulation is equivalent to a correlation filter, used by some of the fastest competitive trackers. For kernel regression, however, we derive a new kernelized correlation filter (KCF), that unlike other kernel algorithms has the exact same complexity as its linear counterpart. Building on it, we also propose a fast multi-channel extension of linear correlation filters, via a linear kernel, which we call dual correlation filter (DCF). Both KCF and DCF outperform top-ranking trackers such as Struck or TLD on a 50 videos benchmark, despite running at hundreds of frames-per-second, and being implemented in a few lines of code (Algorithm 1). To encourage further developments, our tracking framework was made open-source.
4,994 citations
16 Jun 2012
TL;DR: In this paper, a biologically plausible, wide and deep artificial neural network architectures was proposed to match human performance on tasks such as the recognition of handwritten digits or traffic signs, achieving near-human performance.
Abstract: Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible, wide and deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks.
3,717 citations
14 May 2014
TL;DR: It is shown that the data augmentation techniques commonly applied to CNN-based methods can also be applied to shallow methods, and result in an analogous performance boost, and it is identified that the dimensionality of the CNN output layer can be reduced significantly without having an adverse effect on performance.
Abstract: The latest generation of Convolutional Neural Networks (CNN) have achieved impressive results in challenging benchmarks on image recognition and object detection, significantly raising the interest of the community in these methods. Nevertheless, it is still unclear how different CNN methods compare with each other and with previous state-of-the-art shallow representations such as the Bag-of-Visual-Words and the Improved Fisher Vector. This paper conducts a rigorous evaluation of these new techniques, exploring different deep architectures and comparing them on a common ground, identifying and disclosing important implementation details. We identify several useful properties of CNN-based representations, including the fact that the dimensionality of the CNN output layer can be reduced significantly without having an adverse effect on performance. We also identify aspects of deep and shallow methods that can be successfully shared. In particular, we show that the data augmentation techniques commonly applied to CNN-based methods can also be applied to shallow methods, and result in an analogous performance boost. Source code and models to reproduce the experiments in the paper is made publicly available.
3,533 citations
"Computationally efficient deep trac..." refers methods in this paper
...The MDNet architecture is much compact as compared to other architectures for recognition such as AlexNet [10] and VGG-Nets [17], [21]....
[...]
...CNNs have played a vital role in various tasks like semantic segmentation [18], object detection [16], image classification [17], [10], [21]....
[...]
TL;DR: A tracking method that incrementally learns a low-dimensional subspace representation, efficiently adapting online to changes in the appearance of the target, and includes a method for correctly updating the sample mean and a forgetting factor to ensure less modeling power is expended fitting older observations.
Abstract: Visual tracking, in essence, deals with non-stationary image streams that change over time. While most existing algorithms are able to track objects well in controlled environments, they usually fail in the presence of significant variation of the object's appearance or surrounding illumination. One reason for such failures is that many algorithms employ fixed appearance models of the target. Such models are trained using only appearance data available before tracking begins, which in practice limits the range of appearances that are modeled, and ignores the large volume of information (such as shape changes or specific lighting conditions) that becomes available during tracking. In this paper, we present a tracking method that incrementally learns a low-dimensional subspace representation, efficiently adapting online to changes in the appearance of the target. The model update, based on incremental algorithms for principal component analysis, includes two important features: a method for correctly updating the sample mean, and a forgetting factor to ensure less modeling power is expended fitting older observations. Both of these features contribute measurably to improving overall tracking performance. Numerous experiments demonstrate the effectiveness of the proposed tracking algorithm in indoor and outdoor environments where the target objects undergo large changes in pose, scale, and illumination.
3,151 citations
"Computationally efficient deep trac..." refers methods in this paper
...In generative methods, the generative model is described using the target object alone and is used to search the target in the image region with minimum error [1], [2], [3], [19]....
[...]
TL;DR: A novel tracking framework (TLD) that explicitly decomposes the long-term tracking task into tracking, learning, and detection, and develops a novel learning method (P-N learning) which estimates the errors by a pair of “experts”: P-expert estimates missed detections, and N-ex Expert estimates false alarms.
Abstract: This paper investigates long-term tracking of unknown objects in a video stream. The object is defined by its location and extent in a single frame. In every frame that follows, the task is to determine the object's location and extent or indicate that the object is not present. We propose a novel tracking framework (TLD) that explicitly decomposes the long-term tracking task into tracking, learning, and detection. The tracker follows the object from frame to frame. The detector localizes all appearances that have been observed so far and corrects the tracker if necessary. The learning estimates the detector's errors and updates it to avoid these errors in the future. We study how to identify the detector's errors and learn from them. We develop a novel learning method (P-N learning) which estimates the errors by a pair of “experts”: (1) P-expert estimates missed detections, and (2) N-expert estimates false alarms. The learning process is modeled as a discrete dynamical system and the conditions under which the learning guarantees improvement are found. We describe our real-time implementation of the TLD framework and the P-N learning. We carry out an extensive quantitative evaluation which shows a significant improvement over state-of-the-art approaches.
3,137 citations