scispace - formally typeset
Search or ask a question

Showing papers on "Motion estimation published in 2018"


Proceedings ArticleDOI
29 Mar 2018
TL;DR: A recurrent sequence-to-sequence model observes motion histories and predicts future behavior, using a novel pooling mechanism to aggregate information across people, and outperforms prior work in terms of accuracy, variety, collision avoidance, and computational complexity.
Abstract: Understanding human motion behavior is critical for autonomous moving platforms (like self-driving cars and social robots) if they are to navigate human-centric environments. This is challenging because human motion is inherently multimodal: given a history of human motion paths, there are many socially plausible ways that people could move in the future. We tackle this problem by combining tools from sequence prediction and generative adversarial networks: a recurrent sequence-to-sequence model observes motion histories and predicts future behavior, using a novel pooling mechanism to aggregate information across people. We predict socially plausible futures by training adversarially against a recurrent discriminator, and encourage diverse predictions with a novel variety loss. Through experiments on several datasets we demonstrate that our approach outperforms prior work in terms of accuracy, variety, collision avoidance, and computational complexity.

1,461 citations


Proceedings ArticleDOI
01 Jun 2018
TL;DR: A novel end-to-end deep neural network that generates dynamic upsampling filters and a residual image, which are computed depending on the local spatio-temporal neighborhood of each pixel to avoid explicit motion compensation is proposed.
Abstract: Video super-resolution (VSR) has become even more important recently to provide high resolution (HR) contents for ultra high definition displays. While many deep learning based VSR methods have been proposed, most of them rely heavily on the accuracy of motion estimation and compensation. We introduce a fundamentally different framework for VSR in this paper. We propose a novel end-to-end deep neural network that generates dynamic upsampling filters and a residual image, which are computed depending on the local spatio-temporal neighborhood of each pixel to avoid explicit motion compensation. With our approach, an HR image is reconstructed directly from the input image using the dynamic upsampling filters, and the fine details are added through the computed residual. Our network with the help of a new data augmentation technique can generate much sharper HR videos with temporal consistency, compared with the previous methods. We also provide analysis of our network through extensive experiments to show how the network deals with motions implicitly.

503 citations


Proceedings ArticleDOI
01 Jun 2018
TL;DR: The key idea is to exploit parameterized kernel functions that span the full continuous vector space, which allows us to learn over arbitrary data structures as long as their support relationship is computable.
Abstract: Standard convolutional neural networks assume a grid structured input is available and exploit discrete convolutions as their fundamental building blocks. This limits their applicability to many real-world applications. In this paper we propose Parametric Continuous Convolution, a new learnable operator that operates over non-grid structured data. The key idea is to exploit parameterized kernel functions that span the full continuous vector space. This generalization allows us to learn over arbitrary data structures as long as their support relationship is computable. Our experiments show significant improvement over the state-of-the-art in point cloud segmentation of indoor and outdoor scenes, and lidar motion estimation of driving scenes.

392 citations


Book ChapterDOI
08 Sep 2018
TL;DR: The Deep Virtual Stereo Odometry incorporates deep depth predictions into Direct Sparse Odometry (DSO) as direct virtual stereo measurements and designs a novel deep network that refines predicted depth from a single image in a two-stage process.
Abstract: Monocular visual odometry approaches that purely rely on geometric cues are prone to scale drift and require sufficient motion parallax in successive frames for motion estimation and 3D reconstruction. In this paper, we propose to leverage deep monocular depth prediction to overcome limitations of geometry-based monocular visual odometry. To this end, we incorporate deep depth predictions into Direct Sparse Odometry (DSO) as direct virtual stereo measurements. For depth prediction, we design a novel deep network that refines predicted depth from a single image in a two-stage process. We train our network in a semi-supervised way on photoconsistency in stereo images and on consistency with accurate sparse depth reconstructions from Stereo DSO. Our deep predictions excel state-of-the-art approaches for monocular depth on the KITTI benchmark. Moreover, our Deep Virtual Stereo Odometry clearly exceeds previous monocular and deep-learning based methods in accuracy. It even achieves comparable performance to the state-of-the-art stereo methods, while only relying on a single camera.

357 citations


Book ChapterDOI
08 Sep 2018
TL;DR: In this paper, the authors proposed an end-to-end system for video-based measurement of heart and breathing rate using a deep convolutional network and an attention mechanism using appearance information to guide motion estimation.
Abstract: Non-contact video-based physiological measurement has many applications in health care and human-computer interaction. Practical applications require measurements to be accurate even in the presence of large head rotations. We propose the first end-to-end system for video-based measurement of heart and breathing rate using a deep convolutional network. The system features a new motion representation based on a skin reflection model and a new attention mechanism using appearance information to guide motion estimation, both of which enable robust measurement under heterogeneous lighting and major motions. Our approach significantly outperforms all current state-of-the-art methods on both RGB and infrared video datasets. Furthermore, it allows spatial-temporal distributions of physiological signals to be visualized via the attention mechanism.

276 citations


Proceedings ArticleDOI
18 Jun 2018
TL;DR: A novel sequence-to-sequence model for probabilistic human motion prediction, trained with a modified version of improved Wasserstein generative adversarial networks (WGAN-GP), in which the model learns a probability density function of future human poses conditioned on previous poses.
Abstract: Predicting and understanding human motion dynamics has many applications, such as motion synthesis, augmented reality, security, and autonomous vehicles. Due to the recent success of generative adversarial networks (GAN), there has been much interest in probabilistic estimation and synthetic data generation using deep neural network architectures and learning algorithms. We propose a novel sequence-to-sequence model for probabilistic human motion prediction, trained with a modified version of improved Wasserstein generative adversarial networks (WGAN-GP), in which we use a custom loss function designed for human motion prediction. Our model, which we call HP-GAN, learns a probability density function of future human poses conditioned on previous poses. It predicts multiple sequences of possible future human poses, each from the same input sequence but a different vector z drawn from a random distribution. Furthermore, to quantify the quality of the non-deterministic predictions, we simultaneously train a motion-quality-assessment model that learns the probability that a given skeleton sequence is a real human motion. We test our algorithm on two of the largest skeleton datasets: NTURGB-D and Human3.6M. We train our model on both single and multiple action types. Its predictive power for long-term motion estimation is demonstrated by generating multiple plausible futures of more than 30 frames from just 10 frames of input. We show that most sequences generated from the same input have more than 50% probabilities of being judged as a real human sequence. We published all the code used in this paper to https://github.com/ebarsoum/hpgan.

231 citations


Proceedings ArticleDOI
18 Jun 2018
TL;DR: In this paper, a unified approach based on the principle of multi-frame end-to-end learning of features and cross-frame motion is proposed for video object detection, which steadily pushes forward the performance envelope (speed-accuracy tradeoff), towards high performance video object Detection.
Abstract: There has been significant progresses for image object detection in recent years. Nevertheless, video object detection has received little attention, although it is more challenging and more important in practical scenarios. Built upon the recent works [37, 36], this work proposes a unified approach based on the principle of multi-frame end-to-end learning of features and cross-frame motion. Our approach extends prior works with three new techniques and steadily pushes forward the performance envelope (speed-accuracy tradeoff), towards high performance video object detection.

217 citations


Proceedings ArticleDOI
21 May 2018
TL;DR: A method for robust dense RGB-D SLAM in dynamic environments which detects moving objects and simultaneously reconstructs the background structure and achieves similar performance in static environments and improved accuracy and robustness in dynamic scenes is proposed.
Abstract: Dynamic environments are challenging for visual SLAM as moving objects can impair camera pose tracking and cause corruptions to be integrated into the map In this paper, we propose a method for robust dense RGB-D SLAM in dynamic environments which detects moving objects and simultaneously reconstructs the background structure While most methods employ implicit robust penalisers or outlier filtering techniques in order to handle moving objects, our approach is to simultaneously estimate the camera motion as well as a probabilistic static/dynamic segmentation of the current RGB-D image pair This segmentation is then used for weighted dense RGB-D fusion to estimate a 3D model of only the static parts of the environment By leveraging the 3D model for frame-to-model alignment, as well as static/dynamic segmentation, camera motion estimation has reduced overall drift - as well as being more robust to the presence of dynamics in the scene Demonstrations are presented which compare the proposed method to related state-of-the-art approaches using both static and dynamic sequences The proposed method achieves similar performance in static environments and improved accuracy and robustness in dynamic scenes

178 citations


Posted Content
TL;DR: A novel adaptive warping layer is developed to integrate both optical flow and interpolation kernels to synthesize target frame pixels and is fully differentiable such that both the flow and kernel estimation networks can be optimized jointly.
Abstract: Motion estimation (ME) and motion compensation (MC) have been widely used for classical video frame interpolation systems over the past decades. Recently, a number of data-driven frame interpolation methods based on convolutional neural networks have been proposed. However, existing learning based methods typically estimate either flow or compensation kernels, thereby limiting performance on both computational efficiency and interpolation accuracy. In this work, we propose a motion estimation and compensation driven neural network for video frame interpolation. A novel adaptive warping layer is developed to integrate both optical flow and interpolation kernels to synthesize target frame pixels. This layer is fully differentiable such that both the flow and kernel estimation networks can be optimized jointly. The proposed model benefits from the advantages of motion estimation and compensation methods without using hand-crafted features. Compared to existing methods, our approach is computationally efficient and able to generate more visually appealing results. Furthermore, the proposed MEMC-Net can be seamlessly adapted to several video enhancement tasks, e.g., super-resolution, denoising, and deblocking. Extensive quantitative and qualitative evaluations demonstrate that the proposed method performs favorably against the state-of-the-art video frame interpolation and enhancement algorithms on a wide range of datasets.

170 citations


Journal ArticleDOI
TL;DR: The subtle motions from recorded video are extracted by means of Phase-based Motion Estimation (PME) and the extracted information is used to conduct damage identification on a 2.3-m long Skystream® wind turbine blade (WTB).

163 citations


Journal ArticleDOI
TL;DR: This model has a low computational complexity and runs orders of magnitude faster than other multi-frame SR methods, and with the powerful temporal dependency modeling, can super resolve videos with complex motions and achieve well performance.
Abstract: Super resolving a low-resolution video, namely video super-resolution (SR), is usually handled by either single-image SR or multi-frame SR. Single-Image SR deals with each video frame independently, and ignores intrinsic temporal dependency of video frames which actually plays a very important role in video SR. Multi-Frame SR generally extracts motion information, e.g., optical flow, to model the temporal dependency, but often shows high computational cost. Considering that recurrent neural networks (RNNs) can model long-term temporal dependency of video sequences well, we propose a fully convolutional RNN named bidirectional recurrent convolutional network for efficient multi-frame SR. Different from vanilla RNNs, 1) the commonly-used full feedforward and recurrent connections are replaced with weight-sharing convolutional connections. So they can greatly reduce the large number of network parameters and well model the temporal dependency in a finer level, i.e., patch-based rather than frame-based, and 2) connections from input layers at previous timesteps to the current hidden layer are added by 3D feedforward convolutions, which aim to capture discriminate spatio-temporal patterns for short-term fast-varying motions in local adjacent frames. Due to the cheap convolutional operations, our model has a low computational complexity and runs orders of magnitude faster than other multi-frame SR methods. With the powerful temporal dependency modeling, our model can super resolve videos with complex motions and achieve well performance.

Proceedings ArticleDOI
21 May 2018
TL;DR: This paper presents a reliable and accurate radar-only motion estimation algorithm for mobile autonomous systems, using a frequency-modulated continuous-wave scanning radar to extract landmarks and performs scan matching by greedily adding point correspondences based on unary descriptors and pairwise compatibility scores.
Abstract: In contrast to cameras, lidars, GPS, and proprioceptive sensors, radars are affordable and efficient systems that operate well under variable weather and lighting conditions, require no external infrastructure, and detect long-range objects. In this paper, we present a reliable and accurate radar-only motion estimation algorithm for mobile autonomous systems. Using a frequency-modulated continuous-wave (FMCW) scanning radar, we first extract landmarks with an algorithm that accounts for unwanted effects in radar returns. To estimate relative motion, we then perform scan matching by greedily adding point correspondences based on unary descriptors and pairwise compatibility scores. Our radar odometry results are robust under a variety of conditions, including those under which visual odometry and GPS/INS fail.

Journal ArticleDOI
TL;DR: To introduce a methodology for the reconstruction of multi‐shot, multi‐slice magnetic resonance imaging able to cope with both within‐plane and through‐plane rigid motion and to describe its application in structural brain imaging.
Abstract: Purpose To introduce a methodology for the reconstruction of multi-shot, multi-slice magnetic resonance imaging able to cope with both within-plane and through-plane rigid motion and to describe its application in structural brain imaging. Theory and Methods The method alternates between motion estimation and reconstruction using a common objective function for both. Estimates of three-dimensional motion states for each shot and slice are gradually refined by improving on the fit of current reconstructions to the partial k-space information from multiple coils. Overlapped slices and super-resolution allow recovery of through-plane motion and outlier rejection discards artifacted shots. The method is applied to T2 and T1 brain scans acquired in different views. Results The procedure has greatly diminished artifacts in a database of 1883 neonatal image volumes, as assessed by image quality metrics and visual inspection. Examples showing the ability to correct for motion and robustness against damaged shots are provided. Combination of motion corrected reconstructions for different views has shown further artifact suppression and resolution recovery. Conclusion The proposed method addresses the problem of rigid motion in multi-shot multi-slice anatomical brain scans. Tests on a large collection of potentially corrupted datasets have shown a remarkable image quality improvement. Magn Reson Med, 2017. © 2017 The Authors Magnetic Resonance in Medicine published by Wiley Periodicals, Inc. on behalf of International Society for Magnetic Resonance in Medicine. This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

Book ChapterDOI
16 Sep 2018
TL;DR: In this article, a Siamese-style recurrent spatial transformer network is used for joint estimation of motion and segmentation from cardiac MR image sequences, and a joint multi-scale feature encoder is learned by optimizing the segmentation branch and the motion estimation branch simultaneously, enabling the weakly-supervised segmentation by taking advantage of features that are unsupervisedly learned in the motion estimator from a large amount of unannotated data.
Abstract: Cardiac motion estimation and segmentation play important roles in quantitatively assessing cardiac function and diagnosing cardiovascular diseases. In this paper, we propose a novel deep learning method for joint estimation of motion and segmentation from cardiac MR image sequences. The proposed network consists of two branches: a cardiac motion estimation branch which is built on a novel unsupervised Siamese style recurrent spatial transformer network, and a cardiac segmentation branch that is based on a fully convolutional network. In particular, a joint multi-scale feature encoder is learned by optimizing the segmentation branch and the motion estimation branch simultaneously. This enables the weakly-supervised segmentation by taking advantage of features that are unsupervisedly learned in the motion estimation branch from a large amount of unannotated data. Experimental results using cardiac MlRI images from 220 subjects show that the joint learning of both tasks is complementary and the proposed models outperform the competing methods significantly in terms of accuracy and speed.

Journal ArticleDOI
TL;DR: This paper investigates the feasibility of a two-stage motion estimation method, which is a combination of affine and nonrigid estimation, for SR US imaging and reduces the width of the motion-blurred microvessels to approximately 1.5-fold.
Abstract: The structure of microvasculature cannot be resolved using conventional ultrasound (US) imaging due to the fundamental diffraction limit at clinical US frequencies. It is possible to overcome this resolution limitation by localizing individual microbubbles through multiple frames and forming a superresolved image, which usually requires seconds to minutes of acquisition. Over this time interval, motion is inevitable and tissue movement is typically a combination of large- and small-scale tissue translation and deformation. Therefore, super-resolution (SR) imaging is prone to motion artifacts as other imaging modalities based on multiple acquisitions are. This paper investigates the feasibility of a two-stage motion estimation method, which is a combination of affine and nonrigid estimation, for SR US imaging. First, the motion correction accuracy of the proposed method is evaluated using simulations with increasing complexity of motion. A mean absolute error of 12.2 $\mu \text{m}$ was achieved in simulations for the worst-case scenario. The motion correction algorithm was then applied to a clinical data set to demonstrate its potential to enable in vivo SR US imaging in the presence of patient motion. The size of the identified microvessels from the clinical SR images was measured to assess the feasibility of the two-stage motion correction method, which reduced the width of the motion-blurred microvessels to approximately 1.5-fold.

Posted Content
TL;DR: DeepV2D combines the representation ability of neural networks with the geometric principles governing image formation and compose a collection of classical geometric algorithms, which are converted into trainable modules and combined into an end-to-end differentiable architecture.
Abstract: We propose DeepV2D, an end-to-end deep learning architecture for predicting depth from video. DeepV2D combines the representation ability of neural networks with the geometric principles governing image formation. We compose a collection of classical geometric algorithms, which are converted into trainable modules and combined into an end-to-end differentiable architecture. DeepV2D interleaves two stages: motion estimation and depth estimation. During inference, motion and depth estimation are alternated and converge to accurate depth. Code is available this https URL.

Journal ArticleDOI
TL;DR: A novel model for video salient object detection called spatiotemporal constrained optimization model (SCOM), which exploits spatial and temporal cues, as well as a local constraint, to achieve a global saliency optimization.
Abstract: This paper presents a novel model for video salient object detection called spatiotemporal constrained optimization model (SCOM), which exploits spatial and temporal cues, as well as a local constraint, to achieve a global saliency optimization. For a robust motion estimation of salient objects, we propose a novel approach to modeling the motion cues from optical flow field, the saliency map of the prior video frame and the motion history of change detection, which is able to distinguish the moving salient objects from diverse changing background regions. Furthermore, an effective objectness measure is proposed with intuitive geometrical interpretation to extract some reliable object and background regions, which provided as the basis to define the foreground potential, background potential, and the constraint to support saliency propagation. These potentials and the constraint are formulated into the proposed SCOM framework to generate an optimal saliency map for each frame in a video. The proposed model is extensively evaluated on the widely used challenging benchmark data sets. Experiments demonstrate that our proposed SCOM substantially outperforms the state-of-the-art saliency models.

Journal ArticleDOI
TL;DR: This novel approach that relies on a statistical analysis rather than physical models, and use a convolutional neural network to directly estimate the motion of successive ultrasound frames in an end‐to‐end fashion is introduced, yielding unprecedentedly accurate reconstructions.

Journal ArticleDOI
TL;DR: A simplified affine motion model-based coding framework to overcome the limitation of a translational motion model and maintain low-computational complexity is studied.
Abstract: In this paper, we study a simplified affine motion model-based coding framework to overcome the limitation of a translational motion model and maintain low-computational complexity. The proposed framework mainly has three key contributions. First, we propose to reduce the number of affine motion parameters from 6 to 4. The proposed four-parameter affine motion model can not only handle most of the complex motions in natural videos, but also save the bits for two parameters. Second, to efficiently encode the affine motion parameters, we propose two motion prediction modes, i.e., an advanced affine motion vector prediction scheme combined with a gradient-based fast affine motion estimation algorithm and an affine model merge scheme, where the latter attempts to reuse the affine motion parameters (instead of the motion vectors) of neighboring blocks. Third, we propose two fast affine motion compensation algorithms. One is the one-step sub-pixel interpolation that reduces the computations of each interpolation. The other is the interpolation-precision-based adaptive block size motion compensation that performs motion compensation at the block level rather than the pixel level to reduce the number of interpolation. Our proposed techniques have been implemented based on the state-of-the-art high-efficiency video coding standard, and the experimental results show that the proposed techniques altogether achieve, on average, 11.1% and 19.3% bits saving for random access and low-delay configurations, respectively, on typical video sequences that have rich rotation or zooming motions. Meanwhile, the computational complexity increases of both the encoder and the decoder are within an acceptable range.

Journal ArticleDOI
TL;DR: A new patch-based empirical Bayesian video denoising algorithm that builds a Bayesian model for each group of similar space-time patches as simple corrections of the eigenvalues of the sample covariance matrix, demonstrating empirically that these estimators lead to better empirical Wiener filters.
Abstract: In this paper we present a new patch-based empirical Bayesian video denoising algorithm. The method builds a Bayesian model for each group of similar space-time patches. These patches are not motion-compensated, and therefore avoid the risk of inaccuracies caused by motion estimation errors. The high dimensionality of spatiotemporal patches together with a limited number of available samples poses challenges when estimating the statistics needed for an empirical Bayesian method. We therefore assume that groups of similar patches have a low intrinsic dimensionality, leading to a spiked covariance model. Based on theoretical results about the estimation of spiked covariance matrices, we propose estimators of the eigenvalues of the a priori covariance in high-dimensional spaces as simple corrections of the eigenvalues of the sample covariance matrix. We demonstrate empirically that these estimators lead to better empirical Wiener filters. A comparison on classic benchmark videos demonstrates improved visual quality and an increased PSNR with respect to state-of-the-art video denoising methods.

Proceedings ArticleDOI
21 May 2018
TL;DR: A framework for direct visual simultaneous localization and mapping (SLAM) combining a monocular camera with sparse depth information from Light Detection and Ranging (LiDAR) and strict pose marginalization for accurate pose-graph SLAM and depth-integrated frame matching for large-scale mapping is described.
Abstract: This paper describes a framework for direct visual simultaneous localization and mapping (SLAM) combining a monocular camera with sparse depth information from Light Detection and Ranging (LiDAR). To ensure realtime performance while maintaining high accuracy in motion estimation, we present (i) a sliding window-based tracking method, (ii) strict pose marginalization for accurate pose-graph SLAM and (iii) depth-integrated frame matching for large-scale mapping. Unlike conventional feature-based visual and LiDAR mapping, the proposed approach is direct, eliminating the visual feature in the objective function. We evaluated results using our portable camera-LiDAR system as well as KITTI odometry benchmark datasets. The experimental results prove that the characteristics of two complementary sensors are very effective in improving real-time performance and accuracy. Via validation, we achieved low drift error of 0.98 % in the KITTI benchmark including various environments such as a highway and residential areas.

Posted ContentDOI
07 Jun 2018-bioRxiv
TL;DR: It is shown unequivocally that respirations contaminate movement estimates in functional MRI and that respiration generates apparent head motion not associated with degraded quality of functional MRI, and a novel approach using a band-stop filter that accurately removes these respiratory effects is developed.
Abstract: Head motion represents one of the greatest technical obstacles for brain MRI. Accurate detection of artifacts induced by head motion requires precise estimation of movement. However, this estimation may be corrupted by factitious effects owing to main field fluctuations generated by body motion. In the current report, we examine head motion estimation in multiband resting state functional connectivity MRI (rs-fcMRI) data from the Adolescent Brain and Cognitive Development (ABCD) Study and a comparison 9single-shot9 dataset from Oregon Health & Science University. We show unequivocally that respirations contaminate movement estimates in functional MRI and that respiration generates apparent head motion not associated with degraded quality of functional MRI. We have developed a novel approach using a band-stop filter that accurately removes these respiratory effects. Subsequently, we demonstrate that utilizing this filter improves post-processing data quality. Lastly, we demonstrate the real-time implementation of motion estimate filtering in our FIRMM (Framewise Integrated Real-Time MRI Monitoring) software package.

01 Jan 2018
TL;DR: Experimental results show that the joint learning of both tasks is complementary and the proposed models outperform the competing methods significantly in terms of accuracy and speed.
Abstract: Cardiac motion estimation and segmentation play important roles in quantitatively assessing cardiac function and diagnosing cardiovascular diseases. In this paper, we propose a novel deep learning method for joint estimation of motion and segmentation from cardiac MR image sequences. The proposed network consists of two branches: a cardiac motion estimation branch which is built on a novel unsupervised Siamese style recurrent spatial transformer network, and a cardiac segmentation branch that is based on a fully convolutional network. In particular, a joint multi-scale feature encoder is learned by optimizing the segmentation branch and the motion estimation branch simultaneously. This enables the weakly-supervised segmentation by taking advantage of features that are unsupervisedly learned in the motion estimation branch from a large amount of unannotated data. Experimental results using cardiac MlRI images from 220 subjects show that the joint learning of both tasks is complementary and the proposed models outperform the competing methods significantly in terms of accuracy and speed.

Book ChapterDOI
Liangliang Ren1, Xin Yuan1, Jiwen Lu1, Ming Yang, Jie Zhou1 
08 Sep 2018
TL;DR: A deep reinforcement learning with iterative shift (DRL-IS) method for single object tracking, where an actor-critic network is introduced to predict the iterative shifts of object bounding boxes, and evaluate the shifts to take actions on whether to update object models or re-initialize tracking.
Abstract: Visual tracking is confronted by the dilemma to locate a target both accurately and efficiently, and make decisions online whether and how to adapt the appearance model or even restart tracking. In this paper, we propose a deep reinforcement learning with iterative shift (DRL-IS) method for single object tracking, where an actor-critic network is introduced to predict the iterative shifts of object bounding boxes, and evaluate the shifts to take actions on whether to update object models or re-initialize tracking. Since locating an object is achieved by an iterative shift process, rather than online classification on many sampled locations, the proposed method is robust to cope with large deformations and abrupt motion, and computationally efficient since finding a target takes up to 10 shifts. In offline training, the critic network guides to learn how to make decisions jointly on motion estimation and tracking status in an end-to-end manner. Experimental results on the OTB benchmarks with large deformation improve the tracking precision by 1.7% and runs about 5 times faster than the competing state-of-the-art methods.

Proceedings ArticleDOI
13 Dec 2018
TL;DR: This work builds upon the recent developments in deep Convolutional Neural Networks (CNN) and automatically estimates the intrinsic parameters of the camera from a single input image, using the great amount of omnidirectional images available on the Internet to generate a large-scale dataset.
Abstract: Calibration of wide field-of-view cameras is a fundamental step for numerous visual media production applications, such as 3D reconstruction, image undistortion, augmented reality and camera motion estimation. However, existing calibration methods require multiple images of a calibration pattern (typically a checkerboard), assume the presence of lines, require manual interaction and/or need an image sequence. In contrast, we present a novel fully automatic deep learning-based approach that overcomes all these limitations and works with a single image of general scenes. Our approach builds upon the recent developments in deep Convolutional Neural Networks (CNN): our network automatically estimates the intrinsic parameters of the camera (focal length and distortion parameter) from a single input image. In order to train the CNN, we leverage the great amount of omnidirectional images available on the Internet to automatically generate a large-scale dataset composed of millions of wide field-of-view images with ground truth intrinsic parameters. Experiments successfully demonstrated the quality of our results, both quantitatively and qualitatively.

Journal ArticleDOI
TL;DR: This paper converts the segmentation of motion capture data into a temporal subspace clustering problem, and proposes a new segmentation method, which is robust to non-Gaussian noise, since correntropy is a localized similarity measure.
Abstract: Studies on human motion have attracted a lot of attentions. Human motion capture data, which much more precisely records human motion than videos do, has been widely used in many areas. Motion segmentation is an indispensable step for many related applications, but current segmentation methods for motion capture data do not effectively model some important characteristics of motion capture data, such as Riemannian manifold structure and containing non-Gaussian noise. In this paper, we convert the segmentation of motion capture data into a temporal subspace clustering problem. Under the framework of sparse subspace clustering, we propose to use the geodesic exponential kernel to model the Riemannian manifold structure, use correntropy to measure the reconstruction error, use the triangle constraint to guarantee temporal continuity in each cluster and use multi-view reconstruction to extract the relations between different joints. Therefore, exploiting some special characteristics of motion capture data, we propose a new segmentation method, which is robust to non-Gaussian noise, since correntropy is a localized similarity measure. We also develop an efficient optimization algorithm based on block coordinate descent method to solve the proposed model. Our optimization algorithm has a linear complexity while sparse subspace clustering is originally a quadratic problem. Extensive experiment results both on simulated noisy data set and real noisy data set demonstrate the advantage of the proposed method.

Journal ArticleDOI
07 Feb 2018
TL;DR: In this paper, an autoencoder network is used to find a nonlinear representation of the optical flow manifold and a latent space visual odometry (LS-VO) is learned jointly with the estimation task.
Abstract: This work proposes a novel deep network architecture to solve the camera ego-motion estimation problem. A motion estimation network generally learns features similar to optical flow (OF) fields starting from sequences of images. This OF can be described by a lower dimensional latent space. Previous research has shown how to find linear approximations of this space. We propose to use an autoencoder network to find a nonlinear representation of the OF manifold. In addition, we propose to learn the latent space jointly with the estimation task, so that the learned OF features become a more robust description of the OF input. We call this novel architecture latent space visual odometry (LS-VO). The experiments show that LS-VO achieves a considerable increase in performances with respect to baselines, while the number of parameters of the estimation network only slightly increases.

Journal ArticleDOI
TL;DR: An adaptive fractional-pixel ME skipped scheme is proposed for low-complexity HEVC ME, which reduces ME encoding time by an average of 63.22% while encoding efficiency performance is maintained.
Abstract: High-Efficiency Video Coding (HEVC) efficiently addresses the storage and transmit problems of high-definition videos, especially for 4K videos. The variable-size Prediction Units (PUs)--based Motion Estimation (ME) contributes a significant compression rate to the HEVC encoder and also generates a huge computation load. Meanwhile, high-level encoding complexity prevents widespread adoption of the HEVC encoder in multimedia systems. In this article, an adaptive fractional-pixel ME skipped scheme is proposed for low-complexity HEVC ME. First, based on the property of the variable-size PUs--based ME process and the video content partition relationship among variable-size PUs, all inter-PU modes during a coding unit encoding process are classified into root-type PU mode and children-type PU modes. Then, according to the ME result of the root-type PU mode, the fractional-pixel ME of its children-type PU modes is adaptively skipped. Simulation results show that, compared to the original ME in HEVC reference software, the proposed algorithm reduces ME encoding time by an average of 63.22% while encoding efficiency performance is maintained.

Journal ArticleDOI
TL;DR: A complete procedure for the automatic estimation of maritime target motion parameters by evaluating the generated Kelvin waves detected in synthetic aperture radar (SAR) images by evaluating a dual-stage low-rank plus sparse decomposition (LRSD) assisted by Radon transform (RT) for clutter reduction, sparse object detection, precise wake inclination estimation, and Kelvin wave spectral analysis.
Abstract: The problem in obtaining stable motion estimation of maritime targets is that sea clutter makes wake structure detection and reconnaissance difficult. This letter presents a complete procedure for the automatic estimation of maritime target motion parameters by evaluating the generated Kelvin waves detected in synthetic aperture radar (SAR) images. The algorithm consists in evaluating a dual-stage low-rank plus sparse decomposition (LRSD) assisted by Radon transform (RT) for clutter reduction, sparse object detection, precise wake inclination estimation, and Kelvin wave spectral analysis. The algorithm is based on the robust principal component analysis (RPCA) implemented by convex programming. The LRSD algorithm permits the extrapolation of sparse objects of interest consisting of the maritime targets and the Kelvin pattern from the unchanging low-rank background. This dual-stage RPCA and RT applied to SAR surveillance permits fast detection and enhanced motion parameter estimation of maritime targets.

Posted Content
TL;DR: A new algorithm, activation motion compensation, detects changes in the visual input and incrementally updates a previously-computed activation and applies well-known motion estimation techniques to adapt to visual changes to avoid unnecessary computation on most frames.
Abstract: Hardware support for deep convolutional neural networks (CNNs) is critical to advanced computer vision in mobile and embedded devices. Current designs, however, accelerate generic CNNs; they do not exploit the unique characteristics of real-time vision. We propose to use the temporal redundancy in natural video to avoid unnecessary computation on most frames. A new algorithm, activation motion compensation, detects changes in the visual input and incrementally updates a previously-computed output. The technique takes inspiration from video compression and applies well-known motion estimation techniques to adapt to visual changes. We use an adaptive key frame rate to control the trade-off between efficiency and vision quality as the input changes. We implement the technique in hardware as an extension to existing state-of-the-art CNN accelerator designs. The new unit reduces the average energy per frame by 54.2%, 61.7%, and 87.6% for three CNNs with less than 1% loss in vision accuracy.