scispace - formally typeset
Search or ask a question

Showing papers on "Motion estimation published in 2021"


Journal ArticleDOI
TL;DR: In this article, a novel adaptive warping layer is developed to integrate both optical flow and interpolation kernels to synthesize target frame pixels, which is fully differentiable such that both the flow and kernel estimation networks can be optimized.
Abstract: Motion estimation (ME) and motion compensation (MC) have been widely used for classical video frame interpolation systems over the past decades. Recently, a number of data-driven frame interpolation methods based on convolutional neural networks have been proposed. However, existing learning based methods typically estimate either flow or compensation kernels, thereby limiting performance on both computational efficiency and interpolation accuracy. In this work, we propose a motion estimation and compensation driven neural network for video frame interpolation. A novel adaptive warping layer is developed to integrate both optical flow and interpolation kernels to synthesize target frame pixels. This layer is fully differentiable such that both the flow and kernel estimation networks can be optimized jointly. The proposed model benefits from the advantages of motion estimation and compensation methods without using hand-crafted features. Compared to existing methods, our approach is computationally efficient and able to generate more visually appealing results. Furthermore, the proposed MEMC-Net architecture can be seamlessly adapted to several video enhancement tasks, e.g., super-resolution, denoising, and deblocking. Extensive quantitative and qualitative evaluations demonstrate that the proposed method performs favorably against the state-of-the-art video frame interpolation and enhancement algorithms on a wide range of datasets.

168 citations


Journal ArticleDOI
TL;DR: This paper proposes the first end-to-end deep video compression framework that can outperform the widely used video coding standard H.264 and be even on par with the latest standard H265.
Abstract: Traditional video compression approaches build upon the hybrid coding framework with motion-compensated prediction and residual transform coding. In this paper, we propose the first end-to-end deep video compression framework to take advantage of both the classical compression architecture and the powerful non-linear representation ability of neural networks. Our framework employs pixel-wise motion information, which is learned from an optical flow network and further compressed by an auto-encoder network to save bits. The other compression components are also implemented by the well-designed networks for high efficiency. All the modules are jointly optimized by using the rate-distortion trade-off and can collaborate with each other. More importantly, the proposed deep video compression framework is very flexible and can be easily extended by using lightweight or advanced networks for higher speed or better efficiency. We also propose to introduce the adaptive quantization layer to reduce the number of parameters for variable bitrate coding. Comprehensive experimental results demonstrate the effectiveness of the proposed framework on the benchmark datasets.

123 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: Wang et al. as mentioned in this paper proposed a feature-space video coding network (FVC) by performing all major operations (i.e., motion estimation, motion compression, motion compensation and residual compression) in the feature space.
Abstract: Learning based video compression attracts increasing attention in the past few years. The previous hybrid coding approaches rely on pixel space operations to reduce spatial and temporal redundancy, which may suffer from inaccurate motion estimation or less effective motion compensation. In this work, we propose a feature-space video coding network (FVC) by performing all major operations (i.e., motion estimation, motion compression, motion compensation and residual compression) in the feature space. Specifically, in the proposed deformable compensation module, we first apply motion estimation in the feature space to produce motion information (i.e., the offset maps), which will be compressed by using the auto-encoder style network. Then we perform motion compensation by using deformable convolution and generate the predicted feature. After that, we compress the residual feature between the feature from the current frame and the predicted feature from our deformable compensation module. For better frame reconstruction, the reference features from multiple previous reconstructed frames are also fused by using the nonlocal attention mechanism in the multi-frame feature fusion module. Comprehensive experimental results demonstrate that the proposed framework achieves the state-of-the-art performance on four benchmark datasets including HEVC, UVG, VTL and MCL-JCV.

120 citations


Proceedings ArticleDOI
04 May 2021
TL;DR: In this article, a motion representation for animating articulated objects consisting of distinct parts is proposed, which can animate a variety of objects, including vehicles, pedestrians, and vehicles, in a completely unsupervised manner.
Abstract: We propose novel motion representations for animating articulated objects consisting of distinct parts. In a completely unsupervised manner, our method identifies object parts, tracks them in a driving video, and infers their motions by considering their principal axes. In contrast to the previous keypoint-based works, our method extracts meaningful and consistent regions, describing locations, shape, and pose. The regions correspond to semantically relevant and distinct object parts, that are more easily detected in frames of the driving video. To force decoupling of foreground from background, we model non-object related global motion with an additional affine transformation. To facilitate animation and prevent the leakage of the shape of the driving object, we disentangle shape and pose of objects in the region space. Our model1 can animate a variety of objects, surpassing previous methods by a large margin on existing benchmarks. We present a challenging new benchmark with high-resolution videos and show that the improvement is particularly pronounced when articulated objects are considered, reaching 96.6% user preference vs. the state of the art.

105 citations


Journal ArticleDOI
TL;DR: New quadrant-based search algorithm with zero motion prejudgment method is proposed for motion estimation (ME) in HEVC (High Efficiency Video Coding) standard to obtain efficient output with low motion estimation time.
Abstract: In this manuscript, new quadrant-based search algorithm with zero motion prejudgment is proposed for motion estimation (ME) in HEVC (High Efficiency Video Coding) standard. The HEVC standard is used to obtain efficient output with low motion estimation time. The proposed quadrant-based search algorithm is a fast block matching algorithm that obtain better block matching amid the current block and reference block. The zero motion prejudgment (ZMP) method is used to find the block, whether it is motion or static and it is used for decreasing the computational complexity (CC) in the proposed quadrant-based search algorithm. The proposed quadrant-based search algorithm with ZMP technique for motion estimation in HEVC is implemented on the FPGA hardware platform. The entire architecture is executed in Verilog HDL with Virtex-5 technology and integrated with Xilinx ISE Design Suite 14.5. The results are integrated into the CIF (352 × 288 pixels) and HD (1280 × 720 pixels) video input sequence. The evaluation metrics like PSNR, Motion estimation time, sum of absolute difference (SAD) value are analyzed with existing method like hexagon, adaptive root pattern algorithm, and diamond search algorithm. Then the hardware parameters like power consumption and maximum operating frequency are measured. The hardware utilization is reduced and the power consumption of the proposed model is diminished to 0.143 W. The maximal operating frequency of the proposed model is 440.470 MHz. The experimental outcomes demonstrate that the proposed motion evaluation approach in HEVC is more effective than existing algorithms.

104 citations


Proceedings ArticleDOI
TL;DR: In this article, Parametric Continuous Convolutional Neural Networks (PCNNs) are proposed to exploit parameterized kernel functions that span the full continuous vector space, allowing them to learn over arbitrary data structures as long as their support relationship is computable.
Abstract: Standard convolutional neural networks assume a grid structured input is available and exploit discrete convolutions as their fundamental building blocks. This limits their applicability to many real-world applications. In this paper we propose Parametric Continuous Convolution, a new learnable operator that operates over non-grid structured data. The key idea is to exploit parameterized kernel functions that span the full continuous vector space. This generalization allows us to learn over arbitrary data structures as long as their support relationship is computable. Our experiments show significant improvement over the state-of-the-art in point cloud segmentation of indoor and outdoor scenes, and lidar motion estimation of driving scenes.

74 citations


Journal ArticleDOI
TL;DR: A novel progressive fusion network for video SR, in which frames are processed in a way of progressive separation and fusion for the thorough utilization of spatio-temporal information, which incorporates multi-scale structure and hybrid convolutions into the network to capture a wide range of dependencies.
Abstract: How to effectively fuse temporal information from consecutive frames remains to be a non-trivial problem in video super-resolution (SR), since most existing fusion strategies (direct fusion, slow fusion or 3D convolution) either fail to make full use of temporal information or cost too much calculation. To this end, we propose a novel progressive fusion network for video SR, in which frames are processed in a way of progressive separation and fusion for the thorough utilization of spatio-temporal information. We particularly incorporate multi-scale structure and hybrid convolutions into the network to capture a wide range of dependencies. We further propose a non-local operation to extract long-range spatio-temporal correlations directly, taking place of traditional motion estimation and motion compensation (ME&MC). This design relieves the complicated ME&MC algorithms, but enjoys better performance than various ME&MC schemes. Finally, we improve generative adversarial training for video SR to avoid temporal artifacts such as flickering and ghosting. In particular, we propose a frame variation loss with a single-sequence training method to generate more realistic and temporally consistent videos. Extensive experiments on public datasets show the superiority of our method over state-of-the-art methods in terms of performance and complexity. Our code is available at https://github.com/psychopa4/MSHPFNL.

57 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a blur-invariant motion estimation learning to improve motion estimation accuracy between blurry frames, and instead of aligning frames by warping with estimated motions, they use a pixel volume containing candidate sharp pixels to resolve motion estimation errors.
Abstract: For the success of video deblurring, it is essential to utilize information from neighboring frames. Most state-of-the-art video deblurring methods adopt motion compensation between video frames to aggregate information from multiple frames that can help deblur a target frame. However, the motion compensation methods adopted by previous deblurring methods are not blur-invariant, and consequently, their accuracy is limited for blurry frames with different blur amounts. To alleviate this problem, we propose two novel approaches to deblur videos by effectively aggregating information from multiple video frames. First, we present blur-invariant motion estimation learning to improve motion estimation accuracy between blurry frames. Second, for motion compensation, instead of aligning frames by warping with estimated motions, we use a pixel volume that contains candidate sharp pixels to resolve motion estimation errors. We combine these two processes to propose an effective recurrent video deblurring network that fully exploits deblurred previous frames. Experiments show that our method achieves the state-of-the-art performance both quantitatively and qualitatively compared to recent methods that use deep learning.

40 citations


Proceedings Article
01 Jan 2021
TL;DR: Wang et al. as discussed by the authors proposed a video frame interpolation algorithm based on asymmetric bilateral motion estimation (ABME), which synthesizes an intermediate frame between two input frames, and developed a new synthesis network that generates a set of dynamic filters and a residual frame using local and global information.
Abstract: We propose a novel video frame interpolation algorithm based on asymmetric bilateral motion estimation (ABME), which synthesizes an intermediate frame between two input frames. First, we predict symmetric bilateral motion fields to interpolate an anchor frame. Second, we estimate asymmetric bilateral motions fields from the anchor frame to the input frames. Third, we use the asymmetric fields to warp the input frames backward and reconstruct the intermediate frame. Last, to refine the intermediate frame, we develop a new synthesis network that generates a set of dynamic filters and a residual frame using local and global information. Experimental results show that the proposed algorithm achieves excellent performance on various datasets. The source codes and pretrained models are available at this https URL.

39 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this paper, a self-supervised learning framework was proposed to estimate motion from point clouds and paired camera images using a probabilistic motion masking and cross-sensor motion regularization.
Abstract: Autonomous driving can benefit from motion behavior comprehension when interacting with diverse traffic participants in highly dynamic environments. Recently, there has been a growing interest in estimating class-agnostic motion directly from point clouds. Current motion estimation methods usually require vast amount of annotated training data from self-driving scenes. However, manually labeling point clouds is notoriously difficult, error-prone and time-consuming. In this paper, we seek to answer the research question of whether the abundant unlabeled data collections can be utilized for accurate and efficient motion learning. To this end, we propose a learning framework that leverages free supervisory signals from point clouds and paired camera images to estimate motion purely via self-supervision. Our model involves a point cloud based structural consistency augmented with probabilistic motion masking as well as a cross-sensor motion regularization to realize the desired self-supervision. Experiments reveal that our approach performs competitively to supervised methods, and achieves the state-of-the-art result when combining our self-supervised model with supervised fine-tuning.

37 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: MultiBodySync as discussed by the authors proposes an end-to-end trainable multi-body motion segmentation and rigid registration framework for multiple input 3D point clouds, which incorporates spectral synchronization into an iterative deep declarative network.
Abstract: We present MultiBodySync, a novel, end-to-end trainable multi-body motion segmentation and rigid registration framework for multiple input 3D point clouds. The two non-trivial challenges posed by this multi-scan multibody setting that we investigate are: (i) guaranteeing correspondence and segmentation consistency across multiple input point clouds capturing different spatial arrangements of bodies or body parts; and (ii) obtaining robust motion-based rigid body segmentation applicable to novel object categories. We propose an approach to address these issues that incorporates spectral synchronization into an iterative deep declarative network, so as to simultaneously recover consistent correspondences as well as motion segmentation. At the same time, by explicitly disentangling the correspondence and motion segmentation estimation modules, we achieve strong generalizability across different object categories. Our extensive evaluations demonstrate that our method is effective on various datasets ranging from rigid parts in articulated objects to individually moving objects in a 3D scene, be it single-view or full point clouds. Code at https://github.com/huangjh-pub/multibody-sync.

Journal ArticleDOI
TL;DR: In this paper, a deep learning based framework for motion estimation in echocardiography was proposed, which achieved an average end point error of (0.06±0.04) mm per frame using simulated data from an open access database, on par or better compared to previously reported state of the art.
Abstract: Deformation imaging in echocardiography has been shown to have better diagnostic and prognostic value than conventional anatomical measures such as ejection fraction. However, despite clinical availability and demonstrated efficacy, everyday clinical use remains limited at many hospitals. The reasons are complex, but practical robustness has been questioned, and a large inter-vendor variability has been demonstrated. In this work, we propose a novel deep learning based framework for motion estimation in echocardiography, and use this to fully automate myocardial function imaging. A motion estimator was developed based on a PWC-Net architecture, which achieved an average end point error of (0.06±0.04) mm per frame using simulated data from an open access database, on par or better compared to previously reported state of the art. We further demonstrate unique adaptability to image artifacts such as signal dropouts, made possible using trained models that incorporate relevant image augmentations. Further, a fully automatic pipeline consisting of cardiac view classification, event detection, myocardial segmentation and motion estimation was developed and used to estimate left ventricular longitudinal strain in vivo. The method showed promise by achieving a mean deviation of (−0.7±1.6)% compared to a semi-automatic commercial solution for N=30 patients with relevant disease, within the expected limits of agreement. We thus believe that learning-based motion estimation can facilitate extended use of strain imaging in clinical practice.

Journal ArticleDOI
TL;DR: The proposed RespME-net has achieved similar motion-corrected CMRA image quality to the conventional registration method regarding coronary artery length and sharpness, and can predict 3D non-rigid motion fields with subpixel accuracy within ~10 seconds, being ~20 times faster than a GPU-implemented state-of-the-art non- Rigid registration method.
Abstract: Non-rigid motion-corrected reconstruction has been proposed to account for the complex motion of the heart in free-breathing 3D coronary magnetic resonance angiography (CMRA). This reconstruction framework requires efficient and accurate estimation of non-rigid motion fields from undersampled images at different respiratory positions (or bins). However, state-of-the-art registration methods can be time-consuming. This article presents a novel unsupervised deep learning-based strategy for fast estimation of inter-bin 3D non-rigid respiratory motion fields for motion-corrected free-breathing CMRA. The proposed 3D respiratory motion estimation network (RespME-net) is trained as a deep encoder-decoder network, taking pairs of 3D image patches extracted from CMRA volumes as input and outputting the motion field between image patches. Using image warping by the estimated motion field, a loss function that imposes image similarity and motion smoothness is adopted to enable training without ground truth motion field. RespME-net is trained patch-wise to circumvent the challenges of training a 3D network volume-wise which requires large amounts of GPU memory and 3D datasets. We perform 5-fold cross-validation with 45 CMRA datasets and demonstrate that RespME-net can predict 3D non-rigid motion fields with subpixel accuracy (0.44 ± 0.38 mm) within ~10 seconds, being ~20 times faster than a GPU-implemented state-of-the-art non-rigid registration method. Moreover, we perform non-rigid motion-compensated CMRA reconstruction for 9 additional patients. The proposed RespME-net has achieved similar motion-corrected CMRA image quality to the conventional registration method regarding coronary artery length and sharpness.

Proceedings ArticleDOI
11 Jan 2021
TL;DR: This work proposes a modular network, whose architecture is motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field, and achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
Abstract: Appearance-based detectors achieve remarkable performance on common scenes, benefiting from high-capacity models and massive annotated data, but tend to fail for scenarios that lack training data. Geometric motion segmentation algorithms, however, generalize to novel scenes, but have yet to achieve comparable performance to appearance-based ones, due to noisy motion estimations and degenerate motion configurations. To combine the best of both worlds, we propose a modular network, whose architecture is motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field. It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transforma tions. Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel. The inferred rigid motions lead to a significant improvement for depth and scene flow estimation.

Journal ArticleDOI
TL;DR: A plane-edge-SLAM system using an RGB-D sensor and an adaptive weighting algorithm is developed to address the seamless fusion of planes and edges and benefits the performance of motion estimation.
Abstract: Planes and edges are attractive features for simultaneous localization and mapping (SLAM) in indoor environments because they can be reliably extracted and are robust to illumination changes. However, it remains a challenging problem to seamlessly fuse two different kinds of features to avoid degeneracy and accurately estimate the camera motion. In this article, a plane-edge-SLAM system using an RGB-D sensor is developed to address the seamless fusion of planes and edges. Constraint analysis is first performed to obtain a quantitative measure of how the planes constrain the camera motion estimation. Then, using the results of the constraint analysis, an adaptive weighting algorithm is elaborately designed to achieve seamless fusion. Through the fusion of planes and edges, the solution to motion estimation is fully constrained, and the problem remains well-posed in all circumstances. In addition, a probabilistic plane fitting algorithm is proposed to fit a plane model to the noisy 3-D points. By exploiting the error model of the depth sensor, the proposed plane fitting is adaptive to various measurement noises corresponding to different depth measurements. As a result, the estimated plane parameters are more accurate and robust to the points with large uncertainties. Compared with the existing plane fitting methods, the proposed method definitely benefits the performance of motion estimation. The results of extensive experiments on public data sets and in real-world indoor scenes demonstrate that the plane-edge-SLAM system can achieve high accuracy and robustness. Note to Practitioners —This article is motivated by the robust localization and mapping for mobile robots. We suggest a novel simultaneous localization and mapping (SLAM) approach fusing the plane and edge features in indoor scenes (plane-edge-SLAM). This newly proposed approach works well in the textureless or dark scenes and is robust to the sensor noise. The experiments are carried out in various indoor scenes for mobile robots, and the results demonstrate the robustness and effectiveness of the proposed framework. In future work, we will address the fusion of other high-level features (for example, 3-D lines) and the active exploration of the environments.

Journal ArticleDOI
TL;DR: For the success of video deblurring, it is essential to utilize information from neighboring frames as discussed by the authors, and most state-of-the-art methods adopt motion compensation between video frames to deblur the video.
Abstract: For the success of video deblurring, it is essential to utilize information from neighboring frames. Most state-of-the-art video deblurring methods adopt motion compensation between video frames to...

Proceedings ArticleDOI
19 Sep 2021
TL;DR: Home as mentioned in this paper uses a simple architecture with classic convolution networks coupled with an attention mechanism for agent interactions, and outputs an unconstrained 2D top-view representation of the agent's possible future.
Abstract: In this paper, we propose HOME, a framework tackling the motion forecasting problem with an image output representing the probability distribution of the agent's future location. This method allows for a simple architecture with classic convolution networks coupled with attention mechanism for agent interactions, and outputs an unconstrained 2D top-view representation of the agent's possible future. Based on this output, we design two methods to sample a finite set of agent's future locations. These methods allow us to control the optimization trade-off between miss rate and final displacement error for multiple modalities without having to retrain any part of the model. We apply our method to the Argoverse Motion Forecasting Benchmark and achieve 1st place on the online leaderboard.

Journal ArticleDOI
TL;DR: In this article, an accurate imaging and motion estimation method based on multiple-input-multiple-out (MIMO) radar is presented, where a preprocessing strategy is exerted based on the space-time adaptive processing (STAP) theory, clutter signals can be suppressed effectually with constructing a Doppler spectrum model.
Abstract: Image deterioration problem occurs in radar imaging for ship target, which results from the complex time-varying motions of ship, the noise in channels, and the clutter on sea surface. It is hard to be solved effectively due to coherent accumulation sampling time and high-dimensional parametric model. Hence, an accurate imaging and motion estimation method based on multiple-input–multiple-out (MIMO) radar is presented. First, the multidimensional signal model is built to characterize target features accurately. To reduce the interference from sea clutters, a preprocessing strategy is exerted based on the space–time adaptive processing (STAP) theory, clutter signals can be suppressed effectually with constructing a Doppler spectrum model. Then, for accurate imaging and motion estimation, a combined trace norm minimization problem is deduced based on the relaxation of tensor rank, where the noise in sea environments is also considered. Meanwhile, generalized tensor total variation constraint is developed to ensure stable estimation and smooth imaging results when separating the noise term. Accordingly, an effective decomposition criterion is formulated based on alternating direction multiplier method (ADMM) strategy, and motion parameters can be precisely calculated based on the least square (LS) method. Finally, theoretical analysis and simulation results present the accurate performance of the proposed method.

Journal ArticleDOI
TL;DR: This paper proposes a novel multi-scale plane fitting based visual flow algorithm that is robust to the aperture problem and also computationally fast and efficient.
Abstract: Optical flow is a crucial component of the feature space for early visual processing of dynamic scenes especially in new applications such as self-driving vehicles, drones and autonomous robots. The dynamic vision sensors are well suited for such applications because of their asynchronous, sparse and temporally precise representation of the visual dynamics. Many algorithms proposed for computing visual flow for these sensors suffer from the aperture problem as the direction of the estimated flow is governed by the curvature of the object rather than the true motion direction. Some methods that do overcome this problem by temporal windowing under-utilize the true precise temporal nature of the dynamic sensors. In this paper, we propose a novel multi-scale plane fitting based visual flow algorithm that is robust to the aperture problem and also computationally fast and efficient. Our algorithm performs well in many scenarios ranging from fixed camera recording simple geometric shapes to real world scenarios such as camera mounted on a moving car and can successfully perform event-by-event motion estimation of objects in the scene to allow for predictions of up to 500 ms i.e. equivalent to 10 to 25 frames with traditional cameras.

Proceedings ArticleDOI
04 Mar 2021
TL;DR: Wang et al. as discussed by the authors proposed a novel deep learning-based fully unsupervised method for in vivo motion tracking on t-MRI images, which estimates the motion field (INF) between any two consecutive tMRI frames by a bi-directional generative diffeomorphic registration neural network and then estimates the Lagrangian motion field between the reference frame and any other frame through a differentiable composition layer.
Abstract: Cardiac tagging magnetic resonance imaging (t-MRI) is the gold standard for regional myocardium deformation and cardiac strain estimation. However, this technique has not been widely used in clinical diagnosis, as a result of the difficulty of motion tracking encountered with t-MRI images. In this paper, we propose a novel deep learning-based fully unsupervised method for in vivo motion tracking on t-MRI images. We first estimate the motion field (INF) between any two consecutive t-MRI frames by a bi-directional generative diffeomorphic registration neural network. Using this result, we then estimate the Lagrangian motion field between the reference frame and any other frame through a differentiable composition layer. By utilizing temporal information to perform reasonable estimations on spatiotemporal motion fields, this novel method provides a useful solution for motion tracking and image registration in dynamic medical imaging. Our method has been validated on a representative clinical t-MRI dataset; the experimental results show that our method is superior to conventional motion tracking methods in terms of landmark tracking accuracy and inference efficiency. Project page is at: https://github.com/DeepTag/cardiac_tagging_motion_estimation.

Journal ArticleDOI
TL;DR: This work was supported in part by the National Robotics Programme (NRP) under SERC Grants 162 25 00036 and 192 25 00049.
Abstract: Accurate motion estimation plays a crucial role in state estimation of an unmanned aerial vehicle (UAV). This is usually carried out by fusing the kinematics of an inertial measurement unit (IMU) with the video output of a camera. However, the accuracy of existing approaches is hindered by the discretization effect of the model even at a high IMU sampling rate. In order to improve the accuracy, we propose a new IMU motion integration model for the IMU kinematics in continuous time. The kinematics are modeled using a switched linear system. A closed-form discrete formulation is derived to compute the mean measurement, the covariance matrix, and the Jacobian matrix. Thus, it is more accurate and more efficient for online estimation of visual-inertial odometry (VIO), particularly when there is a high dynamic change in the agent's motion or the agent travels with high speed. The proposed IMU factor framework is evaluated using both real public datasets and indoor environment under different scenarios of motion capture. Our evaluation shows that the proposed framework outperforms the state-of-the-art VIO approach by up to 22.71% accuracy improvement on the EuRoc dataset and 38.15% accuracy improvement for motion estimation under the indoor environment.

Journal ArticleDOI
TL;DR: This work introduces reference frame alignment as a key technique for deep network-based frame extrapolation, and proposes to align the reference frames, e.g. using block-based motion estimation and motion compensation, and extrapolate from the aligned frames by a trained deep network.
Abstract: Frame extrapolation is to predict future frames from the past (reference) frames, which has been studied intensively in the computer vision research and has great potential in video coding. Recently, a number of studies have been devoted to the use of deep networks for frame extrapolation, which achieves certain success. However, due to the complex and diverse motion patterns in natural video, it is still difficult to extrapolate frames with high fidelity directly from reference frames. To address this problem, we introduce reference frame alignment as a key technique for deep network-based frame extrapolation. We propose to align the reference frames, e.g. using block-based motion estimation and motion compensation, and then to extrapolate from the aligned frames by a trained deep network. Since the alignment, a preprocessing step, effectively reduces the diversity of network input, we observe that the network is easier to train and the extrapolated frames are of higher quality. We verify the proposed technique in video coding, using the extrapolated frame for inter prediction in High Efficiency Video Coding (HEVC) and Versatile Video Coding (VVC). We investigate different schemes, including whether to align between the target frame and the reference frames, and whether to perform motion estimation on the extrapolated frame. We conduct a comprehensive set of experiments to study the efficiency of the proposed method and to compare different schemes. Experimental results show that our proposal achieves on average 5.3% and 2.8% BD-rate reduction in Y component compared to HEVC, under low-delay P and low-delay B configurations, respectively. Our proposal performs much better than the frame extrapolation without reference frame alignment.

Journal ArticleDOI
TL;DR: A novel dynamic MRI reconstruction approach called MODRN and an end-to-end improved version called MOD RN(e2e), both of which enhance the reconstruction quality by infusing motion information into the modeling process with deep neural networks are proposed.

Journal ArticleDOI
Bo Zhou1, Yu-Jung Tsai1, Xiongchao Chen1, James S. Duncan1, Chi Liu1 
TL;DR: Zhang et al. as mentioned in this paper proposed a Temporal Siamese Pyramid Network (TSP-Net) with basic units made up of 1.) siamese pyramid network and 2.) a recurrent layer for motion estimation among the gates.
Abstract: In positron emission tomography (PET), gating is commonly utilized to reduce respiratory motion blurring and to facilitate motion correction methods. In application where low-dose gated PET is useful, reducing injection dose causes increased noise levels in gated images that could corrupt motion estimation and subsequent corrections, leading to inferior image quality. To address these issues, we propose MDPET, a unified motion correction and denoising adversarial network for generating motion-compensated low-noise images from low-dose gated PET data. Specifically, we proposed a Temporal Siamese Pyramid Network (TSP-Net) with basic units made up of 1.) Siamese Pyramid Network (SP-Net), and 2.) a recurrent layer for motion estimation among the gates. The denoising network is unified with our motion estimation network to simultaneously correct the motion and predict a motion-compensated denoised PET reconstruction. The experimental results on human data demonstrated that our MDPET can generate accurate motion estimation directly from low-dose gated images and produce high-quality motion-compensated low-noise reconstructions. Comparative studies with previous methods also show that our MDPET is able to generate superior motion estimation and denoising performance. Our code is available at https://github.com/bbbbbbzhou/MDPET .

Journal ArticleDOI
TL;DR: A fully automated PET motion correction method, MR‐guided MAF, based on the co‐registration of multicontrast MR images, which can reduce head motion introduced artefacts and improve the image sharpness and quantitative accuracy of PET images acquired using simultaneous MR‐PET scanners.
Abstract: Head motion is a major source of image artefacts in neuroimaging studies and can lead to degradation of the quantitative accuracy of reconstructed PET images. Simultaneous magnetic resonance-positron emission tomography (MR-PET) makes it possible to estimate head motion information from high-resolution MR images and then correct motion artefacts in PET images. In this article, we introduce a fully automated PET motion correction method, MR-guided MAF, based on the co-registration of multicontrast MR images. The performance of the MR-guided MAF method was evaluated using MR-PET data acquired from a cohort of ten healthy participants who received a slow infusion of fluorodeoxyglucose ([18-F]FDG). Compared with conventional methods, MR-guided PET image reconstruction can reduce head motion introduced artefacts and improve the image sharpness and quantitative accuracy of PET images acquired using simultaneous MR-PET scanners. The fully automated motion estimation method has been implemented as a publicly available web-service.

Proceedings Article
18 May 2021
TL;DR: In this article, an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion, and depth in a monocular camera setup without supervision is presented.
Abstract: We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion, and depth in a monocular camera setup without supervision. Our technical contributions are three-fold. First, we highlight the fundamental difference between inverse and forward projection while modeling the individual motion of each rigid object, and propose a geometrically correct projection pipeline using a neural forward projection module. Second, we design a unified instance-aware photometric and geometric consistency loss that holistically imposes self-supervisory signals for every background and object region. Lastly, we introduce a general-purpose auto-annotation scheme using any off-the-shelf instance segmentation and optical flow models to produce video instance segmentation maps that will be utilized as input to our training pipeline. These proposed elements are validated in a detailed ablation study. Through extensive experiments conducted on the KITTI and Cityscapes dataset, our framework is shown to outperform the state-of-the-art depth and motion estimation methods. Our code, dataset, and models are publicly available.

Journal ArticleDOI
TL;DR: This work proposes a framework for dynamic MRI reconstruction framed under a new multi-task optimisation model called Compressed Sensing Plus Motion (CS + M), and shows that the proposed scheme reduces blurring artefacts, and preserves the target shape and fine details in the reconstruction.

Journal ArticleDOI
TL;DR: In this article, the authors propose a complete compression framework for attributes of 3D dynamic point clouds, focusing on optimal inter-coding and predictive transform coding, assuming the Gaussian Markov Random Field model with respect to a spatio-temporal graph underlying the attributes of dynamic points.
Abstract: As 3D scanning devices and depth sensors advance, dynamic point clouds have attracted increasing attention as a format for 3D objects in motion, with applications in various fields such as immersive telepresence, navigation for autonomous driving and gaming. Nevertheless, the tremendous amount of data in dynamic point clouds significantly burden transmission and storage. To this end, we propose a complete compression framework for attributes of 3D dynamic point clouds, focusing on optimal inter-coding. Firstly, we derive the optimal inter-prediction and predictive transform coding assuming the Gaussian Markov Random Field model with respect to a spatio-temporal graph underlying the attributes of dynamic point clouds. The optimal predictive transform proves to be the Generalized Graph Fourier Transform in terms of spatio-temporal decorrelation. Secondly, we propose refined motion estimation via efficient registration prior to inter-prediction, which searches the temporal correspondence between adjacent frames of irregular point clouds. Finally, we present a complete framework based on the optimal inter-coding and our previously proposed intra-coding, where we determine the optimal coding mode from rate-distortion optimization with the proposed offline-trained $\lambda $ -Q model. Experimental results show that we achieve around 17% bit rate reduction on average over competitive dynamic point cloud compression methods.

Journal ArticleDOI
TL;DR: A novel, computationally efficient, and robust light detection and ranging (LiDAR)-only odometry framework based on truncated least squares termed T-LOAM, which focuses on alleviating the impact of outliers to allow robust navigation in sparse, noisy, or cluttered scenarios where degeneration occurs.
Abstract: We propose a novel, computationally efficient, and robust light detection and ranging (LiDAR)-only odometry framework based on truncated least squares termed T-LOAM. Our method focuses on alleviating the impact of outliers to allow robust navigation in sparse, noisy, or cluttered scenarios where degeneration occurs. As preprocessing, the multiregion ground extraction and dynamic curved-voxel clustering methods are proposed to accomplish the segmentation of 3D point clouds and filter out unstable objects. A novel feature extraction module is tailored to discriminate four peculiar features: edge features, sphere features, planar features, and ground features. As frontend, a hierarchical feature-based LiDAR-only odometry performs precise motion estimates through the truncated least squares method for directly processing various features. The preprocessing model and motion estimation precision have been evaluated on the KITTI odometry benchmark as well as various campus scenarios. The experimental results have demonstrated the real-time capability and superior precision of the proposed T-LOAM over other state-of-the-art algorithms.

Proceedings ArticleDOI
01 Jun 2021
TL;DR: This paper proposed spatiotemporal registration as a compelling technique for event-based rotational motion estimation, which produces feature tracks as a byproduct, which directly supports an efficient visual odometry pipeline with graph-based optimisation for motion averaging.
Abstract: A useful application of event sensing is visual odometry, especially in settings that require high-temporal resolution. The state-of-the-art method of contrast maximisation recovers the motion from a batch of events by maximising the contrast of the image of warped events. However, the cost scales with image resolution and the temporal resolution can be limited by the need for large batch sizes to yield sufficient structure in the contrast image1. In this work, we propose spatiotemporal registration as a compelling technique for event-based rotational motion estimation. We theoretically justify the approach and establish its fundamental and practical advantages over contrast maximisation. In particular, spatiotemporal registration also produces feature tracks as a by-product, which directly supports an efficient visual odometry pipeline with graph-based optimisation for motion averaging. The simplicity of our visual odometry pipeline allows it to process more than 1 M events/second. We also contribute a new event dataset for visual odometry, where motion sequences with large velocity variations were acquired using a high-precision robot arm2.