scispace - formally typeset
Search or ask a question
Author

Qiankun Tang

Bio: Qiankun Tang is an academic researcher from Chinese Academy of Sciences. The author has contributed to research in topics: Frame (networking) & Motion blur. The author has an hindex of 4, co-authored 7 publications receiving 83 citations.

Papers
More filters
Proceedings ArticleDOI
14 Jun 2020
TL;DR: Wang et al. as mentioned in this paper proposed a video prediction network based on multi-level wavelet analysis to uniformly deal with spatial and temporal information, which decomposes each video frame into anisotropic sub-bands with multiple frequencies.
Abstract: Video prediction is a pixel-wise dense prediction task to infer future frames based on past frames. Missing appearance details and motion blur are still two major problems for current models, leading to image distortion and temporal inconsistency. We point out the necessity of exploring multi-frequency analysis to deal with the two problems. Inspired by the frequency band decomposition characteristic of Human Vision System (HVS), we propose a video prediction network based on multi-level wavelet analysis to uniformly deal with spatial and temporal information. Specifically, multi-level spatial discrete wavelet transform decomposes each video frame into anisotropic sub-bands with multiple frequencies, helping to enrich structural information and reserve fine details. On the other hand, multilevel temporal discrete wavelet transform which operates on time axis decomposes the frame sequence into sub-band groups of different frequencies to accurately capture multifrequency motions under a fixed frame rate. Extensive experiments on diverse datasets demonstrate that our model shows significant improvements on fidelity and temporal consistency over the state-of-the-art works. Source code and videos are available at https://github.com/Bei-Jin/STMFANet.

75 citations

Proceedings Article
01 Jan 2018
TL;DR: Experimental results show that regardless of inputing a single depth or RGB-D, the proposed disentangled framework can generate high-quality semantic scene completion, and outperforms state-of-the-art approaches on both synthetic and real datasets.
Abstract: Semantic scene completion predicts volumetric occupancy and object category of a 3D scene, which helps intelligent agents to understand and interact with the surroundings. In this work, we propose a disentangled framework, sequentially carrying out 2D semantic segmentation, 2D-3D reprojection and 3D semantic scene completion. This three-stage framework has three advantages: (1) explicit semantic segmentation significantly boosts performance; (2) flexible fusion ways of sensor data bring good extensibility; (3) progress in any subtask will promote the holistic performance. Experimental results show that regardless of inputing a single depth or RGB-D, our framework can generate high-quality semantic scene completion, and outperforms state-of-the-art approaches on both synthetic and real datasets.

71 citations

Proceedings ArticleDOI
Beibei Jin1, Yu Hu1, Zeng Yiming1, Qiankun Tang1, Shice Liu1, Jing Ye1 
01 Oct 2018
TL;DR: The VarNet is presented to directly predict the variations between adjacent frames which are then fused with current frame to generate the future frame and an adaptively re-weighting mechanism for loss function to offer each pixel a fair weight according to the amplitude of its variation.
Abstract: Unsupervised video prediction is a very challenging task due to the complexity and diversity in natural scenes. Prior works directly predicting pixels or optical flows either have the blurring problem or require additional assumptions. We highlight that the crux for video frame prediction lies in precisely capturing the inter-frame variations which encompass the movement of objects and the evolution of the surrounding environment. We then present an unsupervised video prediction framework — Variation Network (VarNet) to directly predict the variations between adjacent frames which are then fused with current frame to generate the future frame. In addition, we propose an adaptively re-weighting mechanism for loss function to offer each pixel a fair weight according to the amplitude of its variation. Extensive experiments for both short-term and long-term video prediction are implemented on two advanced datasets — KTH and KITTI with two evaluating metrics — PSNR and SSIM. For the KTH dataset, the VarNet outperforms the state-of-the-art works up to 11.9% on PSNR and 9.5% on SSIM. As for the KITTI dataset, the performance boosts are up to 55.1 % on PSNR and 15.9% on SSIM. Moreover, we verify that the generalization ability of our model excels other state-of-the-art methods by testing on the unseen CalTech Pedestrian dataset after being trained on the KITTI dataset. Source code and video are available at

23 citations

Proceedings ArticleDOI
04 May 2020
TL;DR: A lightweight backbone that is able to capture rich low-level features by the proposed Detail-Preserving Module to effectively aggregate bottom and top-down features, and introduces an efficient Feature- Preserving and Refinement Module to further reduce the entire network complexity.
Abstract: The extensive computational burden limits the usage of accurate but complex object detectors in resource-bounded scenarios. In this paper, we present a lightweight object detector, named LightDet, to address this dilemma. We design a lightweight backbone that is able to capture rich low-level features by the proposed Detail-Preserving Module. To effectively aggregate bottom and top-down features, we introduce an efficient Feature-Preserving and Refinement Module. A lightweight prediction head is employed to further reduce the entire network complexity. Experimental results show that our LightDet achieves 75.5% mAP on PASCAL VOC 2007 at the speed of 250 FPS and 24.0% mAP on MS COCO dataset.

10 citations

Posted Content
TL;DR: A video prediction network based on multi-level wavelet analysis to uniformly deal with spatial and temporal information is proposed and shows significant improvements on fidelity and temporal consistency over the state-of-the-art works.
Abstract: Video prediction is a pixel-wise dense prediction task to infer future frames based on past frames. Missing appearance details and motion blur are still two major problems for current predictive models, which lead to image distortion and temporal inconsistency. In this paper, we point out the necessity of exploring multi-frequency analysis to deal with the two problems. Inspired by the frequency band decomposition characteristic of Human Vision System (HVS), we propose a video prediction network based on multi-level wavelet analysis to deal with spatial and temporal information in a unified manner. Specifically, the multi-level spatial discrete wavelet transform decomposes each video frame into anisotropic sub-bands with multiple frequencies, helping to enrich structural information and reserve fine details. On the other hand, multi-level temporal discrete wavelet transform which operates on time axis decomposes the frame sequence into sub-band groups of different frequencies to accurately capture multi-frequency motions under a fixed frame rate. Extensive experiments on diverse datasets demonstrate that our model shows significant improvements on fidelity and temporal consistency over state-of-the-art works.

7 citations


Cited by
More filters
Proceedings ArticleDOI
01 Oct 2019
TL;DR: In this paper, the KITTI Vision Odometry Benchmark was used to provide dense point-wise annotations for the complete 360-degree field-of-view of the employed automotive LiDAR.
Abstract: Semantic scene understanding is important for various applications. In particular, self-driving cars need a fine-grained understanding of the surfaces and objects in their vicinity. Light detection and ranging (LiDAR) provides precise geometric information about the environment and is thus a part of the sensor suites of almost all self-driving cars. Despite the relevance of semantic scene understanding for this application, there is a lack of a large dataset for this task which is based on an automotive LiDAR. In this paper, we introduce a large dataset to propel research on laser-based semantic segmentation. We annotated all sequences of the KITTI Vision Odometry Benchmark and provide dense point-wise annotations for the complete 360-degree field-of-view of the employed automotive LiDAR. We propose three benchmark tasks based on this dataset: (i) semantic segmentation of point clouds using a single scan, (ii) semantic segmentation using multiple past scans, and (iii) semantic scene completion, which requires to anticipate the semantic scene in the future. We provide baseline experiments and show that there is a need for more sophisticated models to efficiently tackle these tasks. Our dataset opens the door for the development of more advanced methods, but also provides plentiful data to investigate new research directions.

669 citations

Posted Content
TL;DR: A large dataset to propel research on laser-based semantic segmentation, which opens the door for the development of more advanced methods, but also provides plentiful data to investigate new research directions.
Abstract: Semantic scene understanding is important for various applications. In particular, self-driving cars need a fine-grained understanding of the surfaces and objects in their vicinity. Light detection and ranging (LiDAR) provides precise geometric information about the environment and is thus a part of the sensor suites of almost all self-driving cars. Despite the relevance of semantic scene understanding for this application, there is a lack of a large dataset for this task which is based on an automotive LiDAR. In this paper, we introduce a large dataset to propel research on laser-based semantic segmentation. We annotated all sequences of the KITTI Vision Odometry Benchmark and provide dense point-wise annotations for the complete $360^{o}$ field-of-view of the employed automotive LiDAR. We propose three benchmark tasks based on this dataset: (i) semantic segmentation of point clouds using a single scan, (ii) semantic segmentation using multiple past scans, and (iii) semantic scene completion, which requires to anticipate the semantic scene in the future. We provide baseline experiments and show that there is a need for more sophisticated models to efficiently tackle these tasks. Our dataset opens the door for the development of more advanced methods, but also provides plentiful data to investigate new research directions.

532 citations

Journal ArticleDOI
TL;DR: In this article, the authors provide a review on the deep learning methods for prediction in video sequences, as well as mandatory background concepts and the most used datasets, and carefully analyze existing video prediction models organized according to a proposed taxonomy.
Abstract: The ability to predict, anticipate and reason about future outcomes is a key component of intelligent decision-making systems. In light of the success of deep learning in computer vision, deep-learning-based video prediction emerged as a promising research direction. Defined as a self-supervised learning task, video prediction represents a suitable framework for representation learning, as it demonstrated potential capabilities for extracting meaningful representations of the underlying patterns in natural videos. Motivated by the increasing interest in this task, we provide a review on the deep learning methods for prediction in video sequences. We firstly define the video prediction fundamentals, as well as mandatory background concepts and the most used datasets. Next, we carefully analyze existing video prediction models organized according to a proposed taxonomy, highlighting their contributions and their significance in the field. The summary of the datasets and methods is accompanied with experimental results that facilitate the assessment of the state of the art on a quantitative basis. The paper is summarized by drawing some general conclusions, identifying open research challenges and by pointing out future research directions.

141 citations

Proceedings ArticleDOI
14 Jun 2020
TL;DR: Wang et al. as mentioned in this paper proposed a video prediction network based on multi-level wavelet analysis to uniformly deal with spatial and temporal information, which decomposes each video frame into anisotropic sub-bands with multiple frequencies.
Abstract: Video prediction is a pixel-wise dense prediction task to infer future frames based on past frames. Missing appearance details and motion blur are still two major problems for current models, leading to image distortion and temporal inconsistency. We point out the necessity of exploring multi-frequency analysis to deal with the two problems. Inspired by the frequency band decomposition characteristic of Human Vision System (HVS), we propose a video prediction network based on multi-level wavelet analysis to uniformly deal with spatial and temporal information. Specifically, multi-level spatial discrete wavelet transform decomposes each video frame into anisotropic sub-bands with multiple frequencies, helping to enrich structural information and reserve fine details. On the other hand, multilevel temporal discrete wavelet transform which operates on time axis decomposes the frame sequence into sub-band groups of different frequencies to accurately capture multifrequency motions under a fixed frame rate. Extensive experiments on diverse datasets demonstrate that our model shows significant improvements on fidelity and temporal consistency over the state-of-the-art works. Source code and videos are available at https://github.com/Bei-Jin/STMFANet.

75 citations

Proceedings ArticleDOI
20 Jun 2021
TL;DR: Wang et al. as discussed by the authors proposed a long-term motion context memory (LMC-Memory) with memory alignment learning, which enables to store longterm motion contexts into the memory and to match them with sequences including limited dynamics.
Abstract: Our work addresses long-term motion context issues for predicting future frames. To predict the future precisely, it is required to capture which long-term motion context (e.g., walking or running) the input motion (e.g., leg movement) belongs to. The bottlenecks arising when dealing with the long-term motion context are: (i) how to predict the long-term motion context naturally matching input sequences with limited dynamics, (ii) how to predict the long-term motion context with high-dimensionality (e.g., complex motion). To address the issues, we propose novel motion context-aware video prediction. To solve the bottle-neck (i), we introduce a long-term motion context memory (LMC-Memory) with memory alignment learning. The pro-posed memory alignment learning enables to store long-term motion contexts into the memory and to match them with sequences including limited dynamics. As a result, the long-term context can be recalled from the limited in-put sequence. In addition, to resolve the bottleneck (ii), we propose memory query decomposition to store local motion context (i.e., low-dimensional dynamics) and recall the suitable local context for each local part of the input individually. It enables to boost the alignment effects of the memory. Experimental results show that the proposed method outperforms other sophisticated RNN-based methods, especially in long-term condition. Further, we validate the effectiveness of the proposed network designs by conducting ablation studies and memory feature analysis. The source code of this work is available†.

68 citations