scispace - formally typeset
Search or ask a question
Author

Zhen Cheng

Bio: Zhen Cheng is an academic researcher from University of Science and Technology of China. The author has contributed to research in topics: Computer science & Light field. The author has an hindex of 5, co-authored 7 publications receiving 39 citations.

Papers
More filters
Proceedings ArticleDOI
20 Jun 2021
TL;DR: Zhang et al. as mentioned in this paper propose a zero-shot learning framework for light field super-resolution, which learns a mapping to super-resolve the reference view with examples extracted solely from the input low-resolution light field itself.
Abstract: Deep learning provides a new avenue for light field super-resolution (SR). However, the domain gap caused by drastically different light field acquisition conditions poses a main obstacle in practice. To fill this gap, we propose a zero-shot learning framework for light field SR, which learns a mapping to super-resolve the reference view with examples extracted solely from the input low-resolution light field itself. Given highly limited training data under the zero-shot setting, however, we observe that it is difficult to train an end-to-end network successfully. Instead, we divide this challenging task into three sub-tasks, i.e., pre-upsampling, view alignment, and multi-view aggregation, and then conquer them separately with simple yet efficient CNNs. Moreover, the proposed framework can be readily extended to finetune the pre-trained model on a source dataset to better adapt to the target input, which further boosts the performance of light field SR in the wild. Experimental results validate that our method not only outperforms classic non-learning-based methods, but also generalizes better to unseen light fields than state-of-the-art deep-learning-based methods when the domain gap is large.

38 citations

Proceedings ArticleDOI
01 Jun 2021
TL;DR: Wang et al. as mentioned in this paper proposed a space-time distillation (STD) scheme to exploit both spatial and temporal knowledge in the VSR task, which can be easily incorporated into any network without changing the original network architecture.
Abstract: Compact video super-resolution (VSR) networks can be easily deployed on resource-limited devices, e.g., smartphones and wearable devices, but have considerable performance gaps compared with complicated VSR networks that require a large amount of computing resources. In this paper, we aim to improve the performance of compact VSR networks without changing their original architectures, through a knowledge distillation approach that transfers knowledge from a complicated VSR network to a compact one. Specifically, we propose a space-time distillation (STD) scheme to exploit both spatial and temporal knowledge in the VSR task. For space distillation, we extract spatial attention maps that hint the high-frequency video content from both networks, which are further used for transferring spatial modeling capabilities. For time distillation, we narrow the performance gap between compact models and complicated models by distilling the feature similarity of the temporal memory cells, which are encoded from the sequence of feature maps generated in the training clips using ConvLSTM. During the training process, STD can be easily incorporated into any network without changing the original network architecture. Experimental results on standard benchmarks demonstrate that, in resource-constrained situations, the proposed method notably improves the performance of existing VSR networks without increasing the inference time.

36 citations

Journal ArticleDOI
TL;DR: This paper advances the classic projection-based method that exploits the internal similarity by introducing the intensity consistency checking criterion and a back-projection refinement, and proposes a pixel-wise adaptive fusion network to take advantage of both their merits by learning a weighting matrix.
Abstract: Light field images taken by plenoptic cameras often have a tradeoff between spatial and angular resolutions. In this paper, we propose a novel spatial super-resolution approach for light field images by jointly exploiting internal and external similarities. The internal similarity refers to the correlations across the angular dimensions of the 4D light field itself, while the external similarity refers to the cross-scale correlations learned from an external light field dataset. Specifically, we advance the classic projection-based method that exploits the internal similarity by introducing the intensity consistency checking criterion and a back-projection refinement, while the external correlation is learned by a CNN-based method which aggregates all warped high-resolution sub-aperture images upsampled from the low-resolution input using a single image super-resolution method. By analyzing the error distributions of the above two methods and investigating the upperbound of combining them, we find that the internal and external similarities are complementary to each other. Accordingly, we further propose a pixel-wise adaptive fusion network to take advantage of both their merits by learning a weighting matrix. Experimental results on both synthetic and real-world light field datasets validate the superior performance of the proposed approach over the state-of-the-arts.

26 citations

Proceedings ArticleDOI
16 Jun 2019
TL;DR: This paper conducts the first systematic benchmark evaluation for representative light field SR methods on both synthetic and real-world datasets with various downsampling kernels and scaling factors and finds that CNN-based single image SR without using any angular information outperforms most light fieldSR methods even including learning-based ones.
Abstract: Lenslet-based light field imaging generally suffers from a fundamental trade-off between spatial and angular resolutions, which limits its promotion to practical applications. To this end, a substantial amount of efforts have been dedicated to light field super-resolution (SR) in recent years. Despite the demonstrated success, existing light field SR methods are often evaluated based on different degradation assumptions using different datasets, and even contradictory results are reported in literature. In this paper, we conduct the first systematic benchmark evaluation for representative light field SR methods on both synthetic and real-world datasets with various downsampling kernels and scaling factors. We then analyze and discuss the advantages and limitations of each kind of method from different perspectives. Especially, we find that CNN-based single image SR without using any angular information outperforms most light field SR methods even including learning-based ones. This benchmark evaluation, along with the comprehensive analysis and discussion, sheds light on the future researches in light field SR.

20 citations

Journal ArticleDOI
TL;DR: A multi-domain collaborative transfer learning (MDCTL) method with multi-scale repeated attention mechanism (MSRAM) is proposed for improving the accuracy of underwater sonar image classification and is shown to be more powerful in feature representation by using the MDCTL and MSRAM.
Abstract: Due to the strong speckle noise caused by the seabed reverberation which makes it difficult to extract discriminating and noiseless features of a target, recognition and classification of underwater targets using side-scan sonar (SSS) images is a big challenge. Moreover, unlike classification of optical images which can use a large dataset to train the classifier, classification of SSS images usually has to exploit a very small dataset for training, which may cause classifier overfitting. Compared with traditional feature extraction methods using descriptors—such as Haar, SIFT, and LBP—deep learning-based methods are more powerful in capturing discriminating features. After training on a large optical dataset, e.g., ImageNet, direct fine-tuning method brings improvement to the sonar image classification using a small-size SSS image dataset. However, due to the different statistical characteristics between optical images and sonar images, transfer learning methods—e.g., fine-tuning—lack cross-domain adaptability, and therefore cannot achieve very satisfactory results. In this paper, a multi-domain collaborative transfer learning (MDCTL) method with multi-scale repeated attention mechanism (MSRAM) is proposed for improving the accuracy of underwater sonar image classification. In the MDCTL method, low-level characteristic similarity between SSS images and synthetic aperture radar (SAR) images, and high-level representation similarity between SSS images and optical images are used together to enhance the feature extraction ability of the deep learning model. Using different characteristics of multi-domain data to efficiently capture useful features for the sonar image classification, MDCTL offers a new way for transfer learning. MSRAM is used to effectively combine multi-scale features to make the proposed model pay more attention to the shape details of the target excluding the noise. Experimental results of classification show that, in using multi-domain data sets, the proposed method is more stable with an overall accuracy of 99.21%, bringing an improvement of 4.54% compared with the fine-tuned VGG19. Results given by diverse visualization methods also demonstrate that the method is more powerful in feature representation by using the MDCTL and MSRAM.

9 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: This paper first design a class of domain-specific convolutions to disentangle LFs from different dimensions, and then leverage these disentangled features by designing task-specific modules, which demonstrates the effectiveness, efficiency, and generality of the disentangling mechanism.
Abstract: Light field (LF) cameras record both intensity and directions of light rays, and encode 3D cues into 4D LF images. Recently, many convolutional neural networks (CNNs) have been proposed for various LF image processing tasks. However, it is challenging for CNNs to effectively process LF images since the spatial and angular information are highly inter-twined with varying disparities. In this paper, we propose a generic mechanism to disentangle these coupled information for LF image processing. Specifically, we first design a class of domain-specific convolutions to disentangle LFs from different dimensions, and then leverage these disentangled features by designing task-specific modules. Our disentangling mechanism can well incorporate the LF structure prior and effectively handle 4D LF data. Based on the proposed mechanism, we develop three networks (i.e., DistgSSR, DistgASR and DistgDisp) for spatial super-resolution, angular super-resolution and disparity estimation. Experimental results show that our networks achieve state-of-the-art performance on all these three tasks, which demonstrates the effectiveness, efficiency, and generality of our disentangling mechanism. Project page: https://yingqianwang.github.io/DistgLF/.

45 citations

Journal ArticleDOI
TL;DR: In this article, an end-to-end learning-based method is proposed to simultaneously reconstruct all view images in light field (LF) cameras with higher spatial resolution, and the proposed method outperforms other state-of-theart methods in both visual and numerical evaluations.
Abstract: Light Field (LF) cameras are considered to have many potential applications since angular and spatial information is captured simultaneously. However, the limited spatial resolution has brought lots of difficulties in developing related applications and becomes the main bottleneck of LF cameras. In this paper, an end-to-end learning-based method is proposed to simultaneously reconstruct all view images in LFs with higher spatial resolution. Based on the epipolar geometry, view images in one LF are first grouped into several image stacks and fed into different network branches to learn sub-pixel details for each view image. Since LFs have dense sampling in angular domain, sub-pixel details in multiple spatial directions are learned from corresponding angular directions in multiple branches, respectively. Then, sub-pixel details from different directions are further integrated to generate global high-frequency residual details. Combined with the spatially upsampled LF, the final LF with high spatial resolution is obtained. Experimental results on synthetic and real-world datasets demonstrate that the proposed method outperforms other state-of-the-art methods in both visual and numerical evaluations. We also implement the proposed method on LFs with different angular resolution and experiments show that the proposed method achieves superior results than others, especially for LFs with small angular resolution. Furthermore, since the epipolar geometry is fully considered, the proposed network shows good performances in preserving the inherent epipolar property in LF images.

41 citations

Proceedings ArticleDOI
20 Jun 2021
TL;DR: Zhang et al. as mentioned in this paper propose a zero-shot learning framework for light field super-resolution, which learns a mapping to super-resolve the reference view with examples extracted solely from the input low-resolution light field itself.
Abstract: Deep learning provides a new avenue for light field super-resolution (SR). However, the domain gap caused by drastically different light field acquisition conditions poses a main obstacle in practice. To fill this gap, we propose a zero-shot learning framework for light field SR, which learns a mapping to super-resolve the reference view with examples extracted solely from the input low-resolution light field itself. Given highly limited training data under the zero-shot setting, however, we observe that it is difficult to train an end-to-end network successfully. Instead, we divide this challenging task into three sub-tasks, i.e., pre-upsampling, view alignment, and multi-view aggregation, and then conquer them separately with simple yet efficient CNNs. Moreover, the proposed framework can be readily extended to finetune the pre-trained model on a source dataset to better adapt to the target input, which further boosts the performance of light field SR in the wild. Experimental results validate that our method not only outperforms classic non-learning-based methods, but also generalizes better to unseen light fields than state-of-the-art deep-learning-based methods when the domain gap is large.

38 citations

Proceedings ArticleDOI
01 Sep 2018
TL;DR: An unsupervised CNN-based method for explicit depth estimation from light field, which learns an end-to-end mapping from a 4D light field to the corresponding disparity map without the supervision of groundtruth depth, is proposed.
Abstract: This paper proposes an unsupervised CNN-based method for explicit depth estimation from light field, which learns an end-to-end mapping from a 4D light field to the corresponding disparity map without the supervision of groundtruth depth. Specifically, we design a combined loss function imposing both compliance and divergence constraints on the warped sub-aperture images to the central view, which guarantees our network to generate an accurate and robust disparity map. Furthermore, we find that increasing the number of referenced views in depth feature extraction and complementing missing information caused by warping greatly boost the performance of our network. Due to the difficulty of obtaining groundtruth depth of real-world scenes in practice, the proposed method is much more feasible than supervised learning. On the other hand, compared with traditional non-learning methods, the proposed method better exploits the correlations in the 4D light field and generates superior depth results both quantitatively and qualitatively. Also, the proposed method helps improve the performance of subsequent applications based on the estimated depth, e.g., spatial super-resolution of light field.

35 citations

Proceedings ArticleDOI
12 Oct 2020
TL;DR: A novel space-time video super-resolution method, which aims to recover a high-frame-rate and high-resolution video from its low- frame- rate and low-resolution observation, followed by a refining module for artifacts alleviation and detail enhancement, which does not require any explicit or implicit motion estimation.
Abstract: In this paper, we propose a novel space-time video super-resolution method, which aims to recover a high-frame-rate and high-resolution video from its low-frame-rate and low-resolution observation. Existing solutions seldom consider the spatial-temporal correlation and the long-term temporal context simultaneously and thus are limited in the restoration performance. Inspired by the epipolar-plane image used in multi-view computer vision tasks, we first propose the concept of temporal-profile super-resolution to directly exploit the spatial-temporal correlation in the long-term temporal context. Then, we specifically design a feature shuffling module for spatial retargeting and spatial-temporal information fusion, which is followed by a refining module for artifacts alleviation and detail enhancement. Different from existing solutions, our method does not require any explicit or implicit motion estimation, making it lightweight and flexible to handle any number of input frames. Comprehensive experimental results demonstrate that our method not only generates superior space-time video super-resolution results but also retains competitive implementation efficiency.

30 citations