scispace - formally typeset
Search or ask a question

Showing papers by "Junhui Hou published in 2020"


Journal ArticleDOI
TL;DR: This paper constructs an Underwater Image Enhancement Benchmark (UIEB) including 950 real-world underwater images, 890 of which have the corresponding reference images and proposes an underwater image enhancement network (called Water-Net) trained on this benchmark as a baseline, which indicates the generalization of the proposed UIEB for training Convolutional Neural Networks (CNNs).
Abstract: Underwater image enhancement has been attracting much attention due to its significance in marine engineering and aquatic robotics. Numerous underwater image enhancement algorithms have been proposed in the last few years. However, these algorithms are mainly evaluated using either synthetic datasets or few selected real-world images. It is thus unclear how these algorithms would perform on images acquired in the wild and how we could gauge the progress in the field. To bridge this gap, we present the first comprehensive perceptual study and analysis of underwater image enhancement using large-scale real-world images. In this paper, we construct an Underwater Image Enhancement Benchmark (UIEB) including 950 real-world underwater images, 890 of which have the corresponding reference images. We treat the rest 60 underwater images which cannot obtain satisfactory reference images as challenging data. Using this dataset, we conduct a comprehensive study of the state-of-the-art underwater image enhancement algorithms qualitatively and quantitatively. In addition, we propose an underwater image enhancement network (called Water-Net) trained on this benchmark as a baseline, which indicates the generalization of the proposed UIEB for training Convolutional Neural Networks (CNNs). The benchmark evaluations and the proposed Water-Net demonstrate the performance and limitations of state-of-the-art algorithms, which shed light on future research in underwater image enhancement. The dataset and code are available at https://li-chongyi.github.io/proj_benchmark.html .

697 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: A novel method, Zero-Reference Deep Curve Estimation (Zero-DCE), which formulates light enhancement as a task of image-specific curve estimation with a deep network and shows that it generalizes well to diverse lighting conditions.
Abstract: The paper presents a novel method, Zero-Reference Deep Curve Estimation (Zero-DCE), which formulates light enhancement as a task of image-specific curve estimation with a deep network. Our method trains a lightweight deep network, DCE-Net, to estimate pixel-wise and high-order curves for dynamic range adjustment of a given image. The curve estimation is specially designed, considering pixel value range, monotonicity, and differentiability. Zero-DCE is appealing in its relaxed assumption on reference images, i.e., it does not require any paired or unpaired data during training. This is achieved through a set of carefully formulated non-reference loss functions, which implicitly measure the enhancement quality and drive the learning of the network. Our method is efficient as image enhancement can be achieved by an intuitive and simple nonlinear curve mapping. Despite its simplicity, we show that it generalizes well to diverse lighting conditions. Extensive experiments on various benchmarks demonstrate the advantages of our method over state-of-the-art methods qualitatively and quantitatively. Furthermore, the potential benefits of our Zero-DCE to face detection in the dark are discussed.

447 citations


Posted Content
TL;DR: Zhang et al. as discussed by the authors proposed a zero-reference deep curve estimation (Zero-DCE) method, which formulates light enhancement as a task of image-specific curve estimation with a deep network.
Abstract: The paper presents a novel method, Zero-Reference Deep Curve Estimation (Zero-DCE), which formulates light enhancement as a task of image-specific curve estimation with a deep network. Our method trains a lightweight deep network, DCE-Net, to estimate pixel-wise and high-order curves for dynamic range adjustment of a given image. The curve estimation is specially designed, considering pixel value range, monotonicity, and differentiability. Zero-DCE is appealing in its relaxed assumption on reference images, i.e., it does not require any paired or unpaired data during training. This is achieved through a set of carefully formulated non-reference loss functions, which implicitly measure the enhancement quality and drive the learning of the network. Our method is efficient as image enhancement can be achieved by an intuitive and simple nonlinear curve mapping. Despite its simplicity, we show that it generalizes well to diverse lighting conditions. Extensive experiments on various benchmarks demonstrate the advantages of our method over state-of-the-art methods qualitatively and quantitatively. Furthermore, the potential benefits of our Zero-DCE to face detection in the dark are discussed. Code and model will be available at this https URL.

300 citations


Journal ArticleDOI
TL;DR: A novel depth-guided transformation model (DTM) going from RGB saliency to RGBD saliency is proposed and an optimization model is formulated to attain more consistent and accurate saliency results via an energy function, which integrates the unary data term, color smooth term, and depth consistency term.
Abstract: Depth information has been demonstrated to be useful for saliency detection. However, the existing methods for RGBD saliency detection mainly focus on designing straightforward and comprehensive models, while ignoring the transferable ability of the existing RGB saliency detection models. In this article, we propose a novel depth-guided transformation model (DTM) going from RGB saliency to RGBD saliency. The proposed model includes three components, that is: 1) multilevel RGBD saliency initialization; 2) depth-guided saliency refinement; and 3) saliency optimization with depth constraints. The explicit depth feature is first utilized in the multilevel RGBD saliency model to initialize the RGBD saliency by combining the global compactness saliency cue and local geodesic saliency cue. The depth-guided saliency refinement is used to further highlight the salient objects and suppress the background regions by introducing the prior depth domain knowledge and prior refined depth shape. Benefiting from the consistency of the entire object in the depth map, we formulate an optimization model to attain more consistent and accurate saliency results via an energy function, which integrates the unary data term, color smooth term, and depth consistency term. Experiments on three public RGBD saliency detection benchmarks demonstrate the effectiveness and performance improvement of the proposed DTM from RGB to RGBD saliency.

157 citations


Book ChapterDOI
24 Feb 2020
TL;DR: Zhang et al. as mentioned in this paper proposed a novel deep neural network based method, called PUGeo-Net, which learns a linear transformation matrix for each input point and projects the samples to the curved surface by computing a displacement along the normal of the tangent plane.
Abstract: This paper addresses the problem of generating uniform dense point clouds to describe the underlying geometric structures from given sparse point clouds. Due to the irregular and unordered nature, point cloud densification as a generative task is challenging. To tackle the challenge, we propose a novel deep neural network based method, called PUGeo-Net, that learns a $3\times 3$ linear transformation matrix $\bf T$ for each input point. Matrix $\mathbf T$ approximates the augmented Jacobian matrix of a local parameterization and builds a one-to-one correspondence between the 2D parametric domain and the 3D tangent plane so that we can lift the adaptively distributed 2D samples (which are also learned from data) to 3D space. After that, we project the samples to the curved surface by computing a displacement along the normal of the tangent plane. PUGeo-Net is fundamentally different from the existing deep learning methods that are largely motivated by the image super-resolution techniques and generate new points in the abstract feature space. Thanks to its geometry-centric nature, PUGeo-Net works well for both CAD models with sharp features and scanned models with rich geometric details. Moreover, PUGeo-Net can compute the normal for the original and generated points, which is highly desired by the surface reconstruction algorithms. Computational results show that PUGeo-Net, the first neural network that can jointly generate vertex coordinates and normals, consistently outperforms the state-of-the-art in terms of accuracy and efficiency for upsampling factor $4\sim 16$.

75 citations


Journal ArticleDOI
Hao Liu1, Hui Yuan1, Qi Liu1, Junhui Hou2, Ju Liu1 
TL;DR: Experimental results demonstrate that the coding efficiency of TMC2 is the best on average (especially for lossy geometry and lossy color compression) for dense point clouds while TMC13 achieves the optimal coding performance for sparse and noisy point clouds with lower time complexity.
Abstract: Point cloud based 3D visual representation is becoming popular due to its ability to exhibit the real world in a more comprehensive and immersive way. However, under a limited network bandwidth, it is very challenging to communicate this kind of media due to its huge data volume. Therefore, the MPEG have launched the standardization for point cloud compression (PCC), and proposed three model categories, i.e., TMC1, TMC2, and TMC3. Because the 3D geometry compression methods of TMC1 and TMC3 are similar, TMC1 and TMC3 are further merged into a new platform namely TMC13. In this paper, we first introduce some basic technologies that are usually used in 3D point cloud compression, then review the encoder architectures of these test models in detail, and finally analyze their rate distortion performance as well as complexity quantitatively for different cases (i.e., lossless geometry and lossless color, lossless geometry and lossy color, lossy geometry and lossy color) by using 16 benchmark 3D point clouds that are recommended by MPEG. Experimental results demonstrate that the coding efficiency of TMC2 is the best on average (especially for lossy geometry and lossy color compression) for dense point clouds while TMC13 achieves the optimal coding performance for sparse and noisy point clouds with lower time complexity.

73 citations


Journal ArticleDOI
TL;DR: Experimental results on two benchmark datasets demonstrate that the proposed SBOMP based VS method clearly outperforms several state-of-the-art sparse representation based methods in terms of F-score, redundancy among keyframes and robustness to outlier frames.

66 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: A novel learning-based LF spatial SR framework, in which each view of an LF image is first individually super-resolved by exploring the complementary information among views with combinatorial geometry embedding, which preserves more accurate parallax details, at a lower computation cost.
Abstract: Light field (LF) images acquired by hand-held devices usually suffer from low spatial resolution as the limited sampling resources have to be shared with the angular dimension. LF spatial super-resolution (SR) thus becomes an indispensable part of the LF camera processing pipeline. The high-dimensionality characteristic and complex geometrical structure of LF images makes the problem more challenging than traditional single-image SR. The performance of existing methods are still limited as they fail to thoroughly explore the coherence among LF views and are insufficient in accurately preserving the parallax structure of the scene. In this paper, we propose a novel learning-based LF spatial SR framework, in which each view of an LF image is first individually super-resolved by exploring the complementary information among views with combinatorial geometry embedding. For accurate preservation of the parallax structure among the reconstructed views, a regularization network trained over a structure-aware loss function is subsequently appended to enforce correct parallax relationships over the intermediate estimation. Our proposed approach is evaluated over datasets with a large number of testing images including both synthetic and real-world scenes. Experimental results demonstrate the advantage of our approach over state-of-the-art methods, i.e., our method not only improves the average PSNR by more than 1.0 dB but also preserves more accurate parallax details, at a lower computation cost.

62 citations


Journal ArticleDOI
TL;DR: Experimental results show that the proposed compression scheme for the attributes of voxelized 3D point clouds is able to achieve better rate-distortion performance and visual quality, compared with state-of-the-art methods.
Abstract: 3D point clouds associated with attributes are considered as a promising paradigm for immersive communication. However, the corresponding compression schemes for this media are still in the infant stage. Moreover, in contrast to conventional image/video compression, it is a more challenging task to compress 3D point cloud data, arising from the irregular structure. In this paper, we propose a novel and effective compression scheme for the attributes of voxelized 3D point clouds. In the first stage, an input voxelized 3D point cloud is divided into blocks of equal size. Then, to deal with the irregular structure of 3D point clouds, a geometry-guided sparse representation (GSR) is proposed to eliminate the redundancy within each block, which is formulated as an $\ell _{0}$ -norm regularized optimization problem. Also, an inter-block prediction scheme is applied to remove the redundancy between blocks. Finally, by quantitatively analyzing the characteristics of the resulting transform coefficients by GSR, an effective entropy coding strategy that is tailored to our GSR is developed to generate the bitstream. Experimental results over various benchmark datasets show that the proposed compression scheme is able to achieve better rate-distortion performance and visual quality, compared with state-of-the-art methods.

57 citations


Journal ArticleDOI
TL;DR: The proposed semi-supervised non-negative matrix factorization model is capable of generating discriminable low-dimensional representations to improve clustering performance and theoretically proves that the proposed algorithm can converge to a limiting point that meets the Karush–Kuhn–Tucker conditions.
Abstract: In this article, we propose a semi-supervised non-negative matrix factorization (NMF) model by means of elegantly modeling the label information. The proposed model is capable of generating discriminable low-dimensional representations to improve clustering performance. Specifically, a pair of complementary regularizers, i.e., similarity and dissimilarity regularizers, is incorporated into the conventional NMF to guide the factorization. And, they impose restrictions on both the similarity and dissimilarity of the low-dimensional representations of data samples with labels as well as a small number of unlabeled ones. The proposed model is formulated as a well-posed constrained optimization problem and further solved with an efficient alternating iterative algorithm. Moreover, we theoretically prove that the proposed algorithm can converge to a limiting point that meets the Karush–Kuhn–Tucker conditions. Extensive experiments as well as comprehensive analysis demonstrate that the proposed model outperforms the state-of-the-art NMF methods to a large extent over five benchmark data sets, i.e., the clustering accuracy increases to 82.2% from 57.0%.

55 citations


Journal ArticleDOI
TL;DR: This work proposes a linear perceptual quality model whose variables are the V-PCC geometry and color quantization step sizes and whose coefficients can easily be computed from two features extracted from the original point cloud.
Abstract: In rate-distortion optimization, the encoder settings are determined by maximizing a reconstruction quality measure subject to a constraint on the bit rate. One of the main challenges of this approach is to define a quality measure that can be computed with low computational cost and which correlates well with perceptual quality. While several quality measures that fulfil these two criteria have been developed for images and video, no such one exists for 3D point clouds. We address this limitation for the video-based point cloud compression (V-PCC) standard by proposing a linear perceptual quality model whose variables are the V-PCC geometry and color quantization parameters and whose coefficients can easily be computed from two features extracted from the original 3D point cloud. Subjective quality tests with 400 compressed 3D point clouds show that the proposed model correlates well with the mean opinion score, outperforming state-of-the-art full reference objective measures in terms of Spearman rank-order and Pearsons linear correlation coefficient. Moreover, we show that for the same target bit rate, ratedistortion optimization based on the proposed model offers higher perceptual quality than rate-distortion optimization based on exhaustive search with a point-to-point objective quality metric.

Journal ArticleDOI
03 Apr 2020
TL;DR: Zhang et al. as discussed by the authors proposed an end-to-end learning-based approach aiming at angularly super-resolving a sparsely-sampled light field with a large baseline.
Abstract: The acquisition of light field images with high angular resolution is costly. Although many methods have been proposed to improve the angular resolution of a sparsely-sampled light field, they always focus on the light field with a small baseline, which is captured by a consumer light field camera. By making full use of the intrinsic \textit{geometry} information of light fields, in this paper we propose an end-to-end learning-based approach aiming at angularly super-resolving a sparsely-sampled light field with a large baseline. Our model consists of two learnable modules and a physically-based module. Specifically, it includes a depth estimation module for explicitly modeling the scene geometry, a physically-based warping for novel views synthesis, and a light field blending module specifically designed for light field reconstruction. Moreover, we introduce a novel loss function to promote the preservation of the light field parallax structure. Experimental results over various light field datasets including large baseline light field images demonstrate the significant superiority of our method when compared with state-of-the-art ones, i.e., our method improves the PSNR of the second best method up to 2 dB in average, while saves the execution time 48$\times$. In addition, our method preserves the light field parallax structure better.

Posted Content
TL;DR: In this paper, the complementary information among views with combinatorial geometry embedding is explored to preserve the parallax structure among the reconstructed views, and a regularization network is trained over a structure-aware loss function is subsequently applied to enforce correct spatial relationships over the intermediate estimation.
Abstract: Light field (LF) images acquired by hand-held devices usually suffer from low spatial resolution as the limited sampling resources have to be shared with the angular dimension. LF spatial super-resolution (SR) thus becomes an indispensable part of the LF camera processing pipeline. The high-dimensionality characteristic and complex geometrical structure of LF images make the problem more challenging than traditional single-image SR. The performance of existing methods is still limited as they fail to thoroughly explore the coherence among LF views and are insufficient in accurately preserving the parallax structure of the scene. In this paper, we propose a novel learning-based LF spatial SR framework, in which each view of an LF image is first individually super-resolved by exploring the complementary information among views with combinatorial geometry embedding. For accurate preservation of the parallax structure among the reconstructed views, a regularization network trained over a structure-aware loss function is subsequently appended to enforce correct parallax relationships over the intermediate estimation. Our proposed approach is evaluated over datasets with a large number of testing images including both synthetic and real-world scenes. Experimental results demonstrate the advantage of our approach over state-of-the-art methods, i.e., our method not only improves the average PSNR by more than 1.0 dB but also preserves more accurate parallax details, at a lower computational cost.

Posted Content
TL;DR: This paper proposes an end-to-end learning-based approach aiming at angularly super-resolving a sparsely-sampled light field with a large baseline and introduces a novel loss function to promote the preservation of the light field parallax structure.
Abstract: The acquisition of light field images with high angular resolution is costly. Although many methods have been proposed to improve the angular resolution of a sparsely-sampled light field, they always focus on the light field with a small baseline, which is captured by a consumer light field camera. By making full use of the intrinsic \textit{geometry} information of light fields, in this paper we propose an end-to-end learning-based approach aiming at angularly super-resolving a sparsely-sampled light field with a large baseline. Our model consists of two learnable modules and a physically-based module. Specifically, it includes a depth estimation module for explicitly modeling the scene geometry, a physically-based warping for novel views synthesis, and a light field blending module specifically designed for light field reconstruction. Moreover, we introduce a novel loss function to promote the preservation of the light field parallax structure. Experimental results over various light field datasets including large baseline light field images demonstrate the significant superiority of our method when compared with state-of-the-art ones, i.e., our method improves the PSNR of the second best method up to 2 dB in average, while saves the execution time 48$\times$. In addition, our method preserves the light field parallax structure better.

Journal ArticleDOI
TL;DR: A novel learning-based method is proposed, which accepts sparsely-sampled LFs with irregular structures, and produces densely-samplings with arbitrary angular resolution accurately and efficiently, and a simple yet effective method for optimizing the sampling pattern.
Abstract: A densely-sampled light field (LF) is highly desirable in various applications. However, it is costly to acquire such data. Although many computational methods have been proposed to reconstruct a densely-sampled LF from a sparsely-sampled one, they still suffer from either low reconstruction quality, low computational efficiency, or the restriction on the regularity of the sampling pattern. To this end, we propose a novel learning-based method, which accepts sparsely-sampled LFs with irregular structures, and produces densely-sampled LFs with arbitrary angular resolution accurately and efficiently. We also propose a simple yet effective method for optimizing the sampling pattern. Our proposed method, an end-to-end trainable network, reconstructs a densely-sampled LF in a coarse-to-fine manner. Specifically, the coarse sub-aperture image (SAI) synthesis module first explores the scene geometry from an unstructured sparsely-sampled LF and leverages it to independently synthesize novel SAIs, in which a confidence-based blending strategy is proposed to fuse the information from different input SAIs, giving an intermediate densely-sampled LF. Then, the efficient LF refinement module learns the angular relationship within the intermediate result to recover the LF parallax structure. Comprehensive experimental evaluations demonstrate the superiority of our method on both real-world and synthetic LF images when compared with state-of-the-art methods.

Proceedings ArticleDOI
01 Jul 2020
TL;DR: A deep learning based framework for point cloud geometry lossy compression via hybrid representation of point cloud through adaptively decomposed into non-overlapping local patches through adaptive Octree decomposition and clustering is proposed.
Abstract: In this paper, we propose a deep learning based framework for point cloud geometry lossy compression via hybrid representation of point cloud. First, the input raw 3D point cloud data is adaptively decomposed into non-overlapping local patches through adaptive Octree decomposition and clustering. Second, a framework of point cloud auto-encoder network with quantization layer is proposed for learning compact latent feature representation from each patch. Specifically, the proposed point cloud auto-encoder networks with different input size are trained for achieving optimal rate-distortion (RD) performance. Final, bitstream specifications of proposed compression systems with additional signaled meta-data and header information are designed to support parallel decoding and successive reconstruction. Experimental results shows that our proposed method can achieve 40.20% bitrate saving in average than the existing standard Geometry based Point Cloud Compression (G-PCC) codec.

Journal ArticleDOI
TL;DR: A novel self-training approach named Crowd-SDNet that enables a typical object detector trained only with point-level annotations to estimate both the center points and sizes of crowded objects and proposes a confidence and order-aware refinement scheme to continuously refine the initial pseudo object sizes.
Abstract: In this paper, we propose a novel self-training approach which enables a typical object detector trained only with point-level annotations (i.e., objects are labeled with points) to estimate both the center points and sizes of crowded objects. Specifically, during training we utilize the available point annotations to directly supervise the estimation of the center points of objects. Based on a locally-uniform distribution assumption, we initialize pseudo object sizes from the point-level supervisory information, which are then leveraged to guide the regression of object sizes via a crowdedness-aware loss. Meanwhile, we propose a confidence and order-aware refinement scheme to continuously refine the initial pseudo object sizes such that the ability of the detector is increasingly boosted to simultaneously detect and count objects in crowds. Moreover, to address extremely crowded scenes, we propose an effective decoding method to improve the representation ability of the detector. Experimental results on the WiderFace benchmark show that our approach significantly outperforms state-of-the-art point-supervised methods under both detection and counting tasks, i.e., our method improves the average precision by more than 10% and reduces the counting error by 31.2%. In addition, our method obtains the best results on the dense crowd counting dataset (i.e., ShanghaiTech) and vehicle counting datasets (i.e., CARPK and PUCPR+) when compared with state-of-the-art counting-by-detection methods. We will make the code publicly available to facilitate future research.

Journal ArticleDOI
TL;DR: Comprehensive experimental results over two popular databases demonstrate that the proposed geometry based algorithm can estimate head poses with higher accuracy and lower run time than state-of-the-art geometry based methods.

Journal ArticleDOI
TL;DR: A novel full-reference image quality assessment (IQA) method for evaluating the quality of the distorted light field (LF) image against its reference LF image is proposed, called the log-Gabor feature-based light field coherence (LGF-LFC).
Abstract: In this paper, a novel full-reference image quality assessment (IQA) method for evaluating the quality of the distorted light field (LF) image against its reference LF image is proposed, called the log-Gabor feature-based light field coherence (LGF-LFC). Based on the fact that to compare two LF images, it essentially boils down to measure how coherent of these two LF images, we attempt to measure the degree of their LF coherence (LFC). To pursue this goal, the salient features from the reference and distorted LF images under comparison need to be extracted. By considering that the Gabor feature has the ability to well characterize the human visual system (HVS) perception, and the special characteristics of the LF images, the multi -scale and single -scale Gabor feature extraction schemes are developed to extract the multi-scale log-Gabor features from the sub-aperture images (SAIs) and the single-scale log-Gabor feature from the epi-polar images (EPIs), respectively. Note that the former can reflect the image details (via the SAIs), while the latter indicates the viewing consistency (via the EPI’s depth information). The similarity measurements are subsequently conducted on the comparison of their SAIs and that of their EPIs separately, followed by combining them together for arriving at the final score. Extensive simulation results have clearly demonstrated that the proposed LGF-LFC is more consistent with the perception of the HVS on the quality evaluation of the LF images than multiple classical and state-of-the-art IQA methods.

Journal ArticleDOI
TL;DR: This paper proposes a novel deep neural network-based framework, namely progressive zero-centric residual network (PZRes-Net), to address the problem of hyperspectral image (HSI) super-resolution that merges a low resolution HSI and a high resolution multispectrals image (HR-MSI).
Abstract: This paper explores the problem of hyperspectral image (HSI) super-resolution that merges a low resolution HSI (LR-HSI) and a high resolution multispectral image (HR-MSI). The cross-modality distribution of the spatial and spectral information makes the problem challenging. Inspired by the classic wavelet decomposition-based image fusion, we propose a novel lightweight deep neural network-based framework, namely progressive zero-centric residual network (PZRes-Net), to address this problem efficiently and effectively. Specifically, PZRes-Net learns a high resolution and zero-centric residual image, which contains high-frequency spatial details of the scene across all spectral bands, from both inputs in a progressive fashion along the spectral dimension. And the resulting residual image is then superimposed onto the up-sampled LR-HSI in a mean-value invariant manner, leading to a coarse HR-HSI, which is further refined by exploring the coherence across all spectral bands simultaneously. To learn the residual image efficiently and effectively, we employ spectral-spatial separable convolution with dense connections. In addition, we propose zero-mean normalization implemented on the feature maps of each layer to realize the zero-mean characteristic of the residual image. Extensive experiments over both real and synthetic benchmark datasets demonstrate that our PZRes-Net outperforms state-of-the-art methods to a significant extent in terms of both 4 quantitative metrics and visual quality, e.g., our PZRes-Net improves the PSNR more than 3dB, while saving 2.3$\times$ parameters and consuming 15$\times$ less FLOPs.

Journal ArticleDOI
TL;DR: This paper presents a tile-based adaptive streaming method for 360-degree videos that preserves both the quality and the smoothness of tiles in FoV, thus providing the best QoE for users.
Abstract: The 360-degree video allows users to enjoy the whole scene by interactively switching viewports. However, the huge data volume of the 360-degree video limits its remote applications via network. To provide high quality of experience ( QoE ) for remote web users, this paper presents a tile-based adaptive streaming method for 360-degree videos. First, we propose a simple yet effective rate adaptation algorithm to determine the requested bitrate for downloading the current video segment by considering the balance between the buffer length and video quality. Then, we propose to use a Gaussian model to predict the field of view at the beginning of each requested video segment. To deal with the circumstance that the view angle is switched during the display of a video segment, we propose to download all the tiles in the 360-degree video with different priorities based on a Zipf model. Finally, in order to allocate bitrates for all the tiles, a two-stage optimization algorithm is proposed to preserve the quality of tiles in FoV and guarantee the spatial and temporal smoothness. Experimental results demonstrate the effectiveness and advantage of the proposed method compared with the state-of-the-art methods. That is, our method preserves both the quality and the smoothness of tiles in FoV, thus providing the best QoE for users.

Posted Content
TL;DR: This paper presents an end-to-end collaborative aggregation-and-distribution network (CoADNet) to capture both salient and repetitive visual patterns from multiple images, and develops a group consistency preserving decoder tailored for the CoSOD task.
Abstract: Co-Salient Object Detection (CoSOD) aims at discovering salient objects that repeatedly appear in a given query group containing two or more relevant images. One challenging issue is how to effectively capture co-saliency cues by modeling and exploiting inter-image relationships. In this paper, we present an end-to-end collaborative aggregation-and-distribution network (CoADNet) to capture both salient and repetitive visual patterns from multiple images. First, we integrate saliency priors into the backbone features to suppress the redundant background information through an online intra-saliency guidance structure. After that, we design a two-stage aggregate-and-distribute architecture to explore group-wise semantic interactions and produce the co-saliency features. In the first stage, we propose a group-attentional semantic aggregation module that models inter-image relationships to generate the group-wise semantic representations. In the second stage, we propose a gated group distribution module that adaptively distributes the learned group semantics to different individuals in a dynamic gating mechanism. Finally, we develop a group consistency preserving decoder tailored for the CoSOD task, which maintains group constraints during feature decoding to predict more consistent full-resolution co-saliency maps. The proposed CoADNet is evaluated on four prevailing CoSOD benchmark datasets, which demonstrates the remarkable performance improvement over ten state-of-the-art competitors.

Journal ArticleDOI
TL;DR: The first large-scale video quality assessment (VQA) database specifically for the screen content videos (SCVs) is constructed and the proposed spatiotemporal Gabor feature tensor-based model (SGFTM) consistently outperforms multiple classical and state-of-the-art image/video quality assessment models.
Abstract: In this article, we make the first attempt to study the subjective and objective quality assessment for the screen content video s (SCVs). For that, we construct the first large-scale video quality assessment (VQA) database specifically for the SCVs, called the screen content video database (SCVD). This SCVD provides 16 reference SCVs, 800 distorted SCVs, and their corresponding subjective scores, and it is made publicly available for research usage. The distorted SCVs are generated from each reference SCV with 10 distortion types and 5 degradation levels for each distortion type. Each distorted SCV is rated by at least 32 subjects in the subjective test. Furthermore, we propose the first full-reference VQA model for the SCVs, called the spatiotemporal Gabor feature tensor-based model (SGFTM), to objectively evaluate the perceptual quality of the distorted SCVs. This is motivated by the observation that 3D-Gabor filter can well stimulate the visual functions of the human visual system (HVS) on perceiving videos, being more sensitive to the edge and motion information that are often-encountered in the SCVs. Specifically, the proposed SGFTM exploits 3D-Gabor filter to individually extract the spatiotemporal Gabor feature tensors from the reference and distorted SCVs, followed by measuring their similarities and later combining them together through the developed spatiotemporal feature tensor pooling strategy to obtain the final SGFTM score. Experimental results on SCVD have shown that the proposed SGFTM yields a high consistency on the subjective perception of SCV quality and consistently outperforms multiple classical and state-of-the-art image/video quality assessment models.

Proceedings ArticleDOI
12 Oct 2020
TL;DR: This paper proposes a novel end-to-end learning-based approach, which can comprehensively utilize the specific characteristics of the input from two complementary and parallel perspectives to reconstructing high-resolution light field images from hybrid lenses.
Abstract: This paper explores the problem of reconstructing high-resolution light field (LF) images from hybrid lenses, including a high-resolution camera surrounded by multiple low-resolution cameras. To tackle this challenge, we propose a novel end-to-end learning-based approach, which can comprehensively utilize the specific characteristics of the input from two complementary and parallel perspectives. Specifically, one module regresses a spatially consistent intermediate estimation by learning a deep multidimensional and cross-domain feature representation; the other one constructs another intermediate estimation, which maintains the high-frequency textures, by propagating the information of the high-resolution view. We finally leverage the advantages of the two intermediate estimations via the learned attention maps, leading to the final high-resolution LF image. Extensive experiments demonstrate the significant superiority of our approach over state-of-the-art ones. That is, our method not only improves the PSNR by more than 2 dB, but also preserves the LF structure much better. To the best of our knowledge, this is the first end-to-end deep learning method for reconstructing a high-resolution LF image with a hybrid input. We believe our framework could potentially decrease the cost of high-resolution LF data acquisition and also be beneficial to LF data storage and transmission. The code is available at https://github.com/jingjin25/LFhybridSR-Fusion.

Journal ArticleDOI
TL;DR: In this paper, an ensemble rate adaptation framework for dynamic adaptive streaming over HTTP (DASH) is proposed, which aims to leverage the advantages of multiple rate adaptation methods involved in the framework to improve the quality of experience (QoE ) of users.
Abstract: Rate adaptation is one of the most important issues in dynamic adaptive streaming over HTTP (DASH). Due to the frequent fluctuations of the network bandwidth and complex variations of video content, it is difficult to deal with the varying network conditions and video content perfectly by using a single rate adaptation method. In this paper, we propose an ensemble rate adaptation framework for DASH, which aims to leverage the advantages of multiple methods involved in the framework to improve the quality of experience ( QoE ) of users. The proposed framework is simple yet very effective. Specifically, the proposed framework is composed of two modules, i.e., the method pool and method controller. In the method pool, several rate adaptation methods are integrated. At each decision time, only the method that can achieve the best QoE is chosen to determine the bitrate of the requested video segment. Besides, we also propose two strategies for switching methods, i.e., InstAnt Method Switching, and InterMittent Method Switching, for the method controller to determine which method can provide the best QoE s. Simulation results demonstrate that, the proposed framework always achieves the highest QoE for the change of channel environment and video complexity, compared with state-of-the-art rate adaptation methods.

Journal ArticleDOI
TL;DR: A novel prediction module, namely graph prediction, is proposed, in which a small number of representative points selected from previously encoded clusters are used to predict the points to be encoded by exploring the underlying graph structure constructed from the geometry information.
Abstract: 3D point clouds associated with attributes are considered as a promising data representation for immersive communication. The large amount of data, however, poses great challenges to the subsequent transmission and storage processes. In this letter, we propose a new compression scheme for the color attribute of static voxelized 3D point clouds. Specifically, we first partition the colors of a 3D point cloud into clusters by applying k-d tree to the geometry information, which are then successively encoded. To eliminate the redundancy, we propose a novel prediction module, namely graph prediction, in which a small number of representative points selected from previously encoded clusters are used to predict the points to be encoded by exploring the underlying graph structure constructed from the geometry information. Furthermore, the prediction residuals are transformed with the graph transform, and the resulting transform coefficients are finally uniformly quantified and entropy encoded. Experimental results show that the proposed compression scheme is able to achieve better rate-distortion performance at a lower computational cost when compared with state-of-the-art methods.

Posted Content
Yuheng Jia1, Hui Liu1, Junhui Hou1, Sam Kwong1, Qingfu Zhang1 
TL;DR: A novel structured tensor low-rank norm tailored to multi-view spectral clustering (MVSC) is designed, which outperforms state-of-the-art methods to a significant extent and is able to produce perfect clustering.
Abstract: This paper explores the problem of multi-view spectral clustering (MVSC) based on tensor low-rank modeling. Unlike the existing methods that all adopt an off-the-shelf tensor low-rank norm without considering the special characteristics of the tensor in MVSC, we design a novel structured tensor low-rank norm tailored to MVSC. Specifically, we explicitly impose a symmetric low-rank constraint and a structured sparse low-rank constraint on the frontal and horizontal slices of the tensor to characterize the intra-view and inter-view relationships, respectively. Moreover, the two constraints could be jointly optimized to achieve mutual refinement. On the basis of the novel tensor low-rank norm, we formulate MVSC as a convex low-rank tensor recovery problem, which is then efficiently solved with an augmented Lagrange multiplier based method iteratively. Extensive experimental results on five benchmark datasets show that the proposed method outperforms state-of-the-art methods to a significant extent. Impressively, our method is able to produce perfect clustering. In addition, the parameters of our method can be easily tuned, and the proposed model is robust to different datasets, demonstrating its potential in practice.

Journal ArticleDOI
TL;DR: To maximize the reconstructed quality of 3D point cloud, the bit allocation problem is formulated as a constrained optimization problem and solved by an interior point method and the rate-distortion performance is close to that obtained with exhaustive search but at only 0.68% of its time complexity.
Abstract: Rate distortion optimization plays a very important role in image/video coding. But for 3D point cloud, this problem has not been investigated. In this paper, the rate and distortion characteristics of 3D point cloud are investigated in detail, and a typical and challenging rate distortion optimization problem is solved for 3D point cloud. Specifically, since the quality of the reconstructed 3D point cloud depends on both the geometry and color distortions, we first propose analytical rate and distortion models for the geometry and color information in video-based 3D point cloud compression platform, and then solve the joint bit allocation problem for geometry and color based on the derived models. To maximize the reconstructed quality of 3D point cloud, the bit allocation problem is formulated as a constrained optimization problem and solved by an interior point method. Experimental results show that the rate-distortion performance of the proposed solution is close to that obtained with exhaustive search but at only 0.68% of its time complexity. Moreover, the proposed rate and distortion models can also be used for the other rate-distortion optimization problems (such as prediction mode decision) and rate control technologies for 3D point cloud coding in the future.

Journal ArticleDOI
TL;DR: This article proposes a novel PCP model via dual adversarial manifold regularization to fully explore the potential of the limited initial PCs by propagating MLs and CLs with two separated variables, called similarity and dissimilarity matrices, under the guidance of the graph structure constructed from data samples.
Abstract: Pairwise constraints (PCs) composed of must-links (MLs) and cannot-links (CLs) are widely used in many semisupervised tasks. Due to the limited number of PCs, pairwise constraint propagation (PCP) has been proposed to augment them. However, the existing PCP algorithms only adopt a single matrix to contain all the information, which overlooks the differences between the two types of links such that the discriminability of the propagated PCs is compromised. To this end, this article proposes a novel PCP model via dual adversarial manifold regularization to fully explore the potential of the limited initial PCs. Specifically, we propagate MLs and CLs with two separated variables, called similarity and dissimilarity matrices, under the guidance of the graph structure constructed from data samples. At the same time, the adversarial relationship between the two matrices is taken into consideration. The proposed model is formulated as a nonnegative constrained minimization problem, which can be efficiently solved with convergence theoretically guaranteed. We conduct extensive experiments to evaluate the proposed model, including propagation effectiveness and applications on constrained clustering and metric learning, all of which validate the superior performance of our model to state-of-the-art PCP models.

Book ChapterDOI
23 Aug 2020
TL;DR: LFCA as discussed by the authors incorporates the measurement observation into the deep learning framework elegantly to avoid relying entirely on data-driven priors for LF reconstruction, and constructs the regularization term with an efficient deep spatial-angular convolutional sub-network to comprehensively explore the signal distribution.
Abstract: Coded aperture is a promising approach for capturing the 4-D light field (LF), in which the 4-D data are compressively modulated into 2-D coded measurements that are further decoded by reconstruction algorithms. The bottleneck lies in the reconstruction algorithms, resulting in rather limited reconstruction quality. To tackle this challenge, we propose a novel learning-based framework for the reconstruction of high-quality LFs from acquisitions via learned coded apertures. The proposed method incorporates the measurement observation into the deep learning framework elegantly to avoid relying entirely on data-driven priors for LF reconstruction. Specifically, we first formulate the compressive LF reconstruction as an inverse problem with an implicit regularization term. Then, we construct the regularization term with an efficient deep spatial-angular convolutional sub-network to comprehensively explore the signal distribution free from the limited representation ability and inefficiency of deterministic mathematical modeling. Experimental results show that the reconstructed LFs not only achieve much higher PSNR/SSIM but also preserve the LF parallax structure better, compared with state-of-the-art methods on both real and synthetic LF benchmarks. In addition, experiments show that our method is efficient and robust to noise, which is an essential advantage for a real camera system. The code is publicly available at https://github.com/angmt2008/LFCA.