scispace - formally typeset
Search or ask a question

Showing papers by "Zhibo Chen published in 2019"


Proceedings ArticleDOI
15 Jun 2019
TL;DR: Zhang et al. as discussed by the authors proposed a two-stream network that consists of a main full image stream (MF-Stream) and a densely semantically-aligned guiding stream (DSAG-Stream).
Abstract: We propose a densely semantically aligned person re-identification (re-ID) framework. It fundamentally addresses the body misalignment problem caused by pose/viewpoint variations, imperfect person detection, occlusion, etc.. By leveraging the estimation of the dense semantics of a person image, we construct a set of densely semantically aligned part images (DSAP-images), where the same spatial positions have the same semantics across different person images. We design a two-stream network that consists of a main full image stream (MF-Stream) and a densely semantically-aligned guiding stream (DSAG-Stream). The DSAG-Stream, with the DSAP-images as input, acts as a regulator to guide the MF-Stream to learn densely semantically aligned features from the original image. In the inference, the DSAG-Stream is discarded and only the MF-Stream is needed, which makes the inference system computationally efficient and robust. To our best knowledge, we are the first to make use of fine grained semantics for addressing misalignment problems for re-ID. Our method achieves rank-1 accuracy of 78.9% (new protocol) on the CUHK03 dataset, 90.4% on the CUHK01 dataset, and 95.7% on the Market1501 dataset, outperforming state-of-the-art methods.

233 citations


Journal ArticleDOI
TL;DR: This paper proposes effective and efficient end-to-end convolutional neural network models for spatially super-resolving LF images with an hourglass shape, which allows feature extraction to be performed at the low-resolution level to save both the computational and memory costs.
Abstract: Light field (LF) photography is an emerging paradigm for capturing more immersive representations of the real world. However, arising from the inherent tradeoff between the angular and spatial dimensions, the spatial resolution of LF images captured by commercial micro-lens-based LF cameras is significantly constrained. In this paper, we propose effective and efficient end-to-end convolutional neural network models for spatially super-resolving LF images. Specifically, the proposed models have an hourglass shape, which allows feature extraction to be performed at the low-resolution level to save both the computational and memory costs. To fully make use of the 4D structure information of LF data in both the spatial and angular domains, we propose to use 4D convolution to characterize the relationship among pixels. Moreover, as an approximation of 4D convolution, we also propose to use spatial-angular separable (SAS) convolutions for more computationally and memory-efficient extraction of spatial-angular joint features. Extensive experimental results on 57 test LF images with various challenging natural scenes show significant advantages from the proposed models over the state-of-the-art methods. That is, an average PSNR gain of more than 3.0 dB and better visual quality are achieved, and our methods preserve the LF structure of the super-resolved LF images better, which is highly desirable for subsequent applications. In addition, the SAS convolution-based model can achieve three times speed up with only negligible reconstruction quality decrease when compared with the 4D convolution-based one. The source code of our method is available online.

138 citations


Journal ArticleDOI
TL;DR: The experimental results show that the proposed StereoQA-Net outperforms state-of-the-art algorithms on both symmetrically and asymmetrically distorted stereoscopic image pairs of various distortion types and can effectively predict the perceptual quality of local regions.
Abstract: The goal of objective stereoscopic image quality assessment (SIQA) is to predict the human perceptual quality of stereoscopic/3D images automatically and accurately. Compared with traditional 2D image quality assessment, the quality assessment of stereoscopic images is more challenging because of complex binocular vision mechanisms and multiple quality dimensions. In this paper, inspired by the hierarchical dual-stream interactive nature of the human visual system, we propose a stereoscopic image quality assessment network (StereoQA-Net) for no-reference stereoscopic image quality assessment. The proposed StereoQA-Net is an end-to-end dual-stream interactive network containing left and right view sub-networks, where the interaction of the two sub-networks exists in multiple layers. We evaluate our method on the LIVE stereoscopic image quality databases. The experimental results show that our proposed StereoQA-Net outperforms state-of-the-art algorithms on both symmetrically and asymmetrically distorted stereoscopic image pairs of various distortion types. In a more general case, the proposed StereoQA-Net can effectively predict the perceptual quality of local regions. In addition, cross-dataset experiments also demonstrate the generalization ability of our algorithm.

77 citations


Proceedings ArticleDOI
TL;DR: Full-resolution residual network (FRRN) is proposed to fill irregular holes, which is proved to be effective for progressive image inpainting and well-designed residual architecture facilitates feature integration and texture prediction.
Abstract: Recently, learning-based algorithms for image inpainting achieve remarkable progress dealing with squared or irregular holes. However, they fail to generate plausible textures inside damaged area because there lacks surrounding information. A progressive inpainting approach would be advantageous for eliminating central blurriness, i.e., restoring well and then updating masks. In this paper, we propose full-resolution residual network (FRRN) to fill irregular holes, which is proved to be effective for progressive image inpainting. We show that well-designed residual architecture facilitates feature integration and texture prediction. Additionally, to guarantee completion quality during progressive inpainting, we adopt N Blocks, One Dilation strategy, which assigns several residual blocks for one dilation step. Correspondingly, a step loss function is applied to improve the performance of intermediate restorations. The experimental results demonstrate that the proposed FRRN framework for image inpainting is much better than previous methods both quantitatively and qualitatively. Our codes are released at: \url{this https URL}.

74 citations


Proceedings ArticleDOI
01 Sep 2019
TL;DR: Zhang et al. as mentioned in this paper proposed an unsupervised deraining generative adversarial network (UD-GAN) by introducing self-supervised constraints from the intrinsic statistics of unpaired rainy and clean images.
Abstract: Most existing single image deraining methods require learning supervised models from a large set of paired synthetic training data, which limits their generality and practicality in real-world multimedia applications. Besides, due to lack of labeled-supervised constraints, directly applying existing unsupervised frameworks to the image deraining task will suffer from low-quality recovery. Therefore, we propose an Unsupervised Deraining Generative Adversarial Network (UD-GAN) to tackle above problems by introducing self-supervised constraints from the intrinsic statistics of unpaired rainy and clean images. Specifically, we design two collaboratively optimized modules, namely Rain Guidance Module (RGM) and Background Guidance Module (BGM), to take full advantage of rainy image characteristics. UD-GAN outperforms state-of-the-art methods on various benchmarking datasets in both quantitative and qualitative comparisons.

46 citations


Journal ArticleDOI
TL;DR: In this article, a learned fast intra coding (LFHI) framework is proposed, which takes into account the comprehensive factors of fast intra-coding to reach an improved configurable tradeoff between coding performance and computational complexity.
Abstract: In High Efficiency Video Coding (HEVC), excellent rate-distortion (RD) performance is achieved in part by having a flexible quadtree coding unit (CU) partition and a large number of intra-prediction modes. Such an excellent RD performance is achieved at the expense of much higher computational complexity. In this paper, we propose a learned fast HEVC intra coding (LFHI) framework taking into account the comprehensive factors of fast intra coding to reach an improved configurable tradeoff between coding performance and computational complexity. First, we design a low-complex shallow asymmetric-kernel CNN (AK-CNN) to efficiently extract the local directional texture features of each block for both fast CU partition and fast intra-mode decision. Second, we introduce the concept of the minimum number of RDO candidates (MNRC) into fast mode decision, which utilizes AK-CNN to predict the minimum number of best candidates for RDO calculation to further reduce the computation of intra-mode selection. Third, an evolution optimized threshold decision (EOTD) scheme is designed to achieve configurable complexity-efficiency tradeoffs. Finally, we propose an interpolation-based prediction scheme that allows for our framework to be generalized to all quantization parameters (QPs) without the need for training the network on each QP. The experimental results demonstrate that the LFHI framework has a high degree of parallelism and achieves a much better complexity-efficiency tradeoff, achieving up to 75.2% intra-mode encoding complexity reduction with negligible rate-distortion performance degradation, superior to the existing fast intra-coding schemes.

39 citations


Journal ArticleDOI
TL;DR: The results show that ImmerTai can accelerate the learning process of students noticeably compared to non-immersive learning with the conventional PC setup, and there is a substantial difference in the quality of the learnt motion between CAVE and HMD compared to PC.

39 citations


Posted Content
TL;DR: Wang et al. as mentioned in this paper proposed a Relation-Aware Global Attention (RGA) module to capture the global structural information for better attention learning, which can significantly enhance the feature representation power and help achieve the state-of-theart performance on several popular benchmarks.
Abstract: For person re-identification (re-id), attention mechanisms have become attractive as they aim at strengthening discriminative features and suppressing irrelevant ones, which matches well the key of re-id, i.e., discriminative feature learning. Previous approaches typically learn attention using local convolutions, ignoring the mining of knowledge from global structure patterns. Intuitively, the affinities among spatial positions/nodes in the feature map provide clustering-like information and are helpful for inferring semantics and thus attention, especially for person images where the feasible human poses are constrained. In this work, we propose an effective Relation-Aware Global Attention (RGA) module which captures the global structural information for better attention learning. Specifically, for each feature position, in order to compactly grasp the structural information of global scope and local appearance information, we propose to stack the relations, i.e., its pairwise correlations/affinities with all the feature positions (e.g., in raster scan order), and the feature itself together to learn the attention with a shallow convolutional model. Extensive ablation studies demonstrate that our RGA can significantly enhance the feature representation power and help achieve the state-of-the-art performance on several popular benchmarks. The source code is available at this https URL.

38 citations


Journal ArticleDOI
TL;DR: A Learning based Facial Image Compression framework with a novel Regionally Adaptive Pooling module whose parameters can be automatically optimized according to gradient feedback from an integrated hybrid semantic fidelity metric, including a successfully exploration to apply Generative Adversarial Network (GAN) as metric directly in image compression scheme.

38 citations


Proceedings ArticleDOI
15 Oct 2019
TL;DR: Zhang et al. as discussed by the authors proposed full-resolution residual network (FRRN) to fill irregular holes, which is proved to be effective for progressive image inpainting, and adopted N Blocks, One Dilation strategy, which assigns several residual blocks for one dilation step.
Abstract: Recently, learning-based algorithms for image inpainting achieve remarkable progress dealing with squared or irregular holes. However, they fail to generate plausible textures inside damaged area because there lacks surrounding information. A progressive inpainting approach would be advantageous for eliminating central blurriness, i.e., restoring well and then updating masks. In this paper, we propose full-resolution residual network (FRRN) to fill irregular holes, which is proved to be effective for progressive image inpainting. We show that well-designed residual architecture facilitates feature integration and texture prediction. Additionally, to guarantee completion quality during progressive inpainting, we adopt N Blocks, One Dilation strategy, which assigns several residual blocks for one dilation step. Correspondingly, a step loss function is applied to improve the performance of intermediate restorations. The experimental results demonstrate that the proposed FRRN framework for image inpainting is much better than previous methods both quantitatively and qualitatively.

36 citations


Proceedings ArticleDOI
01 Sep 2019
TL;DR: The tensor theory is adopted to explore the LF 4D structure characteristics and the first Blind quality Evaluator of LIght Field image (BELIF) is proposed, demonstrating that BELIF outperforms the existing image quality assessment algorithms.
Abstract: With the development of immersive media, Light Field Image (LFI) quality assessment is becoming more and more important, which helps to better guide light field acquisition, processing and application. However, almost all existing LFI quality assessment schemes utilize the 2D or 3D quality assessment methods while ignoring the intrinsic high dimensional characteristics of LFI. Therefore, we adopt the tensor theory to explore the LF 4D structure characteristics and propose the first Blind quality Evaluator of LIght Field image (BELIF). We generate cyclopean images tensor from the original LFI and then the features are extracted by the tucker decomposition. Specifically, Tensor Spatial Characteristic Features (TSCF) for spatial quality and Tensor Structure Variation Index (TSVI) for angular consistency are designed to fully assess the LFI quality. Extensive experimental results on the public LFI databases demonstrate that BELIF signifi-cantly outperforms the existing image quality assessment algorithms.

Posted Content
TL;DR: An unpaired image-to-image translation framework, Domain-supervised GAN (DosGAN), which takes the first step towards the exploration of explicit domain supervision and pre-train a classification network to explicitly classify the domain of an image.
Abstract: Image-to-image translation tasks have been widely investigated with Generative Adversarial Networks (GANs). However, existing approaches are mostly designed in an unsupervised manner while little attention has been paid to domain information within unpaired data. In this paper, we treat domain information as explicit supervision and design an unpaired image-to-image translation framework, Domain-supervised GAN (DosGAN), which takes the first step towards the exploration of explicit domain supervision. In contrast to representing domain characteristics using different generators or domain codes, we pre-train a classification network to explicitly classify the domain of an image. After pre-training, this network is used to extract the domain-specific features of each image. Such features, together with the domain-independent features extracted by another encoder (shared across different domains), are used to generate image in target domain. Extensive experiments on multiple facial attribute translation, multiple identity translation, multiple season translation and conditional edges-to-shoes/handbags demonstrate the effectiveness of our method. In addition, we can transfer the domain-specific feature extractor obtained on the Facescrub dataset with domain supervision information to unseen domains, such as faces in the CelebA dataset. We also succeed in achieving conditional translation with any two images in CelebA, while previous models like StarGAN cannot handle this task.

Proceedings ArticleDOI
08 Jul 2019
TL;DR: A learned scalable/progressive image compression scheme based on deep neural networks (DNN), named Bidirectional Context Disentanglement Network (BCD-Net), which outperforms the state-of-the-art DNN-based scalable image compression methods in both PSNR and MS-SSIM metrics.
Abstract: In this paper, we propose a learned scalable/progressive image compression scheme based on deep neural networks (DNN), named Bidirectional Context Disentanglement Network (BCD-Net). For learning hierarchical representations, we first adopt bit-plane decomposition to decompose the information coarsely before the deep-learning-based transformation. However, the information carried by different bit-planes is not only unequal in entropy but also of different importance for reconstruction. We thus take the hidden features corresponding to different bit-planes as the context and design a network topology with bidirectional flows to disentangle the contextual information for more effective compressed representations. Our proposed scheme enables us to obtain the compressed codes with scalable rates via a one-pass encoding-decoding. Experiment results demonstrate that our proposed model outperforms the state-of-the-art DNN-based scalable image compression methods in both PSNR and MS-SSIM metrics. In addition, our proposed model achieves better performance in MS-SSIM metric than conventional scalable image codecs. Effectiveness of our technical components is also verified through sufficient ablation experiments.

Proceedings ArticleDOI
08 Jul 2019
TL;DR: This paper reduces the decision space of 360SRL from exponential to linear by introducing a sequential ABR decision structure, thus making it feasible to be employed with RL and compares it to state-of-the-art ABR algorithms using trace-driven experiments.
Abstract: Tile-based 360-degree video (360 video) streaming, employed with adaptive bitrate (ABR) algorithms, is a promising approach to offer high video quality of experience (QoE) within limited network bandwidth. Existing ABR algorithms, however, fail to achieve optimal performance in real-world fluctuated network conditions as they heavily rely on unbiased bandwidth predictions. Recently, reinforcement learning (RL) has shown promising potential in generating better ABR algorithms in 2D video streaming. However, unlike existed work in 2D video streaming, directly applying RL in the tile-based 360 video streaming is infeasible due to the resulting exponential decision space. To overcome these limitations, we propose in this paper 360SRL, an improved ABR algorithm employing Sequential RL (360SRL). Firstly, we reduce the decision space of 360SRL from exponential to linear by introducing a sequential ABR decision structure, thus making it feasible to be employed with RL. Secondly, instead of relying on accurate bandwidth predictions, 360SRL learns to make ABR decisions solely through observations of the resulting QoE performance of past decisions. Finally, we compare 360SRL to state-of-the-art ABR algorithms using trace-driven experiments. The experiment results demonstrate that 360SRL outperforms state-of-the-art algorithms with around 12% improvement in average QoE.

Journal ArticleDOI
TL;DR: This paper proposes a feedback-enhancement multi-branch CNN (FM-CNN), which takes three derivatives of an image as input and leverages the advantages of hierarchical details, feedback enhancement, model average, and stronger robustness to translation and mirroring.
Abstract: Vehicle type recognition (VTR) is a quite common requirement and one of the key challenges in real surveillance scenarios, such as intelligent traffic and unmanned driving. Usually coarse-grained and fine-grained VTRs are applied in different applications, and the challenge from multiple viewpoints is critical for both cases. In this paper, we propose a feedback-enhancement multi-branch CNN (FM-CNN) to solve the challenge in these two cases. The proposed FM-CNN takes three derivatives of an image as input and leverages the advantages of hierarchical details, feedback enhancement, model average, and stronger robustness to translation and mirroring. A single global cross-entropy loss is insufficient to train such a complex CNN and so we add extra branch losses to enhance feedbacks to each branch. Though reusing pre-trained parameters, we propose a novel parameter update method to adapt FM-CNN to task-specific local visual patterns and global information in new datasets. To test the effectiveness of FM-CNN, we create our own multi-view VTR (MVVTR) data set since there are no such data sets available. And, for fine-grained VTR, we use the CompCars data set. Compared with state-of-the-art classification solutions without special preprocessing, the proposed FM-CNN demonstrates better performance in both coarse-grained and fine-grained scenarios. For coarse-grained VTR, it achieves 94.9% Top-1 accuracy on the MVVTR data set. For fine-grained VTR, it achieves 91.0% Top-1 and 97.8% Top-5 accuracies on the CompCars data set.

Proceedings ArticleDOI
01 Nov 2019
TL;DR: A multi-viewport based full-reference stereo 360 IQA model is proposed that achieves a significant improvement over some well-known IQA metrics and can accurately reflect the overall QoE of perceived images.
Abstract: Objective quality assessment of stereoscopic panoramic images becomes a challenging problem owing to the rapid growth of 360-degree contents. Different from traditional 2D image quality assessment (IQA), more complex aspects are involved in 3D omnidirectional IQA, especially unlimited field of view (FoV) and extra depth perception, which brings difficulty to evaluate the quality of experience (QoE) of 3D omnidirectional images. In this paper, we propose a multi-viewport based full-reference stereo 360 IQA model. Due to the freely changeable viewports when browsing in the head-mounted display, our proposed approach processes the image inside FoV rather than the projected one such as equirectangular projection (ERP). In addition, since overall QoE depends on both image quality and depth perception, we utilize the features estimated by the difference map between left and right views which can reflect disparity. The depth perception features along with binocular image qualities are employed to further predict the overall QoE of 3D 360 images. The experimental results on our public Stereoscopic OmnidirectionaL Image quality assessment Database (SOLID) show that the proposed method achieves a significant improvement over some well-known IQA metrics and can accurately reflect the overall QoE of perceived images.

Proceedings ArticleDOI
26 May 2019
TL;DR: This paper proposes a specified Asymmetric-Kernel CNN (AK-CNN) for fast CTU and PU (prediction unit) partition prediction, superior to the existing fast partition algorithms.
Abstract: High Efficiency Video Coding (HEVC) has higher encoding complexity due to sophisticated coding tree unit (CTU) partition with recursive rate-distortion optimization (RDO) procedures. In this paper, we propose a specified Asymmetric-Kernel CNN (AK-CNN) for fast CTU and PU (prediction unit) partition prediction. Shallow network structures with asymmetric horizontal and vertical convolution kernels are designed to precisely extract the texture features of each block with much lower complexity. We establish our own dataset with complete CTU partition patterns together with their RD-cost for network training. The confidence threshold decision scheme is designed in the PU partition part to achieve the best trade-off between the coding performance and complexity reduction. Experimental results demonstrate that our approach achieves 69.8% intra mode encoding complexity reduction with negligible rate-distortion performance degradation, superior to the existing fast partition algorithms.

Posted Content
TL;DR: In this article, a no-reference light field image quality assessment (NR-LFQA) scheme is proposed to quantify the LFI quality degradation through evaluating the spatial quality and angular consistency.
Abstract: Light field image quality assessment (LFI-QA) is a significant and challenging research problem. It helps to better guide light field acquisition, processing and applications. However, only a few objective models have been proposed and none of them completely consider intrinsic factors affecting the LFI quality. In this paper, we propose a No-Reference Light Field image Quality Assessment (NR-LFQA) scheme, where the main idea is to quantify the LFI quality degradation through evaluating the spatial quality and angular consistency. We first measure the spatial quality deterioration by capturing the naturalness distribution of the light field cyclopean image array, which is formed when human observes the LFI. Then, as a transformed representation of LFI, the Epipolar Plane Image (EPI) contains the slopes of lines and involves the angular information. Therefore, EPI is utilized to extract the global and local features from LFI to measure angular consistency degradation. Specifically, the distribution of gradient direction map of EPI is proposed to measure the global angular consistency distortion in the LFI. We further propose the weighted local binary pattern to capture the characteristics of local angular consistency degradation. Extensive experimental results on four publicly available LFI quality datasets demonstrate that the proposed method outperforms state-of-the-art 2D, 3D, multi-view, and LFI quality assessment algorithms.

Journal ArticleDOI
TL;DR: A new deep convolutional neural network is proposed to improve image classification using extra light-field angular information by replacing the fully connected layer of a VGG network with a set of interleaved spatial-angular filters, thus providing more accurate classification performance over traditional models.
Abstract: Image classification is a well-studied problem. However, there remains challenges for some special categories of images. This paper proposes a new deep convolutional neural network to improve image classification using extra light-field angular information. The proposed network model employs transfer learning by replacing the fully connected layer of a VGG network with a set of interleaved spatial-angular filters. The resulting model takes advantage of both the spatial and angular information of light-field images (LFIs), thus providing more accurate classification performance over traditional models. To evaluate the proposed network model, we established a light-field image dataset, currently consisting of 560 captured LFIs, which have been divided into 11 labeled categories. Based on this dataset, our experimental results show that the proposed LFI model yields an average of 92% classification accuracy as oppose to 84% from the model using traditional 2D images and 85% from the model using stereo pair images. In particular, on classifying challenging objects such as the “screen” images, the proposed LFI model demonstrated to have significant improvement of 16% and 12% respectively over the 2D image model and the stereo image model.

Proceedings ArticleDOI
01 Nov 2019
TL;DR: This paper proposes a learning based Semantically Structured Coding (SSC) framework to generate SemanticallyStructured Bit-stream (SSB), where each part of bit-stream represents a certain object and can be directly used for aforementioned tasks.
Abstract: With the development of 5G and edge computing, it is increasingly important to offload intelligent media computing to edge device. Traditional media coding scheme codes the media into one binary stream without a semantic structure, which prevents many important intelligent applications from operating directly in bit-stream level, including semantic analysis, parsing specific content, media editing, etc. Therefore, in this paper, we propose a learning based Semantically Structured Coding (SSC) framework to generate Semantically Structured Bit-stream (SSB), where each part of bit-stream represents a certain object and can be directly used for aforementioned tasks. Specifically, we integrate an object detection module in our compression framework to locate and align the object in feature domain. After applying quantization and entropy coding, the features are re-organized according to detected and aligned objects to form a bit-stream. Besides, different from existing learning-based compression schemes that individually train models for specific bit-rate, we share most of model parameters among various bit-rates to significantly reduce model size for variable-rate compression. Experimental results demonstrate that only at the cost of negligible overhead, objects can be completely reconstructed from partial bit-stream. We also verified that classification and pose estimation can be directly performed on partial bit-stream without performance degradation.

Posted Content
TL;DR: Zhang et al. as discussed by the authors proposed a semantic alignment network (SAN) which consists of a base network as encoder and a decoder for reconstructing/regressing the densely semantics aligned full texture image.
Abstract: Person re-identification (reID) aims to match person images to retrieve the ones with the same identity. This is a challenging task, as the images to be matched are generally semantically misaligned due to the diversity of human poses and capture viewpoints, incompleteness of the visible bodies (due to occlusion), etc. In this paper, we propose a framework that drives the reID network to learn semantics-aligned feature representation through delicate supervision designs. Specifically, we build a Semantics Aligning Network (SAN) which consists of a base network as encoder (SA-Enc) for re-ID, and a decoder (SA-Dec) for reconstructing/regressing the densely semantics aligned full texture image. We jointly train the SAN under the supervisions of person re-identification and aligned texture generation. Moreover, at the decoder, besides the reconstruction loss, we add Triplet ReID constraints over the feature maps as the perceptual losses. The decoder is discarded in the inference and thus our scheme is computationally efficient. Ablation studies demonstrate the effectiveness of our design. We achieve the state-of-the-art performances on the benchmark datasets CUHK03, Market1501, MSMT17, and the partial person reID dataset Partial REID. Code for our proposed method is available at: this https URL.

Posted Content
TL;DR: ZstGAN as discussed by the authors introduces an adversarial training scheme to learn to model each domain with domain-specific feature distribution that is semantically consistent on vision and attribute modalities, and then the domain-invariant features are disentangled with an shared encoder for image generation.
Abstract: Image-to-image translation models have shown remarkable ability on transferring images among different domains. Most of existing work follows the setting that the source domain and target domain keep the same at training and inference phases, which cannot be generalized to the scenarios for translating an image from an unseen domain to an another unseen domain. In this work, we propose the Unsupervised Zero-Shot Image-to-image Translation (UZSIT) problem, which aims to learn a model that can transfer translation knowledge from seen domains to unseen domains. Accordingly, we propose a framework called ZstGAN: By introducing an adversarial training scheme, ZstGAN learns to model each domain with domain-specific feature distribution that is semantically consistent on vision and attribute modalities. Then the domain-invariant features are disentangled with an shared encoder for image generation. We carry out extensive experiments on CUB and FLO datasets, and the results demonstrate the effectiveness of proposed method on UZSIT task. Moreover, ZstGAN shows significant accuracy improvements over state-of-the-art zero-shot learning methods on CUB and FLO.

Proceedings ArticleDOI
01 Aug 2019
TL;DR: Inspired by the success of deliberation network in natural language processing, deliberation process is extended to the field of image translation by considering an additional polishing step on the output image.
Abstract: Image-to-image translation, which transfers an image from a source domain to a target one, has attracted much attention in both academia and industry. The major approach is to adopt an encoder-decoder based framework, where the encoder extracts features from the input image and then the decoder decodes the features and generates an image in the target domain as the output. In this paper, we go beyond this learning framework by considering an additional polishing step on the output image. Polishing an image is very common in human's daily life, such as editing and beautifying a photo in Photoshop after taking/generating it by a digital camera. Such a deliberation process is shown to be very helpful and important in practice and thus we believe it will also be helpful for image translation. Inspired by the success of deliberation network in natural language processing, we extend deliberation process to the field of image translation. We verify our proposed method on four two-domain translation tasks and one multi-domain translation task. Both the qualitative and quantitative results demonstrate the effectiveness of our method.

Posted Content
05 Apr 2019
TL;DR: This paper proposes an effective Relation-Aware Global Attention module for CNNs to fully exploit the global correlations to infer the attention and demonstrates the general applicability of RGA to vision tasks by applying it to the scene segmentation and image classification tasks resulting in consistent performance improvement.
Abstract: Attention mechanism aims to increase the representation power by focusing on important features and suppressing unnecessary ones. For convolutional neural networks (CNNs), attention is typically learned with local convolutions, which ignores the global information and the hidden relation. How to efficiently exploit the long-range context to globally learn attention is underexplored. In this paper, we propose an effective Relation-Aware Global Attention (RGA) module for CNNs to fully exploit the global correlations to infer the attention. Specifically, when computing the attention at a feature position, in order to grasp information of global scope, we propose to stack the relations, i.e., its pairwise correlations/affinities with all the feature positions, and the feature itself together for learning the attention with convolutional operations. Given an intermediate feature map, we have validated the effectiveness of this design across both the spatial and channel dimensions. When applied to the task of person re-identification, our model achieves the state-of-the-art performance. Extensive ablation studies demonstrate that our RGA can significantly enhance the feature representation power. We further demonstrate the general applicability of RGA to vision tasks by applying it to the scene segmentation and image classification tasks resulting in consistent performance improvement.

Posted Content
TL;DR: Zhang et al. as discussed by the authors proposed a spatial region-wise normalization named Region Normalization (RN), which divides spatial pixels into different regions according to the input mask, and computes the mean and variance in each region for normalization.
Abstract: Feature Normalization (FN) is an important technique to help neural network training, which typically normalizes features across spatial dimensions. Most previous image inpainting methods apply FN in their networks without considering the impact of the corrupted regions of the input image on normalization, e.g. mean and variance shifts. In this work, we show that the mean and variance shifts caused by full-spatial FN limit the image inpainting network training and we propose a spatial region-wise normalization named Region Normalization (RN) to overcome the limitation. RN divides spatial pixels into different regions according to the input mask, and computes the mean and variance in each region for normalization. We develop two kinds of RN for our image inpainting network: (1) Basic RN (RN-B), which normalizes pixels from the corrupted and uncorrupted regions separately based on the original inpainting mask to solve the mean and variance shift problem; (2) Learnable RN (RN-L), which automatically detects potentially corrupted and uncorrupted regions for separate normalization, and performs global affine transformation to enhance their fusion. We apply RN-B in the early layers and RN-L in the latter layers of the network respectively. Experiments show that our method outperforms current state-of-the-art methods quantitatively and qualitatively. We further generalize RN to other inpainting networks and achieve consistent performance improvements.

Posted Content
TL;DR: Wang et al. as mentioned in this paper proposed a stereoscopic omnidirectional image quality evaluator (SOIQE) to cope with the characteristics of 3D 360-degree images.
Abstract: Objective quality assessment of stereoscopic omnidirectional images is a challenging problem since it is influenced by multiple aspects such as projection deformation, field of view (FoV) range, binocular vision, visual comfort, etc. Existing studies show that classic 2D or 3D image quality assessment (IQA) metrics are not able to perform well for stereoscopic omnidirectional images. However, very few research works have focused on evaluating the perceptual visual quality of omnidirectional images, especially for stereoscopic omnidirectional images. In this paper, based on the predictive coding theory of the human vision system (HVS), we propose a stereoscopic omnidirectional image quality evaluator (SOIQE) to cope with the characteristics of 3D 360-degree images. Two modules are involved in SOIQE: predictive coding theory based binocular rivalry module and multi-view fusion module. In the binocular rivalry module, we introduce predictive coding theory to simulate the competition between high-level patterns and calculate the similarity and rivalry dominance to obtain the quality scores of viewport images. Moreover, we develop the multi-view fusion module to aggregate the quality scores of viewport images with the help of both content weight and location weight. The proposed SOIQE is a parametric model without necessary of regression learning, which ensures its interpretability and generalization performance. Experimental results on our published stereoscopic omnidirectional image quality assessment database (SOLID) demonstrate that our proposed SOIQE method outperforms state-of-the-art metrics. Furthermore, we also verify the effectiveness of each proposed module on both public stereoscopic image datasets and panoramic image datasets.

Proceedings ArticleDOI
01 Nov 2019
TL;DR: A transformer-based decorrelation unit (DU) is designed and adopted in the authors' scalable image compression framework to reduce the redundancy of feature representations at different levels and demonstrates that proposed framework outperforms the state-of-the-art DNN-based scalable image Codec and conventional scalable image codecs in terms of MS-SSIM.
Abstract: Scalable image compression allows reconstructing complete images through partially decoding. It plays an important role for image transmission and storage. In this paper, we study the problem of feature decorrelation for Deep Neural Network (DNN) based image codec. Inspired by self-attention mechanism [1], we design a transformer-based decorrelation unit (DU) and adopt it in our scalable image compression framework to reduce the redundancy of feature representations at different levels. Experimental results demonstrate that proposed framework outperforms the state-of-the-art DNN-based scalable image codec and conventional scalable image codecs in terms of MS-SSIM. We also conduct ablation experiments which explicitly verify the effectiveness of decorrelation unit in our scheme.

Proceedings ArticleDOI
01 May 2019
TL;DR: A new kind of loss, multi-path consistency loss, is introduced, which evaluates the differences between direct translation $\mathcal{D}_s\to\mathcal(D)_t$ and indirect translation, to regularize training.
Abstract: Image translation across different domains has attracted much attention in both machine learning and computer vision communities. Taking the translation from source domain $\mathcal{D}_s$ to target domain $\mathcal{D}_t$ as an example, existing algorithms mainly rely on two kinds of loss for training: One is the discrimination loss, which is used to differentiate images generated by the models and natural images; the other is the reconstruction loss, which measures the difference between an original image and the reconstructed version through $\mathcal{D}_s\to\mathcal{D}_t\to\mathcal{D}_s$ translation. In this work, we introduce a new kind of loss, multi-path consistency loss, which evaluates the differences between direct translation $\mathcal{D}_s\to\mathcal{D}_t$ and indirect translation $\mathcal{D}_s\to\mathcal{D}_a\to\mathcal{D}_t$ with $\mathcal{D}_a$ as an auxiliary domain, to regularize training. For multi-domain translation (at least, three) which focuses on building translation models between any two domains, at each training iteration, we randomly select three domains, set them respectively as the source, auxiliary and target domains, build the multi-path consistency loss and optimize the network. For two-domain translation, we need to introduce an additional auxiliary domain and construct the multi-path consistency loss. We conduct various experiments to demonstrate the effectiveness of our proposed methods, including face-to-face translation, paint-to-photo translation, and de-raining/de-noising translation.

Posted Content
TL;DR: Asynchronous Episodic Deep Deterministic Policy Gradient (AE-DDPG) as discussed by the authors is an extension of DDPG which can achieve more effective learning with less training time required.
Abstract: Deep Deterministic Policy Gradient (DDPG) has been proved to be a successful reinforcement learning (RL) algorithm for continuous control tasks. However, DDPG still suffers from data insufficiency and training inefficiency, especially in computationally complex environments. In this paper, we propose Asynchronous Episodic DDPG (AE-DDPG), as an expansion of DDPG, which can achieve more effective learning with less training time required. First, we design a modified scheme for data collection in an asynchronous fashion. Generally, for asynchronous RL algorithms, sample efficiency or/and training stability diminish as the degree of parallelism increases. We consider this problem from the perspectives of both data generation and data utilization. In detail, we re-design experience replay by introducing the idea of episodic control so that the agent can latch on good trajectories rapidly. In addition, we also inject a new type of noise in action space to enrich the exploration behaviors. Experiments demonstrate that our AE-DDPG achieves higher rewards and requires less time consuming than most popular RL algorithms in Learning to Run task which has a computationally complex environment. Not limited to the control tasks in computationally complex environments, AE-DDPG also achieves higher rewards and 2- to 4-fold improvement in sample efficiency on average compared to other variants of DDPG in MuJoCo environments. Furthermore, we verify the effectiveness of each proposed technique component through abundant ablation study.

Posted Content
TL;DR: This paper proposes a No-Reference Light Field image Quality assessment model based on MLI (LF-QMLI), which utilizes Global Entropy Distribution and Uniform Local Binary Pattern descriptor (ULBP) to extract features from the MLI, and pool them together to measure angular consistency.
Abstract: Light field image quality assessment (LF-IQA) plays a significant role due to its guidance to Light Field (LF) contents acquisition, processing and application. The LF can be represented as 4-D signal, and its quality depends on both angular consistency and spatial quality. However, few existing LF-IQA methods concentrate on effects caused by angular inconsistency. Especially, no-reference methods lack effective utilization of 2-D angular information. In this paper, we focus on measuring the 2-D angular consistency for LF-IQA. The Micro-Lens Image (MLI) refers to the angular domain of the LF image, which can simultaneously record the angular information in both horizontal and vertical directions. Since the MLI contains 2-D angular information, we propose a No-Reference Light Field image Quality assessment model based on MLI (LF-QMLI). Specifically, we first utilize Global Entropy Distribution (GED) and Uniform Local Binary Pattern descriptor (ULBP) to extract features from the MLI, and then pool them together to measure angular consistency. In addition, the information entropy of Sub-Aperture Image (SAI) is adopted to measure spatial quality. Extensive experimental results show that LF-QMLI achieves the state-of-the-art performance.