scispace - formally typeset
Search or ask a question

Showing papers on "Real image published in 2019"


Proceedings ArticleDOI
02 May 2019
TL;DR: SinGAN, an unconditional generative model that can be learned from a single natural image, is introduced, trained to capture the internal distribution of patches within the image, and is then able to generate high quality, diverse samples that carry the same visual content as the image.
Abstract: We introduce SinGAN, an unconditional generative model that can be learned from a single natural image. Our model is trained to capture the internal distribution of patches within the image, and is then able to generate high quality, diverse samples that carry the same visual content as the image. SinGAN contains a pyramid of fully convolutional GANs, each responsible for learning the patch distribution at a different scale of the image. This allows generating new samples of arbitrary size and aspect ratio, that have significant variability, yet maintain both the global structure and the fine textures of the training image. In contrast to previous single image GAN schemes, our approach is not limited to texture images, and is not conditional (i.e. it generates samples from noise). User studies confirm that the generated samples are commonly confused to be real images. We illustrate the utility of SinGAN in a wide range of image manipulation tasks.

660 citations


Posted Content
TL;DR: InterFaceGAN as discussed by the authors explores the disentanglement between various semantics and manage to decouple some entangled semantics with subspace projection, leading to more precise control of facial attributes, including gender, age, expression, and the presence of eyeglasses.
Abstract: Despite the recent advance of Generative Adversarial Networks (GANs) in high-fidelity image synthesis, there lacks enough understanding of how GANs are able to map a latent code sampled from a random distribution to a photo-realistic image. Previous work assumes the latent space learned by GANs follows a distributed representation but observes the vector arithmetic phenomenon. In this work, we propose a novel framework, called InterFaceGAN, for semantic face editing by interpreting the latent semantics learned by GANs. In this framework, we conduct a detailed study on how different semantics are encoded in the latent space of GANs for face synthesis. We find that the latent code of well-trained generative models actually learns a disentangled representation after linear transformations. We explore the disentanglement between various semantics and manage to decouple some entangled semantics with subspace projection, leading to more precise control of facial attributes. Besides manipulating gender, age, expression, and the presence of eyeglasses, we can even vary the face pose as well as fix the artifacts accidentally generated by GAN models. The proposed method is further applied to achieve real image manipulation when combined with GAN inversion methods or some encoder-involved models. Extensive results suggest that learning to synthesize faces spontaneously brings a disentangled and controllable facial attribute representation.

426 citations


Journal ArticleDOI
TL;DR: The authors' method can accomplish the mismatch removal from thousands of putative correspondences in only a few milliseconds, and achieves better or favorably competitive performance in accuracy while intensively cutting time cost by more than two orders of magnitude.
Abstract: Seeking reliable correspondences between two feature sets is a fundamental and important task in computer vision. This paper attempts to remove mismatches from given putative image feature correspondences. To achieve the goal, an efficient approach, termed as locality preserving matching (LPM), is designed, the principle of which is to maintain the local neighborhood structures of those potential true matches. We formulate the problem into a mathematical model, and derive a closed-form solution with linearithmic time and linear space complexities. Our method can accomplish the mismatch removal from thousands of putative correspondences in only a few milliseconds. To demonstrate the generality of our strategy for handling image matching problems, extensive experiments on various real image pairs for general feature matching, as well as for point set registration, visual homing and near-duplicate image retrieval are conducted. Compared with other state-of-the-art alternatives, our LPM achieves better or favorably competitive performance in accuracy while intensively cutting time cost by more than two orders of magnitude.

416 citations


Proceedings ArticleDOI
15 Oct 2019
TL;DR: Zheng et al. as mentioned in this paper proposed a lightweight information multi-distillation network (IMDN) by constructing the cascaded information multidistillation blocks (IMDB), which contains distillation and selective fusion parts.
Abstract: In recent years, single image super-resolution (SISR) methods using deep convolution neural network (CNN) have achieved impressive results. Thanks to the powerful representation capabilities of the deep networks, numerous previous ways can learn the complex non-linear mapping between low-resolution (LR) image patches and their high-resolution (HR) versions. However, excessive convolutions will limit the application of super-resolution technology in low computing power devices. Besides, super-resolution of any arbitrary scale factor is a critical issue in practical applications, which has not been well solved in the previous approaches. To address these issues, we propose a lightweight information multi-distillation network (IMDN) by constructing the cascaded information multi-distillation blocks (IMDB), which contains distillation and selective fusion parts. Specifically, the distillation module extracts hierarchical features step-by-step, and fusion module aggregates them according to the importance of candidate features, which is evaluated by the proposed contrast-aware channel attention mechanism. To process real images with any sizes, we develop an adaptive cropping strategy (ACS) to super-resolve block-wise image patches using the same well-trained model. Extensive experiments suggest that the proposed method performs favorably against the state-of-the-art SR algorithms in term of visual quality, memory footprint, and inference time. Code is available at \urlhttps://github.com/Zheng222/IMDN.

386 citations


Proceedings ArticleDOI
15 Jun 2019
TL;DR: In this paper, the authors propose a technique to "unprocess" images by inverting each step of an image processing pipeline, thereby allowing them to synthesize realistic raw sensor measurements from commonly available Internet photos.
Abstract: Machine learning techniques work best when the data used for training resembles the data used for evaluation. This holds true for learned single-image denoising algorithms, which are applied to real raw camera sensor readings but, due to practical constraints, are often trained on synthetic image data. Though it is understood that generalizing from synthetic to real images requires careful consideration of the noise properties of camera sensors, the other aspects of an image processing pipeline (such as gain, color correction, and tone mapping) are often overlooked, despite their significant effect on how raw measurements are transformed into finished images. To address this, we present a technique to “unprocess” images by inverting each step of an image processing pipeline, thereby allowing us to synthesize realistic raw sensor measurements from commonly available Internet photos. We additionally model the relevant components of an image processing pipeline when evaluating our loss function, which allows training to be aware of all relevant photometric processing that will occur after denoising. By unprocessing and processing training data and model outputs in this way, we are able to train a simple convolutional neural network that has 14%-38% lower error rates and is 9×-18× faster than the previous state of the art on the Darmstadt Noise Dataset, and generalizes to sensors outside of that dataset as well.

369 citations


Proceedings ArticleDOI
01 Oct 2019
TL;DR: In this paper, a single-stage blind real image denoising network (RIDNet) was proposed by employing a modular architecture, which uses residual on the residual structure to ease the flow of low-frequency information and apply feature attention to exploit the channel dependencies.
Abstract: Deep convolutional neural networks perform better on images containing spatially invariant noise (synthetic noise); however, its performance is limited on real-noisy photographs and requires multiple stage network modeling. To advance the practicability of the denoising algorithms, this paper proposes a novel single-stage blind real image denoising network (RIDNet) by employing a modular architecture. We use residual on the residual structure to ease the flow of low-frequency information and apply feature attention to exploit the channel dependencies. Furthermore, the evaluation in terms of quantitative metrics and visual quality on three synthetic and four real noisy datasets against 19 state-of-the-art algorithms demonstrate the superiority of our RIDNet.

285 citations


Proceedings Article
06 Sep 2019
TL;DR: DISN as mentioned in this paper predicts the projected location for each 3D point on the 2D image and extracts local features from the image feature maps, which significantly improves the accuracy of the signed distance field prediction, especially for the detail-rich areas.
Abstract: Reconstructing 3D shapes from single-view images has been a long-standing research problem. In this paper, we present DISN, a Deep Implicit Surface Net- work which can generate a high-quality detail-rich 3D mesh from a 2D image by predicting the underlying signed distance fields. In addition to utilizing global image features, DISN predicts the projected location for each 3D point on the 2D image and extracts local features from the image feature maps. Combin- ing global and local features significantly improves the accuracy of the signed distance field prediction, especially for the detail-rich areas. To the best of our knowledge, DISN is the first method that constantly captures details such as holes and thin structures present in 3D shapes from single-view images. DISN achieves the state-of-the-art single-view reconstruction performance on a variety of shape categories reconstructed from both synthetic and real images. Code is available at https://github.com/laughtervv/DISN. The supplemen- tary can be found at https://xharlie.github.io/images/neurips_ 2019_supp.pdf

250 citations


Posted Content
TL;DR: A novel single-stage blind real image denoising network (RIDNet) is proposed by employing a modular architecture that uses residual on the residual structure to ease the flow of low-frequency information and apply feature attention to exploit the channel dependencies.
Abstract: Deep convolutional neural networks perform better on images containing spatially invariant noise (synthetic noise); however, their performance is limited on real-noisy photographs and requires multiple stage network modeling. To advance the practicability of denoising algorithms, this paper proposes a novel single-stage blind real image denoising network (RIDNet) by employing a modular architecture. We use a residual on the residual structure to ease the flow of low-frequency information and apply feature attention to exploit the channel dependencies. Furthermore, the evaluation in terms of quantitative metrics and visual quality on three synthetic and four real noisy datasets against 19 state-of-the-art algorithms demonstrate the superiority of our RIDNet.

243 citations


Proceedings ArticleDOI
01 Oct 2019
TL;DR: A new approach of domain randomization and pyramid consistency to learn a model with high generalizability for semantic segmentation of real-world self-driving scenes in a domain generalization fashion is proposed.
Abstract: We propose to harness the potential of simulation for semantic segmentation of real-world self-driving scenes in a domain generalization fashion. The segmentation network is trained without any information about target domains and tested on the unseen target domains. To this end, we propose a new approach of domain randomization and pyramid consistency to learn a model with high generalizability. First, we propose to randomize the synthetic images with styles of real images in terms of visual appearances using auxiliary datasets, in order to effectively learn domain-invariant representations. Second, we further enforce pyramid consistency across different "stylized" images and within an image, in order to learn domain-invariant and scale-invariant features, respectively. Extensive experiments are conducted on generalization from GTA and SYNTHIA to Cityscapes, BDDS, and Mapillary; and our method achieves superior results over the state-of-the-art techniques. Remarkably, our generalization results are on par with or even better than those obtained by state-of-the-art simulation-to-real domain adaptation methods, which access the target domain data at training time.

228 citations


Journal ArticleDOI
TL;DR: A superpixel-based fast FCM clustering algorithm that is significantly faster and more robust than state-of-the-art clustering algorithms for color image segmentation and implemented with histogram parameter on the superpixel image is proposed.
Abstract: A great number of improved fuzzy c-means (FCM) clustering algorithms have been widely used for grayscale and color image segmentation. However, most of them are time-consuming and unable to provide desired segmentation results for color images due to two reasons. The first one is that the incorporation of local spatial information often causes a high computational complexity due to the repeated distance computation between clustering centers and pixels within a local neighboring window. The other one is that a regular neighboring window usually breaks up the real local spatial structure of images and thus leads to a poor segmentation. In this work, we propose a superpixel-based fast FCM clustering algorithm that is significantly faster and more robust than state-of-the-art clustering algorithms for color image segmentation. To obtain better local spatial neighborhoods, we first define a multiscale morphological gradient reconstruction operation to obtain a superpixel image with accurate contour. In contrast to traditional neighboring window of fixed size and shape, the superpixel image provides better adaptive and irregular local spatial neighborhoods that are helpful for improving color image segmentation. Second, based on the obtained superpixel image, the original color image is simplified efficiently and its histogram is computed easily by counting the number of pixels in each region of the superpixel image. Finally, we implement FCM with histogram parameter on the superpixel image to obtain the final segmentation result. Experiments performed on synthetic images and real images demonstrate that the proposed algorithm provides better segmentation results and takes less time than state-of-the-art clustering algorithms for color image segmentation.

188 citations


Posted Content
TL;DR: SinGAN as discussed by the authors uses a pyramid of fully convolutional GANs, each responsible for learning the patch distribution at a different scale of the image, which allows generating new samples of arbitrary size and aspect ratio, that have significant variability, yet maintain both the global structure and fine textures of the training image.
Abstract: We introduce SinGAN, an unconditional generative model that can be learned from a single natural image. Our model is trained to capture the internal distribution of patches within the image, and is then able to generate high quality, diverse samples that carry the same visual content as the image. SinGAN contains a pyramid of fully convolutional GANs, each responsible for learning the patch distribution at a different scale of the image. This allows generating new samples of arbitrary size and aspect ratio, that have significant variability, yet maintain both the global structure and the fine textures of the training image. In contrast to previous single image GAN schemes, our approach is not limited to texture images, and is not conditional (i.e. it generates samples from noise). User studies confirm that the generated samples are commonly confused to be real images. We illustrate the utility of SinGAN in a wide range of image manipulation tasks.

Proceedings ArticleDOI
TL;DR: An adaptive cropping strategy (ACS) is developed to super-resolve block-wise image patches using the same well-trained model and performs favorably against the state-of-the-art SR algorithms in terms of visual quality, memory footprint, and inference time.
Abstract: In recent years, single image super-resolution (SISR) methods using deep convolution neural network (CNN) have achieved impressive results. Thanks to the powerful representation capabilities of the deep networks, numerous previous ways can learn the complex non-linear mapping between low-resolution (LR) image patches and their high-resolution (HR) versions. However, excessive convolutions will limit the application of super-resolution technology in low computing power devices. Besides, super-resolution of any arbitrary scale factor is a critical issue in practical applications, which has not been well solved in the previous approaches. To address these issues, we propose a lightweight information multi-distillation network (IMDN) by constructing the cascaded information multi-distillation blocks (IMDB), which contains distillation and selective fusion parts. Specifically, the distillation module extracts hierarchical features step-by-step, and fusion module aggregates them according to the importance of candidate features, which is evaluated by the proposed contrast-aware channel attention mechanism. To process real images with any sizes, we develop an adaptive cropping strategy (ACS) to super-resolve block-wise image patches using the same well-trained model. Extensive experiments suggest that the proposed method performs favorably against the state-of-the-art SR algorithms in term of visual quality, memory footprint, and inference time. Code is available at \url{this https URL}.

Posted Content
TL;DR: DISN, a Deep Implicit Surface Network which can generate a high-quality detail-rich 3D mesh from an 2D image by predicting the underlying signed distance fields by combining global and local features, achieves the state-of-the-art single-view reconstruction performance.
Abstract: Reconstructing 3D shapes from single-view images has been a long-standing research problem. In this paper, we present DISN, a Deep Implicit Surface Network which can generate a high-quality detail-rich 3D mesh from an 2D image by predicting the underlying signed distance fields. In addition to utilizing global image features, DISN predicts the projected location for each 3D point on the 2D image, and extracts local features from the image feature maps. Combining global and local features significantly improves the accuracy of the signed distance field prediction, especially for the detail-rich areas. To the best of our knowledge, DISN is the first method that constantly captures details such as holes and thin structures present in 3D shapes from single-view images. DISN achieves the state-of-the-art single-view reconstruction performance on a variety of shape categories reconstructed from both synthetic and real images. Code is available at this https URL The supplementary can be found at this https URL

Proceedings ArticleDOI
20 May 2019
TL;DR: A method for automated dataset generation is presented and a variant of Mask R-CNN is trained with domain randomization on the generated dataset to perform category-agnostic instance segmentation without any hand-labeled data and the model is deployed in an instance-specific grasping pipeline to demonstrate its usefulness in a robotics application.
Abstract: The ability to segment unknown objects in depth images has potential to enhance robot skills in grasping and object tracking. Recent computer vision research has demonstrated that Mask R-CNN can be trained to segment specific categories of objects in RGB images when massive hand-labeled datasets are available. As generating these datasets is time-consuming, we instead train with synthetic depth images. Many robots now use depth sensors, and recent results suggest training on synthetic depth data can transfer successfully to the real world. We present a method for automated dataset generation and rapidly generate a synthetic training dataset of 50,000 depth images and 320,000 object masks using simulated heaps of 3D CAD models. We train a variant of Mask R-CNN with domain randomization on the generated dataset to perform category-agnostic instance segmentation without any hand-labeled data and we evaluate the trained network, which we refer to as Synthetic Depth (SD) Mask R-CNN, on a set of real, high-resolution depth images of challenging, densely-cluttered bins containing objects with highly-varied geometry. SD Mask R-CNN outperforms point cloud clustering baselines by an absolute 15% in Average Precision and 20% in Average Recall on COCO benchmarks, and achieves performance levels similar to a Mask R-CNN trained on a massive, hand-labeled RGB dataset and fine-tuned on real images from the experimental setup. We deploy the model in an instance-specific grasping pipeline to demonstrate its usefulness in a robotics application. Code, the synthetic training dataset, and supplementary material are available at https://bit.ly/2letCuE.

Posted Content
TL;DR: This work proposes a new variational inference method, which integrates both noise estimation and image denoising into a unique Bayesian framework, for blind image Denoising, and presents an approximate posterior, parameterized by deep neural networks, presented by taking the intrinsic clean image and noise variances as latent variables conditioned on the input noisy image.
Abstract: Blind image denoising is an important yet very challenging problem in computer vision due to the complicated acquisition process of real images. In this work we propose a new variational inference method, which integrates both noise estimation and image denoising into a unique Bayesian framework, for blind image denoising. Specifically, an approximate posterior, parameterized by deep neural networks, is presented by taking the intrinsic clean image and noise variances as latent variables conditioned on the input noisy image. This posterior provides explicit parametric forms for all its involved hyper-parameters, and thus can be easily implemented for blind image denoising with automatic noise estimation for the test noisy image. On one hand, as other data-driven deep learning methods, our method, namely variational denoising network (VDN), can perform denoising efficiently due to its explicit form of posterior expression. On the other hand, VDN inherits the advantages of traditional model-driven approaches, especially the good generalization capability of generative models. VDN has good interpretability and can be flexibly utilized to estimate and remove complicated non-i.i.d. noise collected in real scenarios. Comprehensive experiments are performed to substantiate the superiority of our method in blind image denoising.

Posted Content
TL;DR: A novel approach is proposed, called mGANprior, to incorporate the well-trained GANs as effective prior to a variety of image processing tasks, by employing multiple latent codes to generate multiple feature maps at some intermediate layer of the generator and composing them with adaptive channel importance to recover the input image.
Abstract: Despite the success of Generative Adversarial Networks (GANs) in image synthesis, applying trained GAN models to real image processing remains challenging. Previous methods typically invert a target image back to the latent space either by back-propagation or by learning an additional encoder. However, the reconstructions from both of the methods are far from ideal. In this work, we propose a novel approach, called mGANprior, to incorporate the well-trained GANs as effective prior to a variety of image processing tasks. In particular, we employ multiple latent codes to generate multiple feature maps at some intermediate layer of the generator, then compose them with adaptive channel importance to recover the input image. Such an over-parameterization of the latent space significantly improves the image reconstruction quality, outperforming existing competitors. The resulting high-fidelity image reconstruction enables the trained GAN models as prior to many real-world applications, such as image colorization, super-resolution, image inpainting, and semantic manipulation. We further analyze the properties of the layer-wise representation learned by GAN models and shed light on what knowledge each layer is capable of representing.

Journal ArticleDOI
TL;DR: Li et al. as discussed by the authors proposed a two-step GAN-based DA that generates and refines brain Magnetic Resonance (MR) images with/without tumors separately: (i) Progressive Growing of GAN (PGGANs), multi-stage noise-to-image GAN for high-resolution MR image generation, first generates realistic/diverse 256×256 images; (ii) Multimodal UNsupervised Image-toimage Translation (MUNIT) that combines GANs/Variational AutoEncoders or SimGAN that uses a DA-
Abstract: Convolutional Neural Networks (CNNs) achieve excellent computer-assisted diagnosis with sufficient annotated training data. However, most medical imaging datasets are small and fragmented. In this context, Generative Adversarial Networks (GANs) can synthesize realistic/diverse additional training images to fill the data lack in the real image distribution; researchers have improved classification by augmenting data with noise-to-image (e.g., random noise samples to diverse pathological images) or image-to-image GANs (e.g., a benign image to a malignant one). Yet, no research has reported results combining noise-to-image and image-to-image GANs for further performance boost. Therefore, to maximize the DA effect with the GAN combinations, we propose a two-step GAN-based DA that generates and refines brain Magnetic Resonance (MR) images with/without tumors separately: (i) Progressive Growing of GANs (PGGANs), multi-stage noise-to-image GAN for high-resolution MR image generation, first generates realistic/diverse 256×256 images; (ii) Multimodal UNsupervised Image-to-image Translation (MUNIT) that combines GANs/Variational AutoEncoders or SimGAN that uses a DA-focused GAN loss, further refines the texture/shape of the PGGAN-generated images similarly to the real ones. We thoroughly investigate CNN-based tumor classification results, also considering the influence of pre-training on ImageNet and discarding weird-looking GAN-generated images. The results show that, when combined with classic DA, our two-step GAN-based DA can significantly outperform the classic DA alone, in tumor detection (i.e., boosting sensitivity 93.67% to 97.48%) and also in other medical imaging tasks.

Proceedings ArticleDOI
16 Jun 2019
TL;DR: A deep iterative down-up convolutional neural network (DIDN) for image denoising, which repeatedly decreases and increases the resolution of the feature maps.
Abstract: Networks using down-scaling and up-scaling of feature maps have been studied extensively in low-level vision research owing to efficient GPU memory usage and their capacity to yield large receptive fields. In this paper, we propose a deep iterative down-up convolutional neural network (DIDN) for image denoising, which repeatedly decreases and increases the resolution of the feature maps. The basic structure of the network is inspired by U-Net which was originally developed for semantic segmentation. We modify the down-scaling and up-scaling layers for image denoising task. Conventional denoising networks are trained to work with a single-level noise, or alternatively use noise information as inputs to address multi-level noise with a single model. Conversely, because the efficient memory usage of our network enables it to handle multiple parameters, it is capable of processing a wide range of noise levels with a single model without requiring noise-information inputs as a work-around. Consequently, our DIDN exhibits state-of-the-art performance using the benchmark dataset and also demonstrates its superiority in the NTIRE 2019 real image denoising challenge.

Posted Content
TL;DR: O-3D is created, the first markerless dataset of color images with 3D annotations for both the hand and object, and a single RGB image-based method to predict the hand pose when interacting with objects under severe occlusions is developed.
Abstract: We propose a method for annotating images of a hand manipulating an object with the 3D poses of both the hand and the object, together with a dataset created using this method. Our motivation is the current lack of annotated real images for this problem, as estimating the 3D poses is challenging, mostly because of the mutual occlusions between the hand and the object. To tackle this challenge, we capture sequences with one or several RGB-D cameras and jointly optimize the 3D hand and object poses over all the frames simultaneously. This method allows us to automatically annotate each frame with accurate estimates of the poses, despite large mutual occlusions. With this method, we created HO-3D, the first markerless dataset of color images with 3D annotations for both the hand and object. This dataset is currently made of 77,558 frames, 68 sequences, 10 persons, and 10 objects. Using our dataset, we develop a single RGB image-based method to predict the hand pose when interacting with objects under severe occlusions and show it generalizes to objects not seen in the dataset.

Proceedings ArticleDOI
16 Jun 2019
TL;DR: The 3rd NTIRE challenge on single-image super-resolution (restoration of rich details in a low-resolution image) is reviewed with a focus on proposed solutions and results and the state-of-the-art in real-world single image super- resolution.
Abstract: This paper reviewed the 3rd NTIRE challenge on single-image super-resolution (restoration of rich details in a low-resolution image) with a focus on proposed solutions and results. The challenge had 1 track, which was aimed at the real-world single image super-resolution problem with an unknown scaling factor. Participants were mapping low-resolution images captured by a DSLR camera with a shorter focal length to their high-resolution images captured at a longer focal length. With this challenge, we introduced a novel real-world super-resolution dataset (RealSR). The track had 403 registered participants, and 36 teams competed in the final testing phase. They gauge the state-of-the-art in real-world single image super-resolution.

Proceedings ArticleDOI
15 Jun 2019
TL;DR: This paper presents an unsupervised method for domain-specific, single-image deblurring based on disentangled representations, and enforce a KL divergence loss to regularize the distribution range of extracted blur attributes such that little content information is contained.
Abstract: Image deblurring aims to restore the latent sharp images from the corresponding blurred ones. In this paper, we present an unsupervised method for domain-specific, single-image deblurring based on disentangled representations. The disentanglement is achieved by splitting the content and blur features in a blurred image using content encoders and blur encoders. We enforce a KL divergence loss to regularize the distribution range of extracted blur attributes such that little content information is contained. Meanwhile, to handle the unpaired training data, a blurring branch and the cycle-consistency loss are added to guarantee that the content structures of the deblurred results match the original images. We also add an adversarial loss on deblurred results to generate visually realistic images and a perceptual loss to further mitigate the artifacts. We perform extensive experiments on the tasks of face and text deblurring using both synthetic datasets and real images, and achieve improved results compared to recent state-of-the-art deblurring methods.

Proceedings ArticleDOI
01 Jun 2019
TL;DR: A grouped residual dense network (GRDN) is proposed, which is an extended and generalized architecture of the state-of-the-art residual densenetwork (RDN) and a new generative adversarial network-based real-world noise modeling method is developed.
Abstract: Recent research on image denoising has progressed with the development of deep learning architectures, especially convolutional neural networks. However, real-world image denoising is still very challenging because it is not possible to obtain ideal pairs of ground-truth images and real-world noisy images. Owing to the recent release of benchmark datasets, the interest of the image denoising community is now moving toward the real-world denoising problem. In this paper, we propose a grouped residual dense network (GRDN), which is an extended and generalized architecture of the state-of-the-art residual dense network (RDN). The core part of RDN is defined as grouped residual dense block (GRDB) and used as a building module of GRDN. We experimentally show that the image denoising performance can be significantly improved by cascading GRDBs. In addition to the network architecture design, we also develop a new generative adversarial network-based real-world noise modeling method. We demonstrate the superiority of the proposed methods by achieving the highest score in terms of both the peak signal-to-noise ratio and the structural similarity in the NTIRE2019 Real Image Denoising Challenge - Track 2:sRGB.

Journal ArticleDOI
Zhaocheng Wang1, Lan Du1, Jiashun Mao1, Bin Liu1, Dongwen Yang1 
TL;DR: The single shot multibox detector (SSD), which is a real-time object detection method based on convolutional neural network, is applied to realize target detection for synthetic aperture radar (SAR) images and can obtain better detection performance than other detection methods.
Abstract: In this letter, the single shot multibox detector (SSD), which is a real-time object detection method based on convolutional neural network, is applied to realize target detection for synthetic aperture radar (SAR) images. Since there are no sufficient labeled images for training in SAR target detection, we apply two strategies, data augmentation and transfer learning. For data augmentation, the first approaches to use some image processing methods, i.e., manual-extracting subimages, adding noise, filtering, and flipping, on the original training images to generate some new training images; the second approach is to employ the existing SAR target recognition data set, MSTAR data set, to assist in accomplishing the target detection task. For transfer learning, we first apply subaperture decomposition technique on original SAR images to acquire three-channel subaperture SAR images, and then transfer the three-channel VGGNet model pretrained on the ImageNet data set to the three-channel subaperture SAR images, in order to initialize corresponding parameters of the convolutional layers in the base network in our SSD. The feature extraction network, consisting of the base network and the auxiliary structure, is used to learn multiscale feature maps, and then convolutional predictors are used to acquire the final detection results. The experimental results on the miniSAR real image data set demonstrate that the proposed method can obtain better detection performance than other detection methods.

Proceedings ArticleDOI
15 Jun 2019
TL;DR: In this article, a simple and effective approach, the Event-based Double Integral (EDI) model, was proposed to reconstruct a high frame-rate, sharp video from a single blurry frame and its event data.
Abstract: Event-based cameras can measure intensity changes (called ‘events’) with microsecond accuracy under high-speed motion and challenging lighting conditions. With the active pixel sensor (APS), the event camera allows simultaneous output of the intensity frames. However, the output images are captured at a relatively low frame-rate and often suffer from motion blur. A blurry image can be regarded as the integral of a sequence of latent images, while the events indicate the changes between the latent images. Therefore, we are able to model the blur-generation process by associating event data to a latent image. In this paper, we propose a simple and effective approach, the Event-based Double Integral (EDI) model, to reconstruct a high frame-rate, sharp video from a single blurry frame and its event data. The video generation is based on solving a simple non-convex optimization problem in a single scalar variable. Experimental results on both synthetic and real images demonstrate the superiority of our EDI model and optimization method in comparison to the state-of-the-art.

Proceedings ArticleDOI
16 Jun 2019
TL;DR: The proposed methods by the 15 teams represent the current state-of-the-art performance in image denoising targeting real noisy images.
Abstract: This paper reviews the NTIRE 2019 challenge on real image denoising with focus on the proposed methods and their results. The challenge has two tracks for quantitatively evaluating image denoising performance in (1) the Bayer-pattern raw-RGB and (2) the standard RGB (sRGB) color spaces. The tracks had 216 and 220 registered participants, respectively. A total of 15 teams, proposing 17 methods, competed in the final phase of the challenge. The proposed methods by the 15 teams represent the current state-of-the-art performance in image denoising targeting real noisy images.

Posted Content
TL;DR: It is demonstrated that, with careful pre- and post-processing and data augmentation, a standard image classifier trained on only one specific CNN generator (ProGAN) is able to generalize surprisingly well to unseen architectures, datasets, and training methods.
Abstract: In this work we ask whether it is possible to create a "universal" detector for telling apart real images from these generated by a CNN, regardless of architecture or dataset used. To test this, we collect a dataset consisting of fake images generated by 11 different CNN-based image generator models, chosen to span the space of commonly used architectures today (ProGAN, StyleGAN, BigGAN, CycleGAN, StarGAN, GauGAN, DeepFakes, cascaded refinement networks, implicit maximum likelihood estimation, second-order attention super-resolution, seeing-in-the-dark). We demonstrate that, with careful pre- and post-processing and data augmentation, a standard image classifier trained on only one specific CNN generator (ProGAN) is able to generalize surprisingly well to unseen architectures, datasets, and training methods (including the just released StyleGAN2). Our findings suggest the intriguing possibility that today's CNN-generated images share some common systematic flaws, preventing them from achieving realistic image synthesis. Code and pre-trained networks are available at this https URL .

Proceedings ArticleDOI
16 Jun 2019
TL;DR: This study introduces a densely connected hierarchical image denoising network (DHDN), which exceeds the performances of state-of-the-art image Denoising solutions and establishes that the proposed network outperforms conventional methods.
Abstract: Recently, deep convolutional neural networks have been applied in numerous image processing researches and have exhibited drastically improved performances. In this study, we introduce a densely connected hierarchical image denoising network (DHDN), which exceeds the performances of state-of-the-art image denoising solutions. Our proposed network improves the image denoising performance by applying the hierarchical architecture of the modified U-Net; this makes our network to use a larger number of parameters than other methods. In addition, we induce feature reuse and solve the vanishing-gradient problem by applying dense connectivity and residual learning to our convolution blocks and network. Finally, we successfully apply the model ensemble and self-ensemble methods; this enable us to improve the performance of the proposed network. The performance of the proposed network is validated by winning the second place in the NTRIE 2019 real image denoising challenge sRGB track and the third place in the raw-RGB track. Additional experimental results on additive white Gaussian noise removal also establishes that the proposed network outperforms conventional methods; this is notwithstanding the fact that the proposed network handles a wide range of noise levels with a single set of trained parameters.

Proceedings ArticleDOI
15 Jun 2019
TL;DR: This work presents a method for fine-grained face manipulation that can synthesize another arbitrary expression by the same person by first fitting a 3D face model and then disentangling the face into a texture and a shape.
Abstract: We present a method for fine-grained face manipulation. Given a face image with an arbitrary expression, our method can synthesize another arbitrary expression by the same person. This is achieved by first fitting a 3D face model and then disentangling the face into a texture and a shape. We then learn different networks in these two spaces. In the texture space, we use a conditional generative network to change the appearance, and carefully design input formats and loss functions to achieve the best results. In the shape space, we use a fully connected network to predict the accurate shapes and use the available depth data for supervision. Both networks are conditioned on expression coefficients rather than discrete labels, allowing us to generate an unlimited amount of expressions. We show the superiority of this disentangling approach through both quantitative and qualitative studies. In a user study, our method is preferred in 85% of cases when compared to the most recent work. When compared to the ground truth, annotators cannot reliably distinguish between our synthesized images and real images, preferring our method in 53% of the cases.

Proceedings ArticleDOI
19 Sep 2019
TL;DR: In this article, the authors synthesize highly photorealistic images of 3D object models, which they use to train a convolutional neural network for detecting the objects in real images.
Abstract: We present an approach to synthesize highly photorealistic images of 3D object models, which we use to train a convolutional neural network for detecting the objects in real images. The proposed approach has three key ingredients: (1) 3D object models are rendered in 3D models of complete scenes with realistic materials and lighting, (2) plausible geometric configuration of objects and cameras in a scene is generated using physics simulation, and (3) high photorealism of the synthesized images is achieved by physically based rendering. When trained on images synthesized by the proposed approach, the Faster R-CNN object detector [1] achieves a 24% absolute improvement of mAP@.75IoU on Rutgers APC [2] and 11% on LineMod-Occluded [3] datasets, compared to a baseline where the training images are synthesized by rendering object models on top of random photographs. This work is a step towards being able to effectively train object detectors without capturing or annotating any real images. A dataset of 400K synthetic images with ground truth annotations for various computer vision tasks will be released on the project website: thodan.github.io/objectsynth.

Proceedings ArticleDOI
15 Jun 2019
TL;DR: This paper considers transfer learning for semantic segmentation that aims to mitigate the gap between abundant synthetic data (source domain) and limited real data (target domain), and jointly learns hierarchical weighting networks and segmentation network in an end-to-end manner.
Abstract: The success of deep neural networks for semantic segmentation heavily relies on large-scale and well-labeled datasets, which are hard to collect in practice. Synthetic data offers an alternative to obtain ground-truth labels for free. However, models directly trained on synthetic data often struggle to generalize to real images. In this paper, we consider transfer learning for semantic segmentation that aims to mitigate the gap between abundant synthetic data (source domain) and limited real data (target domain). Unlike previous approaches that either learn mappings to target domain or finetune on target images, our proposed method jointly learn from real images and selectively from realistic pixels in synthetic images to adapt to the target domain. Our key idea is to have weighting networks to score how similar the synthetic pixels are to real ones, and learn such weighting at pixel-, region- and image-levels. We jointly learn these hierarchical weighting networks and segmentation network in an end-to-end manner. Extensive experiments demonstrate that our proposed approach significantly outperforms other existing baselines, and is applicable to scenarios with extremely limited real images.