scispace - formally typeset
Search or ask a question

Showing papers on "Pixel published in 2019"


Proceedings ArticleDOI
13 May 2019
TL;DR: Pixel-aligned Implicit Function (PIFu) as mentioned in this paper aligns pixels of 2D images with the global context of their corresponding 3D object to produce highresolution surfaces including largely unseen regions such as the back of a person.
Abstract: We introduce Pixel-aligned Implicit Function (PIFu), an implicit representation that locally aligns pixels of 2D images with the global context of their corresponding 3D object. Using PIFu, we propose an end-to-end deep learning method for digitizing highly detailed clothed humans that can infer both 3D surface and texture from a single image, and optionally, multiple input images. Highly intricate shapes, such as hairstyles, clothing, as well as their variations and deformations can be digitized in a unified way. Compared to existing representations used for 3D deep learning, PIFu produces high-resolution surfaces including largely unseen regions such as the back of a person. In particular, it is memory efficient unlike the voxel representation, can handle arbitrary topology, and the resulting surface is spatially aligned with the input image. Furthermore, while previous techniques are designed to process either a single image or multiple views, PIFu extends naturally to arbitrary number of views. We demonstrate high-resolution and robust reconstructions on real world images from the DeepFashion dataset, which contains a variety of challenging clothing types. Our method achieves state-of-the-art performance on a public benchmark and outperforms the prior work for clothed human digitization from a single image.

907 citations


Journal ArticleDOI
TL;DR: The working principle, advantages, technical considerations and future potential of single-pixel imaging are described, which suits a wide a variety of detector technologies.
Abstract: Modern digital cameras employ silicon focal plane array (FPA) image sensors featuring millions of pixels. However, it is possible to make a camera that only needs one pixel. In these cameras a spatial light modulator, placed before or after the object to be imaged, applies a time-varying pattern and synchronized intensity measurements are made with a single-pixel detector. The principle of compressed sensing then allows an image to be generated. As the approach suits a wide a variety of detector technologies, images can be collected at wavelengths outside the reach of FPA technology or at high frame rates or in three dimensions. Promising applications include the visualization of hazardous gas leaks and 3D situation awareness for autonomous vehicles. Rather than requiring millions of pixels, it is possible to make a camera that only needs one pixel. This Review details the working principle, advantages, technical considerations and future potential of single-pixel imaging.

464 citations


Posted Content
Xu Qin1, Zhilin Wang1, Yuanchao Bai1, Xiaodong Xie1, Huizhu Jia2 
TL;DR: The proposed FFA-Net surpasses previous state-of-the-art single image dehazing methods by a very large margin both quantitatively and qualitatively, boosting the best published PSNR metric from 30.23 dB to 36.39 dB on the SOTS indoor test dataset.
Abstract: In this paper, we propose an end-to-end feature fusion at-tention network (FFA-Net) to directly restore the haze-free image. The FFA-Net architecture consists of three key components: 1) A novel Feature Attention (FA) module combines Channel Attention with Pixel Attention mechanism, considering that different channel-wise features contain totally different weighted information and haze distribution is uneven on the different image pixels. FA treats different features and pixels unequally, which provides additional flexibility in dealing with different types of information, expanding the representational ability of CNNs. 2) A basic block structure consists of Local Residual Learning and Feature Attention, Local Residual Learning allowing the less important information such as thin haze region or low-frequency to be bypassed through multiple local residual connections, let main network architecture focus on more effective information. 3) An Attention-based different levels Feature Fusion (FFA) structure, the feature weights are adaptively learned from the Feature Attention (FA) module, giving more weight to important features. This structure can also retain the information of shallow layers and pass it into deep layers. The experimental results demonstrate that our proposed FFA-Net surpasses previous state-of-the-art single image dehazing methods by a very large margin both quantitatively and qualitatively, boosting the best published PSNR metric from 30.23db to 36.39db on the SOTS indoor test dataset. Code has been made available at GitHub.

406 citations


Journal ArticleDOI
TL;DR: The authors adapts the image prior learned by GANs to image statistics of an individual image, which can accurately reconstruct the input image and synthesize new content consistent with the appearance of the original image.
Abstract: Despite the recent success of GANs in synthesizing images conditioned on inputs such as a user sketch, text, or semantic labels, manipulating the high-level attributes of an existing natural photograph with GANs is challenging for two reasons. First, it is hard for GANs to precisely reproduce an input image. Second, after manipulation, the newly synthesized pixels often do not fit the original image. In this paper, we address these issues by adapting the image prior learned by GANs to image statistics of an individual image. Our method can accurately reconstruct the input image and synthesize new content, consistent with the appearance of the input image. We demonstrate our interactive system on several semantic image editing tasks, including synthesizing new objects consistent with background, removing unwanted objects, and changing the appearance of an object. Quantitative and qualitative comparisons against several existing methods demonstrate the effectiveness of our method.

315 citations


Journal ArticleDOI
TL;DR: A novel unsupervised context-sensitive framework—deep change vector analysis (DCVA)—for CD in multitemporal VHR images that exploit convolutional neural network (CNN) features is proposed and experimental results on mult itemporal data sets of Worldview-2, Pleiades, and Quickbird images confirm the effectiveness of the proposed method.
Abstract: Change detection (CD) in multitemporal images is an important application of remote sensing. Recent technological evolution provided very high spatial resolution (VHR) multitemporal optical satellite images showing high spatial correlation among pixels and requiring an effective modeling of spatial context to accurately capture change information. Here, we propose a novel unsupervised context-sensitive framework—deep change vector analysis (DCVA)—for CD in multitemporal VHR images that exploit convolutional neural network (CNN) features. To have an unsupervised system, DCVA starts from a suboptimal pretrained multilayered CNN for obtaining deep features that can model spatial relationship among neighboring pixels and thus complex objects. An automatic feature selection strategy is employed layerwise to select features emphasizing both high and low prior probability change information. Selected features from multiple layers are combined into a deep feature hypervector providing a multiscale scene representation. The use of the same pretrained CNN for semantic segmentation of single images enables us to obtain coherent multitemporal deep feature hypervectors that can be compared pixelwise to obtain deep change vectors that also model spatial context information. Deep change vectors are analyzed based on their magnitude to identify changed pixels. Then, deep change vectors corresponding to identified changed pixels are binarized to obtain a compressed binary deep change vectors that preserve information about the direction (kind) of change. Changed pixels are analyzed for multiple CD based on the binary features, thus implicitly using the spatial information. Experimental results on multitemporal data sets of Worldview-2, Pleiades, and Quickbird images confirm the effectiveness of the proposed method.

310 citations


Posted Content
TL;DR: The PointRend (Point-based Rendering) neural network module is presented: a module that performs point-based segmentation predictions at adaptively selected locations based on an iterative subdivision algorithm that enables output resolutions that are otherwise impractical in terms of memory or computation compared to existing approaches.
Abstract: We present a new method for efficient high-quality image segmentation of objects and scenes. By analogizing classical computer graphics methods for efficient rendering with over- and undersampling challenges faced in pixel labeling tasks, we develop a unique perspective of image segmentation as a rendering problem. From this vantage, we present the PointRend (Point-based Rendering) neural network module: a module that performs point-based segmentation predictions at adaptively selected locations based on an iterative subdivision algorithm. PointRend can be flexibly applied to both instance and semantic segmentation tasks by building on top of existing state-of-the-art models. While many concrete implementations of the general idea are possible, we show that a simple design already achieves excellent results. Qualitatively, PointRend outputs crisp object boundaries in regions that are over-smoothed by previous methods. Quantitatively, PointRend yields significant gains on COCO and Cityscapes, for both instance and semantic segmentation. PointRend's efficiency enables output resolutions that are otherwise impractical in terms of memory or computation compared to existing approaches. Code has been made available at this https URL.

298 citations


Proceedings ArticleDOI
15 Jun 2019
TL;DR: 3D-SIS is introduced, a novel neural network architecture for 3D semantic instance segmentation in commodity RGB-D scans that leverages high-resolution RGB input by associating 2D images with the volumetric grid based on the pose alignment of the 3D reconstruction.
Abstract: We introduce 3D-SIS, a novel neural network architecture for 3D semantic instance segmentation in commodity RGB-D scans. The core idea of our method to jointly learn from both geometric and color signal, thus enabling accurate instance predictions. Rather than operate solely on 2D frames, we observe that most computer vision applications have multi-view RGB-D input available, which we leverage to construct an approach for 3D instance segmentation that effectively fuses together these multi-modal inputs. Our network leverages high-resolution RGB input by associating 2D images with the volumetric grid based on the pose alignment of the 3D reconstruction. For each image, we first extract 2D features for each pixel with a series of 2D convolutions; we then backproject the resulting feature vector to the associated voxel in the 3D grid. This combination of 2D and 3D feature learning allows significantly higher accuracy object detection and instance segmentation than state-of-the-art alternatives. We show results on both synthetic and real-world public benchmarks, achieving an improvement in mAP of over 13 on real-world data.

297 citations


Proceedings ArticleDOI
14 Nov 2019
TL;DR: Experimental results obtained in the framework of RS image scene classification problems show that a shallow Convolutional Neural Network architecture trained on the BigEarthNet provides much higher accuracy compared to a state-of-the-art CNN model pre-trained on the ImageNet.
Abstract: This paper presents the BigEarthNet that is a new large-scale multi-label Sentinel-2 benchmark archive. The BigEarthNet consists of 590, 326 Sentinel-2 image patches, each of which is a section of i) 120 × 120 pixels for 10m bands; ii) 60×60 pixels for 20m bands; and iii) 20×20 pixels for 60m bands. Unlike most of the existing archives, each image patch is annotated by multiple land-cover classes (i.e., multi-labels) that are provided from the CORINE Land Cover database of the year 2018 (CLC 2018). The BigEarthNet is significantly larger than the existing archives in remote sensing (RS) and thus is much more convenient to be used as a training source in the context of deep learning. This paper first addresses the limitations of the existing archives and then describes the properties of the BigEarthNet. Experimental results obtained in the framework of RS image scene classification problems show that a shallow Convolutional Neural Network (CNN) architecture trained on the BigEarthNet provides much higher accuracy compared to a state-of-the-art CNN model pre-trained on the ImageNet (which is a very popular large-scale benchmark archive in computer vision). The BigEarthNet opens up promising directions to advance operational RS applications and research in massive Sentinel-2 image archives.

295 citations



Proceedings ArticleDOI
01 Oct 2019
TL;DR: Pix2Pose as discussed by the authors predicts the 3D coordinates of each object pixel without textured models and then uses these pixel-wise predictions to form 2D-3D correspondences to directly compute poses with the PnP algorithm with RANSAC iterations.
Abstract: Estimating the 6D pose of objects using only RGB images remains challenging because of problems such as occlusion and symmetries. It is also difficult to construct 3D models with precise texture without expert knowledge or specialized scanning devices. To address these problems, we propose a novel pose estimation method, Pix2Pose, that predicts the 3D coordinates of each object pixel without textured models. An auto-encoder architecture is designed to estimate the 3D coordinates and expected errors per pixel. These pixel-wise predictions are then used in multiple stages to form 2D-3D correspondences to directly compute poses with the PnP algorithm with RANSAC iterations. Our method is robust to occlusion by leveraging recent achievements in generative adversarial training to precisely recover occluded parts. Furthermore, a novel loss function, the transformer loss, is proposed to handle symmetric objects by guiding predictions to the closest symmetric pose. Evaluations on three different benchmark datasets containing symmetric and occluded objects show our method outperforms the state of the art using only RGB images.

280 citations


Proceedings ArticleDOI
15 Jun 2019
TL;DR: A cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions, and compared to a direct audio-to-image approach, this approach avoids fitting spurious correlations between audiovisual signals that are irrelevant to the speech content.
Abstract: We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions. Instead of learning a direct mapping from audio to video frames, we propose first to transfer audio to high-level structure, i.e., the facial landmarks, and then to generate video frames conditioned on the landmarks. Compared to a direct audio-to-image approach, our cascade approach avoids fitting spurious correlations between audiovisual signals that are irrelevant to the speech content. We, humans, are sensitive to temporal discontinuities and subtle artifacts in video. To avoid those pixel jittering problems and to enforce the network to focus on audiovisual-correlated regions, we propose a novel dynamically adjustable pixel-wise loss with an attention mechanism. Furthermore, to generate a sharper image with well-synchronized facial movements, we propose a novel regression-based discriminator structure, which considers sequence-level information along with frame-level information. Thoughtful experiments on several datasets and real-world samples demonstrate significantly better results obtained by our method than the state-of-the-art methods in both quantitative and qualitative comparisons.

Journal ArticleDOI
TL;DR: A pixel‐level detection method for identifying road cracks in black‐box images using a deep convolutional encoder–decoder network that achieves recall, precision, and intersection of union at the pixel level is proposed.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: A novel and end-to-end Alignment Generative Adversarial Network (AlignGAN) for the RGB-IR RE-ID task, which consists of a pixel generator, a feature generator and a joint discriminator that is able to not only alleviate the cross-modality and intra- modality variations, but also learn identity-consistent features.
Abstract: RGB-Infrared (IR) person re-identification is an important and challenging task due to large cross-modality variations between RGB and IR images. Most conventional approaches aim to bridge the cross-modality gap with feature alignment by feature representation learning. Different from existing methods, in this paper, we propose a novel and end-to-end Alignment Generative Adversarial Network (AlignGAN) for the RGB-IR RE-ID task. The proposed model enjoys several merits. First, it can exploit pixel alignment and feature alignment jointly. To the best of our knowledge, this is the first work to model the two alignment strategies jointly for the RGB-IR RE-ID problem. Second, the proposed model consists of a pixel generator, a feature generator and a joint discriminator. By playing a min-max game among the three components, our model is able to not only alleviate the cross-modality and intra-modality variations, but also learn identity-consistent features. Extensive experimental results on two standard benchmarks demonstrate that the proposed model performs favourably against state-of-the-art methods. Especially, on SYSU-MM01 dataset, our model can achieve an absolute gain of 15.4% and 12.9% in terms of Rank-1 and mAP.

Posted Content
TL;DR: The proposed Pixel-aligned Implicit Function (PIFu), an implicit representation that locally aligns pixels of 2D images with the global context of their corresponding 3D object, achieves state-of-the-art performance on a public benchmark and outperforms the prior work for clothed human digitization from a single image.
Abstract: We introduce Pixel-aligned Implicit Function (PIFu), a highly effective implicit representation that locally aligns pixels of 2D images with the global context of their corresponding 3D object. Using PIFu, we propose an end-to-end deep learning method for digitizing highly detailed clothed humans that can infer both 3D surface and texture from a single image, and optionally, multiple input images. Highly intricate shapes, such as hairstyles, clothing, as well as their variations and deformations can be digitized in a unified way. Compared to existing representations used for 3D deep learning, PIFu can produce high-resolution surfaces including largely unseen regions such as the back of a person. In particular, it is memory efficient unlike the voxel representation, can handle arbitrary topology, and the resulting surface is spatially aligned with the input image. Furthermore, while previous techniques are designed to process either a single image or multiple views, PIFu extends naturally to arbitrary number of views. We demonstrate high-resolution and robust reconstructions on real world images from the DeepFashion dataset, which contains a variety of challenging clothing types. Our method achieves state-of-the-art performance on a public benchmark and outperforms the prior work for clothed human digitization from a single image.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: Zhang et al. as discussed by the authors proposed an efficient and accurate arbitrary-shaped text detector, termed Pixel Aggregation Network (PAN), which is equipped with a low computational-cost segmentation head and a learnable postprocessing.
Abstract: Scene text detection, an important step of scene text reading systems, has witnessed rapid development with convolutional neural networks. Nonetheless, two main challenges still exist and hamper its deployment to real-world applications. The first problem is the trade-off between speed and accuracy. The second one is to model the arbitrary-shaped text instance. Recently, some methods have been proposed to tackle arbitrary-shaped text detection, but they rarely take the speed of the entire pipeline into consideration, which may fall short in practical applications. In this paper, we propose an efficient and accurate arbitrary-shaped text detector, termed Pixel Aggregation Network (PAN), which is equipped with a low computational-cost segmentation head and a learnable post-processing. More specifically, the segmentation head is made up of Feature Pyramid Enhancement Module (FPEM) and Feature Fusion Module (FFM). FPEM is a cascadable U-shaped module, which can introduce multi-level information to guide the better segmentation. FFM can gather the features given by the FPEMs of different depths into a final feature for segmentation. The learnable post-processing is implemented by Pixel Aggregation (PA), which can precisely aggregate text pixels by predicted similarity vectors. Experiments on several standard benchmarks validate the superiority of the proposed PAN. It is worth noting that our method can achieve a competitive F-measure of 79.9% at 84.2 FPS on CTW1500.

Proceedings ArticleDOI
01 Jun 2019
TL;DR: This work first synthesizes a spatially and temporally coherent optical flow field across video frames using a newly designed Deep Flow Completion network, then uses the synthesized flow fields to guide the propagation of pixels to fill up the missing regions in the video.
Abstract: Video inpainting, which aims at filling in missing regions in a video, remains challenging due to the difficulty of preserving the precise spatial and temporal coherence of video contents. In this work we propose a novel flow-guided video inpainting approach. Rather than filling in the RGB pixels of each frame directly, we consider the video inpainting as a pixel propagation problem. We first synthesize a spatially and temporally coherent optical flow field across video frames using a newly designed Deep Flow Completion network, then use the synthesized flow fields to guide the propagation of pixels to fill up the missing regions in the video. Specifically, the Deep Flow Competion network follows a coarse-to-fine refinement strategy to complete the flow fields, while their quality is further improved by hard flow example mining. Following the guide of the completed flow fields, the missing video regions can be filled up precisely. Our method is evaluated on DAVIS and YouTubeVOS datasets qualitatively and quantitatively, achieving the state-of-the-art performance in terms of inpainting quality and speed.

Journal ArticleDOI
TL;DR: A superpixel-based fast FCM clustering algorithm that is significantly faster and more robust than state-of-the-art clustering algorithms for color image segmentation and implemented with histogram parameter on the superpixel image is proposed.
Abstract: A great number of improved fuzzy c-means (FCM) clustering algorithms have been widely used for grayscale and color image segmentation. However, most of them are time-consuming and unable to provide desired segmentation results for color images due to two reasons. The first one is that the incorporation of local spatial information often causes a high computational complexity due to the repeated distance computation between clustering centers and pixels within a local neighboring window. The other one is that a regular neighboring window usually breaks up the real local spatial structure of images and thus leads to a poor segmentation. In this work, we propose a superpixel-based fast FCM clustering algorithm that is significantly faster and more robust than state-of-the-art clustering algorithms for color image segmentation. To obtain better local spatial neighborhoods, we first define a multiscale morphological gradient reconstruction operation to obtain a superpixel image with accurate contour. In contrast to traditional neighboring window of fixed size and shape, the superpixel image provides better adaptive and irregular local spatial neighborhoods that are helpful for improving color image segmentation. Second, based on the obtained superpixel image, the original color image is simplified efficiently and its histogram is computed easily by counting the number of pixels in each region of the superpixel image. Finally, we implement FCM with histogram parameter on the superpixel image to obtain the final segmentation result. Experiments performed on synthetic images and real images demonstrate that the proposed algorithm provides better segmentation results and takes less time than state-of-the-art clustering algorithms for color image segmentation.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: A novel loss function is proposed, named Adaptive Wing loss, that is able to adapt its shape to different types of ground truth heatmap pixels, that penalizes loss more on foreground pixels while less on background pixels.
Abstract: Heatmap regression with a deep network has become one of the mainstream approaches to localize facial landmarks. However, the loss function for heatmap regression is rarely studied. In this paper, we analyze the ideal loss function properties for heatmap regression in face alignment problems. Then we propose a novel loss function, named Adaptive Wing loss, that is able to adapt its shape to different types of ground truth heatmap pixels. This adaptability penalizes loss more on foreground pixels while less on background pixels. To address the imbalance between foreground and background pixels, we also propose Weighted Loss Map, which assigns high weights on foreground and difficult background pixels to help training process focus more on pixels that are crucial to landmark localization. To further improve face alignment accuracy, we introduce boundary prediction and CoordConv with boundary coordinates. Extensive experiments on different benchmarks, including COFW, 300W and WFLW, show our approach outperforms the state-of-the-art by a significant margin on various evaluation metrics. Besides, the Adaptive Wing loss also helps other heatmap regression tasks.

Journal ArticleDOI
TL;DR: The experimental results demonstrate that the proposed edge-aware generative adversarial networks (Ea-GANs) outperform multiple state-of-the-art methods for cross-modality MR image synthesis in both qualitative and quantitative measures.
Abstract: Magnetic resonance (MR) imaging is a widely used medical imaging protocol that can be configured to provide different contrasts between the tissues in human body. By setting different scanning parameters, each MR imaging modality reflects the unique visual characteristic of scanned body part, benefiting the subsequent analysis from multiple perspectives. To utilize the complementary information from multiple imaging modalities, cross-modality MR image synthesis has aroused increasing research interest recently. However, most existing methods only focus on minimizing pixel/voxel-wise intensity difference but ignore the textural details of image content structure, which affects the quality of synthesized images. In this paper, we propose edge-aware generative adversarial networks (Ea-GANs) for cross-modality MR image synthesis. Specifically, we integrate edge information, which reflects the textural structure of image content and depicts the boundaries of different objects in images, to reduce this gap. Corresponding to different learning strategies, two frameworks are proposed, i.e., a generator-induced Ea-GAN (gEa-GAN) and a discriminator-induced Ea-GAN (dEa-GAN). The gEa-GAN incorporates the edge information via its generator, while the dEa-GAN further does this from both the generator and the discriminator so that the edge similarity is also adversarially learned. In addition, the proposed Ea-GANs are 3D-based and utilize hierarchical features to capture contextual information. The experimental results demonstrate that the proposed Ea-GANs, especially the dEa-GAN, outperform multiple state-of-the-art methods for cross-modality MR image synthesis in both qualitative and quantitative measures. Moreover, the dEa-GAN also shows excellent generality to generic image synthesis tasks on benchmark datasets about facades, maps, and cityscapes.

Proceedings ArticleDOI
TL;DR: A novel pose estimation method that predicts the 3D coordinates of each object pixel without textured models, and a novel loss function, the transformer loss, is proposed to handle symmetric objects by guiding predictions to the closest symmetric pose.
Abstract: Estimating the 6D pose of objects using only RGB images remains challenging because of problems such as occlusion and symmetries. It is also difficult to construct 3D models with precise texture without expert knowledge or specialized scanning devices. To address these problems, we propose a novel pose estimation method, Pix2Pose, that predicts the 3D coordinates of each object pixel without textured models. An auto-encoder architecture is designed to estimate the 3D coordinates and expected errors per pixel. These pixel-wise predictions are then used in multiple stages to form 2D-3D correspondences to directly compute poses with the PnP algorithm with RANSAC iterations. Our method is robust to occlusion by leveraging recent achievements in generative adversarial training to precisely recover occluded parts. Furthermore, a novel loss function, the transformer loss, is proposed to handle symmetric objects by guiding predictions to the closest symmetric pose. Evaluations on three different benchmark datasets containing symmetric and occluded objects show our method outperforms the state of the art using only RGB images.

Journal ArticleDOI
TL;DR: Simulation results verify the effectiveness and reliability of the proposed image compression and encryption algorithm with considerable compression and security performance.
Abstract: For a linear image encryption system, it is vulnerable to the chosen-plaintext attack. To overcome the weakness and reduce the correlation among pixels of the encryption image, an effective image compression and encryption algorithm based on chaotic system and compressive sensing is proposed. The original image is first permuted by the Arnold transform to reduce the block effect in the compression process, and then the resulting image is compressed and re-encrypted by compressive sensing, simultaneously. Moreover, the bitwise XOR operation based on chaotic system is performed on the measurements to change the pixel values and a pixel scrambling method is employed to disturb the positions of pixels. Besides, the keys used in chaotic systems are related to the plaintext image. Simulation results verify the effectiveness and reliability of the proposed image compression and encryption algorithm with considerable compression and security performance.

Proceedings ArticleDOI
16 Jun 2019
TL;DR: The application of super-resolution techniques to satellite imagery, and the effects of these techniques on object detection algorithm performance are explored, as well as the performance of object detection as a function of native resolution and object pixel size.
Abstract: We explore the application of super-resolution techniques to satellite imagery, and the effects of these techniques on object detection algorithm performance. Specifically, we enhance satellite imagery beyond its native resolution, and test if we can identify various types of vehicles, planes, and boats with greater accuracy than native resolution. Using the Very Deep Super-Resolution (VDSR) framework and a custom Random Forest Super-Resolution (RFSR) framework we generate enhancement levels of 2x, 4x, and 8x over five distinct resolutions ranging from 30 cm to 4.8 meters. Using both native and super-resolved data, we then train several custom detection models using the SIMRDWN object detection framework. SIMRDWN combines a number of popular object detection algorithms (e.g. SSD, YOLO) into a unified framework that is designed to rapidly detect objects in large satellite images. This approach allows us to quantify the effects of super-resolution techniques on object detection performance across multiple classes and resolutions. We also quantify the performance of object detection as a function of native resolution and object pixel size. For our test set we note that performance degrades from mean average precision (mAP) = 0.53 at 30 cm resolution, down to mAP = 0.11 at 4.8 m resolution. Super-resolving native 30 cm imagery to 15 cm yields the greatest benefit; a 13-36% improvement in mAP. Super-resolution is less beneficial at coarser resolutions, though still provides a small improvement in performance.

Posted Content
TL;DR: This paper proposes an efficient and accurate arbitrary-shaped text detector, termed Pixel Aggregation Network (PAN), which is equipped with a low computational-cost segmentation head and a learnable post-processing.
Abstract: Scene text detection, an important step of scene text reading systems, has witnessed rapid development with convolutional neural networks. Nonetheless, two main challenges still exist and hamper its deployment to real-world applications. The first problem is the trade-off between speed and accuracy. The second one is to model the arbitrary-shaped text instance. Recently, some methods have been proposed to tackle arbitrary-shaped text detection, but they rarely take the speed of the entire pipeline into consideration, which may fall short in practical this http URL this paper, we propose an efficient and accurate arbitrary-shaped text detector, termed Pixel Aggregation Network (PAN), which is equipped with a low computational-cost segmentation head and a learnable post-processing. More specifically, the segmentation head is made up of Feature Pyramid Enhancement Module (FPEM) and Feature Fusion Module (FFM). FPEM is a cascadable U-shaped module, which can introduce multi-level information to guide the better segmentation. FFM can gather the features given by the FPEMs of different depths into a final feature for segmentation. The learnable post-processing is implemented by Pixel Aggregation (PA), which can precisely aggregate text pixels by predicted similarity vectors. Experiments on several standard benchmarks validate the superiority of the proposed PAN. It is worth noting that our method can achieve a competitive F-measure of 79.9% at 84.2 FPS on CTW1500.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: This work presents an inter-frame compression approach for neural video coding that can seamlessly build up on different existing neural image codecs and proposes to compute residuals directly in latent space instead of in pixel space to reuse the same image compression network for both key frames and intermediate frames.
Abstract: While there are many deep learning based approaches for single image compression, the field of end-to-end learned video coding has remained much less explored. Therefore, in this work we present an inter-frame compression approach for neural video coding that can seamlessly build up on different existing neural image codecs. Our end-to-end solution performs temporal prediction by optical flow based motion compensation in pixel space. The key insight is that we can increase both decoding efficiency and reconstruction quality by encoding the required information into a latent representation that directly decodes into motion and blending coefficients. In order to account for remaining prediction errors, residual information between the original image and the interpolated frame is needed. We propose to compute residuals directly in latent space instead of in pixel space as this allows to reuse the same image compression network for both key frames and intermediate frames. Our extended evaluation on different datasets and resolutions shows that the rate-distortion performance of our approach is competitive with existing state-of-the-art codecs.

Journal ArticleDOI
TL;DR: In this letter, constant false alarm rate is used for object recognition, and a neural network with hybrid algorithm of CNN and multilayer perceptron (CNN–MLP) is suggested for image classification.
Abstract: Ship detection on the SAR images for marine monitoring has a wide usage. SAR technology helps us to have a better monitoring over intended sections, without considering atmospheric conditions, or image shooting time. In recent years, with advancements in convolutional neural network (CNN), which is one of the well-known ways of deep learning, using image deep features has increased. Recently, usage of CNN for SAR image segmentation has been increased. Existence of clutter edge, multiple interfering targets, speckle and sea-level clutters makes false alarms and false detections on detector algorithms. In this letter, constant false alarm rate is used for object recognition. This algorithm, processes the image pixel by pixel, and based on statistical information of its neighbor pixels, detects the targeted pixels. Afterward, a neural network with hybrid algorithm of CNN and multilayer perceptron (CNN–MLP) is suggested for image classification. In this proposal, the algorithm is trained with real SAR images from Sentinel-1 and RADARSAT-2 satellites, and has a better performance on object classification than state of the art.

Journal ArticleDOI
TL;DR: A first hand classification of region based fusion methods is carried out and a comprehensive list of objective fusion evaluation metrics is highlighted to compare the existing methods.

Proceedings ArticleDOI
15 Jun 2019
TL;DR: The potential of event camera-based conditional generative adversarial networks to create images/videos from an adjustable portion of the event data stream is unlocked and the results are evaluated by comparing the results with the intensity images captured on the same pixel grid-line of events.
Abstract: Event cameras have a lot of advantages over traditional cameras, such as low latency, high temporal resolution, and high dynamic range. However, since the outputs of event cameras are the sequences of asynchronous events over time rather than actual intensity images, existing algorithms could not be directly applied. Therefore, it is demanding to generate intensity images from events for other tasks. In this paper, we unlock the potential of event camera-based conditional generative adversarial networks to create images/videos from an adjustable portion of the event data stream. The stacks of space-time coordinates of events are used as inputs and the network is trained to reproduce images based on the spatio-temporal intensity changes. The usefulness of event cameras to generate high dynamic range (HDR) images even in extreme illumination conditions and also non blurred images under rapid motion is also shown. In addition, the possibility of generating very high frame rate videos is demonstrated, theoretically up to 1 million frames per second(FPS) since the temporal resolution of event cameras is about 1 microsecond. Proposed methods are evaluated by comparing the results with the intensity images captured on the same pixel grid-line of events using online available real datasets and synthetic datasets produced by the event camera simulator.

Proceedings ArticleDOI
15 Jun 2019
TL;DR: An interactive image segmentation algorithm, which accepts user-annotations about a target object and the background, is proposed and the backpropagating refinement scheme (BRS) is developed, which corrects the mislabeled pixels in the initial result.
Abstract: An interactive image segmentation algorithm, which accepts user-annotations about a target object and the background, is proposed in this work. We convert user-annotations into interaction maps by measuring distances of each pixel to the annotated locations. Then, we perform the forward pass in a convolutional neural network, which outputs an initial segmentation map. However, the user-annotated locations can be mislabeled in the initial result. Therefore, we develop the backpropagating refinement scheme (BRS), which corrects the mislabeled pixels. Experimental results demonstrate that the proposed algorithm outperforms the conventional algorithms on four challenging datasets. Furthermore, we demonstrate the generality and applicability of BRS in other computer vision tasks, by transforming existing convolutional neural networks into user-interactive ones.

Journal ArticleDOI
Mengya Zhang1, Guangluan Xu1, Keming Chen1, Menglong Yan1, Xian Sun1 
TL;DR: This letter presents a novel supervised change detection method based on a deep siamese semantic network framework, which is trained by using improved triplet loss function for optical aerial images and produces comparable, even better results, favorably to the state-of-the-art methods in terms of F-measure.
Abstract: This letter presents a novel supervised change detection method based on a deep siamese semantic network framework, which is trained by using improved triplet loss function for optical aerial images. The proposed framework can not only extract features directly from image pairs which include multiscale information and are more abstract as well as robust, but also enhance the interclass separability and the intraclass inseparability by learning semantic relation. The feature vectors of the pixels pair with the same label are closer, and at the same time, the feature vectors of the pixels with different labels are farther from each other. Moreover, we use the distance of the feature map to detect the changes on the difference map between the image pair. Binarized change map can be obtained by a simple threshold. Experiments on optical aerial image data set validate that the proposed approach produces comparable, even better results, favorably to the state-of-the-art methods in terms of F-measure.