scispace - formally typeset
Search or ask a question

Showing papers on "Upsampling published in 2021"


Book ChapterDOI
27 Sep 2021
TL;DR: TransBTS as mentioned in this paper exploits Transformer in 3D CNN for MRI Brain Tumor Segmentation and proposes a novel network named TransBTS based on the encoder-decoder structure.
Abstract: Transformer, which can benefit from global (long-range) information modeling using self-attention mechanisms, has been successful in natural language processing and 2D image classification recently. However, both local and global features are crucial for dense prediction tasks, especially for 3D medical image segmentation. In this paper, we for the first time exploit Transformer in 3D CNN for MRI Brain Tumor Segmentation and propose a novel network named TransBTS based on the encoder-decoder structure. To capture the local 3D context information, the encoder first utilizes 3D CNN to extract the volumetric spatial feature maps. Meanwhile, the feature maps are reformed elaborately for tokens that are fed into Transformer for global feature modeling. The decoder leverages the features embedded by Transformer and performs progressive upsampling to predict the detailed segmentation map. Extensive experimental results on both BraTS 2019 and 2020 datasets show that TransBTS achieves comparable or higher results than previous state-of-the-art 3D methods for brain tumor segmentation on 3D MRI scans. The source code is available at https://github.com/Wenxuan-1119/TransBTS.

306 citations


Journal ArticleDOI
TL;DR: This work proposes a local-global fusion model for fixation prediction on an RGB-D image that combines global and local information through a content-aware fusion module (CAFM) structure.
Abstract: Many RGB-D visual attention models have been proposed with diverse fusion models; thus, the main challenge lies in the differences in the results between the different models. To address this challenge, we propose a local-global fusion model for fixation prediction on an RGB-D image; this method combines global and local information through a content-aware fusion module (CAFM) structure. First, it comprises a channel-based upsampling block for exploiting global contextual information and scaling up this information to the same resolution as the input. Second, our Deconv block contains a contrast feature module to utilize multilevel local features stage-by-stage for superior local feature representation. The experimental results demonstrate that the proposed model exhibits competitive performance on two databases.

117 citations


Proceedings ArticleDOI
20 Jun 2021
TL;DR: Wang et al. as mentioned in this paper proposed a novel network capable of real-time dehazing of 4K images on a single GPU, which consists of three deep CNNs, which extracts haze-relevant features at a reduced resolution of the hazy input and then fits locally-affine models in the bilateral space.
Abstract: Convolutional neural networks (CNNs) have achieved significant success in the single image dehazing task. Unfortunately, most existing deep dehazing models have high computational complexity, which hinders their application to high-resolution images, especially for UHD (ultra-high-definition) or 4K resolution images. To address the problem, we propose a novel network capable of real-time dehazing of 4K images on a single GPU, which consists of three deep CNNs. The first CNN extracts haze-relevant features at a reduced resolution of the hazy input and then fits locally-affine models in the bilateral space. Another CNN is used to learn multiple full-resolution guidance maps corresponding to the learned bilateral model. As a result, the feature maps with high-frequency can be reconstructed by multi-guided bilateral upsampling. Finally, the third CNN fuses the high-quality feature maps into a dehazed image. In addition, we create a large-scale 4K image dehazing dataset to support the training and testing of compared models. Experimental results demonstrate that the proposed algorithm performs favorably against the state-of-the-art dehazing approaches on various benchmarks.

80 citations


Proceedings ArticleDOI
20 Jun 2021
TL;DR: This work proposes to disentangle the task based on its multi-objective nature and formulate two cascaded sub-networks, a dense generator and a spatial refiner, and designs a pair of local and global refinement units in the spatial refiners to evolve a coarse feature map.
Abstract: Point clouds produced by 3D scanning are often sparse, non-uniform, and noisy. Recent upsampling approaches aim to generate a dense point set, while achieving both distribution uniformity and proximity-to-surface, and possibly amending small holes, all in a single network. After revisiting the task, we propose to disentangle the task based on its multi-objective nature and formulate two cascaded sub-networks, a dense generator and a spatial refiner. The dense generator infers a coarse but dense out-put that roughly describes the underlying surface, while the spatial refiner further fine-tunes the coarse output by adjusting the location of each point. Specifically, we design a pair of local and global refinement units in the spatial refiner to evolve a coarse feature map. Also, in the spatial refiner, we regress a per-point offset vector to further adjust the coarse outputs in fine scale. Extensive qualitative and quantitative results on both synthetic and real-scanned datasets demonstrate the superiority of our method over the state-of-the-arts. The code is publicly available at https://github.com/liruihui/Dis-PU.

78 citations


Journal Article
Jonathan Shen1, Ye Jia1, Mike Chrzanowski2, Yu Zhang1, Isaac Elias1, Heiga Zen1, Yonghui Wu 
TL;DR: Non-Attentive Tacotron is presented, replacing the attention mechanism with an explicit duration predictor, which improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model.
Abstract: This paper presents Non-Attentive Tacotron based on the Tacotron 2 text-to-speech model, replacing the attention mechanism with an explicit duration predictor. This improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model. With the use of Gaussian upsampling, Non-Attentive Tacotron achieves a 5-scale mean opinion score for naturalness of 4.41, slightly outperforming Tacotron 2. The duration predictor enables both utterance-wide and per-phoneme control of duration at inference time. When accurate target durations are scarce or unavailable in the training data, we propose a method using a fine-grained variational auto-encoder to train the duration predictor in a semi-supervised or unsupervised manner, with results almost as good as supervised training.

71 citations


Journal ArticleDOI
TL;DR: This paper proposes a three-stream self-attention network (TSNet) for indoor semantic segmentation comprising two asymmetric input streams (asymmetric encoder structure) and a cross-modal distillation stream with a self-Attention module.
Abstract: This article proposes a three-stream self-attention network (TSNet) for indoor semantic segmentation comprising two asymmetric input streams (asymmetric encoder structure) and a cross-modal distillation stream with a self-attention module. The two asymmetric input streams are ResNet34 for the red-green-blue (RGB) stream and VGGNet16 for the depth stream. Accompanying the RGB and depth streams, a cross-modal distillation stream with a self-attention module extracts new RGB plus depth features in each level in the bottom-up path. In addition, while using bilinear upsampling to recover the spatial resolution of the feature map, we incorporated the feature information of both the RGB flow and the depth flow through the self-attention module. We constructed the NYU Depth V2 dataset to evaluate the TSNet and achieved results comparable to those of current state-of-the-art methods.

69 citations


Journal ArticleDOI
TL;DR: This paper proposes a CNN architecture and its efficient implementation, called the deformable kernel network (DKN), that outputs sets of neighbors and the corresponding weights adaptively for each pixel, and shows that the weighted averaging process with sparsely sampled 3 × 3 kernels outperforms the state of the art by a significant margin in all cases.
Abstract: Joint image filters are used to transfer structural details from a guidance picture used as a prior to a target image, in tasks such as enhancing spatial resolution and suppressing noise. Previous methods based on convolutional neural networks (CNNs) combine nonlinear activations of spatially-invariant kernels to estimate structural details and regress the filtering result. In this paper, we instead learn explicitly sparse and spatially-variant kernels. We propose a CNN architecture and its efficient implementation, called the deformable kernel network (DKN), that outputs sets of neighbors and the corresponding weights adaptively for each pixel. The filtering result is then computed as a weighted average. We also propose a fast version of DKN that runs about seventeen times faster for an image of size $$640 \times 480$$ . We demonstrate the effectiveness and flexibility of our models on the tasks of depth map upsampling, saliency map upsampling, cross-modality image restoration, texture removal, and semantic segmentation. In particular, we show that the weighted averaging process with sparsely sampled $$3 \times 3$$ kernels outperforms the state of the art by a significant margin in all cases.

68 citations


Proceedings ArticleDOI
20 Jun 2021
TL;DR: HITNet as discussed by the authors uses a differentiable 2D geometric propagation and warping mechanism to infer disparity hypotheses and achieves state-of-the-art accuracy for real-time stereo matching.
Abstract: This paper presents HITNet, a novel neural network architecture for real-time stereo matching. Contrary to many recent neural network approaches that operate on a full cost volume and rely on 3D convolutions, our approach does not explicitly build a volume and instead relies on a fast multi-resolution initialization step, differentiable 2D geometric propagation and warping mechanisms to infer disparity hypotheses. To achieve a high level of accuracy, our network not only geometrically reasons about disparities but also infers slanted plane hypotheses allowing to more accurately perform geometric warping and upsampling operations. Our architecture is inherently multi-resolution allowing the propagation of information across different levels. Multiple experiments prove the effectiveness of the proposed approach at a fraction of the computation required by state-of-the-art methods. At the time of writing, HITNet ranks 1st-3rd on all the metrics published on the ETH3D website for two view stereo, ranks 1st on most of the metrics amongst all the end-to-end learning approaches on Middlebury-v3, ranks 1st on the popular KITTI 2012 and 2015 benchmarks among the published methods faster than 100 ms.

65 citations


Proceedings ArticleDOI
20 Jun 2021
Abstract: Convolution is one of the basic building blocks of CNN architectures. Despite its common use, standard convolution has two main shortcomings: Content-agnostic and Computation-heavy. Dynamic filters are content-adaptive, while further increasing the computational overhead. Depth-wise convolution is a lightweight variant, but it usually leads to a drop in CNN performance or requires a larger number of channels. In this work, we propose the Decoupled Dynamic Filter (DDF) that can simultaneously tackle both of these shortcomings. Inspired by recent advances in attention, DDF decouples a depth-wise dynamic filter into spatial and channel dynamic filters. This decomposition considerably reduces the number of parameters and limits computational costs to the same level as depth-wise convolution. Meanwhile, we observe a significant boost in performance when replacing standard convolution with DDF in classification networks. ResNet50 / 101 get improved by 1.9% and 1.3% on the top-1 accuracy, while their computational costs are reduced by nearly half. Experiments on the detection and joint upsampling networks also demonstrate the superior performance of the DDF upsampling variant (DDF-Up) in comparison with standard convolution and specialized content-adaptive layers. The project page with code is available 1.

60 citations


Proceedings ArticleDOI
20 Jun 2021
TL;DR: Guo et al. as mentioned in this paper proposed a graph convolutional network (GCN) to better encode local point information from point neighborhoods, which can be incorporated into any point cloud upsampling pipeline.
Abstract: The effectiveness of learning-based point cloud upsampling pipelines heavily relies on the upsampling modules and feature extractors used therein. For the point upsampling module, we propose a novel model called NodeShuffle, which uses a Graph Convolutional Network (GCN) to better encode local point information from point neighborhoods. NodeShuffle is versatile and can be incorporated into any point cloud upsampling pipeline. Extensive experiments show how NodeShuffle consistently improves state-of-the-art upsampling methods. For feature extraction, we also propose a new multi-scale point feature extractor, called Inception DenseGCN. By aggregating features at multiple scales, this feature extractor enables further performance gain in the final upsampled point clouds. We combine Inception DenseGCN with NodeShuffle into a new point upsampling pipeline called PU-GCN. PU-GCN sets new state-of-art performance with much fewer parameters and more efficient inference. Our code is publicly available at https://github.com/guochengqian/PU-GCN.

51 citations


Journal ArticleDOI
TL;DR: In this article, a multi-scale Dense Cross Network (MDCN) is proposed to make full use of multiscale features and learn the inter-scale correlation between different upsampling factors.
Abstract: Convolutional neural networks have been proven to be of great benefit for single-image super-resolution (SISR). However, previous works do not make full use of multi-scale features and ignore the inter-scale correlation between different upsampling factors, resulting in sub-optimal performance. Instead of blindly increasing the depth of the network, we are committed to mining image features and learning the inter-scale correlation between different upsampling factors. To achieve this, we propose a Multi-scale Dense Cross Network (MDCN), which achieves great performance with fewer parameters and less execution time. MDCN consists of multi-scale dense cross blocks (MDCBs), hierarchical feature distillation block (HFDB), and dynamic reconstruction block (DRB). Among them, MDCB aims to detect multi-scale features and maximize the use of image features flow at different scales, HFDB focuses on adaptively recalibrate channel-wise feature responses to achieve feature distillation, and DRB attempts to reconstruct SR images with different upsampling factors in a single model. It is worth noting that all these modules can run independently. It means that these modules can be selectively plugged into any CNN model to improve model performance. Extensive experiments show that MDCN achieves competitive results in SISR, especially in the reconstruction task with multiple upsampling factors. The code is provided at https://github.com/MIVRC/MDCN-PyTorch .

Posted Content
Prafulla Dhariwal1, Alex Nichol1
TL;DR: In this paper, a series of ablations are used to trade off diversity for fidelity using gradients from a classifier, achieving an FID of 2.97 on ImageNet 128$\times$128, 4.59 on ImageNets 256$ \times$256, and 7.72 on Image-Nets 512$ Âtimes$512.
Abstract: We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. We achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For conditional image synthesis, we further improve sample quality with classifier guidance: a simple, compute-efficient method for trading off diversity for fidelity using gradients from a classifier. We achieve an FID of 2.97 on ImageNet 128$\times$128, 4.59 on ImageNet 256$\times$256, and 7.72 on ImageNet 512$\times$512, and we match BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution. Finally, we find that classifier guidance combines well with upsampling diffusion models, further improving FID to 3.94 on ImageNet 256$\times$256 and 3.85 on ImageNet 512$\times$512. We release our code at this https URL

Journal ArticleDOI
TL;DR: A deep convolutional network within the mature Gaussian–Laplacian pyramid for pansharpening (LPPNet), where each level is handled by a spatial subnetwork in a divide-and-conquer way to make the network more efficient.
Abstract: Hyperspectral (HS) pansharpening aims to create a pansharpened image that integrates the spatial details of the panchromatic (PAN) image and the spectral content of the HS image. In this article, we present a deep convolutional network within the mature Gaussian-Laplacian pyramid for pansharpening (LPPNet). The overall structure of LPPNet is a cascade of the Laplacian pyramid dense network with a similar structure at each pyramid level. Following the general idea of multiresolution analysis (MRA), the subband residuals of the desired HS images are extracted from the PAN image and injected into the upsampled HS image to reconstruct the high-resolution HS images level by level. Applying the mature Laplace pyramid decomposition technique to the convolution neural network (CNN) can simplify the pansharpening problem into several pyramid-level learning problems so that the pansharpening problem can be solved with a shallow CNN with fewer parameters. Specifically, the Laplacian pyramid technology is used to decompose the image into different levels that can differentiate large- and small-scale details, and each level is handled by a spatial subnetwork in a divide-and-conquer way to make the network more efficient. Experimental results show that the proposed LPPNet method performs favorably against some state-of-the-art pansharpening methods in terms of objective indexes and subjective visual appearance.

Proceedings ArticleDOI
01 Jun 2021
TL;DR: Keshik et al. as mentioned in this paper showed that high frequency Fourier spectrum decay discrepancies are not inherent characteristics for existing CNN-based generative models, and such features are not robust to perform synthetic image detection.
Abstract: CNN-based generative modelling has evolved to produce synthetic images indistinguishable from real images in the RGB pixel space. Recent works have observed that CNN-generated images share a systematic shortcoming in replicating high frequency Fourier spectrum decay attributes. Furthermore, these works have successfully exploited this systematic shortcoming to detect CNN-generated images reporting up to 99% accuracy across multiple state-of-the-art GAN models.In this work, we investigate the validity of assertions claiming that CNN-generated images are unable to achieve high frequency spectral decay consistency. We meticulously construct a counterexample space of high frequency spectral decay consistent CNN-generated images emerging from our handcrafted experiments using DCGAN, LSGAN, WGAN-GP and StarGAN, where we empirically show that this frequency discrepancy can be avoided by a minor architecture change in the last upsampling operation. We subsequently use images from this counterexample space to successfully bypass the recently proposed forensics detector which leverages on high frequency Fourier spectrum decay attributes for CNN-generated image detection.Through this study, we show that high frequency Fourier spectrum decay discrepancies are not inherent characteristics for existing CNN-based generative models—contrary to the belief of some existing work—, and such features are not robust to perform synthetic image detection. Our results prompt re-thinking of using high frequency Fourier spectrum decay attributes for CNN-generated image detection. Code and models are available at https://keshik6.github.io/Fourier-Discrepancies-CNN-Detection/

Proceedings ArticleDOI
01 Jun 2021
TL;DR: Xu et al. as mentioned in this paper presented a novel edge-preserving cost volume upsampling module based on the slicing operation in the learned bilateral grid, which can be seamlessly embedded into many existing stereo matching networks, such as GCNet, PSMNet, and GANet.
Abstract: Real-time performance of stereo matching networks is important for many applications, such as automatic driving, robot navigation and augmented reality (AR). Although significant progress has been made in stereo matching networks in recent years, it is still challenging to balance real-time performance and accuracy. In this paper, we present a novel edge-preserving cost volume upsampling module based on the slicing operation in the learned bilateral grid. The slicing layer is parameter-free, which allows us to obtain a high quality cost volume of high resolution from a low-resolution cost volume under the guide of the learned guidance map efficiently. The proposed cost volume upsampling module can be seamlessly embedded into many existing stereo matching networks, such as GCNet, PSMNet, and GANet. The resulting networks are accelerated several times while maintaining comparable accuracy. Furthermore, we design a real-time network (named BGNet) based on this module, which outperforms existing published real-time deep stereo matching networks, as well as some complex networks on the KITTI stereo datasets. The code is available at https://github.com/YuhuaXu/BGNet.

Journal ArticleDOI
TL;DR: This paper proposes an improved 3D object detection method based on a two-stage detector called the Improved Point-Voxel Region Convolutional Neural Network (IPV-RCNN), which contains online training for data augmentation, upsampling convolution and k-means clustering for the bounding box to achieve 3D detection tasks from raw point clouds.
Abstract: Recently, 3D object detection based on deep learning has achieved impressive performance in complex indoor and outdoor scenes. Among the methods, the two-stage detection method performs the best; however, this method still needs improved accuracy and efficiency, especially for small size objects or autonomous driving scenes. In this paper, we propose an improved 3D object detection method based on a two-stage detector called the Improved Point-Voxel Region Convolutional Neural Network (IPV-RCNN). Our proposed method contains online training for data augmentation, upsampling convolution and k-means clustering for the bounding box to achieve 3D detection tasks from raw point clouds. The evaluation results on the KITTI 3D dataset show that the IPV-RCNN achieved a 96% mAP, which is 3% more accurate than the state-of-the-art detectors.

Posted Content
TL;DR: CANINE as discussed by the authors is a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias.
Abstract: Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a new attentional change detection network based on Siamese U-shaped structure (SUACDNet), which focused on the global information, difference information and similarity information of bitemporal images respectively.

Posted Content
TL;DR: Zhang et al. as mentioned in this paper proposed a channel enhancement feature pyramid network (CE-FPN) with three simple yet effective modules to alleviate the loss of semantical information due to channel reduction.
Abstract: Feature pyramid network (FPN) has been an effective framework to extract multi-scale features in object detection. However, current FPN-based methods mostly suffer from the intrinsic flaw of channel reduction, which brings about the loss of semantical information. And the miscellaneous fused feature maps may cause serious aliasing effects. In this paper, we present a novel channel enhancement feature pyramid network (CE-FPN) with three simple yet effective modules to alleviate these problems. Specifically, inspired by sub-pixel convolution, we propose a sub-pixel skip fusion method to perform both channel enhancement and upsampling. Instead of the original 1x1 convolution and linear upsampling, it mitigates the information loss due to channel reduction. Then we propose a sub-pixel context enhancement module for extracting more feature representations, which is superior to other context methods due to the utilization of rich channel information by sub-pixel convolution. Furthermore, a channel attention guided module is introduced to optimize the final integrated features on each level, which alleviates the aliasing effect only with a few computational burdens. Our experiments show that CE-FPN achieves competitive performance compared to state-of-the-art FPN-based detectors on MS COCO benchmark.

Journal ArticleDOI
TL;DR: In this paper, a meta-subnetwork is learned to adjust the weights of the residual graph convolution (RGC) blocks dynamically, and a farthest sampling block is adopted to sample different numbers of points.
Abstract: Point cloud upsampling is vital for the quality of the mesh in three-dimensional reconstruction. Recent research on point cloud upsampling has achieved great success due to the development of deep learning. However, the existing methods regard point cloud upsampling of different scale factors as independent tasks. Thus, the methods need to train a specific model for each scale factor, which is both inefficient and impractical for storage and computation in real applications. To address this limitation, in this work, we propose a novel method called ``Meta-PU" to firstly support point cloud upsampling of arbitrary scale factors with a single model. In the Meta-PU method, besides the backbone network consisting of residual graph convolution (RGC) blocks, a meta-subnetwork is learned to adjust the weights of the RGC blocks dynamically, and a farthest sampling block is adopted to sample different numbers of points. Together, these two blocks enable our Meta-PU to continuously upsample the point cloud with arbitrary scale factors by using only a single model. In addition, the experiments reveal that training on multiple scales simultaneously is beneficial to each other. Thus, Meta-PU even outperforms the existing methods trained for a specific scale factor only.

Journal ArticleDOI
TL;DR: The recognition results and the comparison with the other target detectors demonstrate the effectiveness of the proposed YOLOv4 structure and the method of data preprocessing.
Abstract: The YOLOv4 neural network is employed for underwater target recognition. To improve the accuracy and speed of recognition, the structure of YOLOv4 is modified by replacing the upsampling module with a deconvolution module and by incorporating depthwise separable convolution into the network. Moreover, the training set used in the YOLO network is preprocessed by using a modified mosaic augmentation, in which the gray world algorithm is used to derive two images when performing mosaic augmentation. The recognition results and the comparison with the other target detectors demonstrate the effectiveness of the proposed YOLOv4 structure and the method of data preprocessing. According to both subjective and objective evaluation, the proposed target recognition strategy can effectively improve the accuracy and speed of underwater target recognition and reduce the requirement of hardware performance as well.

Journal ArticleDOI
TL;DR: Based on the excellent adaptability of deep neural networks (DNNs) and the structured modeling capabilities of probabilistic graphical models, the cascaded fully-convolutional network (CFCN) is proposed to improve the performance of water body detection in high-resolution SAR images.
Abstract: The water body detection in high-resolution synthetic aperture radar (SAR) images is a challenging task due to the changing interference caused by multiple imaging conditions and complex land backgrounds. Inspired by the excellent adaptability of deep neural networks (DNNs) and the structured modeling capabilities of probabilistic graphical models, the cascaded fully-convolutional network (CFCN) is proposed to improve the performance of water body detection in high-resolution SAR images. First, for the resolution loss caused by convolutions with large stride in traditional convolutional neural network (CNN), the fully-convolutional upsampling pyramid networks (UPNs) are proposed to suppress this loss and realize pixel-wise water body detection. Then considering blurred water boundary, the fully-convolutional conditional random fields (FC-CRFs) are introduced to UPNs, which reduce computational complexity and lead to the automatic learning of Gaussian kernels in CRFs and the higher boundary accuracy. Furthermore, to eliminate the inefficient training caused by imbalanced categorical distribution in the training data set, a novel variable focal loss (VFL) function is proposed, which replaces the constant weighting factor of focal loss with the frequency-dependent factor. The proposed methods can not only improve the pixel accuracy and boundary accuracy but also perform well in detection robustness and speed. Results of GaoFen-3 SAR images are presented to validate the proposed approaches.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed an adaptive feature pyramid network based on the feature pyramid networks to alleviate the feature misalignment and loss of details, which includes two major designs, i.e., adaptive feature upsampling and adaptive feature fusion.
Abstract: In general object detection, scale variation is always a big challenge. At present, feature pyramid networks are employed in numerous methods to alleviate the problems caused by large scale range of objects in object detection, which makes use of multi-level features extracted from the backbone for top-down upsampling and fusion to acquire a set of multi-scale depth image features. However, the feature pyramid network proposed by Ghiasi et al. adopts a simple fusion method, which fails to consider the fusion feature context, and therefore, it is difficult to acquire good features. In addition, the fusion of multi-scale features directly by traditional upsampling is prone to feature misalignment and loss of details. In this paper, an adaptive feature pyramid network is proposed based on the feature pyramid network to alleviate the foregoing potential problems, which includes two major designs, i.e., adaptive feature upsampling and adaptive feature fusion. The adaptive feature upsampling aims to predict a group of sampling points of each pixel through some models, and constitute feature representation of the pixel by feature combination of sampling points, while adaptive feature fusion is to construct pixel-level fusion weights between fusion features through attention mechanism. The experimental results verified the effectiveness of the method proposed in this paper. On the public object detection dataset MS-COCO test-dev, Faster R-CNN model achieved performance improvement of 1.2 AP by virtue of the adaptive feature pyramid network, and FCOS model could achieve performance improvement of 1.0 AP. What’s more, the experiments also validated that the adaptive feature pyramid network proposed herein was more accurate for object localization.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a green channel prior (GCP) to guide the feature extraction and feature upsampling of the whole image for joint denoising and demosaicing.
Abstract: Denoising and demosaicking are essential yet correlated steps to reconstruct a full color image from the raw color filter array (CFA) data. By learning a deep convolutional neural network (CNN), significant progress has been achieved to perform denoising and demosaicking jointly. However, most existing CNN-based joint denoising and demosaicking (JDD) methods work on a single image while assuming additive white Gaussian noise, which limits their performance on real-world applications. In this work, we study the JDD problem for real-world burst images, namely JDD-B. Considering the fact that the green channel has twice the sampling rate and better quality than the red and blue channels in CFA raw data, we propose to use this green channel prior (GCP) to build a GCP-Net for the JDD-B task. In GCP-Net, the GCP features extracted from green channels are utilized to guide the feature extraction and feature upsampling of the whole image. To compensate for the shift between frames, the offset is also estimated from GCP features to reduce the impact of noise. Our GCP-Net can preserve more image structures and details than other JDD methods while removing noise. Experiments on synthetic and real-world noisy images demonstrate the effectiveness of GCP-Net quantitatively and qualitatively.

Journal ArticleDOI
TL;DR: Flexible-PU as discussed by the authors proposes an end-to-end learning-based framework to generate dense point clouds from given sparse point clouds to model the underlying geometric structures of objects/scenes.
Abstract: This paper addresses the problem of generating dense point clouds from given sparse point clouds to model the underlying geometric structures of objects/scenes. To tackle this challenging issue, we propose a novel end-to-end learning-based framework. Specifically, by taking advantage of the linear approximation theorem, we first formulate the problem explicitly, which boils down to determining the interpolation weights and high-order approximation errors. Then, we design a lightweight neural network to adaptively learn unified and sorted interpolation weights as well as the high-order refinements, by analyzing the local geometry of the input point cloud. The proposed method can be interpreted by the explicit formulation, and thus is more memory-efficient than existing ones. In sharp contrast to the existing methods that work only for a pre-defined and fixed upsampling factor, the proposed framework only requires a single neural network with one-time training to handle various upsampling factors within a typical range, which is highly desired in real-world applications. In addition, we propose a simple yet effective training strategy to drive such a flexible ability. In addition, our method can handle non-uniformly distributed and noisy data well. Extensive experiments on both synthetic and real-world data demonstrate the superiority of the proposed method over state-of-the-art methods both quantitatively and qualitatively. The code will be publicly available at https://github.com/ninaqy/Flexible-PU .

Journal ArticleDOI
TL;DR: guided upsampling and background suppression not only improve counting performance but also enable explainable visualization of output visualization, and the TasselNetV3 series is introduced.
Abstract: Fast and accurate plant counting tools affect revolution in modern agriculture. Agricultural practitioners, however, expect the output of the tools to be not only accurate but also explainable. Such explainability often refers to the ability to infer which instance is counted. One intuitive way is to generate a bounding box for each instance. Nevertheless, compared with counting by detection, plant counts can be inferred more directly in the local count framework, while one thing reproaching this paradigm is its poor explainability of output visualization. In particular, we find that the poor explainability becomes a bottleneck limiting the counting performance. To address this, we explore the idea of guided upsampling and background suppression where a novel upsampling operator is proposed to allow count redistribution, and segmentation decoders with different fusion strategies are investigated to suppress background, respectively. By integrating them into our previous counting model TasselNetV2, we introduce TasselNetV3 series: TasselNetV3-Lite and TasselNetV3-Seg. We validate the TasselNetV3 series on three public plant counting data sets and a new unmanned aircraft vehicle (UAV)-based data set, covering maize tassels counting, wheat ears counting, and rice plants counting. Extensive results show that guided upsampling and background suppression not only improve counting performance but also enable explainable visualization. Aside from state-of-the-art performance, we have several interesting observations: 1) a limited-receptive-field counter in most cases outperforms a large-receptive-field one; 2) it is sufficient to generate empirical segmentation masks from dotted annotations; 3) middle fusion is a good choice to integrate foreground-background a priori knowledge; and 4) decoupling the learning of counting and segmentation matters.

Journal ArticleDOI
TL;DR: Yang et al. as discussed by the authors proposed a simple and effective hybrid atrous convolutional network (HACNet), which maintains the same spatial resolution throughout the whole architecture and can retain more spatial precision in prediction.
Abstract: Automated pixel-level crack detection is one of the essential tasks in the field of defect inspection. Deep convolutional neural networks, typically using encoder–decoder architectures, have been successfully applied to many crack detection scenes in recent works. However, encoder–decoder networks commonly rely on downsampling and upsampling operations and have a large number of parameters, which may influence the accuracy of crack prediction due to the cracks usually have long, narrow sizes, and the labeled training set is always limited. To address these issues, we propose a simple and effective hybrid atrous convolutional network (HACNet). HACNet maintains the same spatial resolution throughout the whole architecture. It can retain more spatial precision in prediction. HACNet uses atrous convolutions with the proper dilation rates to enlarge the receptive field and a hybrid approach connecting these convolutions to aggregate multiscale features. The resulting architecture can achieve accurate segmentation with relatively few parameters. Evaluations on the public CFD data set, CrackTree206 data set, Deepcrack data set (DCD), and Yang et al. Crack data set (YCD) demonstrate that our method can obtain promising results, compared with other recent approaches. Evaluation on self-collected images and SDNET2018 data set illustrates the good potential of HACNet for practical applications.

Journal ArticleDOI
Chaojun Shi1, Yatong Zhou1, Bo Qiu1, Dongjiao Guo1, Mengci Li1 
TL;DR: Compared with the current deep-learning-based state-of-the-art cloud images’ segmentation algorithms, the CloudU-Net demonstrates better segmentation performance for daytime and nighttime cloud images.
Abstract: Cloud segmentation is one of the hot tasks in the field of weather forecast, environmental monitoring, site selection for observatory, and other areas. In this letter, we proposed a new deep convolutional neural network architecture called CloudU-Net for daytime and nighttime cloud images’ segmentation. The net consists of dilated convolution, activation, batch normalization (BN), max pooling, upsampling, skip connection, and fully connected conditional random field (CRF) layers. The benefits of the net architecture are four aspects: First, the dilated convolution increases the receptive field of the filters to obtain more information of the context without increasing the extra amount of computation and the extra number of parameters. Second, the BN layer increases the speed of network training and prevents over-fitting. Third, the fully connected CRF optimizes the output of the front end of the architecture, and finally gets better segmentation results. Finally, the enhanced optimizer Lookahead improves the learning stability and speeds up model convergence. Compared with the current deep-learning-based state-of-the-art cloud images’ segmentation algorithms, the CloudU-Net demonstrates better segmentation performance for daytime and nighttime cloud images.

Proceedings ArticleDOI
10 Jan 2021
TL;DR: Pong et al. as mentioned in this paper proposed two modified neural networks based on dual path multi-scale fusion networks (SFANet) and SegNet for accurate and efficient crowd counting, which are end-to-end trainable.
Abstract: In this paper, we propose two modified neural networks based on dual path multi-scale fusion networks (SFANet) and SegNet for accurate and efficient crowd counting. Inspired by SFANet, the first model, which is named M-SFANet, is attached with atrous spatial pyramid pooling (ASPP) and context-aware module (CAN). The encoder of M-SFANet is enhanced with ASPP containing parallel atrous convolutional layers with different sampling rates and hence able to extract multi-scale features of the target object and incorporate larger context. To further deal with scale variation throughout an input image, we leverage the CAN module which adaptively encodes the scales of the contextual information. The combination yields an effective model for counting in both dense and sparse crowd scenes. Based on the SFANet decoder structure, M-SFANet‘s decoder has dual paths, for density map and attention map generation. The second model is called M-SegNet, which is produced by replacing the bilinear upsampling in SFANet with max unpooling that is used in SegNet. This change provides a faster model while providing competitive counting performance. Designed for high-speed surveillance applications, M-SegNet has no additional multi-scale-aware module in order to not increase the complexity. Both models are encoder-decoder based architectures and are end-to-end trainable. We conduct extensive experiments on five crowd counting datasets and one vehicle counting dataset to show that these modifications yield algorithms that could improve state-of-the-art crowd counting methods. Codes are available at https://github.com/Pongpisit-Thanasutives/Variations-of-SFANet-for-Crowd-Counting.

Journal ArticleDOI
TL;DR: LDNguyen et al. as mentioned in this paper propose a ladder-style DenseNet-based architecture which features high modelling power, efficient upsampling, and inherent spatial efficiency which they unlock with checkpointing.
Abstract: Recent progress of deep image classification models provides great potential for improving related computer vision tasks. However, the transition to semantic segmentation is hampered by strict memory limitations of contemporary GPUs. The extent of feature map caching required by convolutional backprop poses significant challenges even for moderately sized Pascal images, while requiring careful architectural considerations when input resolution is in the megapixel range. To address these concerns, we propose a novel ladder-style DenseNet-based architecture which features high modelling power, efficient upsampling, and inherent spatial efficiency which we unlock with checkpointing. The resulting models deliver high performance and allow training at megapixel resolution on commodity hardware. The presented experimental results outperform the state-of-the-art in terms of prediction accuracy and execution speed on Cityscapes, VOC 2012, CamVid and ROB 2018 datasets. Source code at https://github.com/ivankreso/LDN .