scispace - formally typeset
Search or ask a question

Showing papers on "Kernel (image processing) published in 2018"


Proceedings ArticleDOI
12 Mar 2018
TL;DR: DUC is designed to generate pixel-level prediction, which is able to capture and decode more detailed information that is generally missing in bilinear upsampling, and a hybrid dilated convolution (HDC) framework in the encoding phase is proposed.
Abstract: Recent advances in deep learning, especially deep convolutional neural networks (CNNs), have led to significant improvement over previous semantic segmentation systems. Here we show how to improve pixel-wise semantic segmentation by manipulating convolution-related operations that are of both theoretical and practical value. First, we design dense upsampling convolution (DUC) to generate pixel-level prediction, which is able to capture and decode more detailed information that is generally missing in bilinear upsampling. Second, we propose a hybrid dilated convolution (HDC) framework in the encoding phase. This framework 1) effectively enlarges the receptive fields (RF) of the network to aggregate global information; 2) alleviates what we call the "gridding issue"caused by the standard dilated convolution operation. We evaluate our approaches thoroughly on the Cityscapes dataset, and achieve a state-of-art result of 80.1% mIOU in the test set at the time of submission. We also have achieved state-of-theart overall on the KITTI road estimation benchmark and the PASCAL VOC2012 segmentation task. Our source code can be found at https://github.com/TuSimple/TuSimple-DUC.

1,358 citations


Proceedings ArticleDOI
18 Jun 2018
TL;DR: Densely connected Atrous Spatial Pyramid Pooling (DenseASPP) is proposed, which connects a set of atrous convolutional layers in a dense way, such that it generates multi-scale features that not only cover a larger scale range, but also cover that scale range densely, without significantly increasing the model size.
Abstract: Semantic image segmentation is a basic street scene understanding task in autonomous driving, where each pixel in a high resolution image is categorized into a set of semantic labels. Unlike other scenarios, objects in autonomous driving scene exhibit very large scale changes, which poses great challenges for high-level feature representation in a sense that multi-scale information must be correctly encoded. To remedy this problem, atrous convolution[14]was introduced to generate features with larger receptive fields without sacrificing spatial resolution. Built upon atrous convolution, Atrous Spatial Pyramid Pooling (ASPP)[2] was proposed to concatenate multiple atrous-convolved features using different dilation rates into a final feature representation. Although ASPP is able to generate multi-scale features, we argue the feature resolution in the scale-axis is not dense enough for the autonomous driving scenario. To this end, we propose Densely connected Atrous Spatial Pyramid Pooling (DenseASPP), which connects a set of atrous convolutional layers in a dense way, such that it generates multi-scale features that not only cover a larger scale range, but also cover that scale range densely, without significantly increasing the model size. We evaluate DenseASPP on the street scene benchmark Cityscapes[4] and achieve state-of-the-art performance.

1,208 citations


Proceedings ArticleDOI
01 Jun 2018
TL;DR: In this article, a generic classification network equipped with convolutional blocks of different dilated rates was designed to produce dense and reliable object localization maps and effectively benefit both weakly and semi-supervised semantic segmentation.
Abstract: Despite the remarkable progress, weakly supervised segmentation approaches are still inferior to their fully supervised counterparts. We obverse the performance gap mainly comes from their limitation on learning to produce high-quality dense object localization maps from image-level supervision. To mitigate such a gap, we revisit the dilated convolution [1] and reveal how it can be utilized in a novel way to effectively overcome this critical limitation of weakly supervised segmentation approaches. Specifically, we find that varying dilation rates can effectively enlarge the receptive fields of convolutional kernels and more importantly transfer the surrounding discriminative information to non-discriminative object regions, promoting the emergence of these regions in the object localization maps. Then, we design a generic classification network equipped with convolutional blocks of different dilated rates. It can produce dense and reliable object localization maps and effectively benefit both weakly- and semi- supervised semantic segmentation. Despite the apparent simplicity, our proposed approach obtains superior performance over state-of-the-arts. In particular, it achieves 60.8% and 67.6% mIoU scores on Pascal VOC 2012 test set in weakly- (only image-level labels are available) and semi- (1,464 segmentation masks are available) supervised settings, which are the new state-of-the-arts.

514 citations


Proceedings ArticleDOI
18 Jun 2018
TL;DR: Pointwise convolution as discussed by the authors is a new convolution operator that can be applied at each point of a point cloud, which can yield competitive accuracy in both semantic segmentation and object recognition task.
Abstract: Deep learning with 3D data such as reconstructed point clouds and CAD models has received great research interests recently. However, the capability of using point clouds with convolutional neural network has been so far not fully explored. In this paper, we present a convolutional neural network for semantic segmentation and object recognition with 3D point clouds. At the core of our network is point-wise convolution, a new convolution operator that can be applied at each point of a point cloud. Our fully convolutional network design, while being surprisingly simple to implement, can yield competitive accuracy in both semantic segmentation and object recognition task.

496 citations


Journal ArticleDOI
TL;DR: Evaluation of PCNN on three central point cloudlearning benchmarks convincingly outperform competing point cloud learning methods, and the vast majority of methods working with more informative shape representations such as surfaces and/or normals.
Abstract: This paper presents Point Convolutional Neural Networks (PCNN): a novel framework for applying convolutional neural networks to point clouds. The framework consists of two operators: extension and restriction, mapping point cloud functions to volumetric functions and vise-versa. A point cloud convolution is defined by pull-back of the Euclidean volumetric convolution via an extension-restriction mechanism. The point cloud convolution is computationally efficient, invariant to the order of points in the point cloud, robust to different samplings and varying densities, and translation invariant, that is the same convolution kernel is used at all points. PCNN generalizes image CNNs and allows readily adapting their architectures to the point cloud setting. Evaluation of PCNN on three central point cloud learning benchmarks convincingly outperform competing point cloud learning methods, and the vast majority of methods working with more informative shape representations such as surfaces and/or normals.

407 citations


Proceedings ArticleDOI
18 Jun 2018
TL;DR: Two new operations to improve PointNet with a more efficient exploitation of local structures are presented, one focuses on local 3D geometric structures and the other exploits local high-dimensional feature structures by recursive feature aggregation on a nearest-neighbor-graph computed from 3D positions.
Abstract: Unlike on images, semantic learning on 3D point clouds using a deep network is challenging due to the naturally unordered data structure. Among existing works, PointNet has achieved promising results by directly learning on point sets. However, it does not take full advantage of a point's local neighborhood that contains fine-grained structural information which turns out to be helpful towards better semantic learning. In this regard, we present two new operations to improve PointNet with a more efficient exploitation of local structures. The first one focuses on local 3D geometric structures. In analogy to a convolution kernel for images, we define a point-set kernel as a set of learnable 3D points that jointly respond to a set of neighboring data points according to their geometric affinities measured by kernel correlation, adapted from a similar technique for point cloud registration. The second one exploits local high-dimensional feature structures by recursive feature aggregation on a nearest-neighbor-graph computed from 3D positions. Experiments show that our network can efficiently capture local information and robustly achieve better performances on major datasets. Our code is available at http://www.merl.com/research/license#KCNet

397 citations


Proceedings ArticleDOI
18 Jun 2018
TL;DR: In this paper, a convolutional neural network architecture is proposed for predicting spatially varying kernels that can both align and denoise frames, and a synthetic data generation approach based on a realistic noise formation model, and an optimization guided by an annealed loss function to avoid undesirable local minima.
Abstract: We present a technique for jointly denoising bursts of images taken from a handheld camera. In particular, we propose a convolutional neural network architecture for predicting spatially varying kernels that can both align and denoise frames, a synthetic data generation approach based on a realistic noise formation model, and an optimization guided by an annealed loss function to avoid undesirable local minima. Our model matches or outperforms the state-of-the-art across a wide range of noise levels on both real and synthetic data.

387 citations


Journal ArticleDOI
TL;DR: In this paper, the authors demonstrate the failure of the semi-group principle in modeling real-world problems and demonstrate the importance of non-commutative and non-associative operators under which the Caputo-Fabrizio and Atangana-Baleanu fractional operators fall.
Abstract: To answer some issues raised about the concept of fractional differentiation and integration based on the exponential and Mittag-Leffler laws, we present, in this paper, fundamental differences between the power law, exponential decay, Mittag-Leffler law and their possible applications in nature. We demonstrate the failure of the semi-group principle in modeling real-world problems. We use natural phenomena to illustrate the importance of non-commutative and non-associative operators under which the Caputo-Fabrizio and Atangana-Baleanu fractional operators fall. We present statistical properties of generator for each fractional derivative, including Riemann-Liouville, Caputo-Fabrizio and Atangana-Baleanu ones. The Atangana-Baleanu and Caputo-Fabrizio fractional derivatives show crossover properties for the mean-square displacement, while the Riemann-Liouville is scale invariant. Their probability distributions are also a Gaussian to non-Gaussian crossover, with the difference that the Caputo Fabrizio kernel has a steady state between the transition. Only the Atangana-Baleanu kernel is a crossover for the waiting time distribution from stretched exponential to power law. A new criterion was suggested, namely the Atangana-Gomez fractional bracket, that helps describe the energy needed by a fractional derivative to characterize a 2-pletic manifold. Based on these properties, we classified fractional derivatives in three categories: weak, mild and strong fractional differential and integral operators. We presented some applications of fractional differential operators to describe real-world problems and we proved, with numerical simulations, that the Riemann-Liouville power-law derivative provides a description of real-world problems with much additional information, that can be seen as noise or error due to specific memory properties of its power-law kernel. The Caputo-Fabrizio derivative is less noisy while the Atangana-Baleanu fractional derivative provides an excellent description, due to its Mittag-Leffler memory, able to distinguish between dynamical systems taking place at different scales without steady state. The study suggests that the properties of associativity and commutativity or the semi-group principle are just irrelevant in fractional calculus. Properties of classical derivatives were established for the ordinary calculus with no memory effect and it is a failure of mathematical investigation to attempt to describe more complex natural phenomena using the same notions.

368 citations


Proceedings ArticleDOI
02 Feb 2018
TL;DR: Conv-KNRM uses Convolutional Neural Networks to represent n-grams of various lengths and soft matches them in a unified embedding space and is utilized by the kernel pooling and learning-to-rank layers to generate the final ranking score.
Abstract: This paper presents \textttConv-KNRM, a Convolutional Kernel-based Neural Ranking Model that models n-gram soft matches for ad-hoc search. Instead of exact matching query and document n-grams, \textttConv-KNRM uses Convolutional Neural Networks to represent n-grams of various lengths and soft matches them in a unified embedding space. The n-gram soft matches are then utilized by the kernel pooling and learning-to-rank layers to generate the final ranking score. \textttConv-KNRM can be learned end-to-end and fully optimized from user feedback. The learned model»s generalizability is investigated by testing how well it performs in a related domain with small amounts of training data. Experiments on English search logs, Chinese search logs, and TREC Web track tasks demonstrated consistent advantages of \textttConv-KNRM over prior neural IR methods and feature-based methods.

329 citations


Proceedings ArticleDOI
Yuan Yuan1, Siyuan Liu2, Jiawei Zhang1, Yongbing Zhang2, Chao Dong1, Liang Lin1 
18 Jun 2018
TL;DR: This work proposes a Cycle-in-Cycle network structure with generative adversarial networks (GAN) as the basic component to tackle the single image super-resolution problem in a more general case that the low-/high-resolution pairs and the down-sampling process are unavailable.
Abstract: We consider the single image super-resolution problem in a more general case that the low-/high-resolution pairs and the down-sampling process are unavailable. Different from traditional super-resolution formulation, the low-resolution input is further degraded by noises and blurring. This complicated setting makes supervised learning and accurate kernel estimation impossible. To solve this problem, we resort to unsupervised learning without paired data, inspired by the recent successful image-to-image translation applications. With generative adversarial networks (GAN) as the basic component, we propose a Cycle-in-Cycle network structure to tackle the problem within three steps. First, the noisy and blurry input is mapped to a noise-free low-resolution space. Then the intermediate image is up-sampled with a pre-trained deep model. Finally, we fine-tune the two modules in an end-to-end manner to get the high-resolution output. Experiments on NTIRE2018 datasets demonstrate that the proposed unsupervised method achieves comparable results as the state-of-the-art supervised models.

306 citations


Proceedings ArticleDOI
18 Jun 2018
TL;DR: Zhang et al. as discussed by the authors proposed a two-stream Faster R-CNN network and train it end-to-end to detect the tampered regions given a manipulated image.
Abstract: Image manipulation detection is different from traditional semantic object detection because it pays more attention to tampering artifacts than to image content, which suggests that richer features need to be learned. We propose a two-stream Faster R-CNN network and train it end-to-end to detect the tampered regions given a manipulated image. One of the two streams is an RGB stream whose purpose is to extract features from the RGB image input to find tampering artifacts like strong contrast difference, unnatural tampered boundaries, and so on. The other is a noise stream that leverages the noise features extracted from a steganalysis rich model filter layer to discover the noise inconsistency between authentic and tampered regions. We then fuse features from the two streams through a bilinear pooling layer to further incorporate spatial co-occurrence of these two modalities. Experiments on four standard image manipulation datasets demonstrate that our two-stream framework outperforms each individual stream, and also achieves state-of-the-art performance compared to alternative methods with robustness to resizing and compression.

Journal ArticleDOI
TL;DR: A sharp error estimate reflecting the regularity of solution is obtained for a simple L1 scheme with the help of discrete fractional Gronwall inequality and global consistency error analysis.
Abstract: Stability and convergence of the L1 formula on nonuniform time grids are studied for solving linear reaction-subdiffusion equations with the Caputo derivative. A discrete fractional Gronwall inequality is developed for the nonuniform L1 formula by introducing a discrete convolution kernel of Riemann--Liouville fractional integral. To simplify the consistency analysis of the nonuniform L1 formula, we bound the local truncation error in a discrete convolution form and consider a global convolution error involving the discrete Riemann--Liouville integral kernel. With the help of discrete fractional Gronwall inequality and global consistency error analysis, a sharp error estimate reflecting the regularity of solution is obtained for a simple L1 scheme. Numerical examples are provided to verify the sharpness of the error analysis.

Journal ArticleDOI
TL;DR: The proposed end-to-end fast dense spectral–spatial convolution (FDSSC) framework for HSI classification achieved state-of-the-art performance compared with existing deep-learning-based methods while significantly reducing the training time.
Abstract: Recent research shows that deep-learning-derived methods based on a deep convolutional neural network have high accuracy when applied to hyperspectral image (HSI) classification, but long training times. To reduce the training time and improve accuracy, in this paper we propose an end-to-end fast dense spectral–spatial convolution (FDSSC) framework for HSI classification. The FDSSC framework uses different convolutional kernel sizes to extract spectral and spatial features separately, and the “valid” convolution method to reduce the high dimensions. Densely-connected structures—the input of each convolution consisting of the output of all previous convolution layers—was used for deep learning of features, leading to extremely accurate classification. To increase speed and prevent overfitting, the FDSSC framework uses a dynamic learning rate, parametric rectified linear units, batch normalization, and dropout layers. These attributes enable the FDSSC framework to achieve accuracy within as few as 80 epochs. The experimental results show that with the Indian Pines, Kennedy Space Center, and University of Pavia datasets, the proposed FDSSC framework achieved state-of-the-art performance compared with existing deep-learning-based methods while significantly reducing the training time.

Book ChapterDOI
08 Sep 2018
TL;DR: Depth-aware CNN is presented by introducing two intuitive, flexible and effective operations: depth-aware convolution and Depth-aware average pooling, both of which can be easily integrated into existing CNNs.
Abstract: Convolutional neural networks (CNN) are limited by the lack of capability to handle geometric information due to the fixed grid kernel structure. The availability of depth data enables progress in RGB-D semantic segmentation with CNNs. State-of-the-art methods either use depth as additional images or process spatial information in 3D volumes or point clouds. These methods suffer from high computation and memory cost. To address these issues, we present Depth-aware CNN by introducing two intuitive, flexible and effective operations: depth-aware convolution and depth-aware average pooling. By leveraging depth similarity between pixels in the process of information propagation, geometry is seamlessly incorporated into CNN. Without introducing any additional parameters, both operators can be easily integrated into existing CNNs. Extensive experiments and ablation studies on challenging RGB-D semantic segmentation benchmarks validate the effectiveness and flexibility of our approach.

Journal ArticleDOI
TL;DR: MCCNN as mentioned in this paper represents the convolution kernel itself as a multilayer perceptron, phrasing convolution as a Monte Carlo integration problem, using this notion to combine information from multiple samplings at different levels, and using Poisson disk sampling as a scalable means of hierarchical point cloud learning.
Abstract: Deep learning systems extensively use convolution operations to process input data. Though convolution is clearly defined for structured data such as 2D images or 3D volumes, this is not true for other data types such as sparse point clouds. Previous techniques have developed approximations to convolutions for restricted conditions. Unfortunately, their applicability is limited and cannot be used for general point clouds. We propose an efficient and effective method to learn convolutions for non-uniformly sampled point clouds, as they are obtained with modern acquisition techniques. Learning is enabled by four key novelties: first, representing the convolution kernel itself as a multilayer perceptron; second, phrasing convolution as a Monte Carlo integration problem, third, using this notion to combine information from multiple samplings at different levels; and fourth using Poisson disk sampling as a scalable means of hierarchical point cloud learning. The key idea across all these contributions is to guarantee adequate consideration of the underlying non-uniform sample distribution function from a Monte Carlo perspective. To make the proposed concepts applicable to real-world tasks, we furthermore propose an efficient implementation which significantly reduces the GPU memory required during the training process. By employing our method in hierarchical network architectures we can outperform most of the state-of-the-art networks on established point cloud segmentation, classification and normal estimation benchmarks. Furthermore, in contrast to most existing approaches, we also demonstrate the robustness of our method with respect to sampling variations, even when training with uniformly sampled data only. To support the direct application of these concepts, we provide a ready-to-use TensorFlow implementation of these layers at https://github.com/viscom-ulm/MCCNN.

Proceedings ArticleDOI
18 Jun 2018
TL;DR: This paper proposes a kernelized ridge regression model wherein the kernel value is defined as the weighted sum of similarity scores of all pairs of patches between two samples, and shows that this model can be formulated as a neural network and thus can be efficiently solved.
Abstract: In this paper, we analyze the spatial information of deep features, and propose two complementary regressions for robust visual tracking. First, we propose a kernelized ridge regression model wherein the kernel value is defined as the weighted sum of similarity scores of all pairs of patches between two samples. We show that this model can be formulated as a neural network and thus can be efficiently solved. Second, we propose a fully convolutional neural network with spatially regularized kernels, through which the filter kernel corresponding to each output channel is forced to focus on a specific region of the target. Distance transform pooling is further exploited to determine the effectiveness of each output channel of the convolution layer. The outputs from the kernelized ridge regression model and the fully convolutional neural network are combined to obtain the ultimate response. Experimental results on two benchmark datasets validate the effectiveness of the proposed method.

Proceedings ArticleDOI
18 Jun 2018
TL;DR: In this article, the authors leverage the sparsity structure of computation masks and propose a novel tiling-based sparse convolution algorithm for LiDAR-based 3D object detection.
Abstract: Conventional deep convolutional neural networks (CNNs) apply convolution operators uniformly in space across all feature maps for hundreds of layers - this incurs a high computational cost for real-time applications. For many problems such as object detection and semantic segmentation, we are able to obtain a low-cost computation mask, either from a priori problem knowledge, or from a low-resolution segmentation network. We show that such computation masks can be used to reduce computation in the high-resolution main network. Variants of sparse activation CNNs have previously been explored on small-scale tasks and showed no degradation in terms of object classification accuracy, but often measured gains in terms of theoretical FLOPs without realizing a practical speedup when compared to highly optimized dense convolution implementations. In this work, we leverage the sparsity structure of computation masks and propose a novel tiling-based sparse convolution algorithm. We verified the effectiveness of our sparse CNN on LiDAR-based 3D object detection, and we report significant wall-clock speed-ups compared to dense convolution without noticeable loss of accuracy.

Proceedings ArticleDOI
02 Jun 2018
TL;DR: This paper proposes a predictive early activation technique, dubbed SnaPEA, which offers up to 63% speedup and 49% energy reduction across the convolution layers with no loss in classification accuracy.
Abstract: Deep Convolutional Neural Networks (CNNs) perform billions of operations for classifying a single input. To reduce these computations, this paper offers a solution that leverages a combination of runtime information and the algorithmic structure of CNNs. Specifically, in numerous modern CNNs, the outputs of compute-heavy convolution operations are fed to activation units that output zero if their input is negative. By exploiting this unique algorithmic property, we propose a predictive early activation technique, dubbed SnaPEA. This technique cuts the computation of convolution operations short if it determines that the output will be negative. SnaPEA can operate in two distinct modes, exact and predictive. In the exact mode, with no loss in classification accuracy, SnaPEA statically re-orders the weights based on their signs and periodically performs a single-bit sign check on the partial sum. Once the partial sum drops below zero, the rest of computations can simply be ignored, since the output value will be zero in any case. In the predictive mode, which trades the classification accuracy for larger savings, SnaPEA speculatively cuts the computation short even earlier than the exact mode. To control the accuracy, we develop a multi-variable optimization algorithm that thresholds the degree of speculation. As such, the proposed algorithm exposes a knob to gracefully navigate the trade-offs between the classification accuracy and computation reduction. Compared to a state-of-the-art CNN accelerator, SnaPEA in the exact mode, yields, on average, 28% speedup and 16% energy reduction in various modern CNNs without affecting their classification accuracy. With 3% loss in classification accuracy, on average, 67.8% of the convolutional layers can operate in the predictive mode. The average speedup and energy saving of these layers are 2.02x and 1.89x, respectively. The benefits grow to a maximum of 3.59x speedup and 3.14x energy reduction. Compared to static pruning approaches, which are complimentary to the dynamic approach of SnaPEA, our proposed technique offers up to 63% speedup and 49% energy reduction across the convolution layers with no loss in classification accuracy.

Journal ArticleDOI
TL;DR: A theoretical analysis of convergence rates of kernel-predicting architectures is presented, shedding light on why kernel prediction performs better than synthesizing the colors directly, complementing the empirical evidence presented in this and previous works.
Abstract: We present a modular convolutional architecture for denoising rendered images. We expand on the capabilities of kernel-predicting networks by combining them with a number of task-specific modules, and optimizing the assembly using an asymmetric loss. The source-aware encoder---the first module in the assembly---extracts low-level features and embeds them into a common feature space, enabling quick adaptation of a trained network to novel data. The spatial and temporal modules extract abstract, high-level features for kernel-based reconstruction, which is performed at three different spatial scales to reduce low-frequency artifacts. The complete network is trained using a class of asymmetric loss functions that are designed to preserve details and provide the user with a direct control over the variance-bias trade-off during inference. We also propose an error-predicting module for inferring reconstruction error maps that can be used to drive adaptive sampling. Finally, we present a theoretical analysis of convergence rates of kernel-predicting architectures, shedding light on why kernel prediction performs better than synthesizing the colors directly, complementing the empirical evidence presented in this and previous works. We demonstrate that our networks attain results that compare favorably to state-of-the-art methods in terms of detail preservation, low-frequency noise removal, and temporal stability on a variety of production and academic datasets.

Journal ArticleDOI
TL;DR: In this paper, the existence and uniqueness of the solution of nonlinear fractional differential equations with Mittag-Leffler nonsingular kernel was studied and two numerical methods to solve this problem were designed, and their stability and error estimates were investigated by discretizing the convolution integral and using the Gronwall inequality.
Abstract: The purpose of this paper is to study the existence and uniqueness of the solution of nonlinear fractional differential equations with Mittag–Leffler nonsingular kernel. Two numerical methods to solve this problem are designed, and their stability and error estimates are investigated by discretizing the convolution integral and using the Gronwall’s inequality. Finally, the theoretical results are verified by using five illustrative examples.

Journal ArticleDOI
TL;DR: This paper proposes to learn a deep convolutional neural network for extracting sharp edges from blurred images that does not require any coarse-to-fine strategy or edge selection, thereby significantly simplifying kernel estimation and reducing computation load.
Abstract: The success of the state-of-the-art deblurring methods mainly depends on the restoration of sharp edges in a coarse-to-fine kernel estimation process. In this paper, we propose to learn a deep convolutional neural network for extracting sharp edges from blurred images. Motivated by the success of the existing filtering-based deblurring methods, the proposed model consists of two stages: suppressing extraneous details and enhancing sharp edges. We show that the two-stage model simplifies the learning process and effectively restores sharp edges. Facilitated by the learned sharp edges, the proposed deblurring algorithm does not require any coarse-to-fine strategy or edge selection, thereby significantly simplifying kernel estimation and reducing computation load. Extensive experimental results on challenging blurry images demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods on both synthetic and real-world images in terms of visual quality and run-time.

Journal ArticleDOI
TL;DR: The convolution in convolution (CiC) method as mentioned in this paper replaces dense shallow multilayer perceptron (MLP) with sparse shallow MLP to replace the linear filter.
Abstract: Network in network (NiN) is an effective instance and an important extension of deep convolutional neural network consisting of alternating convolutional layers and pooling layers. Instead of using a linear filter for convolution, NiN utilizes shallow multilayer perceptron (MLP), a nonlinear function, to replace the linear filter. Because of the powerfulness of MLP and $ 1\times 1 $ convolutions in spatial domain, NiN has stronger ability of feature representation and hence results in better recognition performance. However, MLP itself consists of fully connected layers that give rise to a large number of parameters. In this paper, we propose to replace dense shallow MLP with sparse shallow MLP. One or more layers of the sparse shallow MLP are sparely connected in the channel dimension or channel–spatial domain. The proposed method is implemented by applying unshared convolution across the channel dimension and applying shared convolution across the spatial dimension in some computational layers. The proposed method is called convolution in convolution (CiC). The experimental results on the CIFAR10 data set, augmented CIFAR10 data set, and CIFAR100 data set demonstrate the effectiveness of the proposed CiC method.

Book ChapterDOI
08 Sep 2018
TL;DR: SDC module for video frame prediction with spatially-displaced convolution inherits the merits of both vector-based and kernel-based approaches, while ameliorating their respective disadvantages.
Abstract: We present an approach for high-resolution video frame prediction by conditioning on both past frames and past optical flows. Previous approaches rely on resampling past frames, guided by a learned future optical flow, or on direct generation of pixels. Resampling based on flow is insufficient because it cannot deal with disocclusions. Generative models currently lead to blurry results. Recent approaches synthesis a pixel by convolving input patches with a predicted kernel. However, their memory requirement increases with kernel size. Here, we present spatially-displaced convolution (SDC) module for video frame prediction. We learn a motion vector and a kernel for each pixel and synthesize a pixel by applying the kernel at a displaced location in the source image, defined by the predicted motion vector. Our approach inherits the merits of both vector-based and kernel-based approaches, while ameliorating their respective disadvantages. We train our model on 428K unlabelled 1080p video game frames. Our approach produces state-of-the-art results, achieving an SSIM score of 0.904 on high-definition YouTube-8M videos, 0.918 on Caltech Pedestrian videos. Our model handles large motion effectively and synthesizes crisp frames with consistent motion.

Proceedings ArticleDOI
01 Oct 2018
TL;DR: Comparisons show that the KE-CNN has promising results for brain tumor classification, which consists of three types of brain tumors including meningioma, glioma and pituitary tumor in T1-weighted contrast-enhanced MRI images.
Abstract: Tumor identification is one of the main and most influential factors in the identification of the type of treatment, the treatment process, the success rate of treatment and the follow-up of the disease. Convolution neural networks are one of the most important and practical classes in the field of deep learning and feed-forward neural networks that is highly applicable for analyzing visual imagery. CNNs learn the features extracted by the convolution and maxpooling layers. Extreme Learning Machines (ELM) are a kind of learning algorithm that consists of one or more layers of hidden nodes. These networks are used in various fields such as classification and regression. By using a CNN, this paper tries to extract hidden features from images. Then a kernel ELM (KELM) classifies the images based on these extracted features. In this work, we will use a dataset to evaluate the effectiveness of our proposed method, which consists of three types of brain tumors including meningioma, glioma and pituitary tumor in T1-weighted contrast-enhanced MRI (CE-MRI) images. The results of this ensemble of CNN and KELM (KE-CNN) are compared with different classifiers such as Support Vector Machine, Radial Base Function, and some other classifiers. These comparisons show that the KE-CNN has promising results for brain tumor classification.

Journal ArticleDOI
TL;DR: The convolutional neural network is better than traditional classifiers and is effective in remote sensing image segmentation and gives better overall accuracy than five state-of-the-art approaches.
Abstract: Image segmentation is an important application of polarimetric synthetic aperture radar. This study aimed to create an 11-layer deep convolutional neural network for this task. The Pauli decomposition formed the RGB image and was used as the input. We created an 11-layer convolutional neural network (CNN). L-band data over the San Francisco bay area and C-band data over Flevoland area were employed as the dataset. For the San Francisco bay PSAR image, our method achieved an overall accuracy of 97.32%, which was at least 2% superior to four state-of-the-art approaches. We provided the confusion matrix over test area, and the kernel visualization. We compared the max pooling and average pooling. We validated by experiment that four convolution layers perform the best. Besides, our method gave better results than AlexNet. The GPU yields a 173× acceleration on the training samples, and a 181× acceleration on the test samples, compared to standard CPU. For the Flevoland PSAR image, our 11-layer CNN also gives better overall accuracy than five state-of-the-art approaches. The convolutional neural network is better than traditional classifiers and is effective in remote sensing image segmentation.

Proceedings ArticleDOI
18 Jun 2018
TL;DR: Zhang et al. as discussed by the authors proposed a data-driven discriminative prior to distinguish whether an input image is clear or not, which can be embedded into the maximum a posterior (MAP) framework.
Abstract: We present an effective blind image deblurring method based on a data-driven discriminative prior. Our work is motivated by the fact that a good image prior should favor clear images over blurred ones. In this work, we formulate the image prior as a binary classifier which can be achieved by a deep convolutional neural network (CNN). The learned prior is able to distinguish whether an input image is clear or not. Embedded into the maximum a posterior (MAP) framework, it helps blind deblurring in various scenarios, including natural, face, text, and low-illumination images. However, it is difficult to optimize the deblurring method with the learned image prior as it involves a non-linear CNN. Therefore, we develop an efficient numerical approach based on the half-quadratic splitting method and gradient decent algorithm to solve the proposed model. Furthermore, the proposed model can be easily extended to non-uniform deblurring. Both qualitative and quantitative experimental results show that our method performs favorably against state-of-the-art algorithms as well as domain-specific image deblurring approaches.

Journal ArticleDOI
TL;DR: The objective of this work is to detect shadows in images by posing this as the problem of labeling image regions, where each region corresponds to a group of superpixels, and training a kernel Least-Squares SVM for separating shadow and non-shadow regions.
Abstract: The objective of this work is to detect shadows in images. We pose this as the problem of labeling image regions, where each region corresponds to a group of superpixels. To predict the label of each region, we train a kernel Least-Squares Support Vector Machine (LSSVM) for separating shadow and non-shadow regions. The parameters of the kernel and the classifier are jointly learned to minimize the leave-one-out cross validation error. Optimizing the leave-one-out cross validation error is typically difficult, but it can be done efficiently in our framework. Experiments on two challenging shadow datasets, UCF and UIUC, show that our region classifier outperforms more complex methods. We further enhance the performance of the region classifier by embedding it in a Markov Random Field (MRF) framework and adding pairwise contextual cues. This leads to a method that outperforms the state-of-the-art for shadow detection. In addition we propose a new method for shadow removal based on region relighting. For each shadow region we use a trained classifier to identify a neighboring lit region of the same material. Given a pair of lit-shadow regions we perform a region relighting transformation based on histogram matching of luminance values between the shadow region and the lit region. Once a shadow is detected, we demonstrate that our shadow removal approach produces results that outperform the state of the art by evaluating our method using a publicly available benchmark dataset.

Proceedings ArticleDOI
04 Apr 2018
TL;DR: This paper presents a modularized building block, IGC-V2: interleaved structured sparse convolutions, which generalizes interleaves group convolutions to the product of more structured sparse kernels, further eliminating the redundancy.
Abstract: In this paper, we study the problem of designing efficient convolutional neural network architectures with the interest in eliminating the redundancy in convolution kernels. In addition to structured sparse kernels, low-rank kernels and the product of low-rank kernels, the product of structured sparse kernels, which is a framework for interpreting the recently-developed interleaved group convolutions (IGC) and its variants (e.g., Xception), has been attracting increasing interests. Motivated by the observation that the convolutions contained in a group convolution in IGC can be further decomposed in the same manner, we present a modularized building block, IGC-V2: interleaved structured sparse convolutions. It generalizes interleaved group convolutions, which is composed of two structured sparse kernels, to the product of more structured sparse kernels, further eliminating the redundancy. We present the complementary condition and the balance condition to guide the design of structured sparse kernels, obtaining a balance among three aspects: model size, computation complexity and classification accuracy. Experimental results demonstrate the advantage on the balance among these three aspects compared to interleaved group convolutions and Xception, and competitive performance compared to other state-of-the-art architecture design methods.

Journal ArticleDOI
TL;DR: Experiments on two remote sensing image datasets illustrate that the Hellinger kernel, PCA, and two aggregate strategies improve classification performance, and the deeply local descriptors outperform the features extracted from fully connected layers.
Abstract: The extraction of features from the fully connected layer of a convolutional neural network (CNN) model is widely used for image representation. However, the features obtained by the convolutional layers are seldom investigated due to their high dimensionality and lack of global representation. In this study, we explore the uses of local description and feature encoding for deeply convolutional features. Given an input image, the image pyramid is constructed, and different pretrained CNNs are applied to each image scale to extract convolutional features. Deeply local descriptors can be obtained by concatenating the convolutional features in each spatial position. Hellinger kernel and principal component analysis (PCA) are introduced to improve the distinguishable capabilities of the deeply local descriptors. The Hellinger kernel causes the distance measure to be sensitive to small feature values, and the PCA helps reduce feature redundancy. In addition, two aggregate strategies are proposed to form global image representations from the deeply local descriptors. The first strategy aggregates the descriptors of different CNNs by Fisher encoding, and the second strategy concatenates the Fisher vectors of different CNNs. Experiments on two remote sensing image datasets illustrate that the Hellinger kernel, PCA, and two aggregate strategies improve classification performance. Moreover, the deeply local descriptors outperform the features extracted from fully connected layers.

Journal ArticleDOI
TL;DR: A CNN hardware accelerator that exploits the zero-value property to achieve significant performance and energy improvements is proposed.
Abstract: Editor’s note: It has been observed that the majority of the kernel weights and input activations in the state-of-the-art convolution neural networks (CNNs) have zero values. This article proposes a CNN hardware accelerator that exploits this property to achieve significant performance and energy improvements. —Mustafa Ozdal, Bilkent University