scispace - formally typeset
Search or ask a question

Showing papers by "Sheng Tang published in 2018"


Posted Content
TL;DR: This work proposes a novel Context Guided Network (CGNet), which is a light-weight and efficient network for semantic segmentation, and develops CGNet which captures contextual information in all stages of the network.
Abstract: The demand of applying semantic segmentation model on mobile devices has been increasing rapidly. Current state-of-the-art networks have enormous amount of parameters hence unsuitable for mobile devices, while other small memory footprint models follow the spirit of classification network and ignore the inherent characteristic of semantic segmentation. To tackle this problem, we propose a novel Context Guided Network (CGNet), which is a light-weight and efficient network for semantic segmentation. We first propose the Context Guided (CG) block, which learns the joint feature of both local feature and surrounding context, and further improves the joint feature with the global context. Based on the CG block, we develop CGNet which captures contextual information in all stages of the network and is specially tailored for increasing segmentation accuracy. CGNet is also elaborately designed to reduce the number of parameters and save memory footprint. Under an equivalent number of parameters, the proposed CGNet significantly outperforms existing segmentation networks. Extensive experiments on Cityscapes and CamVid datasets verify the effectiveness of the proposed approach. Specifically, without any post-processing and multi-scale testing, the proposed CGNet achieves 64.8% mean IoU on Cityscapes with less than 0.5 M parameters. The source code for the complete system can be found at this https URL.

151 citations


Proceedings Article
27 Apr 2018
TL;DR: This paper proposes an iterative approach named Auto-balanced Filter Pruning, where the network is pre-trained in an innovative auto-balanced way to transfer the representational capacity of its convolutional layers to a fraction of the filters, prune the redundant ones, then re-train it to restore the accuracy.
Abstract: In recent years considerable research efforts have been devoted to compression techniques of convolutional neural networks (CNNs). Many works so far have focused on CNN connection pruning methods which produce sparse parameter tensors in convolutional or fully-connected layers. It has been demonstrated in several studies that even simple methods can effectively eliminate connections of a CNN. However, since these methods make parameter tensors just sparser but no smaller, the compression may not transfer directly to acceleration without support from specially designed hardware. In this paper, we propose an iterative approach named Auto-balanced Filter Pruning, where we pre-train the network in an innovative auto-balanced way to transfer the representational capacity of its convolutional layers to a fraction of the filters, prune the redundant ones, then re-train it to restore the accuracy. In this way, a smaller version of the original network is learned and the floating-point operations (FLOPs) are reduced. By applying this method on several common CNNs, we show that a large portion of the filters can be discarded without obvious accuracy drop, leading to significant reduction of computational burdens. Concretely, we reduce the inference cost of LeNet-5 on MNIST, VGG-16 and ResNet-56 on CIFAR-10 by 95.1%, 79.7% and 60.9%, respectively. Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

103 citations


Journal ArticleDOI
TL;DR: The proposed GLA method can generate more relevant image description sentences and achieve the state-of-the-art performance on the well-known Microsoft COCO caption dataset with several popular evaluation metrics.
Abstract: In recent years, the task of automatically generating image description has attracted a lot of attention in the field of artificial intelligence. Benefitting from the development of convolutional neural networks (CNNs) and recurrent neural networks (RNNs), many approaches based on the CNN-RNN framework have been proposed to solve this task and achieved remarkable process. However, two problems remain to be tackled in which the most existing methods use only the image-level representation. One problem is object missing, in which some important objects may be missing when generating the image description and the other is misprediction, when one object may be recognized in a wrong category. In this paper, to address these two problems, we propose a new method called global–local attention (GLA) for generating image description. The proposed GLA model utilizes an attention mechanism to integrate object-level features with image-level feature. Through this manner, our model can selectively pay attention to objects and context information concurrently. Therefore, our proposed GLA method can generate more relevant image description sentences and achieve the state-of-the-art performance on the well-known Microsoft COCO caption dataset with several popular evaluation metrics—CIDEr, METEOR, ROUGE-L and BLEU-1, 2, 3, 4.

98 citations


Book ChapterDOI
16 Sep 2018
TL;DR: A pulmonary detection framework that can achieve high sensitivity with few candidates is proposed and a novel Attention 3D CNN (Attention 3D-CNN) which efficiently utilizes contextual information is proposed to further remove the overwhelming majority of false positives.
Abstract: Automated pulmonary nodule detection plays an important role in lung cancer diagnosis In this paper, we propose a pulmonary detection framework that can achieve high sensitivity with few candidates First, the Feature Pyramid Network (FPN), which leverages multi-level features, is applied to detect nodule candidates that cover almost all true positives Then redundant candidates are removed by a simple but effective Conditional 3-Dimensional Non-Maximum Suppression (Conditional 3D-NMS) Moreover, a novel Attention 3D CNN (Attention 3D-CNN) which efficiently utilizes contextual information is proposed to further remove the overwhelming majority of false positives The proposed method yields a sensitivity of \(958\%\) at 2 false positives per scan on the LUng Nodule Analysis 2016 (LUNA16) dataset, which is competitive compared to the current published state-of-the-art methods

50 citations


Proceedings Article
01 Jan 2018
TL;DR: This work is the first work that notices the influence of attributes themselves and proposes to use a refined attribute set for ZSL, and can be combined to any attribute based ZSL approaches so as to augment their performance.
Abstract: Zero-shot learning (ZSL) is regarded as an effective way to construct classification models for target classes which have no labeled samples available. The basic framework is to transfer knowledge from (different) auxiliary source classes having sufficient labeled samples with some attributes shared by target and source classes as bridge. Attributes play an important role in ZSL but they have not gained sufficient attention in recent years. Previous works mostly assume attributes are perfect and treat each attribute equally. However, as shown in this paper, different attributes have different properties, such as their class distribution, variance, and entropy, which may have considerable impact on ZSL accuracy if treated equally. Based on this observation, in this paper we propose to use a subset of attributes, instead of the whole set, for building ZSL models. The attribute selection is conducted by considering the information amount and predictability under a novel joint optimization framework. To our knowledge, this is the first work that notices the influence of attributes themselves and proposes to use a refined attribute set for ZSL. Since our approach focuses on selecting good attributes for ZSL, it can be combined to any attribute based ZSL approaches so as to augment their performance. Experiments on four ZSL benchmarks demonstrate that our approach can improve zeroshot classification accuracy and yield state-of-the-art results. Introduction Image classification, whose goal is to identify the category of instances in an image, is an active research topic in machine learning and computer vision communities. Recently, benefiting from the fast development of deep learning techniques (Krizhevsky, Sutskever, and Hinton 2012; Simonyan and Zisserman 2014; He et al. 2016; Huang et al. 2016), the image classification accuracy on many benchmarks, including the large-scale ImageNet (Russakovsky et al. 2015), has been improved tremendously and even surpassed human-level performance. It should be noticed that the progress in image classification relies heavily on a largescale training set which provides sufficient labeled samples ∗This research was supported by the National Natural Science Foundation of China (Grant No. 61571269) and the Royal Society Newton Mobility Grant (IE150997). Corresponding author: Guiguang Ding. Copyright c © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. !" # "

26 citations


Posted Content
Tianyi Wu1, Sheng Tang1, Rui Zhang1, Juan Cao1, Jintao Li1 
TL;DR: A novel Kr onecker convolution which adopts Kronecker product to expand the standard convolutional kernel for taking into account the partial feature neglected by atrous convolutions is proposed and a Tree-structured Feature Aggregation module which follows a recursive rule to expand and forms a hierarchical structure is proposed.
Abstract: Most existing semantic segmentation methods employ atrous convolution to enlarge the receptive field of filters, but neglect partial information. To tackle this issue, we firstly propose a novel Kronecker convolution which adopts Kronecker product to expand the standard convolutional kernel for taking into account the partial feature neglected by atrous convolutions. Therefore, it can capture partial information and enlarge the receptive field of filters simultaneously without introducing extra parameters. Secondly, we propose Tree-structured Feature Aggregation (TFA) module which follows a recursive rule to expand and forms a hierarchical structure. Thus, it can naturally learn representations of multi-scale objects and encode hierarchical contextual information in complex scenes. Finally, we design Tree-structured Kronecker Convolutional Networks (TKCN) which employs Kronecker convolution and TFA module. Extensive experiments on three datasets, PASCAL VOC 2012, PASCAL-Context and Cityscapes, verify the effectiveness of our proposed approach. We make the code and the trained model publicly available at this https URL.

22 citations


Proceedings ArticleDOI
15 Oct 2018
TL;DR: Wang et al. as discussed by the authors proposed a Style Separation and Synthesis Generative Adversarial Network (S3-GAN) to simultaneously implement style separation and style synthesis on object photographs of specific categories.
Abstract: Style synthesis attracts great interests recently, while few works focus on its dual problem "style separation". In this paper, we propose the Style Separation and Synthesis Generative Adversarial Network (S3-GAN) to simultaneously implement style separation and style synthesis on object photographs of specific categories. Based on the assumption that the object photographs lie on a manifold, and the contents and styles are independent, we employ S3-GAN to build mappings between the manifold and a latent vector space for separating and synthesizing the contents and styles. The S3-GAN consists of an encoder network, a generator network, and an adversarial network. The encoder network performs style separation by mapping an object photograph to a latent vector. Two halves of the latent vector represent the content and style, respectively. The generator network performs style synthesis by taking a concatenated vector as input. The concatenated vector contains the style half vector of the style target image and the content half vector of the content target image. Once obtaining the images from the generator network, an adversarial network is imposed to generate more photo-realistic images. Experiments on CelebA and UT Zappos 50K datasets demonstrate that the S3-GAN has the capacity of style separation and synthesis simultaneously, and could capture various styles in a single model.

19 citations


Journal ArticleDOI
TL;DR: This paper proposes to divide the background category into multiple implicit sub-categorization to explicitly differentiate diverse patterns within it and uses dilated convolution, which is widely used in the semantic segmentation task, for efficient and valuable context information extraction.
Abstract: In this paper, we focus on improving the proposal classification stage in the object detection task and present implicit negative sub-categorization and sink diversion to lift the performance by strengthening loss function in this stage. First, based on the observation that the “background” class is generally very diverse and thus challenging to be handled as a single indiscriminative class in existing state-of-the-art methods, we propose to divide the background category into multiple implicit sub-categories to explicitly differentiate diverse patterns within it. Second, since the ground truth class inevitably has low-value probability scores for certain images, we propose to add a “sink” class and divert the probabilities of wrong classes to this class when necessary, such that the ground truth label will still have a higher probability than other wrong classes even though it has low probability output. Additionally, we propose to use dilated convolution, which is widely used in the semantic segmentation task, for efficient and valuable context information extraction. Extensive experiments on PASCAL VOC 2007 and 2012 data sets show that our proposed methods based on faster R-CNN implementation can achieve state-of-the-art mAPs, i.e., 84.1%, 82.6%, respectively, and obtain 2.5% improvement on ILSVRC DET compared with that of ResNet.

16 citations


Proceedings ArticleDOI
01 Jul 2018
TL;DR: A High Resolution Feature Recovering (HRFR) framework to accelerate a given parsing network, which can accelerate the scene parsing inference process by about 3.0x speedup from 1/2 down-sampled input with negligible accuracy reduction.
Abstract: Both accuracy and speed are equally important in urban scene parsing. Most of the existing methods mainly focus on improving parsing accuracy, ignoring the problem of low inference speed due to large-sized input and high resolution feature maps. To tackle this issue, we propose a High Resolution Feature Recovering (HRFR) framework to accelerate a given parsing network. A Super-Resolution Recovering module is employed to recover features of large original-sized images from features of down-sampled input. Therefore, our framework can combine the advantages of (1) fast speed of networks with down-sampled input and (2) high accuracy of networks with large original-sized input. Additionally, we employ auxiliary intermediate supervision and boundary region re-weighting to facilitate the optimization of the network. Extensive experiments on the two challenging Cityscapes and CamVid datasets well demonstrate the effectiveness of the proposed HRFR framework, which can accelerate the scene parsing inference process by about 3.0x speedup from 1/2 down-sampled input with negligible accuracy reduction.

10 citations


Proceedings ArticleDOI
TL;DR: Experiments on CelebA and UT Zappos 50K datasets demonstrate that the S3-GAN has the capacity of style separation and synthesis simultaneously, and could capture various styles in a single model.
Abstract: Style synthesis attracts great interests recently, while few works focus on its dual problem "style separation". In this paper, we propose the Style Separation and Synthesis Generative Adversarial Network (S3-GAN) to simultaneously implement style separation and style synthesis on object photographs of specific categories. Based on the assumption that the object photographs lie on a manifold, and the contents and styles are independent, we employ S3-GAN to build mappings between the manifold and a latent vector space for separating and synthesizing the contents and styles. The S3-GAN consists of an encoder network, a generator network, and an adversarial network. The encoder network performs style separation by mapping an object photograph to a latent vector. Two halves of the latent vector represent the content and style, respectively. The generator network performs style synthesis by taking a concatenated vector as input. The concatenated vector contains the style half vector of the style target image and the content half vector of the content target image. Once obtaining the images from the generator network, an adversarial network is imposed to generate more photo-realistic images. Experiments on CelebA and UT Zappos 50K datasets demonstrate that the S3-GAN has the capacity of style separation and synthesis simultaneously, and could capture various styles in a single model.

5 citations


Journal ArticleDOI
TL;DR: This work proposes an efficient hierarchical BoW (HBoW) to achieve large visual words through quantization by a compact vocabulary instead of large vocabulary, and proposes a soft BoW assignment method so that the proposed HBoW can tolerate different word selection for similar patches.
Abstract: The bag-of-words (BoW) has been widely regarded as the most successful algorithms for content-based image related tasks, such as large scale image retrieval, classification, and object categorization. Large visual words acquired by BoW quantization through large vocabulary or codebooks have been receiving much attention in the past years. However, not only construction of large vocabulary but also the quantization process impose a heavy burden in terms of time and memory complexities. In order to tackle this issue, we propose an efficient hierarchical BoW (HBoW) to achieve large visual words through quantization by a compact vocabulary instead of large vocabulary. Our vocabulary is very compact since it is only composed of two small dictionaries which is learned through segmental sparse decomposition of local features. To generate the BoW with large size, we first divide the local features into two half parts, and use the two small dictionaries to compute their sparse codes. Then, we map the two indices of the maximum elements of the two sparse codes to a large set of visual words based upon the fact that data with similar properties will share the same base weighted with the largest sparse coefficient. To further make similar patches have higher probability of select the same dictionary base to get similar BoW vectors, we propose a novel collaborative dictionary learning method by imposing the similarity regularization factor together with the row sparsity regularization across data instances during group sparse coding. Additionally, based on index combination of top-2 large sparse codes of local descriptors, we propose a soft BoW assignment method so that our proposed HBoW can tolerate different word selection for similar patches. By employing the inverted file structure built through our HBoW, K-nearest neighbors (KNN) can be efficiently retrieved. After incorporation of our fast KNN search into the SVM-KNN classification method, our HBoW can be used for efficient image classification and logo recognition. Experiments on serval well-known datasets show that our approach is effective for large scale image classification and retrieval.

Proceedings ArticleDOI
01 Sep 2018
TL;DR: A pipeline based on an adaptive balancing loss (ABL) for image captioning which re-weighs loss of each category dynamically over the training process can improve the accuracy and increase the diversity of generated descriptions through adaptively reducing losses of well- classified and frequent categories and increasing losses of under-classified and infrequent categories.
Abstract: Recently, most of pioneering works based on supervised learning have been proposed for image captioning task. These approaches are heavily dependent on labeled training data. Through careful observation, we note that these approaches suffer from the problem of class imbalance (CIB) which can lead to performance degradation and limit the diversity of generated sentences. In this paper, to address this problem, we propose a pipeline based on an adaptive balancing loss (ABL) for image captioning which re-weighs loss of each category dynamically over the training process. Our proposed method can improve the accuracy and increase the diversity of generated descriptions through adaptively reducing losses of well-classified and frequent categories and increasing losses of under-classified and infrequent categories. We conduct experiments on the well-known MS CO-CO caption dataset to evaluate the performance of the proposed method. The results show that our approach achieves competitive performance compared to the state-of-the-art methods and can generate more accurate and diverse captions.

Proceedings Article
01 Jan 2018
TL;DR: An Inverse Reinforcement Learning (IRL) based learning and thinking strategy for sequence generation that can fill in the space that has not been exposed during training with a better policy than the original RNN model is proposed.
Abstract: In order to alleviate the exposure bias problem caused by the discrepancy between training and testing strategies, we propose an Inverse Reinforcement Learning (IRL) based learning and thinking strategy for sequence generation. First, a task-agnostic reward is learned to evaluate the appropriateness of the generated tokens at each time step with the knowledge of ground truth token and current RNN models. With this reward, a deep SARSA network is then designed to meditate among the whole space. Therefore, it can fill in the space that has not been exposed during training with a better policy than the original RNN model. Sequence generation experiments on various text corpus show significant improvements over strong baseline and demonstrate the effectiveness of our method.