scispace - formally typeset

Author

Gang Sun

Other affiliations: SenseTime
Bio: Gang Sun is an academic researcher from Chinese Academy of Sciences. The author has contributed to research in topic(s): Feature (computer vision) & Convolutional neural network. The author has an hindex of 7, co-authored 12 publication(s) receiving 8592 citation(s). Previous affiliations of Gang Sun include SenseTime.
Papers
More filters

Posted Content
Jie Hu1, Li Shen2, Samuel Albanie2, Gang Sun1  +1 moreInstitutions (2)
Abstract: While the use of bottom-up local operators in convolutional neural networks (CNNs) matches well some of the statistics of natural images, it may also prevent such models from capturing contextual long-range feature interactions. In this work, we propose a simple, lightweight approach for better context exploitation in CNNs. We do so by introducing a pair of operators: gather, which efficiently aggregates feature responses from a large spatial extent, and excite, which redistributes the pooled information to local features. The operators are cheap, both in terms of number of added parameters and computational complexity, and can be integrated directly in existing architectures to improve their performance. Experiments on several datasets show that gather-excite can bring benefits comparable to increasing the depth of a CNN at a fraction of the cost. For example, we find ResNet-50 with gather-excite operators is able to outperform its 101-layer counterpart on ImageNet with no additional learnable parameters. We also propose a parametric gather-excite operator pair which yields further performance gains, relate it to the recently-introduced Squeeze-and-Excitation Networks, and analyse the effects of these changes to the CNN feature activation statistics.

7 citations


Journal ArticleDOI
Jie Hu1, Li Shen2, Samuel Albanie2, Gang Sun1  +1 moreInstitutions (2)
18 Jun 2018
TL;DR: This work proposes a novel architectural unit, which is term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and finds that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at minimal additional computational cost.
Abstract: The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy. In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We show that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. We further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251 percent, surpassing the winning entry of 2016 by a relative improvement of ${\sim }$ ∼ 25 percent. Models and code are available at https://github.com/hujie-frank/SENet .

7,070 citations


Proceedings Article
Jie Hu1, Li Shen2, Samuel Albanie2, Gang Sun1  +1 moreInstitutions (2)
01 Jan 2018
Abstract: While the use of bottom-up local operators in convolutional neural networks (CNNs) matches well some of the statistics of natural images, it may also prevent such models from capturing contextual long-range feature interactions. In this work, we propose a simple, lightweight approach for better context exploitation in CNNs. We do so by introducing a pair of operators: gather, which efficiently aggregates feature responses from a large spatial extent, and excite, which redistributes the pooled information to local features. The operators are cheap, both in terms of number of added parameters and computational complexity, and can be integrated directly in existing architectures to improve their performance. Experiments on several datasets show that gather-excite can bring benefits comparable to increasing the depth of a CNN at a fraction of the cost. For example, we find ResNet-50 with gather-excite operators is able to outperform its 101-layer counterpart on ImageNet with no additional learnable parameters. We also propose a parametric gather-excite operator pair which yields further performance gains, relate it to the recently-introduced Squeeze-and-Excitation Networks, and analyse the effects of these changes to the CNN feature activation statistics.

197 citations


Posted Content
Jie Hu1, Li Shen2, Samuel Albanie2, Gang Sun1  +1 moreInstitutions (2)
Abstract: The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy. In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We show that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. We further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251%, surpassing the winning entry of 2016 by a relative improvement of ~25%. Models and code are available at this https URL.

641 citations


Proceedings ArticleDOI
Wangjiang Zhu1, Jie Hu2, Gang Sun2, Xudong Cao2  +1 moreInstitutions (2)
01 Jun 2016
TL;DR: A key volume mining deep framework to identify key volumes and conduct classification simultaneously and an effective yet simple "unsupervised key volume proposal" method for high quality volume sampling are proposed.
Abstract: Recently, deep learning approaches have demonstrated remarkable progresses for action recognition in videos. Most existing deep frameworks equally treat every volume i.e. spatial-temporal video clip, and directly assign a video label to all volumes sampled from it. However, within a video, discriminative actions may occur sparsely in a few key volumes, and most other volumes are irrelevant to the labeled action category. Training with a large proportion of irrelevant volumes will hurt performance. To address this issue, we propose a key volume mining deep framework to identify key volumes and conduct classification simultaneously. Specifically, our framework is trained is optimized in an alternative way integrated to the forward and backward stages of Stochastic Gradient Descent (SGD). In the forward pass, our network mines key volumes for each action class. In the backward pass, it updates network parameters with the help of these mined key volumes. In addition, we propose "Stochastic out" to model key volumes from multi-modalities, and an effective yet simple "unsupervised key volume proposal" method for high quality volume sampling. Our experiments show that action recognition performance can be significantly improved by mining key volumes, and we achieve state-of-the-art performance on HMDB51 and UCF101 (93.1%).

243 citations


Cited by
More filters

Journal ArticleDOI
Jiang He1, Qiangqiang Yuan1, Jie Li1, Liangpei Zhang1Institutions (1)
Abstract: Spectral super-resolution is a very important technique to obtain hyperspectral images from only multispectral images, which can effectively solve the high acquisition cost and low spatial resolution of hyperspectral images. However, in practice, multispectral channels or images captured by the same sensor are often with different spatial resolutions, which brings a severe challenge to spectral super-resolution. This paper proposed a universal spectral super-resolution network based on physical optimization unfolding for arbitrary multispectral images, including single-resolution and cross-scale multispectral images. Furthermore, two new strategies are proposed to make full use of the spectral information, namely, cross-dimensional channel attention and cross-depth feature fusion. Experimental results on five data sets show superiority and stability of PoNet addressing any spectral super-resolution situations.

Journal ArticleDOI
Yunhan Kim1, Kyumin Na1, Byeng D. Youn1Institutions (1)
Abstract: This research proposes a newly designed convolutional neural network (CNN) for gearbox fault diagnostics. A conventional CNN is a deep-learning model that offers distinctive performance for analyzing two-dimensional image data. To exploit this ability, prior work has been developed using time–frequency analysis, which derives image-like data that is fed into the CNN model. However, the existing time–frequency analysis approach employs fixed basis functions that are limited in their ability to capture fault-related signals in the image. To address this challenge, we propose a health-adaptive time-scale representation (HTSR) embedded CNN (HTSR-CNN). The proposed HTSR approach is designed to exploit the concept of TSR, which is informed by the physics of the time and frequency characteristics induced by the fault-related signals. Instead of using fixed basis functions, the HTSR is constructed using multiscale convolutional filters that behave like the adaptive basis functions. These multiscale filters are effectively learned to include the enriched fault-related information in the HTSR through end-to-end learning of the HTSR-CNN model. The performance of the proposed HTSR-CNN is validated by examining two case studies: vibration signals from a two-stage spur gearbox and vibration signals from a planetary gearbox. From the case study results, the proposed HTSR-CNN method is found to have superior performance for gearbox fault diagnostics, as compared to existing CNN-based fault diagnostic methods.

Journal ArticleDOI
Abstract: The so-called “attention mechanisms” in Deep Neural Networks (DNNs) denote an automatic adaptation of DNNs to capture representative features given a specific classification task and related data. Such attention mechanisms perform both globally by reinforcing feature channels and locally by stressing features in each feature map. Channel and feature importance are learnt in the global end-to-end DNNs training process. In this paper, we present a study and propose a method with a different approach, adding supplementary visual data next to training images. We use human visual attention maps obtained independently with psycho-visual experiments, both in task-driven or in free viewing conditions, or powerful models for prediction of visual attention maps. We add visual attention maps as new data alongside images, thus introducing human visual attention into the DNNs training and compare it with both global and local automatic attention mechanisms. Experimental results show that known attention mechanisms in DNNs work pretty much as human visual attention, but still the proposed approach allows a faster convergence and better performance in image classification tasks.

Journal ArticleDOI
Shuanlong Niu1, Bin Li1, Xinggang Wang1, Songping He1  +1 moreInstitutions (1)
Abstract: Surface defect segmentation is very important for the quality inspection of industrial production and is an important pattern recognition problem. Although deep learning (DL) has achieved remarkable results in surface defect segmentation, most of these results have been obtained by using massive images with pixel-level annotations, which are difficult to obtain at industrial sites. This paper proposes a weakly supervised defect segmentation method based on the dynamic templates generated by an improved cycle-consistent generative adversarial network (CycleGAN) trained by image-level annotations. To generate better templates for defects with weak signals, we propose a defect attention module by applying the defect residual for the discriminator to strengthen the elimination of defect regions and suppress changes in the background. A defect cycle-consistent loss is designed by adding structural similarity (SSIM) to the original L1 loss to include the grayscale and structural features; the proposed loss can better model the inner structure of defects. After obtaining the defect-free template, a defect segmentation map can easily be obtained through a simple image comparison and threshold segmentation. Experiments show that the proposed method is both efficient and effective, significantly outperforms other weakly supervised methods, and achieves performance that is comparable or even superior to that of supervised methods on three industrial datasets (intersection over union (IoU) on the DAGM 2007, KSD and CCSD datasets of 78.28%, 59.43%,and 68.83%, respectively). The proposed method can also be employed as a semiautomatic annotation tool combined with active learning.

Journal ArticleDOI
Wenmeng Yu1, Hua Xu1Institutions (1)
Abstract: Previous research on Facial Expression Recognition (FER) assisted by facial landmarks mainly focused on single-task learning or hard-parameter sharing based multi-task learning. However, soft-parameter sharing based methods have not been explored in this area. Therefore, this paper adopts Facial Landmark Detection (FLD) as the auxiliary task and explores new multi-task learning strategies for FER. First, three classical multi-task structures, including Hard-Parameter Sharing (HPS), Cross-Stitch Network (CSN), and Partially Shared Multi-task Convolutional Neural Network (PS-MCNN), are used to verify the advantages of multi-task learning for FER. Then, we propose a new end-to-end Co-attentive Multi-task Convolutional Neural Network (CMCNN), which is composed of the Channel Co-Attention Module (CCAM) and the Spatial Co-Attention Module (SCAM). Functionally, the CCAM generates the channel co-attention scores by capturing the inter-dependencies of different channels between FER and FLD tasks. The SCAM combines the max- and average-pooling operations to formulate the spatial co-attention scores. Finally, we conduct extensive experiments on four widely used benchmark facial expression databases, including RAF, SFEW2, CK+, and Oulu-CASIA. Extensive experimental results show that our approach achieves better performance than single-task and multi-task baselines, fully validating multi-task learning’s effectiveness and generalizability 1 .

Network Information
Related Authors (5)
Li Shen

72 papers, 11.3K citations

91% related
Shuhui Wang

153 papers, 2.4K citations

88% related
Qingming Huang

603 papers, 16.6K citations

84% related
Enhua Wu

266 papers, 10.3K citations

80% related
Zhouchen Lin

417 papers, 19.3K citations

80% related
Performance
Metrics

Author's H-index: 7

No. of papers from the Author in previous years
YearPapers
20183
20171
20161
20154
20141
20132