scispace - formally typeset

Author

Enhua Wu

Bio: Enhua Wu is an academic researcher from Chinese Academy of Sciences. The author has contributed to research in topic(s): Rendering (computer graphics) & Polygon mesh. The author has an hindex of 24, co-authored 266 publication(s) receiving 10340 citation(s). Previous affiliations of Enhua Wu include University of Macau & Academia Sinica.
Papers
More filters

Proceedings Article
Kai Han1, An Xiao2, Enhua Wu1, Jianyuan Guo3  +2 moreInstitutions (4)
06 Dec 2021
Abstract: Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16$\times$16) as "visual sentences" and present to further divide them into smaller patches (e.g., 4$\times$4) as "visual words". The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an $81.5%$ top-1 accuracy on the ImageNet, which is about $1.7%$ higher than that of the state-of-the-art visual transformer with similar computational cost. The PyTorch code is available at this https URL, and the MindSpore code is at this https URL.

13 citations


Journal ArticleDOI
Kai Han1, Yunhe Wang2, Chang Xu3, Chunjing Xu2  +2 moreInstitutions (3)
Abstract: This paper introduces versatile filters to construct efficient convolutional neural networks that are widely used in various visual recognition tasks. Considering the demands of efficient deep learning techniques running on cost-effective hardware, a number of methods have been developed to learn compact neural networks. Most of these works aim to slim down filters in different ways, e.g. investigating small, sparse or quantized filters. In contrast, we treat filters from an additive perspective. A series of secondary filters can be derived from a primary filter with the help of binary masks. These secondary filters all inherit in the primary filter without occupying more storage, but once been unfolded in computation they could significantly enhance the capability of the filter by integrating information extracted from different receptive fields. Besides spatial versatile filters, we additionally investigate versatile filters from the channel perspective. Binary masks can be further customized for different primary filters under orthogonal constraints. We conduct theoretical analysis on network complexity and an efficient convolution scheme is introduced. Experimental results on benchmark datasets and neural networks demonstrate that our versatile filters are able to achieve comparable accuracy as that of original filters, but require less memory and computation cost.

Proceedings Article
Jiaming Liu1, Ming Lu2, Kaixin Chen, Xiaoqi Li3  +6 moreInstitutions (4)
18 Aug 2021
Abstract: Internet video delivery has undergone a tremendous explosion of growth over the past few years. However, the quality of video delivery system greatly depends on the Internet bandwidth. Deep Neural Networks (DNNs) are utilized to improve the quality of video delivery recently. These methods divide a video into chunks, and stream LR video chunks and corresponding content-aware models to the client. The client runs the inference of models to super-resolve the LR chunks. Consequently, a large number of models are streamed in order to deliver a video. In this paper, we first carefully study the relation between models of different chunks, then we tactfully design a joint training framework along with the Content-aware Feature Modulation (CaFM) layer to compress these models for neural video delivery. {\bf With our method, each video chunk only requires less than $1\% $ of original parameters to be streamed, achieving even better SR performance.} We conduct extensive experiments across various SR backbones, video time length, and scaling factors to demonstrate the advantages of our method. Besides, our method can be also viewed as a new approach of video coding. Our primary experiments achieve better video quality compared with the commercial H.264 and H.265 standard under the same storage cost, showing the great potential of the proposed method. Code is available at:\url{this https URL}

1 citations


Posted Content
Jiaming Liu1, Ming Lu2, Kaixin Chen, Xiaoqi Li3  +6 moreInstitutions (4)
Abstract: Internet video delivery has undergone a tremendous explosion of growth over the past few years. However, the quality of video delivery system greatly depends on the Internet bandwidth. Deep Neural Networks (DNNs) are utilized to improve the quality of video delivery recently. These methods divide a video into chunks, and stream LR video chunks and corresponding content-aware models to the client. The client runs the inference of models to super-resolve the LR chunks. Consequently, a large number of models are streamed in order to deliver a video. In this paper, we first carefully study the relation between models of different chunks, then we tactfully design a joint training framework along with the Content-aware Feature Modulation (CaFM) layer to compress these models for neural video delivery. {\bf With our method, each video chunk only requires less than $1\% $ of original parameters to be streamed, achieving even better SR performance.} We conduct extensive experiments across various SR backbones, video time length, and scaling factors to demonstrate the advantages of our method. Besides, our method can be also viewed as a new approach of video coding. Our primary experiments achieve better video quality compared with the commercial H.264 and H.265 standard under the same storage cost, showing the great potential of the proposed method. Code is available at:\url{this https URL}

Journal ArticleDOI
Zhihua Chen1, Jun Qiu1, Bin Sheng2, Ping Li3  +2 moreInstitutions (5)
Abstract: Due to various complex environmental factors and parking scenes, there are more stringent requirements for automatic parking than the manual one. The existing auto-parking technology is based on space or plane dimension, where the former usually ignores the ground parking spot lines which may cause parking at a wrong position, while the latter often costs a lot of time in object classification which may decreases the algorithm applicability. In this paper, we propose a Generative Parking Spot Detection algorithm which uses a multi-clue recovery model to reconstruct parking spots. In the proposed method, we firstly dismantle the parking spot geometrically for marking the location of its corresponding corners and then use a micro-target recognition network to find corners from the ground image taken by car cameras. After these, we use the multi-clue model to correct the fully pairing map so that the reliable true parking spot can be recovered correctly. The proposed algorithm is compared with several existing algorithms, and the experimental result shows that it has a higher accuracy than others which can reach more than 80% in most test cases.

1 citations


Cited by
More filters

Journal ArticleDOI
Jiang He1, Qiangqiang Yuan1, Jie Li1, Liangpei Zhang1Institutions (1)
Abstract: Spectral super-resolution is a very important technique to obtain hyperspectral images from only multispectral images, which can effectively solve the high acquisition cost and low spatial resolution of hyperspectral images. However, in practice, multispectral channels or images captured by the same sensor are often with different spatial resolutions, which brings a severe challenge to spectral super-resolution. This paper proposed a universal spectral super-resolution network based on physical optimization unfolding for arbitrary multispectral images, including single-resolution and cross-scale multispectral images. Furthermore, two new strategies are proposed to make full use of the spectral information, namely, cross-dimensional channel attention and cross-depth feature fusion. Experimental results on five data sets show superiority and stability of PoNet addressing any spectral super-resolution situations.

Journal ArticleDOI
Yunhan Kim1, Kyumin Na1, Byeng D. Youn1Institutions (1)
Abstract: This research proposes a newly designed convolutional neural network (CNN) for gearbox fault diagnostics. A conventional CNN is a deep-learning model that offers distinctive performance for analyzing two-dimensional image data. To exploit this ability, prior work has been developed using time–frequency analysis, which derives image-like data that is fed into the CNN model. However, the existing time–frequency analysis approach employs fixed basis functions that are limited in their ability to capture fault-related signals in the image. To address this challenge, we propose a health-adaptive time-scale representation (HTSR) embedded CNN (HTSR-CNN). The proposed HTSR approach is designed to exploit the concept of TSR, which is informed by the physics of the time and frequency characteristics induced by the fault-related signals. Instead of using fixed basis functions, the HTSR is constructed using multiscale convolutional filters that behave like the adaptive basis functions. These multiscale filters are effectively learned to include the enriched fault-related information in the HTSR through end-to-end learning of the HTSR-CNN model. The performance of the proposed HTSR-CNN is validated by examining two case studies: vibration signals from a two-stage spur gearbox and vibration signals from a planetary gearbox. From the case study results, the proposed HTSR-CNN method is found to have superior performance for gearbox fault diagnostics, as compared to existing CNN-based fault diagnostic methods.

Journal ArticleDOI
Abstract: The so-called “attention mechanisms” in Deep Neural Networks (DNNs) denote an automatic adaptation of DNNs to capture representative features given a specific classification task and related data. Such attention mechanisms perform both globally by reinforcing feature channels and locally by stressing features in each feature map. Channel and feature importance are learnt in the global end-to-end DNNs training process. In this paper, we present a study and propose a method with a different approach, adding supplementary visual data next to training images. We use human visual attention maps obtained independently with psycho-visual experiments, both in task-driven or in free viewing conditions, or powerful models for prediction of visual attention maps. We add visual attention maps as new data alongside images, thus introducing human visual attention into the DNNs training and compare it with both global and local automatic attention mechanisms. Experimental results show that known attention mechanisms in DNNs work pretty much as human visual attention, but still the proposed approach allows a faster convergence and better performance in image classification tasks.

Journal ArticleDOI
Shuanlong Niu1, Bin Li1, Xinggang Wang1, Songping He1  +1 moreInstitutions (1)
Abstract: Surface defect segmentation is very important for the quality inspection of industrial production and is an important pattern recognition problem. Although deep learning (DL) has achieved remarkable results in surface defect segmentation, most of these results have been obtained by using massive images with pixel-level annotations, which are difficult to obtain at industrial sites. This paper proposes a weakly supervised defect segmentation method based on the dynamic templates generated by an improved cycle-consistent generative adversarial network (CycleGAN) trained by image-level annotations. To generate better templates for defects with weak signals, we propose a defect attention module by applying the defect residual for the discriminator to strengthen the elimination of defect regions and suppress changes in the background. A defect cycle-consistent loss is designed by adding structural similarity (SSIM) to the original L1 loss to include the grayscale and structural features; the proposed loss can better model the inner structure of defects. After obtaining the defect-free template, a defect segmentation map can easily be obtained through a simple image comparison and threshold segmentation. Experiments show that the proposed method is both efficient and effective, significantly outperforms other weakly supervised methods, and achieves performance that is comparable or even superior to that of supervised methods on three industrial datasets (intersection over union (IoU) on the DAGM 2007, KSD and CCSD datasets of 78.28%, 59.43%,and 68.83%, respectively). The proposed method can also be employed as a semiautomatic annotation tool combined with active learning.

Journal ArticleDOI
Wenmeng Yu1, Hua Xu1Institutions (1)
Abstract: Previous research on Facial Expression Recognition (FER) assisted by facial landmarks mainly focused on single-task learning or hard-parameter sharing based multi-task learning. However, soft-parameter sharing based methods have not been explored in this area. Therefore, this paper adopts Facial Landmark Detection (FLD) as the auxiliary task and explores new multi-task learning strategies for FER. First, three classical multi-task structures, including Hard-Parameter Sharing (HPS), Cross-Stitch Network (CSN), and Partially Shared Multi-task Convolutional Neural Network (PS-MCNN), are used to verify the advantages of multi-task learning for FER. Then, we propose a new end-to-end Co-attentive Multi-task Convolutional Neural Network (CMCNN), which is composed of the Channel Co-Attention Module (CCAM) and the Spatial Co-Attention Module (SCAM). Functionally, the CCAM generates the channel co-attention scores by capturing the inter-dependencies of different channels between FER and FLD tasks. The SCAM combines the max- and average-pooling operations to formulate the spatial co-attention scores. Finally, we conduct extensive experiments on four widely used benchmark facial expression databases, including RAF, SFEW2, CK+, and Oulu-CASIA. Extensive experimental results show that our approach achieves better performance than single-task and multi-task baselines, fully validating multi-task learning’s effectiveness and generalizability 1 .

Network Information
Related Authors (5)
Xuehui Liu

54 papers, 706 citations

90% related
Hanqiu Sun

171 papers, 1.9K citations

85% related
Wen Wu

41 papers, 582 citations

84% related
Bin Sheng

223 papers, 2.1K citations

77% related
Sheng Li

54 papers, 514 citations

76% related
Performance
Metrics

Author's H-index: 24

No. of papers from the Author in previous years
YearPapers
202111
202017
201911
201811
20179
201611