Author
Lingxiao Yang
Other affiliations: National University of Defense Technology, Chinese Academy of Sciences
Bio: Lingxiao Yang is an academic researcher from Hong Kong Polytechnic University. The author has contributed to research in topics: Convolutional neural network & Cognitive neuroscience of visual object recognition. The author has an hindex of 4, co-authored 9 publications receiving 81 citations. Previous affiliations of Lingxiao Yang include National University of Defense Technology & Chinese Academy of Sciences.
Papers
More filters
01 Oct 2019
TL;DR: A dynamic feature selection operation to select new pixels in a feature map for each refined anchor received from the ARM, and a bidirectional feature fusion module by combining features from early and deep layers to enhance the representation ability of selected feature pixels.
Abstract: The design of anchors is critical to the performance of one-stage detectors. Recently, the anchor refinement module (ARM) has been proposed to adjust the initialization of default anchors, providing the detector a better anchor reference. However, this module brings another problem: all pixels at a feature map have the same receptive field while the anchors associated with each pixel have different positions and sizes. This discordance may lead to a less effective detector. In this paper, we present a dynamic feature selection operation to select new pixels in a feature map for each refined anchor received from the ARM. The pixels are selected based on the new anchor position and size so that the receptive filed of these pixels can fit the anchor areas well, which makes the detector, especially the regression part, much easier to optimize. Furthermore, to enhance the representation ability of selected feature pixels, we design a bidirectional feature fusion module by combining features from early and deep layers. Extensive experiments on both PASCAL VOC and COCO demonstrate the effectiveness of our dynamic anchor feature selection (DAFS) operation. For the case of high IoU threshold, our DAFS can improve the mAP by a large margin.
61 citations
19 Oct 2017
TL;DR: A novel algorithm, namely Deep Location-Specific Tracking, is proposed, which decomposes the tracking problem into a localization task and a classification task, and trains an individual network for each task.
Abstract: Convolutional Neural Network (CNN) based methods have shown significant performance gains in the problem of visual tracking in recent years. Due to many uncertain changes of objects online, such as abrupt motion, background clutter and large deformation, the visual tracking is still a challenging task. We propose a novel algorithm, namely Deep Location-Specific Tracking, which decomposes the tracking problem into a localization task and a classification task, and trains an individual network for each task. The localization network exploits the information in the current frame and provides a specific location to improve the probability of successful tracking, while the classification network finds the target among many examples generated around the target location in the previous frame, as well as the one estimated from the localization network in the current frame. CNN based trackers often have massive number of trainable parameters, and are prone to over-fitting to some particular object states, leading to less precision or tracking drift. We address this problem by learning a classification network based on 1 × 1 convolution and global average pooling. Extensive experimental results on popular benchmark datasets show that the proposed tracker achieves competitive results without using additional tracking videos for fine-tuning. The code is available at https://github.com/ZjjConan/DLST
29 citations
01 Sep 2017
TL;DR: A method to discover discriminative elements based on deep Convolutional Neural Networks (CNNs), namely Part-based CNN (P-CNN), which acts as the role of encoding module in part-based representation, is presented.
Abstract: Mid-level element based representations have been proven to be very effective for visual recognition. We present a method to discover discriminative elements based on deep Convolutional Neural Networks (CNNs), namely Part-based CNN (P-CNN), which acts as the role of encoding module in part-based representation. The P-CNN can be attached at arbitrary layer of a pre-trained CNN and be trained using image-level labels. The training of P-CNN essentially corresponds to the optimization and selection of discriminative mid-level visual elements. For an input image, the output of P-CNN is naturally the part-based coding and can be directly used for image recognition. By applying P-CNN to multiple layers of a pretrained CNN, more diverse visual elements can be obtained for visual recognitions. Experiments are conducted on two recognition tasks and their results demonstrate the effectiveness of the proposed method.
10 citations
TL;DR: A multi-label learning framework for identifying multiple materials of a real-world object surface without a segmentation for each of them and finds that there are potential correlations between materials and that correlations are relevant to object category.
Abstract: We present a multi-label material recognition framework.Object-specific DAGs are better to encode the correlations of material labels.Object recognition can provide semantic cue to enhance the material recognition. A real-world object surface often consists of multiple materials. Recognizing surface materials is important because it significantly benefits understanding the quality and functionality of the object. However, identifying multiple materials on a surface from a single photograph is very challenging because different materials are often interweaved together and hard to be segmented for separate identification. To address this problem, we present a multi-label learning framework for identifying multiple materials of a real-world object surface without a segmentation for each of them. We find that there are potential correlations between materials and that correlations are relevant to object category. For example, a surface of monitor likely consists of plastic and glasses rather than wood or stone. It motivates us to learn the correlations of material labels locally on each semantic object cluster. To this end, samples are semantically grouped according to their object categories. For each group of samples, we employ a Directed Acyclic Graph (DAG) to encode the conditional dependencies of material labels. These object-specific DAGs are then used for assisting the inference of surface materials. The key enabler of the proposed method is that the object recognition provides a semantic cue for material recognition by formulating an object-specific DAG learning. We test our method on the ALOT database and show consistent improvements over the state-of-the-arts.
7 citations
TL;DR: A part-level CNN architecture, namely Part-based CNN (P-CNN), which acts as a role of encoding module in a part-based representation model, which can be attached at arbitrary layer of a pre-trained CNN and be trained using image-level labels.
Abstract: Mid-level element based representations have been proven to be very effective for visual recognition. This paper presents a method to discover discriminative mid-level visual elements based on deep Convolutional Neural Networks (CNNs). We present a part-level CNN architecture, namely Part-based CNN (P-CNN), which acts as a role of encoding module in a part-based representation model. The P-CNN can be attached at arbitrary layer of a pre-trained CNN and be trained using image-level labels. The training of P-CNN essentially corresponds to the optimization and selection of discriminative mid-level visual elements. For an input image, the output of P-CNN is naturally the part-based coding and can be directly used for image recognition. By applying P-CNN to multiple layers of a pre-trained CNN, more diverse visual elements can be obtained for visual recognitions. We validate the proposed P-CNN on several visual recognition tasks, including scene categorization, action classification and multi-label object recognition. Extensive experiments demonstrate the competitive performance of P-CNN in comparison with state-of-the-arts.
6 citations
Cited by
More filters
Posted Content•
TL;DR: This work uses new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, C mBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100.
Abstract: There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100. Source code is at this https URL
5,709 citations
14 Jun 2020
TL;DR: Cross Stage Partial Network (CSPNet) as discussed by the authors integrates feature maps from the beginning and the end of a network stage to mitigate the problem of duplicate gradient information within network optimization.
Abstract: Neural networks have enabled state-of-the-art approaches to achieve incredible results on computer vision tasks such as object detection. However, such success greatly relies on costly computation resources, which hinders people with cheap devices from appreciating the advanced technology. In this paper, we propose Cross Stage Partial Network (CSPNet) to mitigate the problem that previous works require heavy inference computations from the network architecture perspective. We attribute the problem to the duplicate gradient information within network optimization. The proposed networks respect the variability of the gradients by integrating feature maps from the beginning and the end of a network stage, which, in our experiments, reduces computations by 20% with equivalent or even superior accuracy on the ImageNet dataset, and significantly outperforms state-of-the-art approaches in terms of AP
50
on the MS COCO object detection dataset. The CSPNet is easy to implement and general enough to cope with architectures based on ResNet, ResNeXt, and DenseNet.
1,991 citations
TL;DR: A comprehensive review of historical and recent state-of-the-art approaches in visual, audio, and text processing; social network analysis; and natural language processing is presented, followed by the in-depth analysis on pivoting and groundbreaking advances in deep learning applications.
Abstract: The field of machine learning is witnessing its golden era as deep learning slowly becomes the leader in this domain. Deep learning uses multiple layers to represent the abstractions of data to build computational models. Some key enabler deep learning algorithms such as generative adversarial networks, convolutional neural networks, and model transfers have completely changed our perception of information processing. However, there exists an aperture of understanding behind this tremendously fast-paced domain, because it was never previously represented from a multiscope perspective. The lack of core understanding renders these powerful methods as black-box machines that inhibit development at a fundamental level. Moreover, deep learning has repeatedly been perceived as a silver bullet to all stumbling blocks in machine learning, which is far from the truth. This article presents a comprehensive review of historical and recent state-of-the-art approaches in visual, audio, and text processing; social network analysis; and natural language processing, followed by the in-depth analysis on pivoting and groundbreaking advances in deep learning applications. It was also undertaken to review the issues faced in deep learning such as unsupervised learning, black-box models, and online learning and to illustrate how these challenges can be transformed into prolific future research avenues.
824 citations
14 Jun 2020
TL;DR: Zhang et al. as discussed by the authors proposed Adaptive Training Sample Selection (ATSS) to automatically select positive and negative samples according to statistical characteristics of object, which significantly improves the performance of anchor-based and anchor-free detectors and bridges the gap between them.
Abstract: Object detection has been dominated by anchor-based detectors for several years. Recently, anchor-free detectors have become popular due to the proposal of FPN and Focal Loss. In this paper, we first point out that the essential difference between anchor-based and anchor-free detection is actually how to define positive and negative training samples, which leads to the performance gap between them. If they adopt the same definition of positive and negative samples during training, there is no obvious difference in the final performance, no matter regressing from a box or a point. This shows that how to select positive and negative training samples is important for current object detectors. Then, we propose an Adaptive Training Sample Selection (ATSS) to automatically select positive and negative samples according to statistical characteristics of object. It significantly improves the performance of anchor-based and anchor-free detectors and bridges the gap between them. Finally, we discuss the necessity of tiling multiple anchors per location on the image to detect objects. Extensive experiments conducted on MS COCO support our aforementioned analysis and conclusions. With the newly introduced ATSS, we improve state-of-the-art detectors by a large margin to 50.7% AP without introducing any overhead. The code is available at https://github.com/sfzhang15/ATSS.
643 citations
Posted Content•
TL;DR: An Adaptive Training Sample Selection (ATSS) to automatically select positive and negative samples according to statistical characteristics of object significantly improves the performance of anchor-based and anchor-free detectors and bridges the gap between them.
Abstract: Object detection has been dominated by anchor-based detectors for several years. Recently, anchor-free detectors have become popular due to the proposal of FPN and Focal Loss. In this paper, we first point out that the essential difference between anchor-based and anchor-free detection is actually how to define positive and negative training samples, which leads to the performance gap between them. If they adopt the same definition of positive and negative samples during training, there is no obvious difference in the final performance, no matter regressing from a box or a point. This shows that how to select positive and negative training samples is important for current object detectors. Then, we propose an Adaptive Training Sample Selection (ATSS) to automatically select positive and negative samples according to statistical characteristics of object. It significantly improves the performance of anchor-based and anchor-free detectors and bridges the gap between them. Finally, we discuss the necessity of tiling multiple anchors per location on the image to detect objects. Extensive experiments conducted on MS COCO support our aforementioned analysis and conclusions. With the newly introduced ATSS, we improve state-of-the-art detectors by a large margin to $50.7\%$ AP without introducing any overhead. The code is available at this https URL
564 citations