Other affiliations: University of Science and Technology of China
Bio: Xu Shen is an academic researcher from Alibaba Group. The author has contributed to research in topics: Convolutional neural network & Image retrieval. The author has an hindex of 11, co-authored 27 publications receiving 474 citations. Previous affiliations of Xu Shen include University of Science and Technology of China.
••01 Jun 2019
TL;DR: This paper provides a simple and uniform way for weights and activations quantization by formulating it as a differentiable non-linear function that will shed new lights on the interpretation of neural network quantization.
Abstract: Although deep neural networks are highly effective, their high computational and memory costs severely hinder their applications to portable devices. As a consequence, lowbit quantization, which converts a full-precision neural network into a low-bitwidth integer version, has been an active and promising research topic. Existing methods formulate the low-bit quantization of networks as an approximation or optimization problem. Approximation-based methods confront the gradient mismatch problem, while optimizationbased methods are only suitable for quantizing weights and can introduce high computational cost during the training stage. In this paper, we provide a simple and uniform way for weights and activations quantization by formulating it as a differentiable non-linear function. The quantization function is represented as a linear combination of several Sigmoid functions with learnable biases and scales that could be learned in a lossless and end-to-end manner via continuous relaxation of the steepness of Sigmoid functions. Extensive experiments on image classification and object detection tasks show that our quantization networks outperform state-of-the-art methods. We believe that the proposed method will shed new lights on the interpretation of neural network quantization.
15 Jun 2019
TL;DR: An attribute-driven method for feature disentangling and frame re-weighting for video-based person re-identification that outperforms existing state-of-the-art approaches.
Abstract: Video-based person re-identification plays an important role in surveillance video analysis, expanding image-based methods by learning features of multiple frames Most existing methods fuse features by temporal average-pooling, without exploring the different frame weights caused by various viewpoints, poses, and occlusions In this paper, we propose an attribute-driven method for feature disentangling and frame re-weighting The features of single frames are disentangled into groups of sub-features, each corresponds to specific semantic attributes The sub-features are re-weighted by the confidence of attribute recognition and then aggregated at the temporal dimension as the final representation By means of this strategy, the most informative regions of each frame are enhanced and contributes to a more discriminative sequence representation Extensive ablation studies demonstrate the effectiveness of feature disentangling as well as temporal re-weighting The experimental results on the iLIDS-VID, PRID-2011 and MARS datasets demonstrate that our proposed method outperforms existing state-of-the-art approaches
••05 Jan 2015
TL;DR: A deep convolutional neural network is adopted that “understands” images well and is proved to be effective in many computer vision problems and it does not need human efforts in the design of features.
Abstract: Photo quality assessment from the view of human aesthetics, which tries to classify images into the categories of good and bad, has drawn a lot of attention in computer vision field. Up to now, experts have proposed many methods to deal with this problem. Most of those methods are based on the design of hand-crafted features. However, due to the complexity and subjectivity of human’s aesthetic activities, it is difficult to describe and model all the factors that affect the photo aesthetic quality. Therefore those methods just obtain limited success. On the other hand, deep convolutional neural network has been proved to be effective in many computer vision problems and it does not need human efforts in the design of features. In this paper, we try to adopt a deep convolutional neural network that “understands” images well to conduct the photo aesthetic quality assessment. Firstly, we implement a deep convolutional neural network which has eight layers and millions of parameters. Then to “teach” this network enough knowledge about images, we train it on the ImageNet which is one of the largest available image database. Next, for each given image, we take the activations of the last layer of the neural network as its aesthetic feature. The experimental results on two large and reliable image aesthetic quality assessment datasets prove the effectiveness of our method.
••15 Oct 2018
TL;DR: For the task of person re-identification, the local convolutional neural networks (Local CNN) can outperform state-of-the-art methods consistently on three large-scale benchmarks, including Market-1501, CUHK03, and DukeMTMC-ReID.
Abstract: Recent works have shown that person re-identification can be substantially improved by introducing attention mechanisms, which allow learning both global and local representations. However, all these works learn global and local features in separate branches. As a consequence, the interaction/boosting of global and local information are not allowed, except in the final feature embedding layer. In this paper, we propose local operations as a generic family of building blocks for synthesizing global and local information in any layer. This building block can be inserted into any convolutional networks with only a small amount of prior knowledge about the approximate locations of local parts. For the task of person re-identification, even with only one local block inserted, our local convolutional neural networks (Local CNN) can outperform state-of-the-art methods consistently on three large-scale benchmarks, including Market-1501, CUHK03, and DukeMTMC-ReID.
TL;DR: The proposed continuous dropout is considerably closer to the activation characteristics of neurons in the human brain than traditional binary dropout and has the property of avoiding the co-adaptation of feature detectors, which suggests that it can extract more independent feature detectors for model averaging in the test stage.
Abstract: Dropout has been proven to be an effective algorithm for training robust deep networks because of its ability to prevent overfitting by avoiding the co-adaptation of feature detectors. Current explanations of dropout include bagging, naive Bayes, regularization, and sex in evolution. According to the activation patterns of neurons in the human brain, when faced with different situations, the firing rates of neurons are random and continuous, not binary as current dropout does. Inspired by this phenomenon, we extend the traditional binary dropout to continuous dropout. On the one hand, continuous dropout is considerably closer to the activation characteristics of neurons in the human brain than traditional binary dropout. On the other hand, we demonstrate that continuous dropout has the property of avoiding the co-adaptation of feature detectors, which suggests that we can extract more independent feature detectors for model averaging in the test stage. We introduce the proposed continuous dropout to a feedforward neural network and comprehensively compare it with binary dropout, adaptive dropout, and DropConnect on MNIST, CIFAR-10, SVHN, NORB, and ILSVRC-12. Thorough experiments demonstrate that our method performs better in preventing the co-adaptation of feature detectors and improves test performance. The code is available at: https://github.com/jasonustc/caffe-multigpu/tree/dropout.
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).
01 Sep 2005
TL;DR: A powerful AGW baseline is designed, achieving state-of-the-art or at least comparable performance on twelve datasets for four different Re-ID tasks, and a new evaluation metric (mINP) is introduced, indicating the cost for finding all the correct matches, which provides an additional criteria to evaluate the Re- ID system for real applications.
Abstract: Person re-identification (Re-ID) aims at retrieving a person of interest across multiple non-overlapping cameras. With the advancement of deep neural networks and increasing demand of intelligent video surveillance, it has gained significantly increased interest in the computer vision community. By dissecting the involved components in developing a person Re-ID system, we categorize it into the closed-world and open-world settings. The widely studied closed-world setting is usually applied under various research-oriented assumptions, and has achieved inspiring success using deep learning techniques on a number of datasets. We first conduct a comprehensive overview with in-depth analysis for closed-world person Re-ID from three different perspectives, including deep feature representation learning, deep metric learning and ranking optimization. With the performance saturation under closed-world setting, the research focus for person Re-ID has recently shifted to the open-world setting, facing more challenging issues. This setting is closer to practical applications under specific scenarios. We summarize the open-world Re-ID in terms of five different aspects. By analyzing the advantages of existing methods, we design a powerful AGW baseline, achieving state-of-the-art or at least comparable performance on twelve datasets for FOUR different Re-ID tasks. Meanwhile, we introduce a new evaluation metric (mINP) for person Re-ID, indicating the cost for finding all the correct matches, which provides an additional criteria to evaluate the Re-ID system for real applications. Finally, some important yet under-investigated open issues are discussed.
••18 Jun 2018
TL;DR: The proposed SA-Siam outperforms all other real-time trackers by a large margin on OTB-2013/50/100 benchmarks and proposes a channel attention mechanism for the semantic branch.
Abstract: Observing that Semantic features learned in an image classification task and Appearance features learned in a similarity matching task complement each other, we build a twofold Siamese network, named SA-Siam, for real-time object tracking. SA-Siam is composed of a semantic branch and an appearance branch. Each branch is a similaritylearning Siamese network. An important design choice in SA-Siam is to separately train the two branches to keep the heterogeneity of the two types of features. In addition, we propose a channel attention mechanism for the semantic branch. Channel-wise weights are computed according to the channel activations around the target position. While the inherited architecture from SiamFC  allows our tracker to operate beyond real-time, the twofold design and the attention mechanism significantly improve the tracking performance. The proposed SA-Siam outperforms all other real-time trackers by a large margin on OTB-2013/50/100 benchmarks.
TL;DR: Zhang et al. as mentioned in this paper proposed a coattention mechanism using a deep neural network (DNN) architecture to jointly learn the attentions for both the image and the question, which can reduce the irrelevant features effectively and obtain more discriminative features for image and question representations.
Abstract: Visual question answering (VQA) is challenging, because it requires a simultaneous understanding of both visual content of images and textual content of questions. To support the VQA task, we need to find good solutions for the following three issues: 1) fine-grained feature representations for both the image and the question; 2) multimodal feature fusion that is able to capture the complex interactions between multimodal features; and 3) automatic answer prediction that is able to consider the complex correlations between multiple diverse answers for the same question. For fine-grained image and question representations, a “coattention” mechanism is developed using a deep neural network (DNN) architecture to jointly learn the attentions for both the image and the question, which can allow us to reduce the irrelevant features effectively and obtain more discriminative features for image and question representations. For multimodal feature fusion, a generalized multimodal factorized high-order pooling approach (MFH) is developed to achieve more effective fusion of multimodal features by exploiting their correlations sufficiently, which can further result in superior VQA performance as compared with the state-of-the-art approaches. For answer prediction, the Kullback–Leibler divergence is used as the loss function to achieve precise characterization of the complex correlations between multiple diverse answers with the same or similar meaning, which can allow us to achieve faster convergence rate and obtain slightly better accuracy on answer prediction. A DNN architecture is designed to integrate all these aforementioned modules into a unified model for achieving superior VQA performance. With an ensemble of our MFH models, we achieve the state-of-the-art performance on the large-scale VQA data sets and win the runner-up in VQA Challenge 2017.