scispace - formally typeset
Search or ask a question
Author

Ruimao Zhang

Other affiliations: Sun Yat-sen University, SenseTime
Bio: Ruimao Zhang is an academic researcher from The Chinese University of Hong Kong. The author has contributed to research in topics: Normalization (statistics) & Feature (computer vision). The author has an hindex of 19, co-authored 76 publications receiving 1802 citations. Previous affiliations of Ruimao Zhang include Sun Yat-sen University & SenseTime.

Papers published on a yearly basis

Papers
More filters
Journal ArticleDOI
TL;DR: This paper proposes a novel active learning (AL) framework, which is capable of building a competitive classifier with optimal feature representation via a limited amount of labeled training instances in an incremental learning manner and incorporates deep convolutional neural networks into AL.
Abstract: Recent successes in learning-based image classification, however, heavily rely on the large number of annotated training samples, which may require considerable human effort. In this paper, we propose a novel active learning (AL) framework, which is capable of building a competitive classifier with optimal feature representation via a limited amount of labeled training instances in an incremental learning manner. Our approach advances the existing AL methods in two aspects. First, we incorporate deep convolutional neural networks into AL. Through the properly designed framework, the feature representation and the classifier can be simultaneously updated with progressively annotated informative samples. Second, we present a cost-effective sample selection strategy to improve the classification performance with less manual annotations. Unlike traditional methods focusing on only the uncertain samples of low prediction confidence, we especially discover the large amount of high-confidence samples from the unlabeled set for feature learning. Specifically, these high-confidence samples are automatically selected and iteratively assigned pseudolabels. We thus call our framework cost-effective AL (CEAL) standing for the two advantages. Extensive experiments demonstrate that the proposed CEAL framework can achieve promising results on two challenging image classification data sets, i.e., face recognition on the cross-age celebrity face recognition data set database and object categorization on Caltech-256.

581 citations

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a supervised learning framework to generate compact and bit-scalable hashing codes directly from raw images, where they pose hashing learning as a problem of regularized similarity learning.
Abstract: Extracting informative image features and learning effective approximate hashing functions are two crucial steps in image retrieval. Conventional methods often study these two steps separately, e.g., learning hash functions from a predefined hand-crafted feature space. Meanwhile, the bit lengths of output hashing codes are preset in the most previous methods, neglecting the significance level of different bits and restricting their practical flexibility. To address these issues, we propose a supervised learning framework to generate compact and bit-scalable hashing codes directly from raw images. We pose hashing learning as a problem of regularized similarity learning. In particular, we organize the training images into a batch of triplet samples, each sample containing two images with the same label and one with a different label. With these triplet samples, we maximize the margin between the matched pairs and the mismatched pairs in the Hamming space. In addition, a regularization term is introduced to enforce the adjacency consistency, i.e., images of similar appearances should have similar codes. The deep convolutional neural network is utilized to train the model in an end-to-end fashion, where discriminative image features and hash functions are simultaneously optimized. Furthermore, each bit of our hashing codes is unequally weighted, so that we can manipulate the code lengths by truncating the insignificant bits. Our framework outperforms state-of-the-arts on public benchmarks of similar image search and also achieves promising results in the application of person re-identification in surveillance. It is also shown that the generated bit-scalable hashing codes well preserve the discriminative powers with shorter code lengths.

457 citations

Proceedings ArticleDOI
15 Jun 2019
TL;DR: DeepFashion2 as discussed by the authors proposes a Mask R-CNN to solve clothes detection, pose estimation, segmentation, and retrieval tasks in an end-to-end fashion dataset.
Abstract: Understanding fashion images has been advanced by benchmarks with rich annotations such as DeepFashion, whose labels include clothing categories, landmarks, and consumer-commercial image pairs. However, DeepFashion has nonnegligible issues such as single clothing-item per image, sparse landmarks (4∼8 only), and no per-pixel masks, making it had significant gap from real-world scenarios. We fill in the gap by presenting DeepFashion2 to address these issues. It is a versatile benchmark of four tasks including clothes detection, pose estimation, segmentation, and retrieval. It has 801K clothing items where each item has rich annotations such as style, scale, view- point, occlusion, bounding box, dense landmarks (e.g. 39 for ‘long sleeve outwear’ and 15 for ‘vest’), and masks. There are also 873K Commercial-Consumer clothes pairs. The annotations of DeepFashion2 are much larger than its counterparts such as 8× of FashionAI Global Challenge. A strong baseline is proposed, called Match R- CNN, which builds upon Mask R-CNN to solve the above four tasks in an end-to-end manner. Extensive evaluations are conducted with different criterions in Deep- Fashion2. DeepFashion2 Dataset will be released at : https://github.com/switchablenorms/DeepFashion2

261 citations

Proceedings ArticleDOI
14 Jun 2020
TL;DR: This work proposes a novel visual try-on network, namely Adaptive Content Generating and Preserving Network (ACGPN), which can generate photo-realistic images with much better perceptual quality and richer fine-details.
Abstract: Image visual try-on aims at transferring a target clothes image onto a reference person, and has become a hot topic in recent years. Prior arts usually focus on preserving the character of a clothes image (e.g. texture, logo, embroidery) when warping it to arbitrary human pose. However, it remains a big challenge to generate photo-realistic try-on images when large occlusions and human poses are presented in the reference person. To address this issue, we propose a novel visual try-on network, namely Adaptive Content Generating and Preserving Network (ACGPN). In particular, ACGPN first predicts semantic layout of the reference image that will be changed after try-on (e.g.long sleeve shirt→arm, arm→jacket), and then determines whether its image content needs to be generated or preserved according to the predicted semantic layout, leading to photo-realistic try-on and rich clothes details. ACGPN generally involves three major modules. First, a semantic layout generation module utilizes semantic segmentation of the reference image to progressively predict the desired semantic layout after try-on. Second, a clothes warping module warps clothes image according to the generated semantic layout, where a second-order difference constraint is introduced to stabilize the warping process during training.Third, an inpainting module for content fusion integrates all information (e.g. reference image, semantic layout, warped clothes) to adaptively produce each semantic part of human body. In comparison to the state-of-the-art methods, ACGPN can generate photo-realistic images with much better perceptual quality and richer fine-details.

168 citations

Proceedings Article
28 Jun 2018
TL;DR: Switchable Normalization (SN), which learns to select different normalizers for different normalization layers of a deep neural network, is proposed, which will help ease the usage and understand the normalization techniques in deep learning.
Abstract: We address a learning-to-normalize problem by proposing Switchable Normalization (SN), which learns to select different normalizers for different normalization layers of a deep neural network. SN employs three distinct scopes to compute statistics (means and variances) including a channel, a layer, and a minibatch. SN switches between them by learning their importance weights in an end-to-end manner. It has several good properties. First, it adapts to various network architectures and tasks (see Fig.1). Second, it is robust to a wide range of batch sizes, maintaining high performance even when small minibatch is presented (e.g. 2 images/GPU). Third, SN does not have sensitive hyper-parameter, unlike group normalization that searches the number of groups as a hyper-parameter. Without bells and whistles, SN outperforms its counterparts on various challenging benchmarks, such as ImageNet, COCO, CityScapes, ADE20K, and Kinetics. Analyses of SN are also presented. We hope SN will help ease the usage and understand the normalization techniques in deep learning. The code of SN has been made available in this https URL.

143 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: A comprehensive survey of the recent achievements in this field brought about by deep learning techniques, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics.
Abstract: Object detection, one of the most fundamental and challenging problems in computer vision, seeks to locate object instances from a large number of predefined categories in natural images. Deep learning techniques have emerged as a powerful strategy for learning feature representations directly from data and have led to remarkable breakthroughs in the field of generic object detection. Given this period of rapid evolution, the goal of this paper is to provide a comprehensive survey of the recent achievements in this field brought about by deep learning techniques. More than 300 research contributions are included in this survey, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics. We finish the survey by identifying promising directions for future research.

1,897 citations

Journal ArticleDOI
TL;DR: The LSTM cell and its variants are reviewed and their variants are explored to explore the learning capacity of the LSTm cell and the L STM networks are divided into two broad categories:LSTM-dominated networks and integrated LSTS networks.
Abstract: Recurrent neural networks (RNNs) have been widely adopted in research areas concerned with sequential data, such as text, audio, and video. However, RNNs consisting of sigma cells or tanh cells are...

1,561 citations

Posted Content
TL;DR: The history of person re-identification and its relationship with image classification and instance retrieval is introduced and two new re-ID tasks which are much closer to real-world applications are described and discussed.
Abstract: Person re-identification (re-ID) has become increasingly popular in the community due to its application and research significance. It aims at spotting a person of interest in other cameras. In the early days, hand-crafted algorithms and small-scale evaluation were predominantly reported. Recent years have witnessed the emergence of large-scale datasets and deep learning systems which make use of large data volumes. Considering different tasks, we classify most current re-ID methods into two classes, i.e., image-based and video-based; in both tasks, hand-crafted and deep learning systems will be reviewed. Moreover, two new re-ID tasks which are much closer to real-world applications are described and discussed, i.e., end-to-end re-ID and fast re-ID in very large galleries. This paper: 1) introduces the history of person re-ID and its relationship with image classification and instance retrieval; 2) surveys a broad selection of the hand-crafted systems and the large-scale methods in both image- and video-based re-ID; 3) describes critical future directions in end-to-end re-ID and fast retrieval in large galleries; and 4) finally briefs some important yet under-developed issues.

984 citations

Proceedings ArticleDOI
26 Apr 2016
TL;DR: This work presents a pipeline for learning deep feature representations from multiple domains with Convolutional Neural Networks with CNNs and proposes a Domain Guided Dropout algorithm to improve the feature learning procedure.
Abstract: Learning generic and robust feature representations with data from multiple domains for the same problem is of great value, especially for the problems that have multiple datasets but none of them are large enough to provide abundant data variations. In this work, we present a pipeline for learning deep feature representations from multiple domains with Convolutional Neural Networks (CNNs). When training a CNN with data from all the domains, some neurons learn representations shared across several domains, while some others are effective only for a specific one. Based on this important observation, we propose a Domain Guided Dropout algorithm to improve the feature learning procedure. Experiments show the effectiveness of our pipeline and the proposed algorithm. Our methods on the person re-identification problem outperform stateof-the-art methods on multiple datasets by large margins.

852 citations