scispace - formally typeset
Search or ask a question
Author

Yang Hua

Bio: Yang Hua is an academic researcher from Queen's University Belfast. The author has contributed to research in topics: Computer science & Deep learning. The author has an hindex of 16, co-authored 69 publications receiving 1240 citations. Previous affiliations of Yang Hua include French Institute for Research in Computer Science and Automation & Panasonic.


Papers
More filters
Proceedings ArticleDOI
25 Feb 2019
TL;DR: This work presents two limitations of existing ranking-motivated structured losses and proposes a novel ranked list loss to solve both of them and proposes to learn a hypersphere for each class in order to preserve the similarity structure inside it.
Abstract: The objective of deep metric learning (DML) is to learn embeddings that can capture semantic similarity information among data points. Existing pairwise or tripletwise loss functions used in DML are known to suffer from slow convergence due to a large proportion of trivial pairs or triplets as the model improves. To improve this, rankingmotivated structured losses are proposed recently to incorporate multiple examples and exploit the structured information among them. They converge faster and achieve state-of-the-art performance. In this work, we present two limitations of existing ranking-motivated structured losses and propose a novel ranked list loss to solve both of them. First, given a query, only a fraction of data points is incorporated to build the similarity structure. Consequently, some useful examples are ignored and the structure is less informative. To address this, we propose to build a setbased similarity structure by exploiting all instances in the gallery. The samples are split into a positive set and a negative set. Our objective is to make the query closer to the positive set than to the negative set by a margin. Second, previous methods aim to pull positive pairs as close as possible in the embedding space. As a result, the intraclass data distribution might be dropped. In contrast, we propose to learn a hypersphere for each class in order to preserve the similarity structure inside it. Our extensive experiments show that the proposed method achieves state-of-the-art performance on three widely used benchmarks.

230 citations

Proceedings ArticleDOI
20 Jun 2011
TL;DR: This paper adopts a new method for adaptive context modeling and iterative boosting that achieves the state-of-the-art performance on object classification and detection tasks of PASCAL Visual Object Classes Challenge (VOC) 2007, 2010 and SUN09 data sets.
Abstract: In this paper, we investigate how to iteratively and mutually boost object classification and detection by taking the outputs from one task as the context of the other one. First, instead of intuitive feature and context concatenation or postprocessing with context, the so-called Contextualized Support Vector Machine (Context-SVM) is proposed, where the context takes the responsibility of dynamically adjusting the classification hyperplane, and thus the context-adaptive classifier is achieved. Then, an iterative training procedure is presented. In each step, Context-SVM, associated with the output context from one task (object classification or detection), is instantiated to boost the performance for the other task, whose augmented outputs are then further used to improve the former task by Context-SVM. The proposed solution is evaluated on the object classification and detection tasks of PASCAL Visual Object Challenge (VOC) 2007 and 2010, and achieves the state-of-the-art performance.

193 citations

Journal ArticleDOI
TL;DR: This paper adopts a new method for adaptive context modeling and iterative boosting that achieves the state-of-the-art performance on object classification and detection tasks of PASCAL Visual Object Classes Challenge (VOC) 2007, 2010 and SUN09 data sets.
Abstract: We investigate how to iteratively and mutually boost object classification and detection performance by taking the outputs from one task as the context of the other one. While context models have been quite popular, previous works mainly concentrate on co-occurrence relationship within classes and few of them focus on contextualization from a top-down perspective, i.e. high-level task context. In this paper, our system adopts a new method for adaptive context modeling and iterative boosting. First, the contextualized support vector machine (Context-SVM) is proposed, where the context takes the role of dynamically adjusting the classification score based on the sample ambiguity, and thus the context-adaptive classifier is achieved. Then, an iterative training procedure is presented. In each step, Context-SVM, associated with the output context from one task (object classification or detection), is instantiated to boost the performance for the other task, whose augmented outputs are then further used to improve the former task by Context-SVM. The proposed solution is evaluated on the object classification and detection tasks of PASCAL Visual Object Classes Challenge (VOC) 2007, 2010 and SUN09 data sets, and achieves the state-of-the-art performance.

148 citations

Proceedings ArticleDOI
07 Dec 2015
TL;DR: This paper formulating it as a proposal selection task and making two contributions, introducing novel proposals estimated from the geometric transformations undergone by the object, and building a rich candidate set for predicting the object location.
Abstract: Tracking-by-detection approaches are some of the most successful object trackers in recent years. Their success is largely determined by the detector model they learn initially and then update over time. However, under challenging conditions where an object can undergo transformations, e.g., severe rotation, these methods are found to be lacking. In this paper, we address this problem by formulating it as a proposal selection task and making two contributions. The first one is introducing novel proposals estimated from the geometric transformations undergone by the object, and building a rich candidate set for predicting the object location. The second one is devising a novel selection strategy using multiple cues, i.e., detection score and edgeness score computed from state-of-the-art object edges and motion boundaries. We extensively evaluate our approach on the visual object tracking 2014 challenge and online tracking benchmark datasets, and show the best performance.

117 citations

Proceedings ArticleDOI
16 Jun 2012
TL;DR: A hierarchical matching framework with so-called side information for image classification based on bag-of-words representation and two exemplar algorithms based on two types of side information: object confidence map and visual saliency map, from object detection priors and within-image contexts respectively are designed.
Abstract: In this work, we introduce a hierarchical matching framework with so-called side information for image classification based on bag-of-words representation. Each image is expressed as a bag of orderless pairs, each of which includes a local feature vector encoded over a visual dictionary, and its corresponding side information from priors or contexts. The side information is used for hierarchical clustering of the encoded local features. Then a hierarchical matching kernel is derived as the weighted sum of the similarities over the encoded features pooled within clusters at different levels. Finally the new kernel is integrated with popular machine learning algorithms for classification purpose. This framework is quite general and flexible, other practical and powerful algorithms can be easily designed by using this framework as a template and utilizing particular side information for hierarchical clustering of the encoded local features. To tackle the latent spatial mismatch issues in SPM, we design in this work two exemplar algorithms based on two types of side information: object confidence map and visual saliency map, from object detection priors and within-image contexts respectively. The extensive experiments over the Caltech-UCSD Birds 200, Oxford Flowers 17 and 102, PASCAL VOC 2007, and PASCAL VOC 2010 databases show the state-of-the-art performances from these two exemplar algorithms.

96 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

30,811 citations

Journal ArticleDOI
TL;DR: A review of the Pascal Visual Object Classes challenge from 2008-2012 and an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.
Abstract: The Pascal Visual Object Classes (VOC) challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardised evaluation software; and (ii) an annual competition and workshop. There are five challenges: classification, detection, segmentation, action classification, and person layout. In this paper we provide a review of the challenge from 2008---2012. The paper is intended for two audiences: algorithm designers, researchers who want to see what the state of the art is, as measured by performance on the VOC datasets, along with the limitations and weak points of the current generation of algorithms; and, challenge designers, who want to see what we as organisers have learnt from the process and our recommendations for the organisation of future challenges. To analyse the performance of submitted algorithms on the VOC datasets we introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not; a normalised average precision so that performance can be compared across classes with different proportions of positive instances; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified; and the use of a joint classifier over the submitted algorithms in order to measure their complementarity and combined performance. We also analyse the community's progress through time using the methods of Hoiem et al. (Proceedings of European Conference on Computer Vision, 2012) to identify the types of occurring errors. We conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.

6,061 citations

Posted Content
TL;DR: A series of experiments conducted for different recognition tasks using the publicly available code and model of the OverFeat network which was trained to perform object classification on ILSVRC13 suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks.
Abstract: Recent results indicate that the generic descriptors extracted from the convolutional neural networks are very powerful. This paper adds to the mounting evidence that this is indeed the case. We report on a series of experiments conducted for different recognition tasks using the publicly available code and model of the \overfeat network which was trained to perform object classification on ILSVRC13. We use features extracted from the \overfeat network as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of datasets. We selected these tasks and datasets as they gradually move further away from the original task and data the \overfeat network was trained to solve. Astonishingly, we report consistent superior results compared to the highly tuned state-of-the-art systems in all the visual classification tasks on various datasets. For instance retrieval it consistently outperforms low memory footprint methods except for sculptures dataset. The results are achieved using a linear SVM classifier (or $L2$ distance in case of retrieval) applied to a feature representation of size 4096 extracted from a layer in the net. The representations are further modified using simple augmentation techniques e.g. jittering. The results strongly suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks.

4,033 citations

Proceedings ArticleDOI
23 Jun 2014
TL;DR: In this paper, features extracted from the OverFeat network are used as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of datasets.
Abstract: Recent results indicate that the generic descriptors extracted from the convolutional neural networks are very powerful. This paper adds to the mounting evidence that this is indeed the case. We report on a series of experiments conducted for different recognition tasks using the publicly available code and model of the OverFeat network which was trained to perform object classification on ILSVRC13. We use features extracted from the OverFeat network as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of datasets. We selected these tasks and datasets as they gradually move further away from the original task and data the OverFeat network was trained to solve. Astonishingly, we report consistent superior results compared to the highly tuned state-of-the-art systems in all the visual classification tasks on various datasets. For instance retrieval it consistently outperforms low memory footprint methods except for sculptures dataset. The results are achieved using a linear SVM classifier (or L2 distance in case of retrieval) applied to a feature representation of size 4096 extracted from a layer in the net. The representations are further modified using simple augmentation techniques e.g. jittering. The results strongly suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks.

3,346 citations

Proceedings ArticleDOI
23 Jun 2014
TL;DR: This work designs a method to reuse layers trained on the ImageNet dataset to compute mid-level image representation for images in the PASCAL VOC dataset, and shows that despite differences in image statistics and tasks in the two datasets, the transferred representation leads to significantly improved results for object and action classification.
Abstract: Convolutional neural networks (CNN) have recently shown outstanding image classification performance in the large- scale visual recognition challenge (ILSVRC2012). The suc- cess of CNNs is attributed to their ability to learn rich mid- level image representations as opposed to hand-designed low-level features used in other image classification meth- ods. Learning CNNs, however, amounts to estimating mil- lions of parameters and requires a very large number of annotated image samples. This property currently prevents application of CNNs to problems with limited training data. In this work we show how image representations learned with CNNs on large-scale annotated datasets can be effi- ciently transferred to other visual recognition tasks with limited amount of training data. We design a method to reuse layers trained on the ImageNet dataset to compute mid-level image representation for images in the PASCAL VOC dataset. We show that despite differences in image statistics and tasks in the two datasets, the transferred rep- resentation leads to significantly improved results for object and action classification, outperforming the current state of the art on Pascal VOC 2007 and 2012 datasets. We also show promising results for object and action localization.

3,316 citations