scispace - formally typeset
Search or ask a question
Author

Yi Yu

Bio: Yi Yu is an academic researcher from National Institute of Informatics. The author has contributed to research in topics: Lyrics & Feature learning. The author has an hindex of 24, co-authored 128 publications receiving 1931 citations. Previous affiliations of Yi Yu include Nara Women's University & National University of Singapore.


Papers
More filters
Journal ArticleDOI
TL;DR: A ranking aggregation algorithm is proposed to enhance the detection of similarity and dissimilarity based on the following assumption: the true match should be similar to the probe in different baseline methods, but also be dissimilar to those strongly dissimilar galleries of the probe.
Abstract: Person reidentification is a key technique to match different persons observed in nonoverlapping camera views. Many researchers treat it as a special object-retrieval problem, where ranking optimization plays an important role. Existing ranking optimization methods mainly utilize the similarity relationship between the probe and gallery images to optimize the original ranking list, but seldom consider the important dissimilarity relationship. In this paper, we propose to use both similarity and dissimilarity cues in a ranking optimization framework for person reidentification. Its core idea is that the true match should not only be similar to those strongly similar galleries of the probe, but also be dissimilar to those strongly dissimilar galleries of the probe. Furthermore, motivated by the philosophy of multiview verification, a ranking aggregation algorithm is proposed to enhance the detection of similarity and dissimilarity based on the following assumption: the true match should be similar to the probe in different baseline methods. In other words, if a gallery blue image is strongly similar to the probe in one method, while simultaneously strongly dissimilar to the probe in another method, it will probably be a wrong match of the probe. Extensive experiments conducted on public benchmark datasets and comparisons with different baseline methods have shown the great superiority of the proposed ranking optimization method.

183 citations

Journal ArticleDOI
TL;DR: This paper proposes a data-driven distance metric (DDDM) method, re-exploiting the training data to adjust the metric for each query-gallery pair, with a significant improvement over three baseline metric learning methods.
Abstract: Person re-identification, aiming to identify images of the same person from various cameras configured in different places, has attracted much attention in the multimedia retrieval community. In this problem, choosing a proper distance metric is a crucial aspect, and many classic methods utilize a uniform learnt metric. However, their performance is limited due to ignoring the zero-shot and fine-grained characteristics presented in real person re-identification applications. In this paper, we investigate two consistencies across two cameras, which are cross-view support consistency and cross-view projection consistency. The philosophy behind it is that, in spite of visual changes in two images of the same person under two camera views, the support sets in their respective views are highly consistent, and after being projected to the same view, their context sets are also highly consistent. Based on the above phenomena, we propose a data-driven distance metric (DDDM) method, re-exploiting the training data to adjust the metric for each query-gallery pair. Experiments conducted on three public data sets have validated the effectiveness of the proposed method, with a significant improvement over three baseline metric learning methods. In particular, on the public VIPeR dataset, the proposed method achieves an accuracy rate of 42.09% at rank-1, which outperforms the state-of-the-art methods by 4.29%.

162 citations

Journal ArticleDOI
TL;DR: This paper proposes a new weakly supervised image segmentation model, focusing on learning the semantic associations between superpixel sets (graphlets in this paper), and presents a hierarchical Bayesian network to capture the semantic association between postembedding graphlets.
Abstract: Weakly supervised image segmentation is an important yet challenging task in image processing and pattern recognition fields. It is defined as: in the training stage, semantic labels are only at the image-level, without regard to their specific object/scene location within the image. Given a test image, the goal is to predict the semantics of every pixel/superpixel. In this paper, we propose a new weakly supervised image segmentation model, focusing on learning the semantic associations between superpixel sets (graphlets in this paper). In particular, we first extract graphlets from each image, where a graphlet is a small-sized graph measures the potential of multiple spatially neighboring superpixels (i.e., the probability of these superpixels sharing a common semantic label, such as the sky or the sea). To compare different-sized graphlets and to incorporate image-level labels, a manifold embedding algorithm is designed to transform all graphlets into equal-length feature vectors. Finally, we present a hierarchical Bayesian network to capture the semantic associations between postembedding graphlets, based on which the semantics of each superpixel is inferred accordingly. Experimental results demonstrate that: 1) our approach performs competitively compared with the state-of-the-art approaches on three public data sets and 2) considerable performance enhancement is achieved when using our approach on segmentation-based photo cropping and image categorization.

146 citations

Journal ArticleDOI
TL;DR: This is the first study that uses deep architectures for learning the temporal correlation between audio and lyrics, involving two-branch deep neural networks for audio modality and text modality (lyrics) and two significant contributions are made in the audio branch.
Abstract: Deep cross-modal learning has successfully demonstrated excellent performance in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities, such as audio and lyrics, should be taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Data in different modalities are converted to the same canonical space where intermodal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study that uses deep architectures for learning the temporal correlation between audio and lyrics. A pretrained Doc2Vec model followed by fully connected layers is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: (i) We propose an end-to-end network to learn cross-modal correlation between audio and lyrics, where feature extraction and correlation learning are simultaneously performed and joint representation is learned by considering temporal structures. (ii) And, as for feature extraction, we further represent an audio signal by a short sequence of local summaries (VGG16 features) and apply a recurrent neural network to compute a compact feature that better learns the temporal structures of music audio. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval.

84 citations

Journal ArticleDOI
TL;DR: In this paper, a novel deep learning model, category-based deep canonical correlation analysis, was proposed for fine-grained venue discovery from heterogeneous social multimodal data, where data in different modalities are projected to a same space via deep networks.
Abstract: In this work, travel destinations and business locations are taken as venues. Discovering a venue by a photograph is very important for visual context-aware applications. Unfortunately, few efforts paid attention to complicated real images such as venue photographs generated by users. Our goal is fine-grained venue discovery from heterogeneous social multimodal data. To this end, we propose a novel deep learning model, category-based deep canonical correlation analysis. Given a photograph as input, this model performs: 1) exact venue search (find the venue where the photograph was taken) and 2) group venue search (find relevant venues that have the same category as the photograph), by the cross-modal correlation between the input photograph and textual description of venues. In this model, data in different modalities are projected to a same space via deep networks. Pairwise correlation (between different modality data from the same venue) for exact venue search and category-based correlation (between different modality data from different venues with the same category) for group venue search are jointly optimized. Because a photograph cannot fully reflect rich text description of a venue, the number of photographs per venue in the training phase is increased to capture more aspects of a venue. We build a new venue-aware multimodal data set by integrating Wikipedia featured articles and Foursquare venue photographs. Experimental results on this data set confirm the feasibility of the proposed method. Moreover, the evaluation over another publicly available data set confirms that the proposed method outperforms state of the arts for cross-modal retrieval between image and text.

81 citations


Cited by
More filters
01 Jan 2006

3,012 citations

Proceedings Article
01 Jan 1994
TL;DR: The main focus in MUCKE is on cleaning large scale Web image corpora and on proposing image representations which are closer to the human interpretation of images.
Abstract: MUCKE aims to mine a large volume of images, to structure them conceptually and to use this conceptual structuring in order to improve large-scale image retrieval. The last decade witnessed important progress concerning low-level image representations. However, there are a number problems which need to be solved in order to unleash the full potential of image mining in applications. The central problem with low-level representations is the mismatch between them and the human interpretation of image content. This problem can be instantiated, for instance, by the incapability of existing descriptors to capture spatial relationships between the concepts represented or by their incapability to convey an explanation of why two images are similar in a content-based image retrieval framework. We start by assessing existing local descriptors for image classification and by proposing to use co-occurrence matrices to better capture spatial relationships in images. The main focus in MUCKE is on cleaning large scale Web image corpora and on proposing image representations which are closer to the human interpretation of images. Consequently, we introduce methods which tackle these two problems and compare results to state of the art methods. Note: some aspects of this deliverable are withheld at this time as they are pending review. Please contact the authors for a preview.

2,134 citations

Proceedings ArticleDOI
21 Jul 2017
TL;DR: This paper proposes a k-reciprocal encoding method to re-rank the re-ID results, and hypothesis is that if a gallery image is similar to the probe in the k- Reciprocal nearest neighbors, it is more likely to be a true match.
Abstract: When considering person re-identification (re-ID) as a retrieval process, re-ranking is a critical step to improve its accuracy. Yet in the re-ID community, limited effort has been devoted to re-ranking, especially those fully automatic, unsupervised solutions. In this paper, we propose a k-reciprocal encoding method to re-rank the re-ID results. Our hypothesis is that if a gallery image is similar to the probe in the k-reciprocal nearest neighbors, it is more likely to be a true match. Specifically, given an image, a k-reciprocal feature is calculated by encoding its k-reciprocal nearest neighbors into a single vector, which is used for re-ranking under the Jaccard distance. The final distance is computed as the combination of the original distance and the Jaccard distance. Our re-ranking method does not require any human interaction or any labeled data, so it is applicable to large-scale datasets. Experiments on the large-scale Market-1501, CUHK03, MARS, and PRW datasets confirm the effectiveness of our method.

1,306 citations

Posted Content
TL;DR: The history of person re-identification and its relationship with image classification and instance retrieval is introduced and two new re-ID tasks which are much closer to real-world applications are described and discussed.
Abstract: Person re-identification (re-ID) has become increasingly popular in the community due to its application and research significance. It aims at spotting a person of interest in other cameras. In the early days, hand-crafted algorithms and small-scale evaluation were predominantly reported. Recent years have witnessed the emergence of large-scale datasets and deep learning systems which make use of large data volumes. Considering different tasks, we classify most current re-ID methods into two classes, i.e., image-based and video-based; in both tasks, hand-crafted and deep learning systems will be reviewed. Moreover, two new re-ID tasks which are much closer to real-world applications are described and discussed, i.e., end-to-end re-ID and fast re-ID in very large galleries. This paper: 1) introduces the history of person re-ID and its relationship with image classification and instance retrieval; 2) surveys a broad selection of the hand-crafted systems and the large-scale methods in both image- and video-based re-ID; 3) describes critical future directions in end-to-end re-ID and fast retrieval in large galleries; and 4) finally briefs some important yet under-developed issues.

984 citations