Home
/
Authors
/
Yashaswi Verma

Author

Yashaswi Verma

Other affiliations: Indian Institute of Science, International Institute of Information Technology, Hyderabad

Bio: Yashaswi Verma is an academic researcher from Indian Institute of Technology, Jodhpur. The author has contributed to research in topics: Automatic image annotation & Image retrieval. The author has an hindex of 9, co-authored 19 publications receiving 511 citations. Previous affiliations of Yashaswi Verma include Indian Institute of Science & International Institute of Information Technology, Hyderabad.

Papers

PDF

Open Access

More filters

Book Chapter•DOI•

Image annotation using metric learning in semantic neighbourhoods

[...]

Yashaswi Verma¹, C. V. Jawahar¹•Institutions (1)

International Institute of Information Technology, Hyderabad¹

07 Oct 2012

TL;DR: 2PKNN, a two-step variant of the classical K-nearest neighbour algorithm, is proposed that performs comparable to the current state-of-the-art on three challenging image annotation datasets, and shows significant improvements after metric learning.

...read moreread less

Abstract: Automatic image annotation aims at predicting a set of textual labels for an image that describe its semantics. These are usually taken from an annotation vocabulary of few hundred labels. Because of the large vocabulary, there is a high variance in the number of images corresponding to different labels ("class-imbalance"). Additionally, due to the limitations of manual annotation, a significant number of available images are not annotated with all the relevant labels ("weak-labelling"). These two issues badly affect the performance of most of the existing image annotation models. In this work, we propose 2PKNN, a two-step variant of the classical K-nearest neighbour algorithm, that addresses these two issues in the image annotation task. The first step of 2PKNN uses "image-to-label" similarities, while the second step uses "image-to-image" similarities; thus combining the benefits of both. Since the performance of nearest-neighbour based methods greatly depends on how features are compared, we also propose a metric learning framework over 2PKNN that learns weights for multiple features as well as distances together. This is done in a large margin set-up by generalizing a well-known (single-label) classification metric learning algorithm for multi-label prediction. For scalability, we implement it by alternating between stochastic sub-gradient descent and projection steps. Extensive experiments demonstrate that, though conceptually simple, 2PKNN alone performs comparable to the current state-of-the-art on three challenging image annotation datasets, and shows significant improvements after metric learning.

...read moreread less

168 citations

Proceedings Article•

Choosing linguistics over vision to describe images

[...]

Ankush Gupta¹, Yashaswi Verma¹, C. V. Jawahar¹•Institutions (1)

International Institute of Information Technology, Hyderabad¹

22 Jul 2012

TL;DR: This paper addresses the problem of automatically generating human-like descriptions for unseen images, given a collection of images and their corresponding human-generated descriptions, and presents a generic method which benefits from all three sources simultaneously, and is capable of constructing novel descriptions.

...read moreread less

Abstract: In this paper, we address the problem of automatically generating human-like descriptions for unseen images, given a collection of images and their corresponding human-generated descriptions. Previous attempts for this task mostly rely on visual clues and corpus statistics, but do not take much advantage of the semantic information inherent in the available image descriptions. Here, we present a generic method which benefits from all these three sources (i. e. visual clues, corpus statistics and available descriptions) simultaneously, and is capable of constructing novel descriptions. Our approach works on syntactically and linguistically motivated phrases extracted from the human descriptions. Experimental evaluations demonstrate that our formulation mostly generates lucid and semantically correct descriptions, and significantly outperforms the previous methods on automatic evaluation metrics. One of the significant advantages of our approach is that we can generate multiple interesting descriptions for an image. Unlike any previous work, we also test the applicability of our method on a large dataset containing complex images with rich descriptions.

...read moreread less

126 citations

Proceedings Article•DOI•

Exploring SVM for Image Annotation in Presence of Confusing Labels.

[...]

Yashaswi Verma¹, C. V. Jawahar²•Institutions (2)

International Institute of Information Technology, Hyderabad¹, James I University²

01 Jan 2013

TL;DR: Performance comparison among different methods for kernelization using chi-squared kernel is shown.

...read moreread less

Abstract: MBRM[1] 0.24/0.25/0.245/122 0.18/0.19/0.185/209 0.24/0.23/0.235/233 JEC[3] 0.27/0.32/0.293/139 0.22/0.25/0.234/224 0.28/0.29/0.285/250 TagProp-ML[2] 0.31/0.37/0.337/146 0.49/0.20/0.284/213 0.48/0.25/0.329/227 TagProp-s ML[2] 0.33/0.42/0.370/160 0.39/0.27/0.319/239 0.46/0.35/0.398/266 KSVM 0.29/0.43/0.346/174 0.30/0.28/0.290/256 0.43/0.27/0.332/266 KSVM-VT (Ours) 0.32/0.42/0.363/179 0.33/0.32/0.325/259 0.47/0.29/0.359/268 Table 1: Performance comparison among different methods. The prefix ‘K’ corresponds to kernelization using chi-squared kernel.

...read moreread less

70 citations

Proceedings Article•DOI•

Relative Parts: Distinctive Parts for Learning Relative Attributes

[...]

Ramachandruni N. Sandeep¹, Yashaswi Verma¹, C. V. Jawahar¹•Institutions (1)

International Institute of Information Technology, Hyderabad¹

23 Jun 2014

TL;DR: This paper introduces a part-based representation combining a pair of images that specifically compares corresponding parts and associates a locally adaptive "significance-coefficient" that represents its discriminative ability with respect to a particular attribute.

...read moreread less

Abstract: The notion of relative attributes as introduced by Parikh and Grauman (ICCV, 2011) provides an appealing way of comparing two images based on their visual properties (or attributes) such as "smiling" for face images, "naturalness" for outdoor images, etc. For learning such attributes, a Ranking SVM based formulation was proposed that uses globally represented pairs of annotated images. In this paper, we extend this idea towards learning relative attributes using local parts that are shared across categories. First, instead of using a global representation, we introduce a part-based representation combining a pair of images that specifically compares corresponding parts. Then, with each part we associate a locally adaptive "significance-coefficient" that represents its discriminative ability with respect to a particular attribute. For each attribute, the significance-coefficients are learned simultaneously with a max-margin ranking model in an iterative manner. Compared to the baseline method, the new method is shown to achieve significant improvement in relative attribute prediction accuracy. Additionally, it is also shown to improve relative feedback based interactive image search.

...read moreread less

62 citations

Proceedings Article•DOI•

Im2Text and Text2Im: Associating Images and Texts for Cross-Modal Retrieval

[...]

Yashaswi Verma¹, C. V. Jawahar¹•Institutions (1)

International Institute of Information Technology, Hyderabad¹

01 Jan 2014

TL;DR: This paper studies two complementary cross-modal prediction tasks: predicting text given an image (“Im2Text”), and predicting image(s) given a piece of text (‘Text2Im’), and proposes a novel Structural SVM based unified formulation for these two tasks.

...read moreread less

Abstract: Building bilateral semantic associations between images and texts is among the fundamental problems in computer vision. In this paper, we study two complementary cross-modal prediction tasks: (i) predicting text(s) given an image (“Im2Text”), and (ii) predicting image(s) given a piece of text (“Text2Im”). We make no assumption on the specific form of text; i.e., it could be either a set of labels, phrases, or even captions. We pose both these tasks in a retrieval framework. For Im2Text, given a query image, our goal is to retrieve a ranked list of semantically relevant texts from an independent textcorpus (i.e., texts with no corresponding images). Similarly, for Text2Im, given a query text, we aim to retrieve a ranked list of semantically relevant images from a collection of unannotated images (i.e., images without any associated textual meta-data). We propose a novel Structural SVM based unified formulation for these two tasks. For both visual and textual data, two types of representations are investigated. These are based on: (1) unimodal probability distributions over topics learned using latent Dirichlet allocation, and (2) explicitly learned multi-modal correlations using canonical correlation analysis. Extensive experiments on three popular datasets (two medium and one web-scale) demonstrate that our framework gives promising results compared to existing models under various settings, thus confirming its efficacy for both the tasks.

...read moreread less

52 citations

1
2
3
4
…

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

CIDEr: Consensus-based image description evaluation

[...]

Ramakrishna Vedantam¹, C. Lawrence Zitnick², Devi Parikh•Institutions (2)

Virginia Tech¹, Microsoft²

07 Jun 2015

TL;DR: A novel paradigm for evaluating image descriptions that uses human consensus is proposed and a new automated metric that captures human judgment of consensus better than existing metrics across sentences generated by various sources is evaluated.

...read moreread less

Abstract: Automatically describing an image with a sentence is a long-standing challenge in computer vision and natural language processing. Due to recent progress in object detection, attribute classification, action recognition, etc., there is renewed interest in this area. However, evaluating the quality of descriptions has proven to be challenging. We propose a novel paradigm for evaluating image descriptions that uses human consensus. This paradigm consists of three main parts: a new triplet-based method of collecting human annotations to measure consensus, a new automated metric that captures consensus, and two new datasets: PASCAL-50S and ABSTRACT-50S that contain 50 sentences describing each image. Our simple metric captures human judgment of consensus better than existing metrics across sentences generated by various sources. We also evaluate five state-of-the-art image description approaches using this new protocol and provide a benchmark for future comparisons. A version of CIDEr named CIDEr-D is available as a part of MS COCO evaluation server to enable systematic evaluation and benchmarking.

...read moreread less

3,504 citations

Journal Article•DOI•

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

[...]

Peter Young¹, Alice Lai¹, Micah Hodosh¹, Julia Hockenmaier¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

28 Feb 2014-Transactions of the Association for Computational Linguistics

TL;DR: This work proposes to use the visual denotations of linguistic expressions to define novel denotational similarity metrics, which are shown to be at least as beneficial as distributional similarities for two tasks that require semantic inference.

...read moreread less

Abstract: We propose to use the visual denotations of linguistic expressions (i.e. the set of images they describe) to define novel denotational similarity metrics, which we show to be at least as beneficial as distributional similarities for two tasks that require semantic inference. To compute these denotational similarities, we construct a denotation graph, i.e. a subsumption hierarchy over constituents and their denotations, based on a large corpus of 30K images and 150K descriptive captions.

...read moreread less

2,026 citations

Journal Article•DOI•

Multimodal Machine Learning: A Survey and Taxonomy

[...]

Tadas Baltrusaitis¹, Chaitanya Ahuja², Louis-Philippe Morency²•Institutions (2)

Microsoft¹, Carnegie Mellon University²

01 Feb 2019-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.

...read moreread less

Abstract: Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together Multimodal machine learning aims to build models that can process and relate information from multiple modalities It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research

...read moreread less

1,945 citations

Posted Content•

Microsoft COCO Captions: Data Collection and Evaluation Server

[...]

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, C. Lawrence Zitnick - Show less +3 more

01 Apr 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: The Microsoft COCO Caption dataset and evaluation server are described and several popular metrics, including BLEU, METEOR, ROUGE and CIDEr are used to score candidate captions.

...read moreread less

Abstract: In this paper we describe the Microsoft COCO Caption dataset and evaluation server. When completed, the dataset will contain over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions will be provided. To ensure consistency in evaluation of automatic caption generation algorithms, an evaluation server is used. The evaluation server receives candidate captions and scores them using several popular metrics, including BLEU, METEOR, ROUGE and CIDEr. Instructions for using the evaluation server are provided.

...read moreread less

1,691 citations

Proceedings Article•

Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

[...]

Junhua Mao¹, Junhua Mao², Wei Xu¹, Yi Yang¹, Jiang Wang¹, Zhiheng Huang¹, Alan L. Yuille² - Show less +3 more•Institutions (2)

Baidu¹, University of California, Los Angeles²

07 May 2015

TL;DR: The m-RNN model directly models the probability distribution of generating a word given previous words and an image, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.

...read moreread less

Abstract: In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, we apply the m-RNN model to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval. The project page of this work is: www.stat.ucla.edu/~junhua.mao/m-RNN.html .

...read moreread less

1,203 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99

Collapse