Home
/
Authors
/
Emanuele Coviello

Author

Emanuele Coviello

Other affiliations: University of California, Amazon.com, University of Padua

Bio: Emanuele Coviello is an academic researcher from University of California, San Diego. The author has contributed to research in topics: Generative model & Hidden Markov model. The author has an hindex of 13, co-authored 23 publications receiving 1803 citations. Previous affiliations of Emanuele Coviello include University of California & Amazon.com.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

A new approach to cross-modal multimedia retrieval

[...]

Nikhil Rasiwasia¹, Jose Costa Pereira¹, Emanuele Coviello¹, Gabriel Doyle¹, Gert R. G. Lanckriet¹, Roger Levy¹, Nuno Vasconcelos¹ - Show less +3 more•Institutions (1)

University of California, San Diego¹

25 Oct 2010

TL;DR: It is shown that accounting for cross-modal correlations and semantic abstraction both improve retrieval accuracy and are shown to outperform state-of-the-art image retrieval systems on a unimodal retrieval task.

...read moreread less

Abstract: The problem of joint modeling the text and image components of multimedia documents is studied. The text component is represented as a sample from a hidden topic model, learned with latent Dirichlet allocation, and images are represented as bags of visual (SIFT) features. Two hypotheses are investigated: that 1) there is a benefit to explicitly modeling correlations between the two components, and 2) this modeling is more effective in feature spaces with higher levels of abstraction. Correlations between the two components are learned with canonical correlation analysis. Abstraction is achieved by representing text and images at a more general, semantic level. The two hypotheses are studied in the context of the task of cross-modal document retrieval. This includes retrieving the text that most closely matches a query image, or retrieving the images that most closely match a query text. It is shown that accounting for cross-modal correlations and semantic abstraction both improve retrieval accuracy. The cross-modal model is also shown to outperform state-of-the-art image retrieval systems on a unimodal retrieval task.

...read moreread less

1,284 citations

Journal Article•DOI•

On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval

[...]

Jose Costa Pereira¹, Emanuele Coviello¹, Gabriel Doyle¹, Nikhil Rasiwasia², Gert R. G. Lanckriet¹, Roger Levy¹, Nuno Vasconcelos¹ - Show less +3 more•Institutions (2)

University of California, San Diego¹, Yahoo!²

01 Mar 2014-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A mathematical formulation equating the design of cross-modal retrieval systems to that of isomorphic feature spaces for different content modalities is proposed, finding that both hypotheses hold, in a complementary form, although evidence in favor of the abstraction hypothesis is stronger than that for correlation.

...read moreread less

Abstract: The problem of cross-modal retrieval from multimedia repositories is considered. This problem addresses the design of retrieval systems that support queries across content modalities, for example, using an image to search for texts. A mathematical formulation is proposed, equating the design of cross-modal retrieval systems to that of isomorphic feature spaces for different content modalities. Two hypotheses are then investigated regarding the fundamental attributes of these spaces. The first is that low-level cross-modal correlations should be accounted for. The second is that the space should enable semantic abstraction. Three new solutions to the cross-modal retrieval problem are then derived from these hypotheses: correlation matching (CM), an unsupervised method which models cross-modal correlations, semantic matching (SM), a supervised technique that relies on semantic representation, and semantic correlation matching (SCM), which combines both. An extensive evaluation of retrieval performance is conducted to test the validity of the hypotheses. All approaches are shown successful for text retrieval in response to image queries and vice versa. It is concluded that both hypotheses hold, in a complementary form, although evidence in favor of the abstraction hypothesis is stronger than that for correlation.

...read moreread less

371 citations

Journal Article•DOI•

Clustering Dynamic Textures with the Hierarchical EM Algorithm for Modeling Video

[...]

Adeel Mumtaz¹, Emanuele Coviello², Gert R. G. Lanckriet², Antoni B. Chan¹•Institutions (2)

City University of Hong Kong¹, University of California, San Diego²

01 Jul 2013-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper derives a new algorithm for clustering DT models that is based on the hierarchical EM algorithm, and demonstrates the efficacy of the clustering algorithm on several applications in motion analysis, including hierarchical motion clustering, semantic motion annotation, and bag-of-systems codebook generation.

...read moreread less

Abstract: Dynamic texture (DT) is a probabilistic generative model, defined over space and time, that represents a video as the output of a linear dynamical system (LDS). The DT model has been applied to a wide variety of computer vision problems, such as motion segmentation, motion classification, and video registration. In this paper, we derive a new algorithm for clustering DT models that is based on the hierarchical EM algorithm. The proposed clustering algorithm is capable of both clustering DTs and learning novel DT cluster centers that are representative of the cluster members in a manner that is consistent with the underlying generative probabilistic model of the DT. We also derive an efficient recursive algorithm for sensitivity analysis of the discrete-time Kalman smoothing filter, which is used as the basis for computing expectations in the E-step of the HEM algorithm. Finally, we demonstrate the efficacy of the clustering algorithm on several applications in motion analysis, including hierarchical motion clustering, semantic motion annotation, and learning bag-of-systems (BoS) codebooks for dynamic texture recognition.

...read moreread less

70 citations

Journal Article•DOI•

Time Series Models for Semantic Music Annotation

[...]

Emanuele Coviello¹, Antoni B. Chan², Gert R. G. Lanckriet¹•Institutions (2)

University of California, San Diego¹, City University of Hong Kong²

01 Jul 2011-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A novel approach to automatic music annotation and retrieval that captures temporal aspects as well as timbral content, and a novel, efficient, and hierarchical expectation-maximization algorithm for DTM (HEM-DTM) is used to summarize the common information shared by DTMs modeling individual songs associated with a tag.

...read moreread less

Abstract: Many state-of-the-art systems for automatic music tagging model music based on bag-of-features representations which give little or no account of temporal dynamics, a key characteristic of the audio signal. We describe a novel approach to automatic music annotation and retrieval that captures temporal (e.g., rhythmical) aspects as well as timbral content. The proposed approach leverages a recently proposed song model that is based on a generative time series model of the musical content-the dynamic texture mixture (DTM) model-that treats fragments of audio as the output of a linear dynamical system. To model characteristic temporal dynamics and timbral content at the tag level, a novel, efficient, and hierarchical expectation-maximization (EM) algorithm for DTM (HEM-DTM) is used to summarize the common information shared by DTMs modeling individual songs associated with a tag. Experiments show learning the semantics of music benefits from modeling temporal dynamics.

...read moreread less

58 citations

Journal Article•DOI•

Clustering hidden Markov models with variational HEM

[...]

Emanuele Coviello¹, Antoni B. Chan², Gert R. G. Lanckriet¹•Institutions (2)

University of California, San Diego¹, City University of Hong Kong²

01 Jan 2014-Journal of Machine Learning Research

TL;DR: A novel algorithm to cluster HMMs based on the hierarchical EM (HEM) algorithm, which effectively leverages large amounts of data when learning annotation models by using an efficient hierarchical estimation procedure, which reduces learning times and memory requirements, while improving model robustness through better regularization.

...read moreread less

Abstract: The hidden Markov model (HMM) is a widely-used generative model that copes with sequential data, assuming that each observation is conditioned on the state of a hidden Markov chain. In this paper, we derive a novel algorithm to cluster HMMs based on the hierarchical EM (HEM) algorithm. The proposed algorithm i) clusters a given collection of HMMs into groups of HMMs that are similar, in terms of the distributions they represent, and ii) characterizes each group by a "cluster center", that is, a novel HMM that is representative for the group, in a manner that is consistent with the underlying generative model of the HMM. To cope with intractable inference in the E-step, the HEM algorithm is formulated as a variational optimization problem, and efficiently solved for the HMM case by leveraging an appropriate variational approximation. The benefits of the proposed algorithm, which we call variational HEM (VHEM), are demonstrated on several tasks involving time-series data, such as hierarchical clustering of motion capture sequences, and automatic annotation and retrieval of music and of online hand-writing data, showing improvements over current methods. In particular, our variational HEM algorithm effectively leverages large amounts of data when learning annotation models by using an efficient hierarchical estimation procedure, which reduces learning times and memory requirements, while improving model robustness through better regularization.

...read moreread less

54 citations

1
2
3
4
…
5

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Multimodal Machine Learning: A Survey and Taxonomy

[...]

Tadas Baltrusaitis¹, Chaitanya Ahuja², Louis-Philippe Morency²•Institutions (2)

Microsoft¹, Carnegie Mellon University²

01 Feb 2019-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.

...read moreread less

Abstract: Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together Multimodal machine learning aims to build models that can process and relate information from multiple modalities It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research

...read moreread less

1,945 citations

Journal Article•DOI•

Framing image description as a ranking task: data, models and evaluation metrics

[...]

Micah Hodosh¹, Peter Young¹, Julia Hockenmaier¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 May 2013-Journal of Artificial Intelligence Research

TL;DR: This paper proposed to frame sentence-based image annotation as the task of ranking a given pool of captions and showed that the importance of training on multiple captions per image, and of capturing syntactic (word order-based) and semantic features of these captions, is emphasized.

...read moreread less

Abstract: The ability to associate images with natural language sentences that describe what is depicted in them is a hallmark of image understanding, and a prerequisite for applications such as sentence-based image search. In analogy to image search, we propose to frame sentence-based image annotation as the task of ranking a given pool of captions. We introduce a new benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events. We introduce a number of systems that perform quite well on this task, even though they are only based on features that can be obtained with minimal supervision. Our results clearly indicate the importance of training on multiple captions per image, and of capturing syntactic (word order-based) and semantic features of these captions. We also perform an in-depth comparison of human and automatic evaluation metrics for this task, and propose strategies for collecting human judgments cheaply and on a very large scale, allowing us to augment our collection with additional relevance judgments of which captions describe which image. Our analysis shows that metrics that consider the ranked list of results for each query image or sentence are significantly more robust than metrics that are based on a single response per query. Moreover, our study suggests that the evaluation of ranking-based image description systems may be fully automated.

...read moreread less

991 citations

Journal Article•DOI•

Visual Domain Adaptation: A survey of recent advances

[...]

Vishal M. Patel¹, Raghuraman Gopalan², Ruonan Li³, Rama Chellappa•Institutions (3)

University of Maryland, College Park¹, AT&T Labs², Harvard University³

02 Apr 2015-IEEE Signal Processing Magazine

TL;DR: A survey of domain adaptation methods for visual recognition discusses the merits and drawbacks of existing domain adaptation approaches and identifies promising avenues for research in this rapidly evolving field.

...read moreread less

Abstract: In pattern recognition and computer vision, one is often faced with scenarios where the training data used to learn a model have different distribution from the data on which the model is applied. Regardless of the cause, any distributional change that occurs after learning a classifier can degrade its performance at test time. Domain adaptation tries to mitigate this degradation. In this article, we provide a survey of domain adaptation methods for visual recognition. We discuss the merits and drawbacks of existing domain adaptation approaches and identify promising avenues for research in this rapidly evolving field.

...read moreread less

871 citations

Journal Article•DOI•

Anomaly Detection and Localization in Crowded Scenes

[...]

Weixin Li¹, Vijay Mahadevan¹, Nuno Vasconcelos¹•Institutions (1)

University of California, San Diego¹

01 Jan 2014-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The detection and localization of anomalous behaviors in crowded scenes is considered, and a joint detector of temporal and spatial anomalies is proposed, based on a video representation that accounts for both appearance and dynamics, using a set of mixture of dynamic textures models.

...read moreread less

Abstract: The detection and localization of anomalous behaviors in crowded scenes is considered, and a joint detector of temporal and spatial anomalies is proposed. The proposed detector is based on a video representation that accounts for both appearance and dynamics, using a set of mixture of dynamic textures models. These models are used to implement 1) a center-surround discriminant saliency detector that produces spatial saliency scores, and 2) a model of normal behavior that is learned from training data and produces temporal saliency scores. Spatial and temporal anomaly maps are then defined at multiple spatial scales, by considering the scores of these operators at progressively larger regions of support. The multiscale scores act as potentials of a conditional random field that guarantees global consistency of the anomaly judgments. A data set of densely crowded pedestrian walkways is introduced and used to evaluate the proposed anomaly detector. Experiments on this and other data sets show that the latter achieves state-of-the-art anomaly detection results.

...read moreread less

844 citations

Proceedings Article•DOI•

Generalized Multiview Analysis: A discriminative latent space

[...]

Abhishek Sharma¹, Abhishek Kumar¹, Hal Daumé¹, David W. Jacobs¹•Institutions (1)

University of Maryland, College Park¹

16 Jun 2012

TL;DR: GMA solves a joint, relaxed QCQP over different feature spaces to obtain a single (non)linear subspace and is a supervised extension of Canonical Correlational Analysis (CCA), which is useful for cross-view classification and retrieval.

...read moreread less

Abstract: This paper presents a general multi-view feature extraction approach that we call Generalized Multiview Analysis or GMA. GMA has all the desirable properties required for cross-view classification and retrieval: it is supervised, it allows generalization to unseen classes, it is multi-view and kernelizable, it affords an efficient eigenvalue based solution and is applicable to any domain. GMA exploits the fact that most popular supervised and unsupervised feature extraction techniques are the solution of a special form of a quadratic constrained quadratic program (QCQP), which can be solved efficiently as a generalized eigenvalue problem. GMA solves a joint, relaxed QCQP over different feature spaces to obtain a single (non)linear subspace. Intuitively, GMA is a supervised extension of Canonical Correlational Analysis (CCA), which is useful for cross-view classification and retrieval. The proposed approach is general and has the potential to replace CCA whenever classification or retrieval is the purpose and label information is available. We outperform previous approaches for textimage retrieval on Pascal and Wiki text-image data. We report state-of-the-art results for pose and lighting invariant face recognition on the MultiPIE face dataset, significantly outperforming other approaches.

...read moreread less

733 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse