Home
/
Authors
/
Xavier Anguera Miro

Author

Xavier Anguera Miro

Bio: Xavier Anguera Miro is an academic researcher from Telefónica. The author has contributed to research in topics: Speaker diarisation & Voice analysis. The author has an hindex of 4, co-authored 10 publications receiving 680 citations.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Speaker Diarization: A Review of Recent Research

[...]

Xavier Anguera Miro¹, Simon Bozonnet², Nicholas Evans², Corinne Fredouille³, Gerald Friedland⁴, Oriol Vinyals⁴ - Show less +2 more•Institutions (4)

Telefónica¹, Institut Eurécom², University of Avignon³, Institute of Company Secretaries of India⁴

01 Feb 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: An analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research are presented.

...read moreread less

Abstract: Speaker diarization is the task of determining “who spoke when?” in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers. Initially, it was proposed as a research topic related to automatic speech recognition, where speaker diarization serves as an upstream processing step. Over recent years, however, speaker diarization has become an important key technology for many tasks, such as navigation, retrieval, or higher level inference on audio data. Accordingly, many important improvements in accuracy and robustness have been reported in journals and conferences in the area. The application domains, from broadcast news, to lectures and meetings, vary greatly and pose different problems, such as having access to multiple microphones and multimodal information or overlapping speech. The most recent review of existing technology dates back to 2006 and focuses on the broadcast news domain. In this paper, we review the current state-of-the-art, focusing on research developed since 2006 that relates predominantly to speaker diarization for conference meetings. Finally, we present an analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research.

...read moreread less

706 citations

Journal Article•DOI•

The ICSI RT-09 Speaker Diarization System

[...]

Gerald Friedland, Adam Janin, David Imseng¹, Xavier Anguera Miro², Luke Gottlieb, Marijn Huijbregts³, Mary Tai Knox, Oriol Vinyals - Show less +4 more•Institutions (3)

Idiap Research Institute¹, Telefónica², Radboud University Nijmegen³

01 Feb 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The first full conceptual description of the ICSI speaker diarization system as presented to the National Institute of Standards Technology Rich Transcription 2009 (NIST RT-09) evaluation is presented, which consists of online and offline subsystems, multi-stream and single-stream implementations, and audio and audio-visual approaches.

...read moreread less

Abstract: The speaker diarization system developed at the International Computer Science Institute (ICSI) has played a prominent role in the speaker diarization community, and many researchers in the rich transcription community have adopted methods and techniques developed for the ICSI speaker diarization engine. Although there have been many related publications over the years, previous articles only presented changes and improvements rather than a description of the full system. Attempting to replicate the ICSI speaker diarization system as a complete entity would require an extensive literature review, and might ultimately fail due to component description version mismatches. This paper therefore presents the first full conceptual description of the ICSI speaker diarization system as presented to the National Institute of Standards Technology Rich Transcription 2009 (NIST RT-09) evaluation, which consists of online and offline subsystems, multi-stream and single-stream implementations, and audio and audio-visual approaches. Some of the components, such as the online system, have not been previously described. The paper also includes all necessary preprocessing steps, such as Wiener filtering, speech activity detection and beamforming.

...read moreread less

49 citations

Patent•

Method for detecting audio and video copy in multimedia streams

[...]

Xavier Anguera Miro¹, Pere Obrador Espinosa¹, Nuria Oliver Ramirez¹•Institutions (1)

Telefónica¹

16 Oct 2009

TL;DR: In this article, a multimodal detection of video copies is proposed, which extracts independent audio and video fingerprints representing the changes in the content and then proposes two alternative copy detection strategies.

...read moreread less

Abstract: This invention proposes a multimodal detection of video copies. It first extracts independent audio and video fingerprints representing the changes in the content. It then proposes two alternative copy detection strategies. The full-query matching considers that the query video appears entirely in the queried video. The partial-query matching considers that only part of the query appears. Either for the full query or for each subsegment in the partial-query algorithm, the cross-correlation with phase transform is computed between all signature pairs and accumulated to form a fused cross-correlation signal. In the full-query algorithm, the best alignment candidates are retrieved and a normalized scalar product is used to obtain a final matching score. In the partial query, a histogram is created with optimum alignments for each subsegment and only the best ones are considered and further processed as in the full-query. A threshold is used to determine whether a copy exists.

...read moreread less

21 citations

Patent•

Method and system for improved pattern matching

[...]

Xavier Anguera Miro¹•Institutions (1)

Telefónica¹

17 Dec 2013

TL;DR: In this paper, the authors used an improved algorithm partially based in dynamic time warping and information retrieval techniques, but solving the problems (as computational complexity, memory requirements... ) observed in these matching techniques.

...read moreread less

Abstract: Method, system and computer program for determining matching between two time series. They use an improved algorithm partially based in Dynamic Time Warping and Information Retrieval techniques, but solving the problems (as computational complexity, memory requirements . . . ) observed in these matching techniques.

...read moreread less

15 citations

Patent•

A method and a system to obtain data from voice analysis in a communication and computer programs products thereof

[...]

John Neystadt¹, Marat Asadurian¹, Jordi Luque¹, Xavier Anguera Miro¹•Institutions (1)

Telefónica¹

15 Nov 2013

TL;DR: In this paper, the authors propose a method comprising: requesting, by a calling user, to perform a voice communication with a called user through a communication service, sending the latter voice stream generated in the communication to a voice analysis unit; said voice analyzer, upon said communication matching a restriction criteria based on keywords and a segmentation for said keywords specified by a query manager and stored in a query database, analysing content of said received voice stream to capture data by spotting by means of using a keyword analyser at least one keyword spoken by any of said users in said communication

...read moreread less

Abstract: The method comprising: requesting, by a calling user, to perform a voice communication with a called user through a communication service, sending the latter voice stream generated in the communication to a voice analysis unit; said voice analysis unit, upon said communication matching a restriction criteria based on keywords and a segmentation for said keywords specified by a query manager and stored in a query database, analysing content of said received voice stream to capture data by spotting by means of using a keyword analyser at least one keyword spoken by any of said users in said communication matching said restriction criteria and by detecting by means of using an intonation analyser said captured data to indicate an emotional state of the caller user and/or of the called user when said at least one keyword is spotted in the communication.

...read moreread less

3 citations

Cited by

PDF

Open Access

More filters

Journal Article•

Data Mining Practical Machine Learning Tools and Techniques

[...]

อนิรุธ สืบสิงห์

01 Jan 2014-Journal of management science

9,185 citations

Journal Article•DOI•

pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis

[...]

Theodoros Giannakopoulos

11 Dec 2015-PLOS ONE

TL;DR: In this paper, the authors present pyAudioAnalysis, an open-source Python library that provides a wide range of audio analysis procedures including: feature extraction, classification of audio signals, supervised and unsupervised segmentation and content visualization.

...read moreread less

Abstract: Audio information plays a rather important role in the increasing digital content that is available today, resulting in a need for methodologies that automatically analyze such content: audio event recognition for home automations and surveillance systems, speech recognition, music information retrieval, multimodal analysis (e.g. audio-visual analysis of online videos for content-based recommendation), etc. This paper presents pyAudioAnalysis, an open-source Python library that provides a wide range of audio analysis procedures including: feature extraction, classification of audio signals, supervised and unsupervised segmentation and content visualization. pyAudioAnalysis is licensed under the Apache License and is available at GitHub (https://github.com/tyiannak/pyAudioAnalysis/). Here we present the theoretical background behind the wide range of the implemented methodologies, along with evaluation metrics for some of the methods. pyAudioAnalysis has been already used in several audio analysis research applications: smart-home functionalities through audio event detection, speech emotion recognition, depression classification based on audio-visual features, music segmentation, multimodal content-based movie recommendation and health applications (e.g. monitoring eating habits). The feedback provided from all these particular audio applications has led to practical enhancement of the library.

...read moreread less

362 citations

Journal Article•DOI•

Behavioral Signal Processing: Deriving Human Behavioral Informatics From Speech and Language

[...]

Shrikanth S. Narayanan¹, Panayiotis G. Georgiou¹•Institutions (1)

University of Southern California¹

07 Feb 2013

TL;DR: Behavioral informatics applications of these signal processing techniques that contribute to quantifying higher level, often subjectively described, human behavior in a domain-sensitive fashion are illustrated.

...read moreread less

Abstract: The expression and experience of human behavior are complex and multimodal and characterized by individual and contextual heterogeneity and variability. Speech and spoken language communication cues offer an important means for measuring and modeling human behavior. Observational research and practice across a variety of domains from commerce to healthcare rely on speech- and language-based informatics for crucial assessment and diagnostic information and for planning and tracking response to an intervention. In this paper, we describe some of the opportunities as well as emerging methodologies and applications of human behavioral signal processing (BSP) technology and algorithms for quantitatively understanding and modeling typical, atypical, and distressed human behavior with a specific focus on speech- and language-based communicative, affective, and social behavior. We describe the three important BSP components of acquiring behavioral data in an ecologically valid manner across laboratory to real-world settings, extracting and analyzing behavioral cues from measured data, and developing models offering predictive and decision-making support. We highlight both the foundational speech and language processing building blocks as well as the novel processing and modeling opportunities. Using examples drawn from specific real-world applications ranging from literacy assessment and autism diagnostics to psychotherapy for addiction and marital well being, we illustrate behavioral informatics applications of these signal processing techniques that contribute to quantifying higher level, often subjectively described, human behavior in a domain-sensitive fashion.

...read moreread less

286 citations

Journal Article•DOI•

A Survey of Available Corpora For Building Data-Driven Dialogue Systems: The Journal Version

[...]

Iulian Vlad Serban¹, Ryan Lowe², Peter Henderson², Laurent Charlin³, Joelle Pineau² - Show less +1 more•Institutions (3)

Université de Montréal¹, McGill University², HEC Montréal³

01 Jun 2018

TL;DR: A wide survey of publicly available datasets suitable for data-driven learning of dialogue systems is carried out and important characteristics of these datasets are discussed and how they can be used to learn diverse dialogue strategies.

...read moreread less

Abstract: During the past decade, several areas of speech and language understanding have witnessed substantial breakthroughs from the use of data-driven models. In the area of dialogue systems, the trend is less obvious, and most practical systems are still built through significant engineering and expert knowledge. Nevertheless, several recent results suggest that data-driven approaches are feasible and quite promising. To facilitate research in this area, we have carried out a wide survey of publicly available datasets suitable for data-driven learning of dialogue systems. We discuss important characteristics of these datasets, how they can be used to learn diverse dialogue strategies, and their other potential uses. We also examine methods for transfer learning between datasets and the use of external knowledge. Finally, we discuss appropriate choice of evaluation metrics for the learning objective.

...read moreread less

239 citations

Proceedings Article•DOI•

Speaker diarization with plda i-vector scoring and unsupervised calibration

[...]

Gregory Sell¹, Daniel Garcia-Romero¹•Institutions (1)

Johns Hopkins University¹

01 Dec 2014

TL;DR: A system that incorporates probabilistic linear discriminant analysis (PLDA) for i-vector scoring and uses unsupervised calibration of the PLDA scores to determine the clustering stopping criterion is proposed, and it is shown that PLDA scoring outperforms the same system with cosine scoring, and that overlapping segments reduce diarization error rate (DER) as well.

...read moreread less

Abstract: Speaker diarization via unsupervised i-vector clustering has gained popularity in recent years In this approach, i-vectors are extracted from short clips of speech segmented from a larger multi-speaker conversation and organized into speaker clusters, typically according to their cosine score In this paper, we propose a system that incorporates probabilistic linear discriminant analysis (PLDA) for i-vector scoring, a method already frequently utilized in speaker recognition tasks, and uses unsupervised calibration of the PLDA scores to determine the clustering stopping criterion We also demonstrate that denser sampling in the i-vector space with overlapping temporal segments provides a gain in the diarization task We test our system on the CALLHOME conversational telephone speech corpus, which includes multiple languages and a varying number of speakers, and we show that PLDA scoring outperforms the same system with cosine scoring, and that overlapping segments reduce diarization error rate (DER) as well

...read moreread less

226 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155

Collapse