scispace - formally typeset
Search or ask a question
Author

Xavier Anguera Miro

Bio: Xavier Anguera Miro is an academic researcher from Telefónica. The author has contributed to research in topics: Speaker diarisation & Voice analysis. The author has an hindex of 4, co-authored 10 publications receiving 680 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: An analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research are presented.
Abstract: Speaker diarization is the task of determining “who spoke when?” in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers. Initially, it was proposed as a research topic related to automatic speech recognition, where speaker diarization serves as an upstream processing step. Over recent years, however, speaker diarization has become an important key technology for many tasks, such as navigation, retrieval, or higher level inference on audio data. Accordingly, many important improvements in accuracy and robustness have been reported in journals and conferences in the area. The application domains, from broadcast news, to lectures and meetings, vary greatly and pose different problems, such as having access to multiple microphones and multimodal information or overlapping speech. The most recent review of existing technology dates back to 2006 and focuses on the broadcast news domain. In this paper, we review the current state-of-the-art, focusing on research developed since 2006 that relates predominantly to speaker diarization for conference meetings. Finally, we present an analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research.

706 citations

Journal ArticleDOI
TL;DR: The first full conceptual description of the ICSI speaker diarization system as presented to the National Institute of Standards Technology Rich Transcription 2009 (NIST RT-09) evaluation is presented, which consists of online and offline subsystems, multi-stream and single-stream implementations, and audio and audio-visual approaches.
Abstract: The speaker diarization system developed at the International Computer Science Institute (ICSI) has played a prominent role in the speaker diarization community, and many researchers in the rich transcription community have adopted methods and techniques developed for the ICSI speaker diarization engine. Although there have been many related publications over the years, previous articles only presented changes and improvements rather than a description of the full system. Attempting to replicate the ICSI speaker diarization system as a complete entity would require an extensive literature review, and might ultimately fail due to component description version mismatches. This paper therefore presents the first full conceptual description of the ICSI speaker diarization system as presented to the National Institute of Standards Technology Rich Transcription 2009 (NIST RT-09) evaluation, which consists of online and offline subsystems, multi-stream and single-stream implementations, and audio and audio-visual approaches. Some of the components, such as the online system, have not been previously described. The paper also includes all necessary preprocessing steps, such as Wiener filtering, speech activity detection and beamforming.

49 citations

Patent
16 Oct 2009
TL;DR: In this article, a multimodal detection of video copies is proposed, which extracts independent audio and video fingerprints representing the changes in the content and then proposes two alternative copy detection strategies.
Abstract: This invention proposes a multimodal detection of video copies. It first extracts independent audio and video fingerprints representing the changes in the content. It then proposes two alternative copy detection strategies. The full-query matching considers that the query video appears entirely in the queried video. The partial-query matching considers that only part of the query appears. Either for the full query or for each subsegment in the partial-query algorithm, the cross-correlation with phase transform is computed between all signature pairs and accumulated to form a fused cross-correlation signal. In the full-query algorithm, the best alignment candidates are retrieved and a normalized scalar product is used to obtain a final matching score. In the partial query, a histogram is created with optimum alignments for each subsegment and only the best ones are considered and further processed as in the full-query. A threshold is used to determine whether a copy exists.

21 citations

Patent
Xavier Anguera Miro1
17 Dec 2013
TL;DR: In this paper, the authors used an improved algorithm partially based in dynamic time warping and information retrieval techniques, but solving the problems (as computational complexity, memory requirements... ) observed in these matching techniques.
Abstract: Method, system and computer program for determining matching between two time series. They use an improved algorithm partially based in Dynamic Time Warping and Information Retrieval techniques, but solving the problems (as computational complexity, memory requirements . . . ) observed in these matching techniques.

15 citations

Patent
15 Nov 2013
TL;DR: In this paper, the authors propose a method comprising: requesting, by a calling user, to perform a voice communication with a called user through a communication service, sending the latter voice stream generated in the communication to a voice analysis unit; said voice analyzer, upon said communication matching a restriction criteria based on keywords and a segmentation for said keywords specified by a query manager and stored in a query database, analysing content of said received voice stream to capture data by spotting by means of using a keyword analyser at least one keyword spoken by any of said users in said communication
Abstract: The method comprising: requesting, by a calling user, to perform a voice communication with a called user through a communication service, sending the latter voice stream generated in the communication to a voice analysis unit; said voice analysis unit, upon said communication matching a restriction criteria based on keywords and a segmentation for said keywords specified by a query manager and stored in a query database, analysing content of said received voice stream to capture data by spotting by means of using a keyword analyser at least one keyword spoken by any of said users in said communication matching said restriction criteria and by detecting by means of using an intonation analyser said captured data to indicate an emotional state of the caller user and/or of the called user when said at least one keyword is spotted in the communication.

3 citations


Cited by
More filters
Journal ArticleDOI
11 Dec 2015-PLOS ONE
TL;DR: In this paper, the authors present pyAudioAnalysis, an open-source Python library that provides a wide range of audio analysis procedures including: feature extraction, classification of audio signals, supervised and unsupervised segmentation and content visualization.
Abstract: Audio information plays a rather important role in the increasing digital content that is available today, resulting in a need for methodologies that automatically analyze such content: audio event recognition for home automations and surveillance systems, speech recognition, music information retrieval, multimodal analysis (e.g. audio-visual analysis of online videos for content-based recommendation), etc. This paper presents pyAudioAnalysis, an open-source Python library that provides a wide range of audio analysis procedures including: feature extraction, classification of audio signals, supervised and unsupervised segmentation and content visualization. pyAudioAnalysis is licensed under the Apache License and is available at GitHub (https://github.com/tyiannak/pyAudioAnalysis/). Here we present the theoretical background behind the wide range of the implemented methodologies, along with evaluation metrics for some of the methods. pyAudioAnalysis has been already used in several audio analysis research applications: smart-home functionalities through audio event detection, speech emotion recognition, depression classification based on audio-visual features, music segmentation, multimodal content-based movie recommendation and health applications (e.g. monitoring eating habits). The feedback provided from all these particular audio applications has led to practical enhancement of the library.

362 citations

Journal ArticleDOI
07 Feb 2013
TL;DR: Behavioral informatics applications of these signal processing techniques that contribute to quantifying higher level, often subjectively described, human behavior in a domain-sensitive fashion are illustrated.
Abstract: The expression and experience of human behavior are complex and multimodal and characterized by individual and contextual heterogeneity and variability. Speech and spoken language communication cues offer an important means for measuring and modeling human behavior. Observational research and practice across a variety of domains from commerce to healthcare rely on speech- and language-based informatics for crucial assessment and diagnostic information and for planning and tracking response to an intervention. In this paper, we describe some of the opportunities as well as emerging methodologies and applications of human behavioral signal processing (BSP) technology and algorithms for quantitatively understanding and modeling typical, atypical, and distressed human behavior with a specific focus on speech- and language-based communicative, affective, and social behavior. We describe the three important BSP components of acquiring behavioral data in an ecologically valid manner across laboratory to real-world settings, extracting and analyzing behavioral cues from measured data, and developing models offering predictive and decision-making support. We highlight both the foundational speech and language processing building blocks as well as the novel processing and modeling opportunities. Using examples drawn from specific real-world applications ranging from literacy assessment and autism diagnostics to psychotherapy for addiction and marital well being, we illustrate behavioral informatics applications of these signal processing techniques that contribute to quantifying higher level, often subjectively described, human behavior in a domain-sensitive fashion.

286 citations

Journal ArticleDOI
01 Jun 2018
TL;DR: A wide survey of publicly available datasets suitable for data-driven learning of dialogue systems is carried out and important characteristics of these datasets are discussed and how they can be used to learn diverse dialogue strategies.
Abstract: During the past decade, several areas of speech and language understanding have witnessed substantial breakthroughs from the use of data-driven models. In the area of dialogue systems, the trend is less obvious, and most practical systems are still built through significant engineering and expert knowledge. Nevertheless, several recent results suggest that data-driven approaches are feasible and quite promising. To facilitate research in this area, we have carried out a wide survey of publicly available datasets suitable for data-driven learning of dialogue systems. We discuss important characteristics of these datasets, how they can be used to learn diverse dialogue strategies, and their other potential uses. We also examine methods for transfer learning between datasets and the use of external knowledge. Finally, we discuss appropriate choice of evaluation metrics for the learning objective.

239 citations

Proceedings ArticleDOI
01 Dec 2014
TL;DR: A system that incorporates probabilistic linear discriminant analysis (PLDA) for i-vector scoring and uses unsupervised calibration of the PLDA scores to determine the clustering stopping criterion is proposed, and it is shown that PLDA scoring outperforms the same system with cosine scoring, and that overlapping segments reduce diarization error rate (DER) as well.
Abstract: Speaker diarization via unsupervised i-vector clustering has gained popularity in recent years In this approach, i-vectors are extracted from short clips of speech segmented from a larger multi-speaker conversation and organized into speaker clusters, typically according to their cosine score In this paper, we propose a system that incorporates probabilistic linear discriminant analysis (PLDA) for i-vector scoring, a method already frequently utilized in speaker recognition tasks, and uses unsupervised calibration of the PLDA scores to determine the clustering stopping criterion We also demonstrate that denser sampling in the i-vector space with overlapping temporal segments provides a gain in the diarization task We test our system on the CALLHOME conversational telephone speech corpus, which includes multiple languages and a varying number of speakers, and we show that PLDA scoring outperforms the same system with cosine scoring, and that overlapping segments reduce diarization error rate (DER) as well

226 citations