scispace - formally typeset
Search or ask a question
Institution

Vocapia Research

CompanyOrsay, France
About: Vocapia Research is a company organization based out in Orsay, France. It is known for research contribution in the topics: Speaker recognition & Language model. The organization has 17 authors who have published 37 publications receiving 542 citations. The organization is also known as: Vecsys Research.

Papers
More filters
Proceedings ArticleDOI
01 Dec 2013
TL;DR: Two techniques are shown to yield improved Keyword Spotting (KWS) performance when using the ATWV/MTWV performance measures, which resulted in the highest performance for the official surprise language evaluation for the IARPA-funded Babel project in April 2013.
Abstract: We present two techniques that are shown to yield improved Keyword Spotting (KWS) performance when using the ATWV/MTWV performance measures: (i) score normalization, where the scores of different keywords become commensurate with each other and they more closely correspond to the probability of being correct than raw posteriors; and (ii) system combination, where the detections of multiple systems are merged together, and their scores are interpolated with weights which are optimized using MTWV as the maximization criterion. Both score normalization and system combination approaches show that significant gains in ATWV/MTWV can be obtained, sometimes on the order of 8-10 points (absolute), in five different languages. A variant of these methods resulted in the highest performance for the official surprise language evaluation for the IARPA-funded Babel project in April 2013.

104 citations

Proceedings Article
09 Sep 2012
TL;DR: Three methods for the propagation of the overlaid names to the speech turns are compared, taking into account the co-occurence duration between the speaker clusters and the names provided by the video OCR and using a task-adapted variant of the TF-IDF information retrieval coefficient.
Abstract: We propose an approach for unsupervised speaker identification in TV broadcast videos, by combining acoustic speaker diarization with person names obtained via video OCR from overlaid texts. Three methods for the propagation of the overlaid names to the speech turns are compared, taking into account the co-occurence duration between the speaker clusters and the names provided by the video OCR and using a task-adapted variant of the TF-IDF information retrieval coefficient. These methods were tested on the REPERE dry-run evaluation corpus, containing 3 hours of annotated videos. Our best unsupervised system reaches a F-measure of 70.2% when considering all the speakers, and 81.7% if anchor speakers are left out. By comparison, a mono-modal, supervised speaker identification system with 535 speaker models trained on matching development data and additional TV and radio data only provided a 57.5% F-measure when considering all the speakers and 45.7% without anchor.

37 citations

Proceedings ArticleDOI
27 Aug 2011
TL;DR: Different architectures for cross-show speaker diarization are compared: the obvious concatenation of all shows, a hybrid system combining first a local clustering stage followed by a global clusteringStage, and an incremental system which processes the shows in a predefined order and updates the speaker models accordingly.
Abstract: Acoustic speaker diarization is investigated for situations where a collection of shows from the same source needs to be processed. In this case, the same speaker should receive the same label across all shows. We compare different architectures for cross-show speaker diarization: the obvious concatenation of all shows, a hybrid system combining first a local clustering stage followed by a global clustering stage, and an incremental system which processes the shows in a predefined order and updates the speaker models accordingly. This latter system being best suited to real applicative situations. These three strategies were compared to a baseline single-show system on a set of 46 ten-minutes samples of British English scientific podcasts.

35 citations

Book ChapterDOI
21 Jun 2017
TL;DR: An intelligent embodied conversation agent with linguistic, social and emotional competence constructed around an ontology-based knowledge model that allows for flexible reasoning-driven dialogue planning, instead of using predefined dialogue scripts.
Abstract: We present an intelligent embodied conversation agent with linguistic, social and emotional competence Unlike the vast majority of the state-of-the-art conversation agents, the proposed agent is constructed around an ontology-based knowledge model that allows for flexible reasoning-driven dialogue planning, instead of using predefined dialogue scripts It is further complemented by multimodal communication analysis and generation modules and a search engine for the retrieval of multimedia background content from the web needed for conducting a conversation on a given topic The evaluation of the 1st prototype of the agent shows a high degree of acceptance of the agent by the users with respect to its trustworthiness, naturalness, etc The individual technologies are being further improved in the 2nd prototype

33 citations

Proceedings ArticleDOI
08 Sep 2016
TL;DR: A Divide-and-Conquer (D&C) method is introduced to quickly and successfully train an RNN-based multi-language classifier that outperforms classical LID techniques and combines very well with a phonotactic system.
Abstract: This paper describes the design of an acoustic language recognition system based on BLSTM that can discriminate closely related languages and dialects of the same language. We introduce a Divide-and-Conquer (D&C) method to quickly and successfully train an RNN-based multi-language classifier. Experiments compare this approach to the straightforward training of the same RNN, as well as to two widely used LID techniques: a phonotactic system using DNN acoustic models and an i-vector system. Results are reported on two different data sets: the 14 languages of NIST LRE07 and the 20 closely related languages and dialects of NIST OpenLRE15. In addition to reporting the NIST Cavg metric which served as the primary metric for the LRE07 and OpenLRE15 evaluations, the EER and LER are provided. When used with BLSTM, the D&C training scheme significantly outperformed the classical training method for multi-class RNNs. On the OpenLRE15 data set, this method also outperforms classical LID techniques and combines very well with a phonotactic system.

31 citations


Network Information
Related Institutions (5)
Institut Eurécom
3.7K papers, 113K citations

85% related

Institute for Infocomm Research Singapore
7.9K papers, 212.2K citations

84% related

Télécom ParisTech
7.7K papers, 191.4K citations

82% related

NTT DoCoMo
8.6K papers, 160.5K citations

82% related

Mitsubishi Electric Research Laboratories
3.8K papers, 131.6K citations

82% related

Performance
Metrics
No. of papers from the Institution in previous years
YearPapers
20211
20191
20181
20175
20168
20153