scispace - formally typeset
Search or ask a question
Author

Nicolas Obin

Bio: Nicolas Obin is an academic researcher from IRCAM. The author has contributed to research in topics: Prosody & Speech synthesis. The author has an hindex of 15, co-authored 74 publications receiving 654 citations. Previous affiliations of Nicolas Obin include Centre national de la recherche scientifique & University of Paris.


Papers
More filters
Proceedings Article
26 May 2014
TL;DR: The deliverable is described, a syntactic and prosodic treebank of spoken French, composed of 57 short samples of speech and 33000 words, orthographically and phonetically transcribed.
Abstract: The main objective of the Rhapsodie project (ANR Rhapsodie 07 Corp-030-01) was to define rich, explicit, and reproducible schemes for the annotation of prosody and syntax in different genres (± spontaneous, ± planned, face-to-face interviews vs. broadcast, etc.), in order to study the prosody/syntax/discourse interface in spoken French, and their roles in the segmentation of speech into discourse units (Lacheret, Kahane, & Pietrandrea forthcoming). We here describe the deliverable, a syntactic and prosodic treebank of spoken French, composed of 57 short samples of spoken French (5 minutes long on average, amounting to 3 hours of speech and 33000 words), orthographically and phonetically transcribed. The transcriptions and the annotations are all aligned on the speech signal: phonemes, syllables, words, speakers, overlaps. This resource is freely available at www.projet-rhapsodie.fr. The sound samples (wav/mp3), the acoustic analysis (original F0 curve manually corrected and automatic stylized F0, pitch format), the orthographic transcriptions (txt), the microsyntactic annotations (tabular format), the macrosyntactic annotations (txt, tabular format), the prosodic annotations (xml, textgrid, tabular format), and the metadata (xml and html) can be freely downloaded under the terms of the Creative Commons licence Attribution - Noncommercial - Share Alike 3.0 France. The metadata are encoded in the IMDI-CMFI format and can be parsed on line.

61 citations

Proceedings ArticleDOI
26 May 2013
TL;DR: The proposed method outperforms conventional methods for the detection of syllable landmark and boundaries on the TIMIT database of American-English, and provides a promising paradigm for the segmentation of speech into syllables.
Abstract: This paper introduces novel paradigms for the segmentation of speech into syllables. The main idea of the proposed method is based on the use of a time-frequency representation of the speech signal, and the fusion of intensity and voicing measures through various frequency regions for the automatic selection of pertinent information for the segmentation. The time-frequency representation is used to exploit the speech characteristics depending on the frequency region. In this representation, intensity profiles are measured to provide information into various frequency regions, and voicing profiles are measured to determine the frequency regions that are pertinent for the segmentation. The proposed method outperforms conventional methods for the detection of syllable landmark and boundaries on the TIMIT database of American-English, and provides a promising paradigm for the segmentation of speech into syllables.

33 citations

Proceedings ArticleDOI
05 May 2019
TL;DR: This research investigated the effectiveness of using a sequence-to-sequence (seq2seq) encoder-decoder based model to transform the intonation of a human voice from neutral to expressive speech, with some preliminary introduction of linguistic conditioning.
Abstract: Voice interfaces are becoming wildly popular and driving demand for more advanced speech synthesis and voice transformation systems. Current text-to-speech methods produce realistic sounding voices, but they lack the emotional expressivity that listeners expect, given the context of the interaction and the phrase being spoken. Emotional voice conversion is a research domain concerned with generating expressive speech from neutral synthesised speech or natural human voice. This research investigated the effectiveness of using a sequence-to-sequence (seq2seq) encoder-decoder based model to transform the intonation of a human voice from neutral to expressive speech, with some preliminary introduction of linguistic conditioning. A subjective experiment conducted on the task of speech emotion recognition by listeners successfully demonstrated the effectiveness of the proposed sequence-to-sequence models to produce convincing voice emotion transformations. In particular, conditioning the model on the position of the syllable in the phrase significantly improved recognition rates.

33 citations

Dissertation
23 Jun 2011
TL;DR: This thesis presents MeLos: a complete system for the analysis and modelling of speech prosody, "the music of speech", which is extended to model the speaking style of any arbitrary number of speakers using shared-context-dependent modelling and speaker normalization techniques.
Abstract: This thesis addresses the issue of modelling speech prosody for speech synthesis and presents MeLos: a complete system for the analysis and modelling of speech prosody, "the music of speech". The objective of this thesis is to model the strategy, alternatives, and speaking style of a speaker for natural, expressive, and varied speech synthesis. The present study presents original contributions with special attention paid to the combination of theoretical linguistic and statistical modelling to provide a complete speech prosody system. A unified discrete/continuous context-dependent HMM is presented to model the symbolic and the acoustic characteristics of speech prosody: 1) A rich description of the text characteristics based on a linguistic processing chain that includes surface and deep syntactic parsing is proposed to refine the modelling of the speech prosody in context. 2) Segmental HMMs and Dempster-Shafer fusion are used to balance linguistic and metric constrains in the production of a pause. 3) A trajectory model is proposed based on the stylization and the simultaneous modelling of short and long-term F0 variations over various temporal domains. The proposed system is used to model the strategies, alternatives and speaking style of a speaker, and is extended to model the speaking style of any arbitrary number of speakers using shared-context-dependent modelling and speaker normalization techniques.

31 citations

Journal ArticleDOI
TL;DR: In this article, a semi-automatiqué system for detecting prosodique du francais is presented, based on the alignment of phonemes, and the detection of prosodiques proeminentes en prenant en consideration des criteres acoustiques basiques tels que la f0, la duree, and la presence of pauses.
Abstract: L'objectif de cet article est de presenter un outil developpe en vue de modeliser semi-automatiquement la structure prosodique du francais. Sur la base d'un alignement en phonemes, notre systeme procede a la detection des syllabes proeminentes en prenant en consideration des criteres acoustiques basiques tels que la f0, la duree et la presence de pauses. A partir des mesures ainsi prises, le systeme attribue un degre de proeminence a chacune des syllabes identifiees comme saillante. Nous illustrons ensuite les resultats de l'analyse d'extraits du corpus PROSO_FR. Plus precisement, nous comparons l'analyse prosodique de phrases que l'on pourrait faire avec les regles traditionnelles de la phonologie prosodique avec l'analyse conduite par notre logiciel. Nous discutons ainsi de trois regles: la regle de dominance droite, la regle de clash accentuel et la regle des sept syllabes.

31 citations


Cited by
More filters
Journal ArticleDOI

893 citations

Proceedings ArticleDOI
TL;DR: This paper investigates the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating their language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks.
Abstract: Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models --in all languages except English-- very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.

490 citations

Proceedings Article
24 Mar 2018
TL;DR: An extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody results in synthesized audio that matches the prosody of the reference signal with fine time detail.
Abstract: We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.

408 citations