scispace - formally typeset
Search or ask a question
Book ChapterDOI

A Unified Parser for Developing Indian Language Text to Speech Synthesizers

TL;DR: The design of a language independent parser for text-to-speech synthesis in Indian languages is described and the accuracy of the phoneme sequences generated by the proposed parser is more accurate than that of language specific parsers.
Abstract: This paper describes the design of a language independent parser for text-to-speech synthesis in Indian languages. Indian languages come from 5–6 different language families of the world. Most Indian languages have their own scripts. This makes parsing for text to speech systems for Indian languages a difficult task. In spite of the number of different families which leads to divergence, there is a convergence owing to borrowings across language families. Most importantly Indian languages are more or less phonetic and can be considered to consist broadly of about 35–38 consonants and 15–18 vowels. In this paper, an attempt is made to unify the languages based on this broad list of phones. A common label set is defined to represent the various phones in Indian languages. A uniform parser is designed across all the languages capitalising on the syllable structure of Indian languages. The proposed parser converts UTF-8 text to common label set, applies letter-to-sound rules and generates the corresponding phoneme sequences. The parser is tested against the custom-built parsers for multiple Indian languages. The TTS results show that the accuracy of the phoneme sequences generated by the proposed parser is more accurate than that of language specific parsers.
Citations
More filters
Proceedings ArticleDOI
29 Aug 2018
TL;DR: A low-resource Automatic Speech Recognition challenge for Indian languages as part of Interspeech 2018, which received 109 submissions from 18 research groups and evaluated the systems in terms of Word Error Rate on a blind test set.
Abstract: India has more than 1500 languages, with 30 of them spoken by more than one million native speakers. Most of them are low-resource and could greatly benefit from speech and language technologies. Building speech recognition support for these low-resource languages requires innovation in handling constraints on data size, while also exploiting the unique properties and similarities among Indian languages. With this goal, we organized a low-resource Automatic Speech Recognition challenge for Indian languages as part of Interspeech 2018. We released 50 hours of speech data with transcriptions for Tamil, Telugu and Gujarati, amounting to a total of 150 hours. Participants were required to only use the data we released for the challenge to preserve the low-resource setting, however, they were not restricted to work on any particular aspect of the speech recognizer. We received 109 submissions from 18 research groups and evaluated the systems in terms of Word Error Rate on a blind test set. In this paper we summarize the data, approaches and results of the challenge.

38 citations


Cites methods from "A Unified Parser for Developing Ind..."

  • ...For each language, we released two pronunciation lexicons created using the Festvox Indic frontend [9], which used a phoneset similar to SAMPA and IIT Madras’ Common Label Set [10] which used an IPA-based phoneset....

    [...]

Proceedings ArticleDOI
20 Sep 2019
TL;DR: Subjective evaluations indicate that reasonably good quality Indic TTSes can be developed using both approaches, which emphasises the need to incorporate multilingual text processing in the end-to-end framework.
Abstract: Building text-to-speech (TTS) synthesisers is a difficult task, especially for low resource languages. Language-specific modules need to be developed for system building. End-to-end speech synthesis has become a popular paradigm as a TTS can be trained using only pairs. However, end-to-end speech synthesis is not scalable in a multilanguage scenario, as the vocabulary increases with the number of different scripts. In this paper, TTSes are trained for Indian languages using two text representations– character-based and phone-based. For the character-based approach, a multi-language character map (MLCM) is proposed to easily train Indic speech synthesisers. The phone-based approach uses the common label set (CLS) representation for Indian languages. Both approaches leverage the similarities that exist among the languages. The advantage is a compact representation across multiple languages. Experiments are conducted by building TTSes using monolingual data and by pooling data across two languages. The ability to synthesise code-mixed text using the phone-based approach is also assessed. Subjective evaluations indicate that reasonably good quality Indic TTSes can be developed using both approaches. This emphasises the need to incorporate multilingual text processing in the end-to-end framework.

25 citations


Cites methods from "A Unified Parser for Developing Ind..."

  • ...To obtain the phone-based representation from the input text, the unified parser for Indian languages is used [15]....

    [...]

  • ...A unified parser is used to convert words in Indian languages to CLS representation [15]....

    [...]

Proceedings ArticleDOI
06 Jun 2021
TL;DR: In this article, the authors explore the benefits of representing similar target subword units (e.g., Byte Pair Encoded(BPE) units) through a Common Label Set (CLS).
Abstract: In many Indian languages, written characters are organized on sound phonetic principles, and the ordering of characters is the same across many of them. However, while training conventional end-to-end (E2E) Multilingual speech recognition systems, we treat characters or target subword units from different languages as separate entities. Since the visual rendering of these characters is different, in this paper, we explore the benefits of representing such similar target subword units (e.g., Byte Pair Encoded(BPE) units) through a Common Label Set (CLS). The CLS can be very easily created using automatic methods since the ordering of characters is the same in many Indian Languages. E2E models are trained using a transformer-based encoder-decoder architecture. During testing, given the Mel-filterbank features as input, the system outputs a sequence of BPE units in CLS representation. Depending on the language, we then map the recognized CLS units back to the language-specific grapheme representation. Results show that models trained using CLS improve over monolingual baseline and a multilingual framework with separate symbols for each language. Similar experiments on a subset of the Voxforge dataset also confirm the benefits of CLS. An extension of this idea is to decode an unseen language (Zero-resource) using CLS trained model.

16 citations

Proceedings ArticleDOI
02 Sep 2018
TL;DR: The joint acoustic model trained with RNN-CTC has performed better than monolingual models, due to an efficient data sharing across the languages, andSub-space Gaussian mixture models, and recurrent neural networks trained with connectionst temporal classification (CTC) objective function are explored for training joint acoustic models.
Abstract: India being a multilingual society, a multilingual automatic speech recognition system (ASR) is widely appreciated. Despite different orthographies, Indian languages share same phonetic space. To exploit this property, a joint acoustic model has been trained for developing multilingual ASR system using a common phone-set. Three Indian languages namely Telugu, Tamil and, Gujarati are considered for the study. This work studies the amenability of two different acoustic modeling approaches for training a joint acoustic model using common phone-set. Sub-space Gaussian mixture models (SGMM), and recurrent neural networks (RNN) trained with connectionst temporal classification (CTC) objective function are explored for training joint acoustic models. From the experimental results, it can be observed that the joint acoustic models trained with RNN-CTC have performed better than SGMM system even on 120 hours of data (approx 40 hrs per language). The joint acoustic model trained with RNN-CTC has performed better than monolingual models, due to an efficient data sharing across the languages. Conditioning the joint model with language identity had a minimal advantage. Sub-sampling the features by a factor of 2 while training RNN-CTC models has reduced the training times and has performed better.

15 citations


Cites background or methods from "A Unified Parser for Developing Ind..."

  • ...The transcriptions from training utterances in IT3-format have been used to train a trigram language model....

    [...]

  • ...A parser to convert utf8 to IT3 [29] has been used to convert the text to the IT3-format [7]....

    [...]

  • ...As Indian languages are syllabic in nature, the pronunciation models could be generated from a simple rule-based parser [6, 7, 8, 9]....

    [...]

  • ...The pronunciation model contains unique words from all the three languages in IT3-format and the corresponding phone sequences....

    [...]

  • ...IT3-format are any other language independent mapping which could map the words in different languages with same phone sequence as a single entity would be more beneficial in training a multilingual ASR....

    [...]

Proceedings ArticleDOI
20 Aug 2017
TL;DR: This paper capitalise on the ability of robust acoustic modeling techniques such as deep neural networks (DNN) and convolutional deep neural Networks (CNN) for acoustic modeling to correct the segment boundaries obtained using DNN-HMM/CNN- HMM segmentation.
Abstract: Automatic detection of phoneme boundaries is an important sub-task in building speech processing applications, especially text-to-speech synthesis (TTS) systems. The main drawback of the Gaussian mixture model hidden Markov model (GMMHMM) based forced-alignment is that the phoneme boundaries are not explicitly modeled. In an earlier work, we had proposed the use of signal processing cues in tandem with GMM-HMM based forced alignment for boundary correction for building Indian language TTS systems. In this paper, we capitalise on the ability of robust acoustic modeling techniques such as deep neural networks (DNN) and convolutional deep neural networks (CNN) for acoustic modeling. The GMM-HMM based forced alignment is replaced by DNN-HMM/CNN-HMM based forced alignment. Signal processing cues are used to correct the segment boundaries obtained using DNN-HMM/CNN-HMM segmentation. TTS systems built using these boundaries show a relative improvement in synthesis quality.

14 citations


Cites methods from "A Unified Parser for Developing Ind..."

  • ...For grapheme to phoneme conversion of the native text, a unified parser for Indian languages is used [17]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: In this paper, an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail, is described, and a reported shortcoming of the basic algorithm is discussed.
Abstract: The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.

17,177 citations

Proceedings ArticleDOI
05 Jun 2000
TL;DR: A speech parameter generation algorithm for HMM-based speech synthesis, in which the speech parameter sequence is generated from HMMs whose observation vector consists of a spectral parameter vector and its dynamic feature vectors, is derived.
Abstract: This paper derives a speech parameter generation algorithm for HMM-based speech synthesis, in which the speech parameter sequence is generated from HMMs whose observation vector consists of a spectral parameter vector and its dynamic feature vectors. In the algorithm, we assume that the state sequence (state and mixture sequence for the multi-mixture case) or a part of the state sequence is unobservable (i.e., hidden or latent). As a result, the algorithm iterates the forward-backward algorithm and the parameter generation algorithm for the case where the state sequence is given. Experimental results show that by using the algorithm, we can reproduce clear formant structure from multi-mixture HMMs as compared with that produced from single-mixture HMMs.

1,071 citations


"A Unified Parser for Developing Ind..." refers methods in this paper

  • ...Hidden Markov model (HMM), a statistical parametric based approach which is found effective in synthesizing speech is employed here [12]....

    [...]

Proceedings Article
01 May 2000
TL;DR: An outline of the LinGO English grammar and LKB system is given, and the ways in which they are currently being used are discussed, which supports collaborative development on many levels.
Abstract: The LinGO (Linguistic Grammars Online) project’s English Resource Grammar and the LKB grammar development environment are language resources which are freely available for download for any purpose, including commercial use (see http://lingostanfordedu) Executable programs and source code are both included In this paper, we give an outline of the LinGO English grammar and LKB system, and discuss the ways in which they are currently being used The grammar and processing system can be used independently or combined to give a central component which can be exploited in a variety of ways Our intention in writing this paper is to encourage more people to use the technology, which supports collaborative development on many levels

307 citations


"A Unified Parser for Developing Ind..." refers background in this paper

  • ...Parsers that work for more than one language focuses on structurally related languages such as English and French or English and German [1]....

    [...]

Book
01 Oct 1992

285 citations


"A Unified Parser for Developing Ind..." refers methods in this paper

  • ...Lex and Yacc [4] stands in good stead to build rule-based language parsers as these employ rulebased method for token matching....

    [...]

01 Jan 2013
TL;DR: The common phoneset and common question set are used to build HTS based systems for six Indian languages, namely, Hindi, Marathi, Bengali, Tamil, Telugu and Malayalam, and a uniform HMM framework for building speech synthesisers is proposed.
Abstract: State-of-the art approaches to speech synthesis are unit selection based concatenative speech synthesis (USS) and hidden Markov model based Text to speech synthesis (HTS). The former is based on waveform concatenation of subword units, while the latter is based on generation of an optimal parameter sequence from subword HMMs. The quality of an HMM based synthesiser in the HTS framework, crucially depends on an accurate description of the phoneset, and accurate description of the question set for clustering of the phones. Given the number of Indian languages, building a HTS system for every language is time consuming. Exploiting the properties of Indian languages, a uniform HMM framework for building speech synthesisers is proposed. Apart from the speech and text data used, the tasks involved in building a synthesis system can be made language-independent. A language-independent common phone set is first derived. Similar articulatory descriptions also hold for sounds that are similar. The common phoneset and common question set are used to build HTS based systems for six Indian languages, namely, Hindi, Marathi, Bengali, Tamil, Telugu and Malayalam. Mean opinion score (MOS) is used to evaluate the system. An average MOS of 3.0 for naturalness and 3.4 for intelligibility is obtained for all languages.

74 citations


"A Unified Parser for Developing Ind..." refers background in this paper

  • ...The acoustic similarity among the same set of phones of different languages suggests the possibility of a compact and common set of labels [10,11]....

    [...]

  • ...The notations of labels and rules for mapping are detailed in [10]....

    [...]