scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Using polysyllabic units for text to speech synthesis in Indian languages

TL;DR: This paper describes the design and development of Indian language Text-To-Speech (TTS) synthesis systems, using polysyllabic units, that contains cluster units of more than one type (monosyllable, bisyLLable and trisyllable).
Abstract: This paper describes the design and development of Indian language Text-To-Speech (TTS) synthesis systems, using polysyllabic units. Firstly, a phone based TTS is built. Later, a monosyllable cluster unit TTS is built. It is observed that the quality of the synthesized sentences can improve if polysyllable units are used (when the appropriate units are available), since the effects of co-articulation will be preserved in such a case. Hence, we built Hindi and Tamil TTS with polysyllabic units, that contains cluster units of more than one type (monosyllable, bisyllable and trisyllable). The system selects the best set of units during the unit selection process, so as to minimize the join and concatenation costs. Preliminary listening tests indicated that the polysyllable TTS has better quality.
Citations
More filters
Proceedings ArticleDOI
17 Mar 2011
TL;DR: This paper describes ways to improve prosody modeling in syllable-based concatenative speech synthesis systems for two Indian languages, namely Hindi and Tamil, within the unit selection paradigm.
Abstract: This paper describes ways to improve prosody modeling in syllable-based concatenative speech synthesis systems for two Indian languages, namely Hindi and Tamil, within the unit selection paradigm. The syllable is a larger unit than the diphone and contains most of the coarticulation information. Although syllable-based synthesis is quite intelligible compared to diphone based systems, naturalness especially in terms of prosody, requires additional processing. Since the synthesizer is built using a cluster unit framework, a hybrid approach, where a combination of both rule based and statistical models are proposed to model prosody of syllable like units better. It is further observed that prediction of phrase boundaries is crucial, particularly because Indian languages are replete with polysyllabic words. CART based phrase modeling for Hindi and Tamil are discussed. Perceptual experiments show a significant improvement in the MOS for both Hindi and Tamil synthesizers. Index Terms: speech synthesis, unit selection, cluster unit synthesis, phrase boundaries

31 citations

Journal ArticleDOI
TL;DR: The model uses a pronunciation rule based waveform concatenation approach, to produce intelligible speech minimizing the memory requirement, and the results show the technique outperforms the existing technique.
Abstract: Speech Synthesis deals with artificial production of speech and a text-to-speech system (TTS) in this aspect converts natural language text into a spoken waveform or speech. There are a number of TTS systems available today for different languages, still Indian languages are lacking behind in providing high quality synthesized speech. Even though almost all Indian languages share a common phonetic base, till now a generic model for all official Indian languages is not available. Also, the existing speech synthesis techniques are found to be less effective in the scripting format of Indian languages. Considering the intelligibility of speech production and increasing memory requirement in concatenative speech synthesis technique, in this paper, we have proposed an efficient technique for text-to-speech synthesis in Indian languages. The model uses a pronunciation rule based waveform concatenation approach, to produce intelligible speech minimizing the memory requirement. To show the effectiveness of the technique, at an initial step of implementation the Odia (formerly Oriya), Bengali and Hindi languages are considered. The model is being compared with the existing technique and the results of our experiments show our technique outperforms the existing technique.

15 citations

Journal ArticleDOI
TL;DR: The thrust has been given to explore the usefulness of this technique in designing a TTS system for Indian languages, and some of the open research issues where work in this area may be done are focused on.
Abstract: Speech synthesis deals with artificial production of speech, and a Text-to-Speech TTS system in this aspect converts natural language text into a corresponding spoken waveform or speech. There have been sufficient successes today in the area of speech and natural language processing that suggest that these technologies will continue to be a major area of research and development in creating intelligent systems. In this paper we provide an overview of the TTS synthesis technology along with details of the phases involved. The thrust has been given to explore the usefulness of this technique in designing a TTS system for Indian languages. This paper also focuses on some of the open research issues where work in this area may further be done.

13 citations


Cites background from "Using polysyllabic units for text t..."

  • ...Mohanty (2011), Narendra et al. (2011), Narendra and Rao (2013), Rama et al. (2002), Talesara et al. (2013) and Vinodh et al. (2010) propose different models for speech synthesis in different Indian languages which may further be optimised to be used with mobile devices....

    [...]

Proceedings ArticleDOI
01 Mar 2012
TL;DR: A detailed survey of issues in building a speech corpus for Indian languages and techniques that are used in the database to improve the intelligibility of the synthesized speech in Speech synthesis system are surveyed.
Abstract: Any spoken language system, it may either be a speech synthesis or a speech recognition system, starts with building a speech corpora. We give a detailed survey of issues in building a speech corpus for Indian languages. To begin with, an appropriate text file should be selected for building the speech corpus. Then a corresponding speech file is generated and stored. This speech file is the phonetic representation of the selected text file. The speech file is processed in different levels viz., paragraphs, sentences, phrases, words, syllables and phones. These are called the speech units of the file. Researches have been done taking these units as the basic unit for processing. This paper analyses the researches done using phones, diphones, triphones, syllables and polysyllables as their basic unit for speech synthesis. Concatenative speech synthesis involves the concatenation of these basic units to synthesize a natural sounding speech. The speech units are added with some more relevnt information about each unit, manually or automatically, based on an algorithm. The database consisting of the units along with their associated information is called as the speech corpus. Techniques that are used in the database to improve the intelligibility of the synthesized speech in Speech synthesis system are also surveyed.

12 citations

Proceedings Article
01 Sep 2013
TL;DR: In this paper, syllables are used as the basic units in the parametric synthesiser and are comparable to that of the phoneme based system in terms of DMOS and WER.
Abstract: A statistical parametric speech synthesis system uses triphones, phones or full context phones to address the problem of co-articulation. In this paper, syllables are used as the basic units in the parametric synthesiser. Conventionally full context phones in a HiddenMarkovModel (HMM) based speech synthesis framework are modeled with a fixed number of states. This is because each phoneme corresponds to a single indivisible sound. On the other hand a syllable is made up of a sequence of one or more sounds. To accommodate this variation, a variable number of states are used to model a syllable. Although a variable number of states are required to model syllables, a syllable captures co-articulation well since it is the smallest production unit. A syllable based speech synthesis system therefore does not require a well designed question set. The total number of syllables in a language is quite high and all of them cannot be modeled. To address this issue, a fallback unit is modeled instead. The quality of the proposed system is comparable to that of the phoneme based system in terms of DMOS and WER.

11 citations


Cites background from "Using polysyllabic units for text t..."

  • ...The body of work presented in this paper exploits this....

    [...]

References
More filters
Journal ArticleDOI
H. Sakoe1, S. Chiba1
TL;DR: This paper reports on an optimum dynamic progxamming (DP) based time-normalization algorithm for spoken word recognition, in which the warping function slope is restricted so as to improve discrimination between words in different categories.
Abstract: This paper reports on an optimum dynamic progxamming (DP) based time-normalization algorithm for spoken word recognition. First, a general principle of time-normalization is given using time-warping function. Then, two time-normalized distance definitions, called symmetric and asymmetric forms, are derived from the principle. These two forms are compared with each other through theoretical discussions and experimental studies. The symmetric form algorithm superiority is established. A new technique, called slope constraint, is successfully introduced, in which the warping function slope is restricted so as to improve discrimination between words in different categories. The effective slope constraint characteristic is qualitatively analyzed, and the optimum slope constraint condition is determined through experiments. The optimized algorithm is then extensively subjected to experimental comparison with various DP-algorithms, previously applied to spoken word recognition by different research groups. The experiment shows that the present algorithm gives no more than about two-thirds errors, even compared to the best conventional algorithm.

5,906 citations

Book
01 Jan 1997
TL;DR: An Introduction to Text-to-Speech Synthesis is a comprehensive introduction to speech synthesis that focuses on digital signal processing, with an emphasis on the concatenative approach.
Abstract: From the Publisher: An Introduction to Text-to-Speech Synthesis is a comprehensive introduction; Dutoit treats two areas of speech synthesis: Part I of the book concerns natural language processing and the inherent problems it presents for speech synthesis; Part II focuses on digital signal processing, with an emphasis on the concatenative approach. Both parts of the text guide the reader through the material in a step-by-step easy to follow way. This is the book to treat the topic of speech synthesis from the perspective of two different engineering approaches. The book will be of interest to researchers and students in phonetics and speech communication, in both academia and industry.

524 citations

Proceedings Article
01 Jan 2003
TL;DR: Perceptual tests conducted to evaluate the quality of the synthesizers with different unit size indicate that the syllable synthesizer performs better than the phone, diphone and half phone synthesizers, and the half phone Synthesizer performsbetter than diphones and phone synthesizer.
Abstract: In this paper, we address the issue of choice of unit size in unit selection speech synthesis. We discuss the development of a Hindi speech synthesizer and our experiments with different choices of units: syllable, diphone, phone and half phone. Perceptual tests conducted to evaluate the quality of the synthesizers with different unit size indicate that the syllable synthesizer performs better than the phone, diphone and half phone synthesizers, and the half phone synthesizer performs better than diphone and phone synthesizers.

106 citations

Journal ArticleDOI
TL;DR: A subband-based group delay approach to segment spontaneous speech into syllable-like units using the additiveproperty of the Fourier transform phase and the deconvolution property of the cepstrum to smooth the STE function of the speech signal and make it suitable for syllable boundary detection.
Abstract: In the development of a syllable-centric automatic speech recognition (ASR) system, segmentation of the acoustic signal into syllabic units is an important stage. Although the short-term energy (STE) function contains useful information about syllable segment boundaries, it has to be processed before segment boundaries can be extracted. This paper presents a subband-based group delay approach to segment spontaneous speech into syllable-like units. This technique exploits the additive property of the Fourier transform phase and the deconvolution property of the cepstrum to smooth the STE function of the speech signal and make it suitable for syllable boundary detection. By treating the STE function as a magnitude spectrum of an arbitrary signal, a minimum-phase group delay function is derived. This group delay function is found to be a better representative of the STE function for syllable boundary detection. Although the group delay function derived from the STE function of the speech signal contains segment boundaries, the boundaries are difficult to determine in the context of long silences, semivowels, and fricatives. In this paper, these issues are specifically addressed and algorithms are developed to improve the segmentation performance. The speech signal is first passed through a bank of three filters, corresponding to three different spectral bands. The STE functions of these signals are computed. Using these three STE functions, three minimum-phase group delay functions are derived. By combining the evidence derived from these group delay functions, the syllable boundaries are detected. Further, a multiresolution-based technique is presented to overcome the problem of shift in segment boundaries during smoothing. Experiments carried out on the Switchboard and OGI-MLTS corpora show that the error in segmentation is at most 25 milliseconds for 67% and 76.6% of the syllable segments, respectively.

43 citations

Proceedings Article
01 Sep 2006
TL;DR: A new “syllable-like” speech unit that is suitable for concatenative speech synthesis is described, automatically generated using a group delay based segmentation algorithm and acoustically correspond to the form C*VC* (C: consonant, V: vowel).
Abstract: In this work we describe a new “syllable-like” speech unit that is suitable for concatenative speech synthesis. These units are automatically generated using a group delay based segmentation algorithm and acoustically correspond to the form C*VC* (C: consonant, V: vowel). The effectiveness of the unit is demonstrated by synthesizing natural-sounding speech in Tamil, a regional Indian language. Significant quality improvement is obtained if bisyllable units are also used, rather than just monosyllables, with results far superior to the traditional diphone-based approach. An important advantage of this approach is the elimination of prosody rules. Since ƒ 0 is part of the target cost, the unit selection procedure chooses the best unit from among the many candidates. The naturalness of the synthesized speech demonstrates the effectiveness of this approach.

32 citations