scispace - formally typeset
Search or ask a question
Author

Arup Saha

Bio: Arup Saha is an academic researcher from Centre for Development of Advanced Computing. The author has contributed to research in topics: Speech synthesis & Assamese. The author has an hindex of 2, co-authored 6 publications receiving 48 citations.
Topics: Speech synthesis, Assamese, Syllable, Semivowel, Tamil

Papers
More filters
Proceedings ArticleDOI
01 Nov 2013
TL;DR: A consortium effort on building text to speech (TTS) systems for 13 Indian languages using the same common framework and the TTS systems are evaluated using Mean Opinion Score (DMOS) and Word Error Rate (WER).
Abstract: In this paper, we discuss a consortium effort on building text to speech (TTS) systems for 13 Indian languages. There are about 1652 Indian languages. A unified framework is therefore attempted required for building TTSes for Indian languages. As Indian languages are syllable-timed, a syllable-based framework is developed. As quality of speech synthesis is of paramount interest, unit-selection synthesizers are built. Building TTS systems for low-resource languages requires that the data be carefully collected an annotated as the database has to be built from the scratch. Various criteria have to addressed while building the database, namely, speaker selection, pronunciation variation, optimal text selection, handling of out of vocabulary words and so on. The various characteristics of the voice that affect speech synthesis quality are first analysed. Next the design of the corpus of each of the Indian languages is tabulated. The collected data is labeled at the syllable level using a semiautomatic labeling tool. Text to speech synthesizers are built for all the 13 languages, namely, Hindi, Tamil, Marathi, Bengali, Malayalam, Telugu, Kannada, Gujarati, Rajasthani, Assamese, Manipuri, Odia and Bodo using the same common framework. The TTS systems are evaluated using degradation Mean Opinion Score (DMOS) and Word Error Rate (WER). An average DMOS score of ≈3.0 and an average WER of about 20 % is observed across all the languages.

42 citations

Proceedings Article
01 Jan 2010
TL;DR: In this article, a detailed analysis is conducted for sentence-medial pauses for readout speech of Bangla, and linear models with variables of syntactic unit length and distance to directly modifying word are constructed for pause occurrence and duration.
Abstract: Control of pause occurrence and duration is an important issue for text-to-speech synthesis systems. In text-readout speech, pauses occur unconditionally at sentence boundaries and with high probability at major syntactic boundaries such as clause boundaries, but more or less arbitrarily at minor syntactic boundaries. Pause duration tends to be longer at the end of a longer syntactic unit. A detailed analysis is conducted for sentence-medial pauses for readout speech of Bangla. Based on the results, linear models (with variables of syntactic unit length and distance to directly modifying word) are constructed for pause occurrence and duration. The models are evaluated using the test data not included in the analyzed data (open-test condition). The results show that the proposed models can predict occurrence probability for 87% of phrase boundaries correctly, and pause duration within ±100 ms for 80% of the cases.

8 citations

Proceedings ArticleDOI
01 Nov 2013
TL;DR: A perceptual comparison method is adopted in this work to evaluate prosody model as it is very hard to evaluate Prosody model in an objective way.
Abstract: In speech synthesis the role of prosody is very crucial. To make the synthesized speech more natural and soothing to the human ears various prosody and intonation model together with emotional model have been experimented over last few decades. Apart from the segmental quality and voice characteristics, it depends mostly on the quality of the prosody model which is responsible for the naturalness of any TTS system. But as it is very hard to evaluate prosody model in an objective way, a perceptual comparison method is adopted in this work to evaluate prosody model.

2 citations

Proceedings ArticleDOI
29 Aug 2018
TL;DR: The main aim was to develop a system and compare different efficient text-independent Bengali speaker recognition systems that can give good rates of accuracy (greater than 90%) with not more than 10 minutes of speech data available for each speaker and can easily produce results without long amounts of delay.
Abstract: Speaker Recognition is the collective name of problems given to identifying a person or a set of persons using his/her voice. Variation of speaker speaking styles due to different languages can make speaker recognition a difficult task. In this paper, the main aim was to develop a system and compare different efficient text-independent Bengali speaker recognition systems that can give good rates of accuracy (greater than 90%) with not more than 10 minutes of speech data available for each speaker and can easily produce results without long amounts of delay. The experiments were carried out using the SHRUTI Bengali speech database and validated using TED-EX database. We have also analyzed different features of a Bengali speaker using GMM-UBM framework, Joint Factor Analysis, i-vectors, CNN and RNN. Elaborate comparisons and classifications are carried out based on training durations and languages spoken by the speakers.

1 citations

Book ChapterDOI
09 Mar 2011
TL;DR: An automated approach for providing naturalness in synthesized speech by derive intonation rules through analysis of large amount of sentences spoken by common people is described.
Abstract: For providing naturalness in synthesized speech it is imperative to give appropriate intonation on the synthesized sentences. The problem is not with synthesis engines but with the fact that comprehensive intonation rules of natural intonation are not available for any of the major spoken language of India. The knowledge available in this area is primarily subjective with the risk of unintentional personal bias. It lacks plurality in the sense that these do not reflect the natural intonation of common people. It is imperative to derive intonation rules through analysis of large amount of sentences spoken by common people. Manual processing is time consuming and extremely cumbrous. The present paper describes briefly an automated approach for such a task. A pilot study on about 1000 complex and interrogative sentences spoken by five female and four male native speakers is presented. 93% accuracy is obtained for the desired objective.

Cited by
More filters
Journal ArticleDOI
20 Jul 2017
TL;DR: This paper presents a system that can synthesize a new word or short phrase such that it blends seamlessly in the context of the existing narration, using a text to speech synthesizer to say the word in a generic voice, and then using voice conversion to convert it into a voice that matches the narration.
Abstract: Editing audio narration using conventional software typically involves many painstaking low-level manipulations. Some state of the art systems allow the editor to work in a text transcript of the narration, and perform select, cut, copy and paste operations directly in the transcript; these operations are then automatically applied to the waveform in a straightforward manner. However, an obvious gap in the text-based interface is the ability to type new words not appearing in the transcript, for example inserting a new word for emphasis or replacing a misspoken word. While high-quality voice synthesizers exist today, the challenge is to synthesize the new word in a voice that matches the rest of the narration. This paper presents a system that can synthesize a new word or short phrase such that it blends seamlessly in the context of the existing narration. Our approach is to use a text to speech synthesizer to say the word in a generic voice, and then use voice conversion to convert it into a voice that matches the narration. Offering a range of degrees of control to the editor, our interface supports fully automatic synthesis, selection among a candidate set of alternative pronunciations, fine control over edit placements and pitch profiles, and even guidance by the editors own voice. The paper presents studies showing that the output of our method is preferred over baseline methods and often indistinguishable from the original voice.

61 citations

Proceedings ArticleDOI
04 May 2014
TL;DR: From the subjective and objective evaluations, it is observed that Viterbi-based and STM with PLPCC-based segmentation algorithms work better than other algorithms.
Abstract: In this paper, use of Viterbi-based algorithm and spectral transition measure (STM)-based algorithm for the task of speech data labeling is being attempted. In the STM framework, we propose use of several spectral features such as recently proposed cochlear filter cepstral coefficients (CFCC), perceptual linear prediction cepstral coefficients (PLPCC) and RelAtive SpecTrAl (RASTA)-based PLPCC in addition to Mel frequency cepstral coefficients (MFCC) for phonetic segmentation task. To evaluate effectiveness of these segmentation algorithms, we require manual accurate phoneme-level labeled data which is not available for low resourced languages such as Gujarati (one of the official languages of India). In order to measure effectiveness of various segmentation algorithms, HMM-based speech synthesis system (HTS) for Gujarati has been built. From the subjective and objective evaluations, it is observed that Viterbi-based and STM with PLPCC-based segmentation algorithms work better than other algorithms.

22 citations

Journal ArticleDOI
TL;DR: A review of the contributions made by different researchers in the field of Indian language speech synthesis along with a study on the Indian language characteristics and the associated challenges in designing TTS systems are provided.
Abstract: The text to speech technology has achieved significant progress during the past decade and is an active area of research and development in providing different human–computer interactive systems. Even though a number of speech synthesis models are available for different languages focusing on the domain requirements with many motive applications, a source of information on current trends in Indian language speech synthesis is unavailable till date making it difficult for the beginners to initiate research for the development of TTS systems for the low-resourced languages. This paper provides a review of the contributions made by different researchers in the field of Indian language speech synthesis along with a study on the Indian language characteristics and the associated challenges in designing TTS systems. A set of available applications and tools results out of different projects undertaken by different organizations along with a set of possible future developments are also discussed to provide a single reference to an important strand of research in speech synthesis which may benefit anyone interested to initiate research in this area.

19 citations

16 Sep 2016
TL;DR: In this paper, the IRISA unit selection-based TTS system was implemented for the Blizzard Challenge 2016. The search is based on a A* algorithm with preselection filters used to reduce the search space and a fuzzy function is used to relax this penalty based on the concatenation quality with respect to the cost distribution.
Abstract: This paper describes the implementation of the IRISA unit selection-based TTS system for our participation in the Blizzard Challenge 2016. We describe the process followed to build the voices from given data and the architecture of our system. The search is based on a A* algorithm with preselection filters used to reduce the search space. A penalty is introduced in the concatenation cost to block some concatenations based on their phonological class. Moreover, a fuzzy function is used to relax this penalty based on the concatenation quality with respect to the cost distribution.

16 citations

Proceedings ArticleDOI
01 Aug 2015
TL;DR: The main part of the stories has the highest classification accuracy compared to introduction and climax parts of the story, and a framework for story classification using keyword and Part-of-speech (POS) based features is proposed.
Abstract: The main objective of this work is to classify Hindi and Telugu stories based on their structure into three genres: Fable, Folk-tale and Legend In this work, each story is divided into three parts: (i) introduction, (ii) main and (iii) climax The objective of this work is to explore how story genre information is embedded in different parts of the story We are proposing a framework for story classification using keyword and Part-of-speech (POS) based features Keyword based features like Term Frequency (TF) and Term Frequency Inverse Document Frequency (TFIDF) are used Classification performance is analyzed for different story parts using various combinations of features with three classifiers: (i) Naive Bayes (NB), (ii) k-Nearest Neighbour (KNN) and (iii) Support Vector Machine (SVM) From the experimental studies, it has been observed that classification performance has not significantly improved by combining linguistic (POS) and keyword based features Among classifiers, SVM outperformed the other classifiers The main part of the story has the highest classification accuracy compared to introduction and climax parts of the story

12 citations