Home
/
Authors
/
Arup Saha

Author

Arup Saha

Centre for Development of Advanced Computing

Bio: Arup Saha is an academic researcher from Centre for Development of Advanced Computing. The author has contributed to research in topics: Speech synthesis & Assamese. The author has an hindex of 2, co-authored 6 publications receiving 48 citations.

Topics: Speech synthesis, Assamese, Syllable, Semivowel, Tamil ...read more

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

A syllable-based framework for unit selection synthesis in 13 Indian languages

[...]

Hemant A. Patil¹, Tanvina B. Patel¹, Nirmesh J. Shah¹, Hardik B. Sailor¹, Raghava Krishnan², G. R. Kasthuri², T. Nagarajan³, Lilly Christina³, Naresh Kumar, Veera Raghavendra, Surekha Kishore, S. R. M. Prasanna⁴, Nagaraj Adiga⁴, Sanasam Ranbir Singh⁴, Konjengbam Anand⁴, Pranaw Kumar⁵, Bira Chandra Singh⁵, S. L. Binil Kumar⁵, T. G. Bhadran⁵, T. Sajini⁵, Arup Saha⁵, T. K. Basu⁵, K. Sreenivasa Rao⁶, N. P. Narendra⁶, Anil Kumar Sao⁷, Rajesh Kumar⁷, Pranhari Talukdar, Purnendu Acharyaa, Somnath Chandra, Swaran Lata, Hema A. Murthy² - Show less +27 more•Institutions (7)

Dhirubhai Ambani Institute of Information and Communication Technology¹, Indian Institute of Technology Madras², Sri Sivasubramaniya Nadar College of Engineering³, Indian Institute of Technology Guwahati⁴, Centre for Development of Advanced Computing⁵, Indian Institute of Technology Kharagpur⁶, Indian Institute of Technology Mandi⁷

01 Nov 2013

TL;DR: A consortium effort on building text to speech (TTS) systems for 13 Indian languages using the same common framework and the TTS systems are evaluated using Mean Opinion Score (DMOS) and Word Error Rate (WER).

...read moreread less

Abstract: In this paper, we discuss a consortium effort on building text to speech (TTS) systems for 13 Indian languages. There are about 1652 Indian languages. A unified framework is therefore attempted required for building TTSes for Indian languages. As Indian languages are syllable-timed, a syllable-based framework is developed. As quality of speech synthesis is of paramount interest, unit-selection synthesizers are built. Building TTS systems for low-resource languages requires that the data be carefully collected an annotated as the database has to be built from the scratch. Various criteria have to addressed while building the database, namely, speaker selection, pronunciation variation, optimal text selection, handling of out of vocabulary words and so on. The various characteristics of the voice that affect speech synthesis quality are first analysed. Next the design of the corpus of each of the Indian languages is tabulated. The collected data is labeled at the syllable level using a semiautomatic labeling tool. Text to speech synthesizers are built for all the 13 languages, namely, Hindi, Tamil, Marathi, Bengali, Malayalam, Telugu, Kannada, Gujarati, Rajasthani, Assamese, Manipuri, Odia and Bodo using the same common framework. The TTS systems are evaluated using degradation Mean Opinion Score (DMOS) and Word Error Rate (WER). An average DMOS score of ≈3.0 and an average WER of about 20 % is observed across all the languages.

...read moreread less

42 citations

Proceedings Article•

Modeling of sentence-medial pauses in bangla readout speech: occurrence and duration.

[...]

Shyamal Kr. Das Mandal¹, Arup Saha², T. K. Basu², Keikichi Hirose³, Hiroya Fujisaki⁴ - Show less +1 more•Institutions (4)

Indian Institute of Technology Kharagpur¹, Centre for Development of Advanced Computing², University of Tokyo³, Massachusetts Institute of Technology⁴

01 Jan 2010

TL;DR: In this article, a detailed analysis is conducted for sentence-medial pauses for readout speech of Bangla, and linear models with variables of syntactic unit length and distance to directly modifying word are constructed for pause occurrence and duration.

...read moreread less

Abstract: Control of pause occurrence and duration is an important issue for text-to-speech synthesis systems. In text-readout speech, pauses occur unconditionally at sentence boundaries and with high probability at major syntactic boundaries such as clause boundaries, but more or less arbitrarily at minor syntactic boundaries. Pause duration tends to be longer at the end of a longer syntactic unit. A detailed analysis is conducted for sentence-medial pauses for readout speech of Bangla. Based on the results, linear models (with variables of syntactic unit length and distance to directly modifying word) are constructed for pause occurrence and duration. The models are evaluated using the test data not included in the analyzed data (open-test condition). The results show that the proposed models can predict occurrence probability for 87% of phrase boundaries correctly, and pause duration within ±100 ms for 80% of the cases.

...read moreread less

8 citations

Proceedings Article•DOI•

Evaluation of prosody in text-to-speech synthesis system of Bangla

[...]

T. K. Basu, Arup Saha

01 Nov 2013

TL;DR: A perceptual comparison method is adopted in this work to evaluate prosody model as it is very hard to evaluate Prosody model in an objective way.

...read moreread less

Abstract: In speech synthesis the role of prosody is very crucial. To make the synthesized speech more natural and soothing to the human ears various prosody and intonation model together with emotional model have been experimented over last few decades. Apart from the segmental quality and voice characteristics, it depends mostly on the quality of the prosody model which is responsible for the naturalness of any TTS system. But as it is very hard to evaluate prosody model in an objective way, a perceptual comparison method is adopted in this work to evaluate prosody model.

...read moreread less

2 citations

Proceedings Article•DOI•

Preliminary Acoustic Analysis of Manipuri Vowels.

[...]

T. K. Basu¹, Arup Saha¹, Potsangbam Madhubala•Institutions (1)

Centre for Development of Advanced Computing¹

29 Aug 2018

TL;DR: The main aim was to develop a system and compare different efficient text-independent Bengali speaker recognition systems that can give good rates of accuracy (greater than 90%) with not more than 10 minutes of speech data available for each speaker and can easily produce results without long amounts of delay.

...read moreread less

Abstract: Speaker Recognition is the collective name of problems given to identifying a person or a set of persons using his/her voice. Variation of speaker speaking styles due to different languages can make speaker recognition a difficult task. In this paper, the main aim was to develop a system and compare different efficient text-independent Bengali speaker recognition systems that can give good rates of accuracy (greater than 90%) with not more than 10 minutes of speech data available for each speaker and can easily produce results without long amounts of delay. The experiments were carried out using the SHRUTI Bengali speech database and validated using TED-EX database. We have also analyzed different features of a Bengali speaker using GMM-UBM framework, Joint Factor Analysis, i-vectors, CNN and RNN. Elaborate comparisons and classifications are carried out based on training durations and languages spoken by the speakers.

...read moreread less

1 citations

Book Chapter•DOI•

A system for analysis of large scale speech data for the development of rules of intonation for speech synthesis

[...]

Asoke Kumar Datta, Arup Saha¹•Institutions (1)

Centre for Development of Advanced Computing¹

09 Mar 2011

TL;DR: An automated approach for providing naturalness in synthesized speech by derive intonation rules through analysis of large amount of sentences spoken by common people is described.

...read moreread less

Abstract: For providing naturalness in synthesized speech it is imperative to give appropriate intonation on the synthesized sentences. The problem is not with synthesis engines but with the fact that comprehensive intonation rules of natural intonation are not available for any of the major spoken language of India. The knowledge available in this area is primarily subjective with the risk of unintentional personal bias. It lacks plurality in the sense that these do not reflect the natural intonation of common people. It is imperative to derive intonation rules through analysis of large amount of sentences spoken by common people. Manual processing is time consuming and extremely cumbrous. The present paper describes briefly an automated approach for such a task. A pilot study on about 1000 complex and interrogative sentences spoken by five female and four male native speakers is presented. 93% accuracy is obtained for the desired objective.

...read moreread less

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

VoCo: text-based insertion and replacement in audio narration

[...]

Zeyu Jin¹, Gautham J. Mysore², Stephen DiVerdi², Jingwan Lu², Adam Finkelstein¹ - Show less +1 more•Institutions (2)

Princeton University¹, Adobe Systems²

20 Jul 2017

TL;DR: This paper presents a system that can synthesize a new word or short phrase such that it blends seamlessly in the context of the existing narration, using a text to speech synthesizer to say the word in a generic voice, and then using voice conversion to convert it into a voice that matches the narration.

...read moreread less

Abstract: Editing audio narration using conventional software typically involves many painstaking low-level manipulations. Some state of the art systems allow the editor to work in a text transcript of the narration, and perform select, cut, copy and paste operations directly in the transcript; these operations are then automatically applied to the waveform in a straightforward manner. However, an obvious gap in the text-based interface is the ability to type new words not appearing in the transcript, for example inserting a new word for emphasis or replacing a misspoken word. While high-quality voice synthesizers exist today, the challenge is to synthesize the new word in a voice that matches the rest of the narration. This paper presents a system that can synthesize a new word or short phrase such that it blends seamlessly in the context of the existing narration. Our approach is to use a text to speech synthesizer to say the word in a generic voice, and then use voice conversion to convert it into a voice that matches the narration. Offering a range of degrees of control to the editor, our interface supports fully automatic synthesis, selection among a candidate set of alternative pronunciations, fine control over edit placements and pitch profiles, and even guidance by the editors own voice. The paper presents studies showing that the output of our method is preferred over baseline methods and often indistinguishable from the original voice.

...read moreread less

61 citations

Proceedings Article•DOI•

Effectiveness of PLP-based phonetic segmentation for speech synthesis

[...]

Nirmesh J. Shah¹, Bhavik Vachhani¹, Hardik B. Sailor¹, Hemant A. Patil¹•Institutions (1)

Dhirubhai Ambani Institute of Information and Communication Technology¹

04 May 2014

TL;DR: From the subjective and objective evaluations, it is observed that Viterbi-based and STM with PLPCC-based segmentation algorithms work better than other algorithms.

...read moreread less

Abstract: In this paper, use of Viterbi-based algorithm and spectral transition measure (STM)-based algorithm for the task of speech data labeling is being attempted. In the STM framework, we propose use of several spectral features such as recently proposed cochlear filter cepstral coefficients (CFCC), perceptual linear prediction cepstral coefficients (PLPCC) and RelAtive SpecTrAl (RASTA)-based PLPCC in addition to Mel frequency cepstral coefficients (MFCC) for phonetic segmentation task. To evaluate effectiveness of these segmentation algorithms, we require manual accurate phoneme-level labeled data which is not available for low resourced languages such as Gujarati (one of the official languages of India). In order to measure effectiveness of various segmentation algorithms, HMM-based speech synthesis system (HTS) for Gujarati has been built. From the subjective and objective evaluations, it is observed that Viterbi-based and STM with PLPCC-based segmentation algorithms work better than other algorithms.

...read moreread less

22 citations

Journal Article•DOI•

A survey on speech synthesis techniques in Indian languages

[...]

Soumya Priyadarsini Panda¹, Ajit Kumar Nayak², Satyananda Champati Rai¹•Institutions (2)

Silicon Institute of Technology¹, Siksha O Anusandhan University²

29 May 2020-Multimedia Systems

TL;DR: A review of the contributions made by different researchers in the field of Indian language speech synthesis along with a study on the Indian language characteristics and the associated challenges in designing TTS systems are provided.

...read moreread less

Abstract: The text to speech technology has achieved significant progress during the past decade and is an active area of research and development in providing different human–computer interactive systems. Even though a number of speech synthesis models are available for different languages focusing on the domain requirements with many motive applications, a source of information on current trends in Indian language speech synthesis is unavailable till date making it difficult for the beginners to initiate research for the development of TTS systems for the low-resourced languages. This paper provides a review of the contributions made by different researchers in the field of Indian language speech synthesis along with a study on the Indian language characteristics and the associated challenges in designing TTS systems. A set of available applications and tools results out of different projects undertaken by different organizations along with a set of possible future developments are also discussed to provide a single reference to an important strand of research in speech synthesis which may benefit anyone interested to initiate research in this area.

...read moreread less

19 citations

The IRISA Text-To-Speech System for the Blizzard Challenge 2016

[...]

Pierre Alain, Jonathan Chevelu, David Guennec, Gwénolé Lecorvé, Damien Lolive - Show less +1 more

16 Sep 2016

TL;DR: In this paper, the IRISA unit selection-based TTS system was implemented for the Blizzard Challenge 2016. The search is based on a A* algorithm with preselection filters used to reduce the search space and a fuzzy function is used to relax this penalty based on the concatenation quality with respect to the cost distribution.

...read moreread less

Abstract: This paper describes the implementation of the IRISA unit selection-based TTS system for our participation in the Blizzard Challenge 2016. We describe the process followed to build the voices from given data and the architecture of our system. The search is based on a A* algorithm with preselection filters used to reduce the search space. A penalty is introduced in the concatenation cost to block some concatenations based on their phonological class. Moreover, a fuzzy function is used to relax this penalty based on the concatenation quality with respect to the cost distribution.

...read moreread less

16 citations

Proceedings Article•DOI•

Children story classification based on structure of the story

[...]

Harikrishna D M¹, K. Sreenivasa Rao¹•Institutions (1)

Indian Institute of Technology Kharagpur¹

01 Aug 2015

TL;DR: The main part of the stories has the highest classification accuracy compared to introduction and climax parts of the story, and a framework for story classification using keyword and Part-of-speech (POS) based features is proposed.

...read moreread less

Abstract: The main objective of this work is to classify Hindi and Telugu stories based on their structure into three genres: Fable, Folk-tale and Legend In this work, each story is divided into three parts: (i) introduction, (ii) main and (iii) climax The objective of this work is to explore how story genre information is embedded in different parts of the story We are proposing a framework for story classification using keyword and Part-of-speech (POS) based features Keyword based features like Term Frequency (TF) and Term Frequency Inverse Document Frequency (TFIDF) are used Classification performance is analyzed for different story parts using various combinations of features with three classifiers: (i) Naive Bayes (NB), (ii) k-Nearest Neighbour (KNN) and (iii) Support Vector Machine (SVM) From the experimental studies, it has been observed that classification performance has not significantly improved by combining linguistic (POS) and keyword based features Among classifiers, SVM outperformed the other classifiers The main part of the story has the highest classification accuracy compared to introduction and climax parts of the story

...read moreread less

12 citations

1
2
3
4
…
5
6
7
8
9
10
11

Collapse