scispace - formally typeset
Search or ask a question
Author

S. Rajendran

Bio: S. Rajendran is an academic researcher from Indian Institute of Technology Madras. The author has contributed to research in topics: Speech synthesis & Speaker diarisation. The author has an hindex of 7, co-authored 12 publications receiving 309 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: A scheme for developing a voice conversion system that converts the speech signal uttered by a source speaker to a speech signal having the voice characteristics of the target speaker using formants and a formant vocoder is proposed.

207 citations

Journal ArticleDOI
TL;DR: The significance of segmental and prosodic knowledge sources for developing a text-to-speech system for Indian languages is discussed.
Abstract: This paper discusses the significance of segmental and prosodic knowledge sources for developing a text-to-speech system for Indian languages. Acoustic parameters such as linear prediction coefficients, formants, pitch and gain are prestored for the basic speech sound units corresponding to the orthographic characters of Hindi. The parameters are concatenated based on the input text. These parameters are modified by stored knowledge sources corresponding to coarticulation, duration and intonation. The coarticulation rules specify the pattern of joining the basic units. The duration rules modify the inherent duration of the basic units based on the linguistic context in which the units occur. The intonation rules specify the overall pitch contour for the utterance (declination or rising contour), fall-rise patterns, resetting phenomena and inherent fundamental frequency of vowels. Appropriate pauses between syntactic units are specified to enhance intelligibility and naturalness.

27 citations

Journal ArticleDOI
TL;DR: Some features of the fundamental frequency (F0) contours of speech in Hindi are described and an approach to represent and activate this intonation knowledge for an unrestricted text-to-speech system for Hindi is proposed.

20 citations

Journal ArticleDOI
TL;DR: The results of the word boundary hypothesization can be used to improve the performance of the acoustic-phonetic, lexical and syntactic modules in a speech-to-text conversion system and the algorithm in handling noisy speech input conditions and telephone speech are discussed.

19 citations

01 Jan 2011
TL;DR: An approach to improve the performance and usability of the Mandi Information System by using multiple decoders and contextual information is suggested.
Abstract: In this paper we describe the development of Mandi Information System, a Telugu spoken dialogue system for obtaining price information of agricultural commodities like vegetables, fruits, pulses, spices, etc.. The target users of MIS are primarily the farmers in rural and semiurban areas. Speech recognition is error prone and it is necessary for the dialogue system to make minimum number of errors while acquiring information from a user and also to detect errors (if not correctable) and adopt appropriate strategies. In this paper we suggest an approach to improve the performance and usability of the system by using multiple decoders and contextual information.

14 citations


Cited by
More filters
Book
30 Aug 2004
TL;DR: artificial neural networks, artificial neural networks , مرکز فناوری اطلاعات و اصاع رسانی, کδاوρزی
Abstract: artificial neural networks , artificial neural networks , مرکز فناوری اطلاعات و اطلاع رسانی کشاورزی

2,254 citations

Journal ArticleDOI
TL;DR: A set of simple new procedures has been developed to enable the real-time manipulation of speech parameters by using pitch-adaptive spectral analysis combined with a surface reconstruction method in the time–frequency region.

1,741 citations

Journal ArticleDOI
TL;DR: In this article, a Gaussian mixture model (GMM) of the joint probability density of source and target features is employed for performing spectral conversion between speakers, and a conversion method based on the maximum-likelihood estimation of a spectral parameter trajectory is proposed.
Abstract: In this paper, we describe a novel spectral conversion method for voice conversion (VC). A Gaussian mixture model (GMM) of the joint probability density of source and target features is employed for performing spectral conversion between speakers. The conventional method converts spectral parameters frame by frame based on the minimum mean square error. Although it is reasonably effective, the deterioration of speech quality is caused by some problems: 1) appropriate spectral movements are not always caused by the frame-based conversion process, and 2) the converted spectra are excessively smoothed by statistical modeling. In order to address those problems, we propose a conversion method based on the maximum-likelihood estimation of a spectral parameter trajectory. Not only static but also dynamic feature statistics are used for realizing the appropriate converted spectrum sequence. Moreover, the oversmoothing effect is alleviated by considering a global variance feature of the converted spectra. Experimental results indicate that the performance of VC can be dramatically improved by the proposed method in view of both speech quality and conversion accuracy for speaker individuality.

914 citations

Proceedings ArticleDOI
22 Sep 2008
TL;DR: The 9th Annual Conference of the International Speech Communication Association, September 22-26, 2008, Brisbane, Australia as discussed by the authors, was held at the University of Queensland, Queensland, Australia.
Abstract: INTERSPEECH2008: 9th Annual Conference of the International Speech Communication Association, September 22-26, 2008, Brisbane, Australia.

796 citations

Proceedings ArticleDOI
12 May 1998
TL;DR: A new voice conversion algorithm that modifies a source speaker's speech to sound as if produced by a target speaker is presented and is found to perform more reliably for small training sets than a previous approach.
Abstract: A new voice conversion algorithm that modifies a source speaker's speech to sound as if produced by a target speaker is presented. It is applied to a residual-excited LPC text-to-speech diphone synthesizer. Spectral parameters are mapped using a locally linear transformation based on Gaussian mixture models whose parameters are trained by joint density estimation. The LPC residuals are adjusted to match the target speakers average pitch. To study effects of the amount of training on performance, data sets of varying sizes are created by automatically selecting subsets of all available diphones by a vector quantization method. In an objective evaluation, the proposed method is found to perform more reliably for small training sets than a previous approach. In perceptual tests, it was shown that nearly optimal spectral conversion performance was achieved, even with a small amount of training data. However, speech quality improved with increases in the training set size.

692 citations