scispace - formally typeset
Search or ask a question
Author

Ramu Reddy Vempada

Bio: Ramu Reddy Vempada is an academic researcher from Indian Institute of Technology Kharagpur. The author has contributed to research in topics: Speech corpus & Speaker recognition. The author has an hindex of 3, co-authored 5 publications receiving 210 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: The results indicate that, the recognition performance using local Prosodic features is better compared to the performance of global prosodic features.
Abstract: In this paper, global and local prosodic features extracted from sentence, word and syllables are proposed for speech emotion or affect recognition. In this work, duration, pitch, and energy values are used to represent the prosodic information, for recognizing the emotions from speech. Global prosodic features represent the gross statistics such as mean, minimum, maximum, standard deviation, and slope of the prosodic contours. Local prosodic features represent the temporal dynamics in the prosody. In this work, global and local prosodic features are analyzed separately and in combination at different levels for the recognition of emotions. In this study, we have also explored the words and syllables at different positions (initial, middle, and final) separately, to analyze their contribution towards the recognition of emotions. In this paper, all the studies are carried out using simulated Telugu emotion speech corpus (IITKGP-SESC). These results are compared with the results of internationally known Berlin emotion speech corpus (Emo-DB). Support vector machines are used to develop the emotion recognition models. The results indicate that, the recognition performance using local prosodic features is better compared to the performance of global prosodic features. Words in the final position of the sentences, syllables in the final position of the words exhibit more emotion discriminative information compared to the words and syllables present in the other positions.

149 citations

Journal ArticleDOI
TL;DR: The subjective and objective measures indicate that the proposed features and methods have improved the quality of the synthesized speech from stage-2 to stage-4.
Abstract: This paper presents the design and development of unrestricted text to speech synthesis (TTS) system in Bengali language. Unrestricted TTS system is capable to synthesize good quality of speech in different domains. In this work, syllables are used as basic units for synthesis. Festival framework has been used for building the TTS system. Speech collected from a female artist is used as speech corpus. Initially five speakers' speech is collected and a prototype TTS is built from each of the five speakers. Best speaker among the five is selected through subjective and objective evaluation of natural and synthesized waveforms. Then development of unrestricted TTS is carried out by addressing the issues involved at each stage to produce good quality synthesizer. Evaluation is carried out in four stages by conducting objective and subjective listening tests on synthesized speech. At the first stage, TTS system is built with basic festival framework. In the following stages, additional features are incorporated into the system and quality of synthesis is evaluated. The subjective and objective measures indicate that the proposed features and methods have improved the quality of the synthesized speech from stage-2 to stage-4.

65 citations

Proceedings ArticleDOI
03 Apr 2012
TL;DR: In this paper, spectral and prosodic features are explored for recognition of infant cry, and support vector machines (SVM) are used to capture the discriminative information with respect to above mentioned cries from the spectral features.
Abstract: In this paper, spectral and prosodic features are explored for recognition of infant cry. Different types of infant cries considered in this work are wet-diaper, hunger and pain. In this work, mel-frequency cepstral coefficients (MFCC) are used to represent the spectral information, and short-time frame energies (STE) and pause duration are used for representing the prosodic information. Support Vector Machines (SVM) are used to capture the discriminative information with respect to above mentioned cries from the spectral and prosodic features. SVM models are developed seperately using spectral and prosodic features. For carrying out these studies, infant cry database collected under Telemedicine project at IIT-KGP has been used. The recognition performance of the developed SVM models using spectral and prosodic features is observed to be 61.11% and 57.41% respectively. In this work, we also examined the recognition performance by combining the spectral and prosodic information at feature and score levels. The recognition performance using feature and score level fusion is observed to be 74.07% and 80.56% respectively.

32 citations

Proceedings ArticleDOI
03 Apr 2012
TL;DR: The proposed two-stage segmentation method is evaluated on manual segmented ten broadcast TV news bulletins and it is observed that about 93% of the news stories are correctly segmented, 7% are missed and 11% are spurious.
Abstract: In this paper, we proposed two-stage segmentation approach for splitting the TV broadcast news bulletins into sequence of news stories. In the first stage, speaker (news reader) specific characteristics present in initial headlines of the news bulletin are used for gross level segmentation. During second stage, errors in the gross level segmentation (first stage) are corrected by exploiting the speaker specific information captured from the individual news stories other than headlines. During headlines the captured speaker specific information is mixed with background music, and hence the segmentation at the first stage may not be accurate. In this work speaker specific information is represented by using mel frequency cepstral coefficients (MFCCs), and it is captured by using Gaussian mixture models (GMMs). The proposed two-stage segmentation method is evaluated on manual segmented ten broadcast TV news bulletins. From the evaluation results, it is observed that about 93% of the news stories are correctly segmented, 7% are missed and 11% are spurious.

3 citations

Proceedings ArticleDOI
03 Apr 2012
TL;DR: Positional, contextual and phonological features associated to syllables are proposed to model the intensities of syllables to improve the quality of the synthesized speech of text-to-speech (TTS) synthesis systems.
Abstract: The quality of the synthesized speech of text-to-speech (TTS) synthesis systems can be improved by controlling the intensities of speech segments in addition to other prosodic features such as intonation and duration. In this paper we proposed Classification and Regression Tree (CART) to model the intensities of syllables. Positional, contextual and phonological features associated to syllables are proposed to model the intensities. The proposed CART model is evaluated by means of objective measures such as average prediction error (μ), standard deviation (σ), correlation coefficient (γX,Y) and the percentage of syllables predicted within different deviations. From the studies we find that 82% of the syllable intensities could be predicted from the models within 7% deviation. The contribution of individual features in modeling the syllable intensities is also analysed. The proposed model is also evaluated by means of subjective listening tests on the synthesized speech generated by incorporating the predicted syllable intensities.

2 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: This paper proposes to learn affect-salient features for SER using convolutional neural networks (CNN), and shows that this approach leads to stable and robust recognition performance in complex scenes and outperforms several well-established SER features.
Abstract: As an essential way of human emotional behavior understanding, speech emotion recognition (SER) has attracted a great deal of attention in human-centered signal processing. Accuracy in SER heavily depends on finding good affect- related , discriminative features. In this paper, we propose to learn affect-salient features for SER using convolutional neural networks (CNN). The training of CNN involves two stages. In the first stage, unlabeled samples are used to learn local invariant features (LIF) using a variant of sparse auto-encoder (SAE) with reconstruction penalization. In the second step, LIF is used as the input to a feature extractor, salient discriminative feature analysis (SDFA), to learn affect-salient, discriminative features using a novel objective function that encourages feature saliency, orthogonality, and discrimination for SER. Our experimental results on benchmark datasets show that our approach leads to stable and robust recognition performance in complex scenes (e.g., with speaker and language variation, and environment distortion) and outperforms several well-established SER features.

479 citations

Journal ArticleDOI
TL;DR: This work defines speech emotion recognition systems as a collection of methodologies that process and classify speech signals to detect the embedded emotions and identified and discussed distinct areas of SER.

378 citations

Journal ArticleDOI
TL;DR: In this study, available literature on various databases, different features and classifiers have been taken in to consideration for speech emotion recognition from assorted languages.
Abstract: Speech is an effective medium to express emotions and attitude through language. Finding the emotional content from a speech signal and identify the emotions from the speech utterances is an important task for the researchers. Speech emotion recognition has considered as an important research area over the last decade. Many researchers have been attracted due to the automated analysis of human affective behaviour. Therefore a number of systems, algorithms, and classifiers have been developed and outlined for the identification of emotional content of a speech from a person's speech. In this study, available literature on various databases, different features and classifiers have been taken in to consideration for speech emotion recognition from assorted languages.

228 citations

Journal ArticleDOI
TL;DR: It has been evident from results that MLP gives best accuracy to recognize human emotion in response to audio music tracks using hybrid features of brain signals.

156 citations

Journal ArticleDOI
TL;DR: This paper proposes rectangular kernels of varying shapes and sizes, along with max pooling in rectangular neighborhoods, to extract discriminative features from speech spectrograms using a deep convolutional neural network (CNN) with rectangular kernels.
Abstract: Emotion recognition from speech signals is an interesting research with several applications like smart healthcare, autonomous voice response systems, assessing situational seriousness by caller affective state analysis in emergency centers, and other smart affective services. In this paper, we present a study of speech emotion recognition based on the features extracted from spectrograms using a deep convolutional neural network (CNN) with rectangular kernels. Typically, CNNs have square shaped kernels and pooling operators at various layers, which are suited for 2D image data. However, in case of spectrograms, the information is encoded in a slightly different manner. Time is represented along the x-axis and y-axis shows frequency of the speech signal, whereas, the amplitude is indicated by the intensity value in the spectrogram at a particular position. To analyze speech through spectrograms, we propose rectangular kernels of varying shapes and sizes, along with max pooling in rectangular neighborhoods, to extract discriminative features. The proposed scheme effectively learns discriminative features from speech spectrograms and performs better than many state-of-the-art techniques when evaluated its performance on Emo-DB and Korean speech dataset.

118 citations