Home
/
Authors
/
Ramu Reddy Vempada

Author

Ramu Reddy Vempada

Indian Institute of Technology Kharagpur

Bio: Ramu Reddy Vempada is an academic researcher from Indian Institute of Technology Kharagpur. The author has contributed to research in topics: Speech corpus & Speaker recognition. The author has an hindex of 3, co-authored 5 publications receiving 210 citations.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Emotion recognition from speech using global and local prosodic features

[...]

K. Sreenivasa Rao¹, Shashidhar G. Koolagudi¹, Ramu Reddy Vempada¹•Institutions (1)

Indian Institute of Technology Kharagpur¹

01 Jun 2013-International Journal of Speech Technology

TL;DR: The results indicate that, the recognition performance using local Prosodic features is better compared to the performance of global prosodic features.

...read moreread less

Abstract: In this paper, global and local prosodic features extracted from sentence, word and syllables are proposed for speech emotion or affect recognition. In this work, duration, pitch, and energy values are used to represent the prosodic information, for recognizing the emotions from speech. Global prosodic features represent the gross statistics such as mean, minimum, maximum, standard deviation, and slope of the prosodic contours. Local prosodic features represent the temporal dynamics in the prosody. In this work, global and local prosodic features are analyzed separately and in combination at different levels for the recognition of emotions. In this study, we have also explored the words and syllables at different positions (initial, middle, and final) separately, to analyze their contribution towards the recognition of emotions. In this paper, all the studies are carried out using simulated Telugu emotion speech corpus (IITKGP-SESC). These results are compared with the results of internationally known Berlin emotion speech corpus (Emo-DB). Support vector machines are used to develop the emotion recognition models. The results indicate that, the recognition performance using local prosodic features is better compared to the performance of global prosodic features. Words in the final position of the sentences, syllables in the final position of the words exhibit more emotion discriminative information compared to the words and syllables present in the other positions.

...read moreread less

149 citations

Journal Article•DOI•

Development of syllable-based text to speech synthesis system in Bengali

[...]

N. P. Narendra¹, K. Sreenivasa Rao¹, Krishnendu Ghosh¹, Ramu Reddy Vempada¹, Sudhamay Maity¹ - Show less +1 more•Institutions (1)

Indian Institute of Technology Kharagpur¹

01 Sep 2011-International Journal of Speech Technology

TL;DR: The subjective and objective measures indicate that the proposed features and methods have improved the quality of the synthesized speech from stage-2 to stage-4.

...read moreread less

Abstract: This paper presents the design and development of unrestricted text to speech synthesis (TTS) system in Bengali language. Unrestricted TTS system is capable to synthesize good quality of speech in different domains. In this work, syllables are used as basic units for synthesis. Festival framework has been used for building the TTS system. Speech collected from a female artist is used as speech corpus. Initially five speakers' speech is collected and a prototype TTS is built from each of the five speakers. Best speaker among the five is selected through subjective and objective evaluation of natural and synthesized waveforms. Then development of unrestricted TTS is carried out by addressing the issues involved at each stage to produce good quality synthesizer. Evaluation is carried out in four stages by conducting objective and subjective listening tests on synthesized speech. At the first stage, TTS system is built with basic festival framework. In the following stages, additional features are incorporated into the system and quality of synthesis is evaluated. The subjective and objective measures indicate that the proposed features and methods have improved the quality of the synthesized speech from stage-2 to stage-4.

...read moreread less

65 citations

Proceedings Article•DOI•

Characterization of infant cries using spectral and prosodic features

[...]

Ramu Reddy Vempada¹, B. Siva Kumar¹, K. Sreenivasa Rao¹•Institutions (1)

Indian Institute of Technology Kharagpur¹

03 Apr 2012

TL;DR: In this paper, spectral and prosodic features are explored for recognition of infant cry, and support vector machines (SVM) are used to capture the discriminative information with respect to above mentioned cries from the spectral features.

...read moreread less

Abstract: In this paper, spectral and prosodic features are explored for recognition of infant cry. Different types of infant cries considered in this work are wet-diaper, hunger and pain. In this work, mel-frequency cepstral coefficients (MFCC) are used to represent the spectral information, and short-time frame energies (STE) and pause duration are used for representing the prosodic information. Support Vector Machines (SVM) are used to capture the discriminative information with respect to above mentioned cries from the spectral and prosodic features. SVM models are developed seperately using spectral and prosodic features. For carrying out these studies, infant cry database collected under Telemedicine project at IIT-KGP has been used. The recognition performance of the developed SVM models using spectral and prosodic features is observed to be 61.11% and 57.41% respectively. In this work, we also examined the recognition performance by combining the spectral and prosodic information at feature and score levels. The recognition performance using feature and score level fusion is observed to be 74.07% and 80.56% respectively.

...read moreread less

32 citations

Proceedings Article•DOI•

Segmentation of TV broadcast news using speaker specific information

[...]

K. Sreenivasa Rao¹, Ketan Pachpande¹, Ramu Reddy Vempada¹, Sudhamay Maity¹•Institutions (1)

Indian Institute of Technology Kharagpur¹

03 Apr 2012

TL;DR: The proposed two-stage segmentation method is evaluated on manual segmented ten broadcast TV news bulletins and it is observed that about 93% of the news stories are correctly segmented, 7% are missed and 11% are spurious.

...read moreread less

Abstract: In this paper, we proposed two-stage segmentation approach for splitting the TV broadcast news bulletins into sequence of news stories. In the first stage, speaker (news reader) specific characteristics present in initial headlines of the news bulletin are used for gross level segmentation. During second stage, errors in the gross level segmentation (first stage) are corrected by exploiting the speaker specific information captured from the individual news stories other than headlines. During headlines the captured speaker specific information is mixed with background music, and hence the segmentation at the first stage may not be accurate. In this work speaker specific information is represented by using mel frequency cepstral coefficients (MFCCs), and it is captured by using Gaussian mixture models (GMMs). The proposed two-stage segmentation method is evaluated on manual segmented ten broadcast TV news bulletins. From the evaluation results, it is observed that about 93% of the news stories are correctly segmented, 7% are missed and 11% are spurious.

...read moreread less

3 citations

Proceedings Article•DOI•

Modeling the intensity of syllables using classification and Regression Trees

[...]

Ramu Reddy Vempada¹, K. Sreenivasa Rao¹•Institutions (1)

Indian Institute of Technology Kharagpur¹

03 Apr 2012

TL;DR: Positional, contextual and phonological features associated to syllables are proposed to model the intensities of syllables to improve the quality of the synthesized speech of text-to-speech (TTS) synthesis systems.

...read moreread less

Abstract: The quality of the synthesized speech of text-to-speech (TTS) synthesis systems can be improved by controlling the intensities of speech segments in addition to other prosodic features such as intonation and duration. In this paper we proposed Classification and Regression Tree (CART) to model the intensities of syllables. Positional, contextual and phonological features associated to syllables are proposed to model the intensities. The proposed CART model is evaluated by means of objective measures such as average prediction error (μ), standard deviation (σ), correlation coefficient (γX,Y) and the percentage of syllables predicted within different deviations. From the studies we find that 82% of the syllable intensities could be predicted from the models within 7% deviation. The contribution of individual features in modeling the syllable intensities is also analysed. The proposed model is also evaluated by means of subjective listening tests on the synthesized speech generated by incorporating the predicted syllable intensities.

...read moreread less

2 citations

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks

[...]

Qirong Mao¹, Ming Dong², Zhengwei Huang¹, Yongzhao Zhan¹•Institutions (2)

Jiangsu University¹, Wayne State University²

29 Sep 2014-IEEE Transactions on Multimedia

TL;DR: This paper proposes to learn affect-salient features for SER using convolutional neural networks (CNN), and shows that this approach leads to stable and robust recognition performance in complex scenes and outperforms several well-established SER features.

...read moreread less

Abstract: As an essential way of human emotional behavior understanding, speech emotion recognition (SER) has attracted a great deal of attention in human-centered signal processing. Accuracy in SER heavily depends on finding good affect- related , discriminative features. In this paper, we propose to learn affect-salient features for SER using convolutional neural networks (CNN). The training of CNN involves two stages. In the first stage, unlabeled samples are used to learn local invariant features (LIF) using a variant of sparse auto-encoder (SAE) with reconstruction penalization. In the second step, LIF is used as the input to a feature extractor, salient discriminative feature analysis (SDFA), to learn affect-salient, discriminative features using a novel objective function that encourages feature saliency, orthogonality, and discrimination for SER. Our experimental results on benchmark datasets show that our approach leads to stable and robust recognition performance in complex scenes (e.g., with speaker and language variation, and environment distortion) and outperforms several well-established SER features.

...read moreread less

479 citations

Journal Article•DOI•

Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers

[...]

Mehmet Berkehan Akçay¹, Kaya Oguz¹•Institutions (1)

İzmir University of Economics¹

01 Jan 2020-Speech Communication

TL;DR: This work defines speech emotion recognition systems as a collection of methodologies that process and classify speech signals to detect the embedded emotions and identified and discussed distinct areas of SER.

...read moreread less

378 citations

Journal Article•DOI•

Databases, features and classifiers for speech emotion recognition: a review

[...]

Monorama Swain¹, Aurobinda Routray², P. Kabisatpathy³•Institutions (3)

Silicon Institute of Technology¹, Indian Institute of Technology Kharagpur², C. V. Raman College of Engineering, Bhubaneshwar³

01 Mar 2018-International Journal of Speech Technology

TL;DR: In this study, available literature on various databases, different features and classifiers have been taken in to consideration for speech emotion recognition from assorted languages.

...read moreread less

Abstract: Speech is an effective medium to express emotions and attitude through language. Finding the emotional content from a speech signal and identify the emotions from the speech utterances is an important task for the researchers. Speech emotion recognition has considered as an important research area over the last decade. Many researchers have been attracted due to the automated analysis of human affective behaviour. Therefore a number of systems, algorithms, and classifiers have been developed and outlined for the identification of emotional content of a speech from a person's speech. In this study, available literature on various databases, different features and classifiers have been taken in to consideration for speech emotion recognition from assorted languages.

...read moreread less

228 citations

Journal Article•DOI•

Human emotion recognition and analysis in response to audio music using brain signals

[...]

Adnan Mehmood Bhatti¹, Muhammad Majid¹, Syed Muhammad Anwar¹, Bilal Khan²•Institutions (2)

University of Engineering and Technology¹, COMSATS Institute of Information Technology²

01 Dec 2016-Computers in Human Behavior

TL;DR: It has been evident from results that MLP gives best accuracy to recognize human emotion in response to audio music tracks using hybrid features of brain signals.

...read moreread less

156 citations

Journal Article•DOI•

Deep features-based speech emotion recognition for smart affective services

[...]

Abdul Malik Badshah¹, Nasir Rahim¹, Noor Ullah¹, Jamil Ahmad¹, Khan Muhammad¹, Mi Young Lee¹, Soonil Kwon¹, Sung Wook Baik¹ - Show less +4 more•Institutions (1)

Sejong University¹

01 Mar 2019-Multimedia Tools and Applications

TL;DR: This paper proposes rectangular kernels of varying shapes and sizes, along with max pooling in rectangular neighborhoods, to extract discriminative features from speech spectrograms using a deep convolutional neural network (CNN) with rectangular kernels.

...read moreread less

Abstract: Emotion recognition from speech signals is an interesting research with several applications like smart healthcare, autonomous voice response systems, assessing situational seriousness by caller affective state analysis in emergency centers, and other smart affective services. In this paper, we present a study of speech emotion recognition based on the features extracted from spectrograms using a deep convolutional neural network (CNN) with rectangular kernels. Typically, CNNs have square shaped kernels and pooling operators at various layers, which are suited for 2D image data. However, in case of spectrograms, the information is encoded in a slightly different manner. Time is represented along the x-axis and y-axis shows frequency of the speech signal, whereas, the amplitude is indicated by the intensity value in the spectrogram at a particular position. To analyze speech through spectrograms, we propose rectangular kernels of varying shapes and sizes, along with max pooling in rectangular neighborhoods, to extract discriminative features. The proposed scheme effectively learns discriminative features from speech spectrograms and performs better than many state-of-the-art techniques when evaluated its performance on Emo-DB and Korean speech dataset.

...read moreread less

118 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

Collapse