scispace - formally typeset
Search or ask a question
Author

D. Pravena

Bio: D. Pravena is an academic researcher from Amrita Vishwa Vidyapeetham. The author has contributed to research in topics: Mel-frequency cepstrum & Feature (machine learning). The author has an hindex of 5, co-authored 19 publications receiving 80 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: The emotion recognition rates obtained for the proposed speech-EGG emotion database using the conventional mel frequency cepstral coefficients and Gaussian mixture model based emotion recognition system, are found to be comparable with that of the existing German and IITKGP-SESC Telugu speech emotion databases.
Abstract: The work presented in this paper is focused on the development of a simulated emotion database particularly for the excitation source analysis. The presence of simultaneous electroglottogram (EGG) recordings for each emotion utterance helps to accurately analyze the variations in the source parameters according to different emotions. The work presented in this paper describes the development of comparatively large simulated emotion database for three emotions (Anger, Happy and Sad) along with neutrally spoken utterances in three languages (Tamil, Malayalam and Indian English). Emotion utterances in each language are recorded from 10 speakers in multiple sessions (Tamil and Malayalam). Unlike the existing simulated emotion databases, instead of emotionally neutral utterances, emotionally biased utterances are used for recording. Based on the emotion recognition experiments, the emotions elicited from emotionally biased utterances are found to show more emotion discrimination as compared to emotionally neutral utterances. Also, based on the comparative experimental analysis, the speech and EGG utterances of the proposed simulated emotion database are found to preserve the general trend in the excitation source characteristics (instantaneous F0 and strength of excitation parameters) for different emotions as that of the classical German emotion speech-EGG database (EmoDb). Finally, the emotion recognition rates obtained for the proposed speech-EGG emotion database using the conventional mel frequency cepstral coefficients and Gaussian mixture model based emotion recognition system, are found to be comparable with that of the existing German (EmoDb) and IITKGP-SESC Telugu speech emotion databases.

17 citations

Journal ArticleDOI
TL;DR: This paper presents the detection of vocal fold pathology with the aid of the speech signal recorded from the patients and the design and implementation of the proposed system for recognizing pathological and normal voice.
Abstract: Pathology is the study and diagnosis of disease. Due to the nature of job, unhealthy habits and voice abuse, the people are subjected to the risk of voice problems. The diagnosis of vocal and voice disorders should be in the early stage otherwise it causes changes in the normal signal. It is well known that most of vocal fold pathologies cause changes in the acoustic voice signal. Therefore, the voice signal can be a useful tool to diagnose them. Acoustic voice analysis can be used to characterize the pathological voices. This paper presents the detection of vocal fold pathology with the aid of the speech signal recorded from the patients. We are going to recognize the disordered voice for vocal fold disease by focusing on the classification of pathological voice from healthy voice based on acoustic features. The method includes two steps. The first step is the extraction of feature vectors based on MFCC. The second is the classification of feature vectors using GMM. The extracted acoustic parameters from the voice signals are used as an input for the MFCC. The main advantage of this method is less computation time and possibility of real-time system development. This report introduces the design and implementation of the proposed system for recognizing pathological and normal voice. Also a description is given about the literature survey done and the implementation of different modules in the system. The result of the proposed system and the scope of improvements are also discussed in the report.

16 citations

Journal ArticleDOI
TL;DR: The effectiveness of SoE and instantaneous $$F_0$$F0 in characterizing different emotions is also confirmed by the improved emotion recognition performance in Tamil speech-EGG emotion database.
Abstract: The work presented in this paper explores the effectiveness of incorporating the excitation source parameters such as strength of excitation and instantaneous fundamental frequency ( $$F_0$$ ) for emotion recognition task from speech and electroglottographic (EGG) signals. The strength of excitation (SoE) is an important parameter indicating the pressure with which glottis closes at the glottal closure instants (GCIs). The SoE is computed by the popular zero frequency filtering (ZFF) method which accurately estimates the glottal signal characteristics by attenuating or removing the high frequency vocaltract interactions in speech. The arbitrary impulse sequence, obtained from the estimated GCIs, is used to derive the instantaneous $$F_0$$ . The SoE and the instantaneous $$F_0$$ parameters are combined with the conventional mel frequency cepstral coefficients (MFCC) to improve the recognition rates of distinct emotions (Anger, Happy and Sad) using Gaussian mixture models as classifier. The performances of the proposed combination of SoE and instantaneous $$F_0$$ and their dynamic features with MFCC coefficients are compared with the emotion utterances (4 emotions and neutral) from classical German full blown emotion speech database (EmoDb) having simultaneous speech and EGG signals and Surrey Audio Visual Expressed Emotion database (3 emotions and neutral) for both speaker dependent and speaker independent emotion recognition scenarios. To reinforce the effectiveness of the proposed features and for better statistical consistency of the emotion analysis, a fairly large emotion speech database of 220 utterances per emotion in Tamil language with simultaneous EGG recordings, is used in addition to EmoDb. The effectiveness of SoE and instantaneous $$F_0$$ in characterizing different emotions is also confirmed by the improved emotion recognition performance in Tamil speech-EGG emotion database.

14 citations

Journal Article
TL;DR: This paper addresses the problem of reducing additive white Gaussian noise from speech signal while preserving the intelligibility and quality of the speech signal by using Savitzky-Golay smoothing filter based denoising method.
Abstract: denoising is the process of removing unwanted sounds from the speech signal. In the presence of noise, it is difficult for the listener to understand the message of the speech signal. Also, the presence of noise in speech signal will degrade the performance of various signal processing tasks like speech recognition, speaker recognition, speaker verification etc. Many methods have been widely used to eliminate noise from speech signal like linear and nonlinear filtering methods, total variation denoising, wavelet based denoising etc. This paper addresses the problem of reducing additive white Gaussian noise from speech signal while preserving the intelligibility and quality of the speech signal. The method is based on Savitzky-Golay smoothing filter, which is basically a low pass filter that performs a polynomial regression on the signal values. The results of S-G filter based denoising method are compared against two widely used enhancement methods, Spectral subtraction method and Total variation denoising. Objective and subjective quality evaluation are performed for the three speech enhancement schemes. The results show that S-G based method is ideal for the removal of additive white Gaussian noise from the speech signals.

9 citations

Proceedings ArticleDOI
01 Oct 2015
TL;DR: The objective of the present work is to propose refinements to the existing ZFF based epoch estimation algorithm for improved epoch estimation in telephonic speech, and the strength of the impulses at the zero frequency region are enhanced by computing the Hilbert envelope of the speech which in turn improve the epoch estimation performance.
Abstract: Epochs are the locations correspond to glottal closure instants for voiced speech segments and onset of bursts or frication in unvoiced segments. In the recent years, the zero frequency filtering (ZFF) based epoch estimation has received a growing attention for clean or studio speech signals. The ZFF based epoch estimation exploits the impulse like excitation characteristics at the zero frequency (DC) region in speech. As the lower frequency regions in telephonic speech are significantly attenuated, ZFF approach gives degraded epoch estimation performance. Therefore, the objective of the present work is to propose refinements to the existing ZFF based epoch estimation algorithm for improved epoch estimation in telephonic speech. The strength of the impulses at the zero frequency region are enhanced by computing the Hilbert envelope (HE) of the speech which in turn improve the epoch estimation performance. The resonators located at the approximate F0 locations of the short term blocks of conventional zero frequency filtered signal, are also found to improve the epoch estimation performance in telephonic speech. The performance of the refined ZFF method is evaluated on 3 speaker voices (JMK, SLT and BDL) of CMU Arctic database having simultaneous speech and EGG recordings. The telephonic version of CMU Arctic database is simulated using tools provided by the international telecommunication union (ITU).

8 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: In this study, available literature on various databases, different features and classifiers have been taken in to consideration for speech emotion recognition from assorted languages.
Abstract: Speech is an effective medium to express emotions and attitude through language. Finding the emotional content from a speech signal and identify the emotions from the speech utterances is an important task for the researchers. Speech emotion recognition has considered as an important research area over the last decade. Many researchers have been attracted due to the automated analysis of human affective behaviour. Therefore a number of systems, algorithms, and classifiers have been developed and outlined for the identification of emotional content of a speech from a person's speech. In this study, available literature on various databases, different features and classifiers have been taken in to consideration for speech emotion recognition from assorted languages.

228 citations

Journal ArticleDOI
TL;DR: In this article, the authors identify and synthesize recent relevant literature related to the speech emotion recognition systems' varied design components/methodologies, thereby providing readers with a state-of-the-art understanding of the hot research topic.
Abstract: During the last decade, Speech Emotion Recognition (SER) has emerged as an integral component within Human-computer Interaction (HCI) and other high-end speech processing systems. Generally, an SER system targets the speaker’s existence of varied emotions by extracting and classifying the prominent features from a preprocessed speech signal. However, the way humans and machines recognize and correlate emotional aspects of speech signals are quite contrasting quantitatively and qualitatively, which present enormous difficulties in blending knowledge from interdisciplinary fields, particularly speech emotion recognition, applied psychology, and human-computer interface. The paper carefully identifies and synthesizes recent relevant literature related to the SER systems’ varied design components/methodologies, thereby providing readers with a state-of-the-art understanding of the hot research topic. Furthermore, while scrutinizing the current state of understanding on SER systems, the research gap’s prominence has been sketched out for consideration and analysis by other related researchers, institutions, and regulatory bodies.

77 citations

Journal ArticleDOI
TL;DR: A review of the recent development in SER is provided and the impact of various attention mechanisms on SER performance is examined and overall comparison of the system accuracies is performed on a widely used IEMOCAP benchmark database.
Abstract: Emotions are an integral part of human interactions and are significant factors in determining user satisfaction or customer opinion. speech emotion recognition (SER) modules also play an important role in the development of human–computer interaction (HCI) applications. A tremendous number of SER systems have been developed over the last decades. Attention-based deep neural networks (DNNs) have been shown as suitable tools for mining information that is unevenly time distributed in multimedia content. The attention mechanism has been recently incorporated in DNN architectures to emphasise also emotional salient information. This paper provides a review of the recent development in SER and also examines the impact of various attention mechanisms on SER performance. Overall comparison of the system accuracies is performed on a widely used IEMOCAP benchmark database.

59 citations

Journal ArticleDOI
TL;DR: It is found that reasonably good classification accuracies could be achieved by selecting appropriate features and these results may assist in the feature development of automated detection systems for diagnosis of patients with symptoms of pathological voice.

41 citations

Journal ArticleDOI
01 Jun 2020-Irbm
TL;DR: HOS features show promising results in the automatic voice pathology detection and classification compared to DWT features and can reliably be used as noninvasive tool to assist clinical evaluation for pathological voices identification.
Abstract: Background The voice is a prominent tool allowing people to communicate and to change information in their daily activities. However, any slight alteration in the voice production system may affect the voice quality. Over the last years, researchers in biomedical engineering field worked to develop a robust automatic system that may help clinicians to perform a preventive diagnosis in order to detect the voice pathologies in an early stage. Method In this context, pathological voice detection and classification method based on EMD-DWT analysis and Higher Order Statistics (HOS) features, is proposed. Also DWT coefficients features are extracted and tested. To carry out our experiments a wide subset of voice signal from normal subjects and subjects which suffer from the five most frequent pathologies in the Saarbrucken Voice Database (SVD), is selected. In The first step, we applied the Empirical Mode Decomposition (EMD) to the voice signal. Afterwards, among the obtained candidates of Intrinsic Mode Functions (IMFs), we choose the robust one based on temporal energy criterion. In the second step, the selected IMF was decomposed via the Discrete Wavelet Transform (DWT). As a result, two features vector includes six HOSs parameters, and a features vector includes six DWT features were formed from both approximation and detail coefficients. In order to classify the obtained data a support vector machine (SVM) is employed. After having trained the proposed system using the SVD database, the system was evaluated using voice signals of volunteer's subjects from the Neurological department of RABTA Hospital of Tunis. Results The proposed method gives promising results in pathological voices detection. The accuracies reached 99.26% using HOS features and 93.1% using DWT features for SVD database. In the classification, an accuracy of 100% was reached for “Funktionelle Dysphonia vs. Rekrrensparese” based on HOS features. Nevertheless, using DWT features the accuracy achieved was 90.32% for “Hyperfunktionelle Dysphonia vs. Rekurrensparse”. Furthermore, in the validation the accuracies reached were 94.82%, 91.37% for HOS and DWT features, respectively. In the classification the highest accuracies reached were for classifying “Parkinson versus Paralysis” 94.44% and 88.87% based on HOS and DWT features, respectively. Conclusion HOS features show promising results in the automatic voice pathology detection and classification compared to DWT features. Thus, it can reliably be used as noninvasive tool to assist clinical evaluation for pathological voices identification.

37 citations