scispace - formally typeset
Search or ask a question
Author

Prashanth Gurunath Shivakumar

Bio: Prashanth Gurunath Shivakumar is an academic researcher from University of Southern California. The author has contributed to research in topics: Language model & Recurrent neural network. The author has an hindex of 9, co-authored 19 publications receiving 304 citations.

Papers
More filters
Proceedings ArticleDOI
16 Oct 2016
TL;DR: A multimodal depression classification system is presented as a part of the 2016 Audio/Visual Emotion Challenge and Workshop (AVEC2016), and polynomial parameterization of facial landmark features achieves the best performance among all systems and outperforms the best baseline system.
Abstract: Automatic classification of depression using audiovisual cues can help towards its objective diagnosis. In this paper, we present a multimodal depression classification system as a part of the 2016 Audio/Visual Emotion Challenge and Workshop (AVEC2016). We investigate a number of audio and video features for classification with different fusion techniques and temporal contexts. In the audio modality, Teager energy cepstral coefficients~(TECC) outperform standard baseline features; while the best accuracy is achieved with i-vector modelling based on MFCC features. On the other hand, polynomial parameterization of facial landmark features achieves the best performance among all systems and outperforms the best baseline system as well.

107 citations

01 Jan 2014
TL;DR: This paper presents a preliminary study towards better acoustic modeling, pronunciation modeling and front-end processing for children’s speech, and introduction of pronunciation modeling shows promising performance improvements.
Abstract: Developing a robust Automatic Speech Recognition (ASR) system for children is a challenging task because of increased variability in acoustic and linguistic correlates as function of young age. The acoustic variability is mainly due to the developmental changes associated with vocal tract growth. On the linguistic side, the variability is associated with limited knowledge of vocabulary, pronunciations and other linguistic constructs. This paper presents a preliminary study towards better acoustic modeling, pronunciation modeling and front-end processing for children’s speech. Results are presented as a function of age. Speaker adaptation significantly reduces mismatch and variability improving recognition results across age groups. In addition, introduction of pronunciation modeling shows promising performance improvements.

70 citations

Journal ArticleDOI
TL;DR: In this paper, a transfer learning from adult's models to children's models in a deep neural network (DNN) framework for children's Automatic Speech Recognition (ASR) task evaluating on multiple children's speech corpora with a large vocabulary.

69 citations

Proceedings ArticleDOI
08 Sep 2016
TL;DR: A novel objective loss function is proposed, which takes into account the perceptual quality of speech and is used to train PerceptuallyOptimized Speech Denoising Auto-Encoders (POS-DAE), and a two level DNN architecture for denoising and enhancement is introduced.
Abstract: Speech Enhancement is a challenging and important area of research due to the many applications that depend on improved signal quality. It is a pre-processing step of speech processing systems and used for perceptually improving quality of speech for humans. With recent advances in Deep Neural Networks (DNN), deep Denoising Auto-Encoders have proved to be very successful for speech enhancement. In this paper, we propose a novel objective loss function, which takes into account the perceptual quality of speech. We use that to train PerceptuallyOptimized Speech Denoising Auto-Encoders (POS-DAE). We demonstrate the effectiveness of POS-DAE in a speech enhancement task. Further we introduce a two level DNN architecture for denoising and enhancement. We show the effectiveness of the proposed methods for a high noise subset of the QUT-NOISE-TIMIT database under mismatched noise conditions. Experiments are conducted comparing the POS-DAE against the Mean Square Error loss function using speech distortion, noise reduction and Perceptual Evaluation of Speech Quality. We find that the proposed loss function and the new 2stage architecture give significant improvements in perceptual speech quality measures and the improvements become more significant for higher noise conditions.

55 citations

Posted Content
TL;DR: This work attempts to address the key challenges using transfer learning from adult's models to children's models in a Deep Neural Network (DNN) framework for children's Automatic Speech Recognition (ASR) task evaluating on multiple children's speech corpora with a large vocabulary.
Abstract: Children speech recognition is challenging mainly due to the inherent high variability in children's physical and articulatory characteristics and expressions. This variability manifests in both acoustic constructs and linguistic usage due to the rapidly changing developmental stage in children's life. Part of the challenge is due to the lack of large amounts of available children speech data for efficient modeling. This work attempts to address the key challenges using transfer learning from adult's models to children's models in a Deep Neural Network (DNN) framework for children's Automatic Speech Recognition (ASR) task evaluating on multiple children's speech corpora with a large vocabulary. The paper presents a systematic and an extensive analysis of the proposed transfer learning technique considering the key factors affecting children's speech recognition from prior literature. Evaluations are presented on (i) comparisons of earlier GMM-HMM and the newer DNN Models, (ii) effectiveness of standard adaptation techniques versus transfer learning, (iii) various adaptation configurations in tackling the variabilities present in children speech, in terms of (a) acoustic spectral variability, and (b) pronunciation variability and linguistic constraints. Our Analysis spans over (i) number of DNN model parameters (for adaptation), (ii) amount of adaptation data, (iii) ages of children, (iv) age dependent-independent adaptation. Finally, we provide Recommendations on (i) the favorable strategies over various aforementioned - analyzed parameters, and (ii) potential future research directions and relevant challenges/problems persisting in DNN based ASR for children's speech.

30 citations


Cited by
More filters
Journal ArticleDOI
01 Apr 1956-Nature
TL;DR: The Foundations of Statistics By Prof. Leonard J. Savage as mentioned in this paper, p. 48s. (Wiley Publications in Statistics.) Pp. xv + 294. (New York; John Wiley and Sons, Inc., London: Chapman and Hall, Ltd., 1954).
Abstract: The Foundations of Statistics By Prof. Leonard J. Savage. (Wiley Publications in Statistics.) Pp. xv + 294. (New York; John Wiley and Sons, Inc.; London: Chapman and Hall, Ltd., 1954.) 48s. net.

844 citations

Journal ArticleDOI
TL;DR: In this paper, an end-to-end utterance-based speech enhancement framework using fully convolutional neural networks (FCN) was proposed to reduce the gap between the model optimization and the evaluation criterion.
Abstract: Speech enhancement model is used to map a noisy speech to a clean speech. In the training stage, an objective function is often adopted to optimize the model parameters. However, in the existing literature, there is an inconsistency between the model optimization criterion and the evaluation criterion for the enhanced speech. For example, in measuring speech intelligibility, most of the evaluation metric is based on a short-time objective intelligibility (STOI) measure, while the frame based mean square error (MSE) between estimated and clean speech is widely used in optimizing the model. Due to the inconsistency, there is no guarantee that the trained model can provide optimal performance in applications. In this study, we propose an end-to-end utterance-based speech enhancement framework using fully convolutional neural networks (FCN) to reduce the gap between the model optimization and the evaluation criterion. Because of the utterance-based optimization, temporal correlation information of long speech segments, or even at the entire utterance level, can be considered to directly optimize perception-based objective functions. As an example, we implemented the proposed FCN enhancement framework to optimize the STOI measure. Experimental results show that the STOI of a test speech processed by the proposed approach is better than conventional MSE-optimized speech due to the consistency between the training and the evaluation targets. Moreover, by integrating the STOI into model optimization, the intelligibility of human subjects and automatic speech recognition system on the enhanced speech is also substantially improved compared to those generated based on the minimum MSE criterion.

275 citations

Proceedings ArticleDOI
02 Sep 2018
TL;DR: An automated depression-detection algorithm is demonstrated that models interviews between an individual and agent and learns from sequences of questions and answers without the need to perform explicit topic modeling of the content.
Abstract: Medical professionals diagnose depression by interpreting the responses of individuals to a variety of questions, probing lifestyle changes and ongoing thoughts. Like professionals, an effective automated agent must understand that responses to queries have varying prognostic value. In this study we demonstrate an automated depression-detection algorithm that models interviews between an individual and agent and learns from sequences of questions and answers without the need to perform explicit topic modeling of the content. We utilized data of 142 individuals undergoing depression screening, and modeled the interactions with audio and text features in a Long-Short Term Memory (LSTM) neural network model to detect depression. Our results were comparable to methods that explicitly modeled the topics of the questions and answers which suggests that depression can be detected through sequential modeling of an interaction, with minimal information on the structure of the interview.

176 citations

Journal ArticleDOI
TL;DR: This paper proposes a combination of hand-crafted and deep-learned features which can effectively measure the severity of depression from speech and proposes joint fine-tuning layers to combine the raw and spectrogram DCNN to boost the depression recognition performance.

133 citations