Home
/
Authors
/
Prashanth Gurunath Shivakumar

Author

Prashanth Gurunath Shivakumar

Bio: Prashanth Gurunath Shivakumar is an academic researcher from University of Southern California. The author has contributed to research in topics: Language model & Recurrent neural network. The author has an hindex of 9, co-authored 19 publications receiving 304 citations.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Multimodal and Multiresolution Depression Detection from Speech and Facial Landmark Features

[...]

Nasir¹, Arindam Jati¹, Prashanth Gurunath Shivakumar¹, Sandeep Nallan Chakravarthula¹, Panayiotis G. Georgiou¹ - Show less +1 more•Institutions (1)

University of Southern California¹

16 Oct 2016

TL;DR: A multimodal depression classification system is presented as a part of the 2016 Audio/Visual Emotion Challenge and Workshop (AVEC2016), and polynomial parameterization of facial landmark features achieves the best performance among all systems and outperforms the best baseline system.

...read moreread less

Abstract: Automatic classification of depression using audiovisual cues can help towards its objective diagnosis. In this paper, we present a multimodal depression classification system as a part of the 2016 Audio/Visual Emotion Challenge and Workshop (AVEC2016). We investigate a number of audio and video features for classification with different fusion techniques and temporal contexts. In the audio modality, Teager energy cepstral coefficients~(TECC) outperform standard baseline features; while the best accuracy is achieved with i-vector modelling based on MFCC features. On the other hand, polynomial parameterization of facial landmark features achieves the best performance among all systems and outperforms the best baseline system as well.

...read moreread less

107 citations

Improving speech recognition for children using acoustic adaptation and pronunciation modeling.

[...]

Prashanth Gurunath Shivakumar¹, Alexandros Potamianos, Sungbok Lee¹, Shrikanth S. Narayanan¹•Institutions (1)

University of Southern California¹

01 Jan 2014

TL;DR: This paper presents a preliminary study towards better acoustic modeling, pronunciation modeling and front-end processing for children’s speech, and introduction of pronunciation modeling shows promising performance improvements.

...read moreread less

Abstract: Developing a robust Automatic Speech Recognition (ASR) system for children is a challenging task because of increased variability in acoustic and linguistic correlates as function of young age. The acoustic variability is mainly due to the developmental changes associated with vocal tract growth. On the linguistic side, the variability is associated with limited knowledge of vocabulary, pronunciations and other linguistic constructs. This paper presents a preliminary study towards better acoustic modeling, pronunciation modeling and front-end processing for children’s speech. Results are presented as a function of age. Speaker adaptation significantly reduces mismatch and variability improving recognition results across age groups. In addition, introduction of pronunciation modeling shows promising performance improvements.

...read moreread less

70 citations

Journal Article•DOI•

Transfer Learning from Adult to Children for Speech Recognition: Evaluation, Analysis and Recommendations.

[...]

Prashanth Gurunath Shivakumar¹, Panayiotis G. Georgiou¹•Institutions (1)

University of Southern California¹

01 Sep 2020-Computer Speech & Language

TL;DR: In this paper, a transfer learning from adult's models to children's models in a deep neural network (DNN) framework for children's Automatic Speech Recognition (ASR) task evaluating on multiple children's speech corpora with a large vocabulary.

...read moreread less

69 citations

Proceedings Article•DOI•

Perception Optimized Deep Denoising AutoEncoders for Speech Enhancement.

[...]

Prashanth Gurunath Shivakumar¹, Panayiotis G. Georgiou¹•Institutions (1)

University of Southern California¹

08 Sep 2016

TL;DR: A novel objective loss function is proposed, which takes into account the perceptual quality of speech and is used to train PerceptuallyOptimized Speech Denoising Auto-Encoders (POS-DAE), and a two level DNN architecture for denoising and enhancement is introduced.

...read moreread less

Abstract: Speech Enhancement is a challenging and important area of research due to the many applications that depend on improved signal quality. It is a pre-processing step of speech processing systems and used for perceptually improving quality of speech for humans. With recent advances in Deep Neural Networks (DNN), deep Denoising Auto-Encoders have proved to be very successful for speech enhancement. In this paper, we propose a novel objective loss function, which takes into account the perceptual quality of speech. We use that to train PerceptuallyOptimized Speech Denoising Auto-Encoders (POS-DAE). We demonstrate the effectiveness of POS-DAE in a speech enhancement task. Further we introduce a two level DNN architecture for denoising and enhancement. We show the effectiveness of the proposed methods for a high noise subset of the QUT-NOISE-TIMIT database under mismatched noise conditions. Experiments are conducted comparing the POS-DAE against the Mean Square Error loss function using speech distortion, noise reduction and Perceptual Evaluation of Speech Quality. We find that the proposed loss function and the new 2stage architecture give significant improvements in perceptual speech quality measures and the improvements become more significant for higher noise conditions.

...read moreread less

55 citations

Posted Content•

Transfer Learning from Adult to Children for Speech Recognition: Evaluation, Analysis and Recommendations

[...]

Prashanth Gurunath Shivakumar¹, Panayiotis G. Georgiou¹•Institutions (1)

University of Southern California¹

08 May 2018-arXiv: Audio and Speech Processing

TL;DR: This work attempts to address the key challenges using transfer learning from adult's models to children's models in a Deep Neural Network (DNN) framework for children's Automatic Speech Recognition (ASR) task evaluating on multiple children's speech corpora with a large vocabulary.

...read moreread less

Abstract: Children speech recognition is challenging mainly due to the inherent high variability in children's physical and articulatory characteristics and expressions. This variability manifests in both acoustic constructs and linguistic usage due to the rapidly changing developmental stage in children's life. Part of the challenge is due to the lack of large amounts of available children speech data for efficient modeling. This work attempts to address the key challenges using transfer learning from adult's models to children's models in a Deep Neural Network (DNN) framework for children's Automatic Speech Recognition (ASR) task evaluating on multiple children's speech corpora with a large vocabulary. The paper presents a systematic and an extensive analysis of the proposed transfer learning technique considering the key factors affecting children's speech recognition from prior literature. Evaluations are presented on (i) comparisons of earlier GMM-HMM and the newer DNN Models, (ii) effectiveness of standard adaptation techniques versus transfer learning, (iii) various adaptation configurations in tackling the variabilities present in children speech, in terms of (a) acoustic spectral variability, and (b) pronunciation variability and linguistic constraints. Our Analysis spans over (i) number of DNN model parameters (for adaptation), (ii) amount of adaptation data, (iii) ages of children, (iv) age dependent-independent adaptation. Finally, we provide Recommendations on (i) the favorable strategies over various aforementioned - analyzed parameters, and (ii) potential future research directions and relevant challenges/problems persisting in DNN based ASR for children's speech.

...read moreread less

30 citations

1
2
3
4
…
5

Cited by

PDF

Open Access

More filters

Journal Article•

다중혈관 관상동맥 환자에서 y-문합을 이용하여 양쪽 내흉동맥만을 사용한 우회술의 조기 성적

[...]

성기익, 이영탁, 박계현, 전태국, 박표원, 한일용, 장윤희 - Show less +3 more

01 Mar 2003-The Korean Journal of Thoracic and Cardiovascular Surgery

28,685 citations

Journal Article•DOI•

Probability and Statistical Inference

[...]

M. G. Kendall

01 Apr 1956-Nature

TL;DR: The Foundations of Statistics By Prof. Leonard J. Savage as mentioned in this paper, p. 48s. (Wiley Publications in Statistics.) Pp. xv + 294. (New York; John Wiley and Sons, Inc., London: Chapman and Hall, Ltd., 1954).

...read moreread less

Abstract: The Foundations of Statistics By Prof. Leonard J. Savage. (Wiley Publications in Statistics.) Pp. xv + 294. (New York; John Wiley and Sons, Inc.; London: Chapman and Hall, Ltd., 1954.) 48s. net.

...read moreread less

844 citations

Journal Article•DOI•

End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks

[...]

Szu-Wei Fu¹, Tao-Wei Wang², Yu Tsao², Xugang Lu³, Hisashi Kawai³ - Show less +1 more•Institutions (3)

National Taiwan University¹, Center for Information Technology², National Institute of Information and Communications Technology³

01 Sep 2018-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: In this paper, an end-to-end utterance-based speech enhancement framework using fully convolutional neural networks (FCN) was proposed to reduce the gap between the model optimization and the evaluation criterion.

...read moreread less

Abstract: Speech enhancement model is used to map a noisy speech to a clean speech. In the training stage, an objective function is often adopted to optimize the model parameters. However, in the existing literature, there is an inconsistency between the model optimization criterion and the evaluation criterion for the enhanced speech. For example, in measuring speech intelligibility, most of the evaluation metric is based on a short-time objective intelligibility (STOI) measure, while the frame based mean square error (MSE) between estimated and clean speech is widely used in optimizing the model. Due to the inconsistency, there is no guarantee that the trained model can provide optimal performance in applications. In this study, we propose an end-to-end utterance-based speech enhancement framework using fully convolutional neural networks (FCN) to reduce the gap between the model optimization and the evaluation criterion. Because of the utterance-based optimization, temporal correlation information of long speech segments, or even at the entire utterance level, can be considered to directly optimize perception-based objective functions. As an example, we implemented the proposed FCN enhancement framework to optimize the STOI measure. Experimental results show that the STOI of a test speech processed by the proposed approach is better than conventional MSE-optimized speech due to the consistency between the training and the evaluation targets. Moreover, by integrating the STOI into model optimization, the intelligibility of human subjects and automatic speech recognition system on the enhanced speech is also substantially improved compared to those generated based on the minimum MSE criterion.

...read moreread less

275 citations

Proceedings Article•DOI•

Detecting depression with audio/text sequence modeling of interviews

[...]

Tuka Al Hanai, Mohammad M. Ghassemi¹, James Glass¹•Institutions (1)

Massachusetts Institute of Technology¹

02 Sep 2018

TL;DR: An automated depression-detection algorithm is demonstrated that models interviews between an individual and agent and learns from sequences of questions and answers without the need to perform explicit topic modeling of the content.

...read moreread less

Abstract: Medical professionals diagnose depression by interpreting the responses of individuals to a variety of questions, probing lifestyle changes and ongoing thoughts. Like professionals, an effective automated agent must understand that responses to queries have varying prognostic value. In this study we demonstrate an automated depression-detection algorithm that models interviews between an individual and agent and learns from sequences of questions and answers without the need to perform explicit topic modeling of the content. We utilized data of 142 individuals undergoing depression screening, and modeled the interactions with audio and text features in a Long-Short Term Memory (LSTM) neural network model to detect depression. Our results were comparable to methods that explicitly modeled the topics of the questions and answers which suggests that depression can be detected through sequential modeling of an interaction, with minimal information on the structure of the interview.

...read moreread less

176 citations

Journal Article•DOI•

Automated depression analysis using convolutional neural networks from speech.

[...]

Lang He¹, Cui Cao²•Institutions (2)

Northwestern Polytechnical University¹, Weinan Normal University²

01 Jul 2018-Journal of Biomedical Informatics

TL;DR: This paper proposes a combination of hand-crafted and deep-learned features which can effectively measure the severity of depression from speech and proposes joint fine-tuning layers to combine the raw and spectrogram DCNN to boost the depression recognition performance.

...read moreread less

133 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

Collapse