scispace - formally typeset
Search or ask a question

Showing papers by "Goutam Saha published in 2021"


Journal ArticleDOI
TL;DR: The study indicates that liking influences classification performance and also the temporal dynamics of emotional experience across these scales, and observes an inverted U relationship between the level of liking and arousal and dominance classification performance.

38 citations


Proceedings ArticleDOI
27 Jan 2021
TL;DR: In this paper, the authors explore the constant-Q transform (CQT) for speech emotion recognition (SER), which provides variable spectro-temporal resolution with higher frequency resolution at lower frequencies.
Abstract: In this work, we explore the constant-Q transform (CQT) for speech emotion recognition (SER). The CQT-based time-frequency analysis provides variable spectro-temporal resolution with higher frequency resolution at lower frequencies. Since lower-frequency regions of speech signal contain more emotion-related information than higher-frequency regions, the increased low-frequency resolution of CQT makes it more promising for SER than standard short-time Fourier transform (STFT). We present a comparative analysis of short-term acoustic features based on STFT and CQT for SER with deep neural network (DNN) as a back-end classifier. We optimize different parameters for both features. The CQT-based features outperform the STFT-based spectral features for SER experiments. Further experiments with cross-corpora evaluation demonstrate that the CQT-based systems provide better generalization with out-of-domain training data.

12 citations


Journal ArticleDOI
TL;DR: A frame selection strategy for improved detection of spoofed speech by using selected speech frames of the test utterance for scoring based on the log-likelihood ratios of the individual frames is introduced.
Abstract: In this paper, we introduce a frame selection strategy for improved detection of spoofed speech. A countermeasure (CM) system typically uses a Gaussian mixture model (GMM) based classifier for computing the log-likelihood scores. The average log-likelihood ratio for all speech frames of a test utterance is calculated as the score for the decision making. As opposed to this standard approach, we propose to use selected speech frames of the test utterance for scoring. We present two simple and computationally efficient frame selection strategies based on the log-likelihood ratios of the individual frames. The performance is evaluated with constant-Q cepstral coefficients as front-end feature extraction and two-class GMM as a back-end classifier. We conduct the experiments using the speech corpora from ASVspoof 2015, 2017, and 2019 challenges. The experimental results show that the proposed scoring techniques substantially outperform the conventional scoring technique for both the development and evaluation data set of ASVspoof 2015 corpus. We did not observe noticeable performance gain in ASVspoof 2017 and ASVspoof 2019 corpus. We further conducted experiments with partially spoofed data where spoofed data is created by augmenting natural and spoofed speech. In this scenario, the proposed methods demonstrate considerable performance improvement over baseline.

7 citations


Journal ArticleDOI
TL;DR: A novel signal processing based method is proposed for extraction of LSCs automatically by automated segmentation of LSS without using any additional sensor, and is found to be superior when compared with a recently proposed method.

2 citations


Posted Content
TL;DR: In this article, a simple and efficient solution for acoustic domain dependent speech diarization was developed by the ABSP Laboratory team for the third edition of the DIHARD Speech Diarization Challenge.
Abstract: This report presents the system developed by the ABSP Laboratory team for the third DIHARD speech diarization challenge. Our main contribution in this work is to develop a simple and efficient solution for acoustic domain dependent speech diarization. We explore speaker embeddings for \emph{acoustic domain identification} (ADI) task. Our study reveals that i-vector based method achieves considerably better performance than x-vector based approach in the third DIHARD challenge dataset. Next, we integrate the ADI module with the diarization framework. The performance substantially improved over that of the baseline when we optimized the thresholds for agglomerative hierarchical clustering and the parameters for dimensionality reduction during scoring for individual acoustic domains. We achieved a relative improvement of $9.63\%$ and $10.64\%$ in DER for core and full conditions, respectively, for Track 1 of the DIHARD III evaluation set.

1 citations


Posted Content
TL;DR: In this paper, the authors explore the constant-Q transform (CQT) for speech emotion recognition (SER), which provides variable spectro-temporal resolution with higher frequency resolution at lower frequencies.
Abstract: In this work, we explore the constant-Q transform (CQT) for speech emotion recognition (SER). The CQT-based time-frequency analysis provides variable spectro-temporal resolution with higher frequency resolution at lower frequencies. Since lower-frequency regions of speech signal contain more emotion-related information than higher-frequency regions, the increased low-frequency resolution of CQT makes it more promising for SER than standard short-time Fourier transform (STFT). We present a comparative analysis of short-term acoustic features based on STFT and CQT for SER with deep neural network (DNN) as a back-end classifier. We optimize different parameters for both features. The CQT-based features outperform the STFT-based spectral features for SER experiments. Further experiments with cross-corpora evaluation demonstrate that the CQT-based systems provide better generalization with out-of-domain training data.