scispace - formally typeset
Search or ask a question
Author

M Mano Ranjith Kumar

Bio: M Mano Ranjith Kumar is an academic researcher from Indian Institute of Technology Madras. The author has contributed to research in topics: Speech synthesis & Spectrogram. The author has an hindex of 2, co-authored 4 publications receiving 4 citations.

Papers
More filters
Proceedings ArticleDOI
05 Dec 2020
TL;DR: In this paper, a neural network system using the time-delay neural network to model temporal information and long short-term memory (LSTM) layer to model spatial information is proposed.
Abstract: Automatic detection of seizures from EEG signals is an important problem of interest for clinical institutions. EEG is a temporal signal collected from multiple spatial sources around the scalp. Efficient modeling of both temporal and spatial information is important to identify the seizures using EEG. In this paper, we propose a neural network system using the time-delay neural network to model temporal information (TDNN) and long short term memory (LSTM) layer to model spatial information. On the development subset of Temple University seizure dataset, the proposed system achieved a sensitivity of 23.32 % with 11.13 false alarms in 24 hours.

3 citations

Posted Content
TL;DR: In this article, a zero-resource speech synthesis system is proposed, where speech is modelled as a sequence of transient and steady-state acoustic units, and a unique set of acoustic units is discovered by iterative training.
Abstract: A Spoken dialogue system for an unseen language is referred to as Zero resource speech. It is especially beneficial for developing applications for languages that have low digital resources. Zero resource speech synthesis is the task of building text-to-speech (TTS) models in the absence of transcriptions. In this work, speech is modelled as a sequence of transient and steady-state acoustic units, and a unique set of acoustic units is discovered by iterative training. Using the acoustic unit sequence, TTS models are trained. The main goal of this work is to improve the synthesis quality of zero resource TTS system. Four different systems are proposed. All the systems consist of three stages: unit discovery, followed by unit sequence to spectrogram mapping, and finally spectrogram to speech inversion. Modifications are proposed to the spectrogram mapping stage. These modifications include training the mapping on voice data, using x-vectors to improve the mapping, two-stage learning, and gender-specific modelling. Evaluation of the proposed systems in the Zerospeech 2020 challenge shows that quite good quality synthesis can be achieved.

3 citations

Proceedings ArticleDOI
25 Oct 2020
TL;DR: The main goal of this work is to improve the synthesis quality of zero resource TTS system, and modifications are proposed to the spectrogram mapping stage.
Abstract: A Spoken dialogue system for an unseen language is referred to as Zero resource speech. It is especially beneficial for developing applications for languages that have low digital resources. Zero resource speech synthesis is the task of building text-to-speech (TTS) models in the absence of transcriptions. In this work, speech is modelled as a sequence of transient and steady-state acoustic units, and a unique set of acoustic units is discovered by iterative training. Using the acoustic unit sequence, TTS models are trained. The main goal of this work is to improve the synthesis quality of zero resource TTS system. Four different systems are proposed. All the systems consist of three stages: unit discovery, followed by unit sequence to spectrogram mapping, and finally spectrogram to speech inversion. Modifications are proposed to the spectrogram mapping stage. These modifications include training the mapping on voice data, using x-vectors to improve the mapping, two-stage learning, and gender-specific modelling. Evaluation of the proposed systems in the Zerospeech 2020 challenge shows that quite good quality synthesis can be achieved.

2 citations

Proceedings ArticleDOI
25 Oct 2020
TL;DR: This paper proposes a technique that combines the classical parametric HMM-based TTS framework (HTS) with the neuralnetwork-based Waveglow vocoder using histogram equalization (HEQ) in a low resource environment and indicates that the synthesis quality of the hybrid system is better than that of the conventional HTS system.
Abstract: Conventional text-to-speech (TTS) synthesis requires extensive linguistic processing for producing quality output. The advent of end-to-end (E2E) systems has caused a relocation in the paradigm with better synthesized voices. However, hidden Markov model (HMM) based systems are still popular due to their fast synthesis time, robustness to less training data, and flexible adaptation of voice characteristics, speaking styles, and emotions. This paper proposes a technique that combines the classical parametric HMM-based TTS framework (HTS) with the neuralnetwork-based Waveglow vocoder using histogram equalization (HEQ) in a low resource environment. The two paradigms are combined by performing HEQ across mel-spectrograms extracted from HTS generated audio and source spectra of training data. During testing, the synthesized mel-spectrograms are mapped to the source spectrograms using the learned HEQ. Experiments are carried out on Hindi male and female dataset of the Indic TTS database. Systems are evaluated based on degradation mean opinion scores (DMOS). Results indicate that the synthesis quality of the hybrid system is better than that of the conventional HTS system. These results are quite promising as they pave way to good quality TTS systems with less data compared to E2E systems.

2 citations

Proceedings ArticleDOI
04 Mar 2022
TL;DR: The system architectures and the models submitted by the team “IISERB Brains” to SemEval 2022 Task 6 competition are described and the other models and results that they obtained through their experiments after organizers published the gold labels of their evaluation data are reported.
Abstract: This paper describes the system architectures and the models submitted by our team “IISERB Brains” to SemEval 2022 Task 6 competition. We contested for all three sub-tasks floated for the English dataset. On the leader-board, we got 19th rank out of 43 teams for sub-task A, 8th rank out of 22 teams for sub-task B, and 13th rank out of 16 teams for sub-task C. Apart from the submitted results and models, we also report the other models and results that we obtained through our experiments after organizers published the gold labels of their evaluation data. All of our code and links to additional resources are present in GitHub for reproducibility.

1 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: This article proposed a novel approach by combining textual and audio features together to detecting sarcasm in conversational data by taking a combined vector of extracted audio and text features from their respective models as the input.
Abstract: In the modern era, posting sarcastic comments on social media became the common trend. Sarcasm is often used by people to taunt or pester others. It is frequently expressed through inflexion, tonal stress in speech or in the form of lexical, pragmatic, and hyperbolic features present in the text. Most of the existing work has been focused on either detecting sarcasm in textual data using text features or audio data using audio features. This article proposed a novel approach by combining textual and audio features together to detecting sarcasm in conversational data. This hybrid method takes a combined vector of extracted audio and text features from their respective models as the input. This combined features will compensated the shortcomings of only text features and vice-versa. The obtained result of hybrid model outperforms both the individual model significantly.

9 citations

Posted Content
TL;DR: The Zero Resource Speech Challenge 2020 as discussed by the authors aims at learning speech representations from raw audio signals without any labels and combines the data sets and metrics from two previous benchmarks (2017 and 2019) and features two tasks which tap into two levels of speech representation.
Abstract: We present the Zero Resource Speech Challenge 2020, which aims at learning speech representations from raw audio signals without any labels. It combines the data sets and metrics from two previous benchmarks (2017 and 2019) and features two tasks which tap into two levels of speech representation. The first task is to discover low bit-rate subword representations that optimize the quality of speech synthesis; the second one is to discover word-like units from unsegmented raw speech. We present the results of the twenty submitted models and discuss the implications of the main findings for unsupervised speech learning.

7 citations

Journal ArticleDOI
TL;DR: A novel recurrence plot (RP)-based time-distributed convolutional neural network and long short-term memory (CNN-LSTM) algorithm for the integrated classification of fNIRS EEG for hybrid BCI applications and results confirm the viability of the RP-based deep-learning algorithm for successful BCI systems.
Abstract: The constantly evolving human–machine interaction and advancement in sociotechnical systems have made it essential to analyze vital human factors such as mental workload, vigilance, fatigue, and stress by monitoring brain states for optimum performance and human safety. Similarly, brain signals have become paramount for rehabilitation and assistive purposes in fields such as brain–computer interface (BCI) and closed-loop neuromodulation for neurological disorders and motor disabilities. The complexity, non-stationary nature, and low signal-to-noise ratio of brain signals pose significant challenges for researchers to design robust and reliable BCI systems to accurately detect meaningful changes in brain states outside the laboratory environment. Different neuroimaging modalities are used in hybrid settings to enhance accuracy, increase control commands, and decrease the time required for brain activity detection. Functional near-infrared spectroscopy (fNIRS) and electroencephalography (EEG) measure the hemodynamic and electrical activity of the brain with a good spatial and temporal resolution, respectively. However, in hybrid settings, where both modalities enhance the output performance of BCI, their data compatibility due to the huge discrepancy between their sampling rate and the number of channels remains a challenge for real-time BCI applications. Traditional methods, such as downsampling and channel selection, result in important information loss while making both modalities compatible. In this study, we present a novel recurrence plot (RP)-based time-distributed convolutional neural network and long short-term memory (CNN-LSTM) algorithm for the integrated classification of fNIRS EEG for hybrid BCI applications. The acquired brain signals are first projected into a non-linear dimension with RPs and fed into the CNN to extract essential features without performing any downsampling. Then, LSTM is used to learn the chronological features and time-dependence relation to detect brain activity. The average accuracies achieved with the proposed model were 78.44% for fNIRS, 86.24% for EEG, and 88.41% for hybrid EEG-fNIRS BCI. Moreover, the maximum accuracies achieved were 85.9, 88.1, and 92.4%, respectively. The results confirm the viability of the RP-based deep-learning algorithm for successful BCI systems.

5 citations

Proceedings Article
21 Jan 2022
TL;DR: This work extensively compare multiple state-of-the-art models and signal feature extractors in a real-time seizure detection framework suitable for real-world application, using various evaluation metrics including a new one to evaluate more practical aspects of seizure detection models.
Abstract: Electroencephalogram (EEG) is an important diagnostic test that physicians use to record brain activity and detect seizures by monitoring the signals. There have been several attempts to detect seizures and abnormalities in EEG signals with modern deep learning models to reduce the clinical burden. However, they cannot be fairly compared against each other as they were tested in distinct experimental settings. Also, some of them are not trained in real-time seizure detection tasks, making it hard for on-device applications. In this work, for the first time, we extensively compare multiple state-of-the-art models and signal feature extractors in a real-time seizure detection framework suitable for real-world application, using various evaluation metrics including a new one we propose to evaluate more practical aspects of seizure detection models.

4 citations

Journal ArticleDOI
TL;DR: An overview of the six editions of the Zero Resource Speech Challenge series since 2015 is presented, the lessons learned are discussed, and the areas which need more work or give puzzling results are outlined.
Abstract: Recent progress in self-supervised or unsupervised machine learning has opened the possibility of building a full speech processing system from raw audio without using any textual representations or expert labels such as phonemes, dictionaries or parse trees. The contribution of the Zero Resource Speech Challenge series since 2015 has been to break down this long-term objective into four well-defined tasks—Acoustic Unit Discovery, Spoken Term Discovery, Discrete Resynthesis, and Spoken Language Modeling—and introduce associated metrics and benchmarks enabling model comparison and cumulative progress. We present an overview of the six editions of this challenge series since 2015, discuss the lessons learned, and outline the areas which need more work or give puzzling results.

4 citations