Home
/
Authors
/
Nirmesh J. Shah

Author

Nirmesh J. Shah

Other affiliations: Dhirubhai Ambani Institute of Information and Communication Technology

Bio: Nirmesh J. Shah is an academic researcher from Indian Institute of Chemical Technology. The author has contributed to research in topics: Speech synthesis & Speech production. The author has an hindex of 8, co-authored 28 publications receiving 196 citations. Previous affiliations of Nirmesh J. Shah include Dhirubhai Ambani Institute of Information and Communication Technology.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

A syllable-based framework for unit selection synthesis in 13 Indian languages

[...]

Hemant A. Patil¹, Tanvina B. Patel¹, Nirmesh J. Shah¹, Hardik B. Sailor¹, Raghava Krishnan², G. R. Kasthuri², T. Nagarajan³, Lilly Christina³, Naresh Kumar, Veera Raghavendra, Surekha Kishore, S. R. M. Prasanna⁴, Nagaraj Adiga⁴, Sanasam Ranbir Singh⁴, Konjengbam Anand⁴, Pranaw Kumar⁵, Bira Chandra Singh⁵, S. L. Binil Kumar⁵, T. G. Bhadran⁵, T. Sajini⁵, Arup Saha⁵, T. K. Basu⁵, K. Sreenivasa Rao⁶, N. P. Narendra⁶, Anil Kumar Sao⁷, Rajesh Kumar⁷, Pranhari Talukdar, Purnendu Acharyaa, Somnath Chandra, Swaran Lata, Hema A. Murthy² - Show less +27 more•Institutions (7)

Dhirubhai Ambani Institute of Information and Communication Technology¹, Indian Institute of Technology Madras², Sri Sivasubramaniya Nadar College of Engineering³, Indian Institute of Technology Guwahati⁴, Centre for Development of Advanced Computing⁵, Indian Institute of Technology Kharagpur⁶, Indian Institute of Technology Mandi⁷

01 Nov 2013

TL;DR: A consortium effort on building text to speech (TTS) systems for 13 Indian languages using the same common framework and the TTS systems are evaluated using Mean Opinion Score (DMOS) and Word Error Rate (WER).

...read moreread less

Abstract: In this paper, we discuss a consortium effort on building text to speech (TTS) systems for 13 Indian languages. There are about 1652 Indian languages. A unified framework is therefore attempted required for building TTSes for Indian languages. As Indian languages are syllable-timed, a syllable-based framework is developed. As quality of speech synthesis is of paramount interest, unit-selection synthesizers are built. Building TTS systems for low-resource languages requires that the data be carefully collected an annotated as the database has to be built from the scratch. Various criteria have to addressed while building the database, namely, speaker selection, pronunciation variation, optimal text selection, handling of out of vocabulary words and so on. The various characteristics of the voice that affect speech synthesis quality are first analysed. Next the design of the corpus of each of the Indian languages is tabulated. The collected data is labeled at the syllable level using a semiautomatic labeling tool. Text to speech synthesizers are built for all the 13 languages, namely, Hindi, Tamil, Marathi, Bengali, Malayalam, Telugu, Kannada, Gujarati, Rajasthani, Assamese, Manipuri, Odia and Bodo using the same common framework. The TTS systems are evaluated using degradation Mean Opinion Score (DMOS) and Word Error Rate (WER). An average DMOS score of ≈3.0 and an average WER of about 20 % is observed across all the languages.

...read moreread less

42 citations

Proceedings Article•DOI•

Effectiveness of PLP-based phonetic segmentation for speech synthesis

[...]

Nirmesh J. Shah¹, Bhavik Vachhani¹, Hardik B. Sailor¹, Hemant A. Patil¹•Institutions (1)

Dhirubhai Ambani Institute of Information and Communication Technology¹

04 May 2014

TL;DR: From the subjective and objective evaluations, it is observed that Viterbi-based and STM with PLPCC-based segmentation algorithms work better than other algorithms.

...read moreread less

Abstract: In this paper, use of Viterbi-based algorithm and spectral transition measure (STM)-based algorithm for the task of speech data labeling is being attempted. In the STM framework, we propose use of several spectral features such as recently proposed cochlear filter cepstral coefficients (CFCC), perceptual linear prediction cepstral coefficients (PLPCC) and RelAtive SpecTrAl (RASTA)-based PLPCC in addition to Mel frequency cepstral coefficients (MFCC) for phonetic segmentation task. To evaluate effectiveness of these segmentation algorithms, we require manual accurate phoneme-level labeled data which is not available for low resourced languages such as Gujarati (one of the official languages of India). In order to measure effectiveness of various segmentation algorithms, HMM-based speech synthesis system (HTS) for Gujarati has been built. From the subjective and objective evaluations, it is observed that Viterbi-based and STM with PLPCC-based segmentation algorithms work better than other algorithms.

...read moreread less

22 citations

Proceedings Article•DOI•

Effectiveness of Generative Adversarial Network for Non-Audible Murmur-to-Whisper Speech Conversion.

[...]

Neil Shah¹, Nirmesh J. Shah¹, Hemant A. Patil¹•Institutions (1)

Dhirubhai Ambani Institute of Information and Communication Technology¹

02 Sep 2018

TL;DR: The objective and subjective evaluation performed on the proposed system, justifies the ability of adversarial optimization over Maximum Likelihood (ML)-based optimization networks, such as a Deep Neural Network (DNN), in preserving and improving the speech quality and intelligibility.

...read moreread less

Abstract: The murmur produced by the speaker and captured by the NonAudible Murmur (NAM)-one of the Silent Speech Interface (SSI) technique, suffers from the speech quality degradation. This is due to the lack of radiation effect at the lips and lowpass nature of the soft tissue, which attenuates the high frequencyrelated information. In this work, a novel method for NAM-toWhisper (NAM2WHSP) speech conversion incorporating Generative Adversarial Network (GAN) is proposed. The GAN minimizes the distributional divergence between the whispered speech and the generated speech parameters (through adversarial optimization). The objective and subjective evaluation performed on the proposed system, justifies the ability of adversarial optimization over Maximum Likelihood (ML)-based optimization networks, such as a Deep Neural Network (DNN), in preserving and improving the speech quality and intelligibility. The adversarial optimization learns the mapping function with 54.2 % relative improvement in MOS and 29.83 % absolute reduction in % WER w.r.t. the state-of-the-art mapping techniques. Furthermore, we evaluated the proposed framework by analyzing the level of contextual information and the number of training utterances required for optimizing the network parameters, for the given task and database.

...read moreread less

19 citations

Journal Article•DOI•

M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation

[...]

Vishal Chudasama, Purbayan Kar, Ashishkumar Prabhakar Gudmalwar, Nirmesh J. Shah, Pankaj Wasnik, N. Onoe - Show less +2 more

01 Jun 2022

TL;DR: A Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality and employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data is proposed.

...read moreread less

Abstract: Emotion Recognition in Conversations (ERC) is crucial in developing sympathetic human-machine interaction. In conversational videos, emotion can be present in multiple modalities, i.e., audio, video, and transcript. However, due to the inherent characteristics of these modalities, multi-modal ERC has always been considered a challenging undertaking. Existing ERC research focuses mainly on using text information in a discussion, ignoring the other two modalities. We anticipate that emotion recognition accuracy can be improved by employing a multi-modal approach. Thus, in this study, we propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality. It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data. We introduce a new feature extractor to extract latent features from the audio and visual modality. The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data. In the domain of ERC, the existing methods perform well on one benchmark dataset but not on others. Our results show that the proposed M2FNet architecture outperforms all other methods in terms of weighted average F1 score on well-known MELD and IEMOCAP datasets and sets a new state-of-the-art performance in ERC.

...read moreread less

18 citations

Proceedings Article•DOI•

A Novel Gaussian Filter-Based Automatic Labeling of Speech Data for TTS System in Gujarati Language

[...]

Swati Talesara¹, Hemant A. Patil¹, Tanvina B. Patel¹, Hardik B. Sailor¹, Nirmesh J. Shah¹ - Show less +1 more•Institutions (1)

Dhirubhai Ambani Institute of Information and Communication Technology¹

17 Aug 2013

TL;DR: A Gaussian-based segmentation method has been proposed for automatic segmentation of speech at syllable-level and it has been observed that percentage correctness of labeled data is around 80% for both male and female voice as compared to 70% for group delay-based labeling.

...read moreread less

Abstract: Text-to-speech (TTS) synthesizer has been proved to be an aiding tool for many visually challenged people for reading through hearing feedback. There are TTS synthesizers available in English, however, it has been observed that people feel more comfortable in hearing their own native language. Keeping this point in mind, Gujarati TTS synthesizer has been built. This TTS system has been built in Festival speech synthesis framework. Syllable is taken as the basic unit in building Gujarati TTS synthesizer as Indian languages are syllabic in nature. In building the unit-selection based Gujarati TTS system, one requires large Gujarati labeled corpus. The task of labeling is most time-consuming and tedious. This task requires large manual efforts. Therefore, in this work, an attempt has been made to reduce these efforts by automatically generating labeled corpus at syllable-level. To that effect, a Gaussian-based segmentation method has been proposed for automatic segmentation of speech at syllable-level. It has been observed that percentage correctness of labeled data is around 80% for both male and female voice as compared to 70% for group delay-based labeling. In addition, the system built on the proposed approach shows better intelligibility when evaluated by a visually challenged subject. The word error rate is reduced by 5% for Gaussian filter-based TTS system, compared to group delay-based TTS system. Also, 5% increment is observed in correctly synthesized words. The main focus of this work is to reduce the manual efforts required in building TTS system (which are primarily the manual efforts required in labeling speech data) for Gujarati.

...read moreread less

15 citations

1
2
3
4
…
5
6
7

Collapse

Cited by

PDF

Open Access

More filters

On robust estimation of the location parameter

[...]

Frederick R. Forst

01 Jan 1980

3,652 citations

Book•

2017 25th European Signal Processing Conference (EUSIPCO)

[...]

Ieee Staff

01 Jan 2017

376 citations

Journal Article•DOI•

An overview of voice conversion systems

[...]

Seyed Hamidreza Mohammadi¹, Alexander Kain¹•Institutions (1)

Oregon Health & Science University¹

01 Apr 2017-Speech Communication

TL;DR: An overview of real-world applications of VC systems, extensively study existing systems proposed in the literature, and discuss remaining challenges are provided.

...read moreread less

232 citations

Journal Article•

Auditory Neuroscience: Making Sense of Sound

[...]

Adam M. Croom

01 Jan 2014-Musicae Scientiae

183 citations

Posted Content•

Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion.

[...]

Yi Zhao, Wen-Chin Huang, Xiaohai Tian, Junichi Yamagishi, Rohan Kumar Das, Tomi Kinnunen, Zhen-Hua Ling, Tomoki Toda - Show less +4 more

28 Aug 2020-arXiv: Audio and Speech Processing

TL;DR: From the results of crowd-sourced listening tests, it is observed that VC methods have progressed rapidly thanks to advanced deep learning methods, and the overall naturalness and similarity scores were lower than those for the intra-lingual conversion task.

...read moreread less

Abstract: The voice conversion challenge is a bi-annual scientific event held to compare and understand different voice conversion (VC) systems built on a common dataset. In 2020, we organized the third edition of the challenge and constructed and distributed a new database for two tasks, intra-lingual semi-parallel and cross-lingual VC. After a two-month challenge period, we received 33 submissions, including 3 baselines built on the database. From the results of crowd-sourced listening tests, we observed that VC methods have progressed rapidly thanks to advanced deep learning methods. In particular, speaker similarity scores of several systems turned out to be as high as target speakers in the intra-lingual semi-parallel VC task. However, we confirmed that none of them have achieved human-level naturalness yet for the same task. The cross-lingual conversion task is, as expected, a more difficult task, and the overall naturalness and similarity scores were lower than those for the intra-lingual conversion task. However, we observed encouraging results, and the MOS scores of the best systems were higher than 4.0. We also show a few additional analysis results to aid in understanding cross-lingual VC better.

...read moreread less

124 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

Collapse