Investigation of Methods to Improve the Recognition Performance of Tamil-English Code-Switched Data in Transformer Framework

doi:10.1109/ICASSP40776.2020.9054138

Home
/
Papers
/
Investigation of Methods to Improve the Recognition Performance of Tamil-English Code-Switched Data in Transformer Framework

Proceedings Article•DOI•

Investigation of Methods to Improve the Recognition Performance of Tamil-English Code-Switched Data in Transformer Framework

Metilda Sagaya Mary N. J¹, Vishwas M. Shetty¹, Srinivasan Umesh¹•Institutions (1)

Indian Institute of Technology Madras¹

04 May 2020-pp 7889-7893

TL;DR: Two methods for Tamil-English CS speech recognition are investigated, namely, well-trained encoders of Monolingual Transformers as feature extractors to provide language discrimination, and language information as tokens at the targets, showing that CS is efficiently handled by the second method, while the first method was efficient in discriminating languages.

read less

Abstract: Code-switching (CS) refers to (inter/intra-word) switching between multiple languages in a single conversation. In multilingual countries like India, CS occurs very often in everyday speech, resulting in a new breed of languages in urban regions like Hinglish (Hindi-English), Tanglish (Tamil-English), etc. Research in Indic CS speech recognition is primarily affected by insufficient data. In this paper, we investigate methods to deal with such very low resource scenarios. Recently, Transformers have shown promising results on automatic speech recognition (ASR) tasks. In a Transformer based framework, we investigate two methods for Tamil-English CS speech recognition, namely, (i) well-trained encoders of Monolingual Transformers as feature extractors to provide language discrimination, (ii) language information as tokens at the targets. Our results show that CS is efficiently handled by the second method, while the first method was efficient in discriminating languages.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Bi-Encoder Transformer Network for Mandarin-English Code-Switching Speech Recognition Using Mixture of Experts.

[...]

Yizhou Lu¹, Mingkun Huang¹, Hao Li¹, Jiaqi Guo, Yanmin Qian² - Show less +1 more•Institutions (2)

Shanghai Jiao Tong University¹, Chinese Academy of Sciences²

25 Oct 2020

TL;DR: This paper study end-to-end models for Mandarin-English codeswitching automatic speech recognition, and proposes a bi-encoder transformer network based Mixture of Experts (MoE) architecture to better leverage these data.

...read moreread less

Abstract: Code-switching speech recognition is a challenging task which has been studied in many previous work, and one main challenge for this task is the lack of code-switching data. In this paper, we study end-to-end models for Mandarin-English codeswitching automatic speech recognition. External monolingual data are utilized to alleviate the data sparsity problem. More importantly, we propose a bi-encoder transformer network based Mixture of Experts (MoE) architecture to better leverage these data. We decouple Mandarin and English modeling with two separate encoders to better capture language-specific information, and a gating network is employed to explicitly handle the language identification task. For the gating network, different models and training modes are explored to learn the better MoE interpolation coefficients. Experimental results show that compared with the baseline transformer model, the proposed new MoE architecture can obtain up to 10.4% relative error reduction on the code-switching test set.

...read moreread less

44 citations

Cites background from "Investigation of Methods to Improve..."

...Besides, there are also researches on augmenting output token set with LID tokens [28, 29]....
[...]

Proceedings Article•DOI•

Exploring the use of Common Label Set to Improve Speech Recognition of Low Resource Indian Languages

[...]

Vishwas M. Shetty¹, Srinivasan Umesh¹•Institutions (1)

Indian Institute of Technology Madras¹

06 Jun 2021

TL;DR: In this article, the authors explore the benefits of representing similar target subword units (e.g., Byte Pair Encoded(BPE) units) through a Common Label Set (CLS).

...read moreread less

Abstract: In many Indian languages, written characters are organized on sound phonetic principles, and the ordering of characters is the same across many of them. However, while training conventional end-to-end (E2E) Multilingual speech recognition systems, we treat characters or target subword units from different languages as separate entities. Since the visual rendering of these characters is different, in this paper, we explore the benefits of representing such similar target subword units (e.g., Byte Pair Encoded(BPE) units) through a Common Label Set (CLS). The CLS can be very easily created using automatic methods since the ordering of characters is the same in many Indian Languages. E2E models are trained using a transformer-based encoder-decoder architecture. During testing, given the Mel-filterbank features as input, the system outputs a sequence of BPE units in CLS representation. Depending on the language, we then map the recognized CLS units back to the language-specific grapheme representation. Results show that models trained using CLS improve over monolingual baseline and a multilingual framework with separate symbols for each language. Similar experiments on a subset of the Voxforge dataset also confirm the benefits of CLS. An extension of this idea is to decode an unseen language (Zero-resource) using CLS trained model.

...read moreread less

16 citations

Journal Article•DOI•

S-Vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based on Transformer Encoder

[...]

01 Jan 2022-IEEE/ACM transactions on audio, speech, and language processing

TL;DR: In this article , the Transformer encoder speaker authenticator (TESA) is proposed to generate speaker embeddings from self-attention in the encoder of a Transformer.

...read moreread less

Abstract: One of the most popular speaker embeddings is x-vectors, which are obtained from an architecture that gradually builds a larger temporal context with layers. In this paper, we propose to derive speaker embeddings from Transformer’s encoder trained for speaker classification. Self-attention, on which Transformer’s encoder is built, attends to all the features over the entire utterance and might be more suitable in capturing the speaker characteristics in an utterance. We refer to the speaker embeddings obtained from the proposed speaker classification model as s-vectors to emphasize that they are obtained from an architecture that heavily relies on self-attention. Through experiments, we demonstrate that s-vectors perform better than x-vectors. In addition to the s-vectors, we also propose a new architecture based on Transformer’s encoder for speaker verification as a replacement for speaker verification based on conventional probabilistic linear discriminant analysis (PLDA). This architecture is inspired by the next sentence prediction task of bidirectional encoder representations from Transformers (BERT), and we feed the s-vectors of two utterances to verify whether they belong to the same speaker. We name this architecture the Transformer encoder speaker authenticator (TESA). Our experiments show that the performance of s-vectors with TESA is better than s-vectors with conventional PLDA-based speaker verification.

...read moreread less

10 citations

Proceedings Article•DOI•

Language-specific Characteristic Assistance for Code-switching Speech Recognition

[...]

Tongtong Song, Qiang Xu, Meng Ge, Longbiao Wang, Hao Shi, Yongjie Lv, Yuqin Lin, Jianwu Dang - Show less +4 more

29 Jun 2022

TL;DR: A language-speciﬁc characteristic assistance (LSCA) method to mitigate the above problems and can process code-switching speech recognition tasks well without extra shared parameters or even retraining based on two pre-trained LSMs by using this method.

...read moreread less

Abstract: Dual-encoder structure successfully utilizes two language-specific encoders (LSEs) for code-switching speech recognition. Because LSEs are initialized by two pre-trained language-specific models (LSMs), the dual-encoder structure can exploit sufficient monolingual data and capture the individual language attributes. However, most existing methods have no language constraints on LSEs and underutilize language-specific knowledge of LSMs. In this paper, we propose a language-specific characteristic assistance (LSCA) method to mitigate the above problems. Specifically, during training, we introduce two language-specific losses as language constraints and generate corresponding language-specific targets for them. During decoding, we take the decoding abilities of LSMs into account by combining the output probabilities of two LSMs and the mixture model to obtain the final predictions. Experiments show that either the training or decoding method of LSCA can improve the model's performance. Furthermore, the best result can obtain up to 15.4% relative error reduction on the code-switching test set by combining the training and decoding methods of LSCA. Moreover, the system can process code-switching speech recognition tasks well without extra shared parameters or even retraining based on two pre-trained LSMs by using our method.

...read moreread less

9 citations

Posted Content•

S-vectors: Speaker Embeddings based on Transformer's Encoder for Text-Independent Speaker Verification.

[...]

Metilda Sagaya Mary N. J¹, Sandesh V. Katta¹, Srinivasan Umesh¹•Institutions (1)

Indian Institute of Technology Madras¹

11 Aug 2020-arXiv: Audio and Speech Processing

TL;DR: This paper has proposed to derive speaker embeddings from the output of the trained Transformer encoder structure after appropriate statistics pooling to obtain utterance level features as s-vectors.

...read moreread less

Abstract: X-vectors have become the standard for speaker-embeddings in automatic speaker verification. X-vectors are obtained using a Time-delay Neural Network (TDNN) with context over several frames. We have explored the use of an architecture built on self-attention which attends to all the features over the entire utterance, and hence better capture speaker-level characteristics. We have used the encoder structure of Transformers, which is built on self-attention, as the base architecture and trained it to do a speaker classification task. In this paper, we have proposed to derive speaker embeddings from the output of the trained Transformer encoder structure after appropriate statistics pooling to obtain utterance level features. We have named the speaker embeddings from this structure as s-vectors. s-vectors outperform x-vectors with a relative improvement of 10% and 15% in % EER when trained on Voxceleb-1 only and Voxceleb-1+2 datasets. We have also investigated the effect of deriving s-vectors from different layers of the model.

...read moreread less

9 citations

Cites methods from "Investigation of Methods to Improve..."

...Transformers have also been successfully used in both Automatic Speech Recognition (ASR) and Text To Speech (TTS) tasks [12, 13, 14, 15]....
[...]

References

PDF

Open Access

More filters

Proceedings Article•

Attention is All you Need

[...]

Ashish Vaswani¹, Noam Shazeer¹, Niki Parmar², Jakob Uszkoreit¹, Llion Jones¹, Aidan N. Gomez¹, Lukasz Kaiser¹, Illia Polosukhin¹ - Show less +4 more•Institutions (2)

Google¹, University of Southern California²

12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Abstract: The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

...read moreread less

52,856 citations

Proceedings Article•

The Kaldi Speech Recognition Toolkit

[...]

Daniel Povey¹, Arnab Ghoshal², Gilles Boulianne, Lukas Burget³, Ondrej Glembek³, Nagendra Kumar Goel, Mirko Hannemann³, Petr Motlicek⁴, Yanmin Qian⁵, Petr Schwarz³, Jan Silovsky, Georg Stemmer⁶, Karel Vesely³ - Show less +9 more•Institutions (6)

Microsoft¹, Saarland University², Brno University of Technology³, Idiap Research Institute⁴, Tsinghua University⁵, University of Erlangen-Nuremberg⁶

01 Jan 2011

TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.

...read moreread less

Abstract: We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state automata (using the freely available OpenFst), together with detailed documentation and a comprehensive set of scripts for building complete recognition systems. Kaldi is written is C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users.

...read moreread less

5,857 citations

"Investigation of Methods to Improve..." refers methods in this paper

...All experiments were conducted using Kaldi [15] and Espnet [16] tool-kits in a joint CTC/attention framework [17]....
[...]

Proceedings Article•DOI•

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

[...]

Alex Graves¹, Santiago Fernández¹, Faustino Gomez¹, Jürgen Schmidhuber²•Institutions (2)

Dalle Molle Institute for Artificial Intelligence Research¹, Technische Universität München²

25 Jun 2006

TL;DR: This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing.

...read moreread less

Abstract: Many real-world sequence learning tasks require the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their outputs into label sequences, their applicability has so far been limited. This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems. An experiment on the TIMIT speech corpus demonstrates its advantages over both a baseline HMM and a hybrid HMM-RNN.

...read moreread less

5,188 citations

"Investigation of Methods to Improve..." refers background in this paper

...Unlike the Encoder-Decoder, where output at each time instant depends on the past outputs, it is not the case in Connectionist temporal classification (CTC) [8]....
[...]
...CTC was shown to be helpful in the case of CS scenario in [9]....
[...]
...All experiments were conducted using Kaldi [15] and Espnet [16] tool-kits in a joint CTC/attention framework [17]....
[...]

Proceedings Article•DOI•

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

[...]

William Chan¹, Navdeep Jaitly², Quoc V. Le², Oriol Vinyals²•Institutions (2)

Carnegie Mellon University¹, Google²

20 Mar 2016

TL;DR: Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers is presented.

...read moreread less

Abstract: We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers. In LAS, the neural network architecture subsumes the acoustic, pronunciation and language models making it not only an end-to-end trained system but an end-to-end model. In contrast to DNN-HMM, CTC and most other models, LAS makes no independence assumptions about the probability distribution of the output character sequences given the acoustic sequence. Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits each character conditioned on all previous characters, and the entire acoustic sequence. On a Google voice search task, LAS achieves a WER of 14.1% without a dictionary or an external language model and 10.3% with language model rescoring over the top 32 beams. In comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0% on the same set.

...read moreread less

2,279 citations

"Investigation of Methods to Improve..." refers methods in this paper

...The model was trained based on Listen, Attend and Spell attention mechanism [2]....
[...]

Proceedings Article•DOI•

ESPNet: End-to-end speech processing toolkit

[...]

Shinji Watanabe¹, Takaaki Hori², Shigeki Karita, Tomoki Hayashi³, Jiro Nishitoba, Yuya Unno, Nelson Yalta⁴, Jahn Heymann⁵, Matthew Wiesner¹, Nanxin Chen¹, Adithya Renduchintala¹, Tsubasa Ochiai⁶ - Show less +8 more•Institutions (6)

Johns Hopkins University¹, Mitsubishi Electric², Nagoya University³, Waseda University⁴, University of Paderborn⁵, Doshisha University⁶

30 Mar 2018

TL;DR: In this article, a new open source platform for end-to-end speech processing named ESPnet is introduced, which mainly focuses on automatic speech recognition (ASR), and adopts widely used dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine.

...read moreread less

Abstract: This paper introduces a new open source platform for end-to-end speech processing named ESPnet. ESPnet mainly focuses on end-to-end automatic speech recognition (ASR), and adopts widely-used dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine. ESPnet also follows the Kaldi ASR toolkit style for data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments. This paper explains a major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks.

...read moreread less

806 citations