Improving the Performance of Transformer Based Low Resource Speech Recognition for Indian Languages

doi:10.1109/ICASSP40776.2020.9053808

Proceedings ArticleDOI

Improving the Performance of Transformer Based Low Resource Speech Recognition for Indian Languages

- pp 8279-8283

TLDR

The proposed approach with retraining gave 6% - 11% relative improvements in character error rates over the monolingual baseline, and the language embedding learned from the proposed approach, when added to the acoustic feature vector, gave the best result.

Abstract:

The recent success of the Transformer based sequence-to-sequence framework for various Natural Language Processing tasks has motivated its application to Automatic Speech Recognition. In this work, we explore the application of Transformers on low resource Indian languages in a multilingual framework. We explore various methods to incorporate language information into a multilingual Transformer, i.e., (i) at the decoder, (ii) at the encoder. These methods include using language identity tokens or providing language information to the acoustic vectors. Language information to the acoustic vectors can be given in the form of one hot vector or by learning a language embedding. From our experiments, we observed that providing language identity always improved performance. The language embedding learned from our proposed approach, when added to the acoustic feature vector, gave the best result. The proposed approach with retraining gave 6% - 11% relative improvements in character error rates over the monolingual baseline.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

S-Vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based on Transformer Encoder

- 01 Jan 2022 -

IEEE/ACM transactions on audio, speech, ...

TL;DR: In this article , the Transformer encoder speaker authenticator (TESA) is proposed to generate speaker embeddings from self-attention in the encoder of a Transformer.

...read moreread less

Proceedings ArticleDOI

LAE: Language-Aware Encoder for Monolingual and Multilingual ASR

Jinchuan Tian, +5 more

TL;DR: Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level and shows superior performance on both monolingual and multilingual ASR tasks.

...read moreread less

Posted Content

S-vectors: Speaker Embeddings based on Transformer's Encoder for Text-Independent Speaker Verification.

Metilda Sagaya Mary N. J, +2 more

- 11 Aug 2020 -

arXiv: Audio and Speech Processing

TL;DR: This paper has proposed to derive speaker embeddings from the output of the trained Transformer encoder structure after appropriate statistics pooling to obtain utterance level features as s-vectors.

...read moreread less

Proceedings ArticleDOI

Multilingual Speech Recognition Using Language-Specific Phoneme Recognition as Auxiliary Task for Indian Languages.

Hardik B. Sailor, +1 more

TL;DR: This paper proposes a multilingual acoustic modeling approach for Indian languages using a Multitask Learning (MTL) framework and explores language-specific phoneme recognition as an auxiliary task in MTL framework along with the primary task of multilingual senone classification.

...read moreread less

Proceedings ArticleDOI

Using Large Self-Supervised Models for Low-Resource Speech Recognition

Krishna D. N, +2 more

TL;DR: This work investigates the effectiveness of many self-supervised pre-trained models for the low-resource speech recognition task for three Indian languages Telugu, Tamil, and Gujarati and carefully analyzes the generalization capability of multilingual pre- trained models for both seen and unseen languages.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Proceedings Article

The Kaldi Speech Recognition Toolkit

Daniel Povey, +12 more

TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.

...read moreread less

Proceedings ArticleDOI

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

William Chan, +3 more

TL;DR: Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers is presented.

...read moreread less

Proceedings ArticleDOI

ESPNet: End-to-end speech processing toolkit

Shinji Watanabe, +11 more

TL;DR: In this article, a new open source platform for end-to-end speech processing named ESPnet is introduced, which mainly focuses on automatic speech recognition (ASR), and adopts widely used dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine.

...read moreread less

Journal ArticleDOI

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

Shinji Watanabe, +4 more

- 16 Oct 2017 -

IEEE Journal of Selected Topics in Signa...

TL;DR: The proposed hybrid CTC/attention end-to-end ASR is applied to two large-scale ASR benchmarks, and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.

...read moreread less

Related Papers (5)

Cross-language use of acoustic information for automatic speech recognition

Christoph Nieuwoudt, +1 more

- 01 Sep 2002 -

Speech Communication

Improving the Performance of Transformer Based Low Resource Speech Recognition for Indian Languages

Citations

S-Vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based on Transformer Encoder

LAE: Language-Aware Encoder for Monolingual and Multilingual ASR

S-vectors: Speaker Embeddings based on Transformer's Encoder for Text-Independent Speaker Verification.

Multilingual Speech Recognition Using Language-Specific Phoneme Recognition as Auxiliary Task for Indian Languages.

Using Large Self-Supervised Models for Low-Resource Speech Recognition

References

Attention is All you Need

The Kaldi Speech Recognition Toolkit

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

ESPNet: End-to-end speech processing toolkit

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

Related Papers (5)

Cross-language use of acoustic information for automatic speech recognition

Attention is All you Need

Learning Methods in Multilingual Speech Recognition

Multilingual Speech Recognition with a Single End-to-End Model

Universal Phone Recognition with a Multilingual Allophone System