scispace - formally typeset
Proceedings ArticleDOI

Improving the Performance of Transformer Based Low Resource Speech Recognition for Indian Languages

TLDR
The proposed approach with retraining gave 6% - 11% relative improvements in character error rates over the monolingual baseline, and the language embedding learned from the proposed approach, when added to the acoustic feature vector, gave the best result.
Abstract: 
The recent success of the Transformer based sequence-to-sequence framework for various Natural Language Processing tasks has motivated its application to Automatic Speech Recognition. In this work, we explore the application of Transformers on low resource Indian languages in a multilingual framework. We explore various methods to incorporate language information into a multilingual Transformer, i.e., (i) at the decoder, (ii) at the encoder. These methods include using language identity tokens or providing language information to the acoustic vectors. Language information to the acoustic vectors can be given in the form of one hot vector or by learning a language embedding. From our experiments, we observed that providing language identity always improved performance. The language embedding learned from our proposed approach, when added to the acoustic feature vector, gave the best result. The proposed approach with retraining gave 6% - 11% relative improvements in character error rates over the monolingual baseline.

read more

Citations
More filters
Journal ArticleDOI

S-Vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based on Transformer Encoder

TL;DR: In this article , the Transformer encoder speaker authenticator (TESA) is proposed to generate speaker embeddings from self-attention in the encoder of a Transformer.
Proceedings ArticleDOI

LAE: Language-Aware Encoder for Monolingual and Multilingual ASR

TL;DR: Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level and shows superior performance on both monolingual and multilingual ASR tasks.
Posted Content

S-vectors: Speaker Embeddings based on Transformer's Encoder for Text-Independent Speaker Verification.

TL;DR: This paper has proposed to derive speaker embeddings from the output of the trained Transformer encoder structure after appropriate statistics pooling to obtain utterance level features as s-vectors.
Proceedings ArticleDOI

Multilingual Speech Recognition Using Language-Specific Phoneme Recognition as Auxiliary Task for Indian Languages.

TL;DR: This paper proposes a multilingual acoustic modeling approach for Indian languages using a Multitask Learning (MTL) framework and explores language-specific phoneme recognition as an auxiliary task in MTL framework along with the primary task of multilingual senone classification.
Proceedings ArticleDOI

Using Large Self-Supervised Models for Low-Resource Speech Recognition

TL;DR: This work investigates the effectiveness of many self-supervised pre-trained models for the low-resource speech recognition task for three Indian languages Telugu, Tamil, and Gujarati and carefully analyzes the generalization capability of multilingual pre- trained models for both seen and unseen languages.
References
More filters
Proceedings Article

Attention is All you Need

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings Article

The Kaldi Speech Recognition Toolkit

TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
Proceedings ArticleDOI

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

TL;DR: Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers is presented.
Proceedings ArticleDOI

ESPNet: End-to-end speech processing toolkit

TL;DR: In this article, a new open source platform for end-to-end speech processing named ESPnet is introduced, which mainly focuses on automatic speech recognition (ASR), and adopts widely used dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine.
Journal ArticleDOI

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

TL;DR: The proposed hybrid CTC/attention end-to-end ASR is applied to two large-scale ASR benchmarks, and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.
Related Papers (5)