Proceedings ArticleDOI
Improving the Performance of Transformer Based Low Resource Speech Recognition for Indian Languages
Vishwas M. Shetty,Metilda Sagaya Mary N. J,Srinivasan Umesh +2 more
- pp 8279-8283
TLDR
The proposed approach with retraining gave 6% - 11% relative improvements in character error rates over the monolingual baseline, and the language embedding learned from the proposed approach, when added to the acoustic feature vector, gave the best result.Abstract:Â
The recent success of the Transformer based sequence-to-sequence framework for various Natural Language Processing tasks has motivated its application to Automatic Speech Recognition. In this work, we explore the application of Transformers on low resource Indian languages in a multilingual framework. We explore various methods to incorporate language information into a multilingual Transformer, i.e., (i) at the decoder, (ii) at the encoder. These methods include using language identity tokens or providing language information to the acoustic vectors. Language information to the acoustic vectors can be given in the form of one hot vector or by learning a language embedding. From our experiments, we observed that providing language identity always improved performance. The language embedding learned from our proposed approach, when added to the acoustic feature vector, gave the best result. The proposed approach with retraining gave 6% - 11% relative improvements in character error rates over the monolingual baseline.read more
Citations
More filters
Journal ArticleDOI
S-Vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based on Transformer Encoder
TL;DR: In this article , the Transformer encoder speaker authenticator (TESA) is proposed to generate speaker embeddings from self-attention in the encoder of a Transformer.
Proceedings ArticleDOI
LAE: Language-Aware Encoder for Monolingual and Multilingual ASR
TL;DR: Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level and shows superior performance on both monolingual and multilingual ASR tasks.
Posted Content
S-vectors: Speaker Embeddings based on Transformer's Encoder for Text-Independent Speaker Verification.
TL;DR: This paper has proposed to derive speaker embeddings from the output of the trained Transformer encoder structure after appropriate statistics pooling to obtain utterance level features as s-vectors.
Proceedings ArticleDOI
Multilingual Speech Recognition Using Language-Specific Phoneme Recognition as Auxiliary Task for Indian Languages.
Hardik B. Sailor,Thomas Hain +1 more
TL;DR: This paper proposes a multilingual acoustic modeling approach for Indian languages using a Multitask Learning (MTL) framework and explores language-specific phoneme recognition as an auxiliary task in MTL framework along with the primary task of multilingual senone classification.
Proceedings ArticleDOI
Using Large Self-Supervised Models for Low-Resource Speech Recognition
TL;DR: This work investigates the effectiveness of many self-supervised pre-trained models for the low-resource speech recognition task for three Indian languages Telugu, Tamil, and Gujarati and carefully analyzes the generalization capability of multilingual pre- trained models for both seen and unseen languages.
References
More filters
Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings Article
The Kaldi Speech Recognition Toolkit
Daniel Povey,Arnab Ghoshal,Gilles Boulianne,Lukas Burget,Ondrej Glembek,Nagendra Kumar Goel,Mirko Hannemann,Petr Motlicek,Yanmin Qian,Petr Schwarz,Jan Silovsky,Georg Stemmer,Karel Vesely +12 more
TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
Proceedings ArticleDOI
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
TL;DR: Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers is presented.
Proceedings ArticleDOI
ESPNet: End-to-end speech processing toolkit
Shinji Watanabe,Takaaki Hori,Shigeki Karita,Tomoki Hayashi,Jiro Nishitoba,Yuya Unno,Nelson Yalta,Jahn Heymann,Matthew Wiesner,Nanxin Chen,Adithya Renduchintala,Tsubasa Ochiai +11 more
TL;DR: In this article, a new open source platform for end-to-end speech processing named ESPnet is introduced, which mainly focuses on automatic speech recognition (ASR), and adopts widely used dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine.
Journal ArticleDOI
Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
TL;DR: The proposed hybrid CTC/attention end-to-end ASR is applied to two large-scale ASR benchmarks, and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.