Proceedings ArticleDOI
Improving the Performance of Transformer Based Low Resource Speech Recognition for Indian Languages
Vishwas M. Shetty,Metilda Sagaya Mary N. J,Srinivasan Umesh +2 more
- pp 8279-8283
TLDR
The proposed approach with retraining gave 6% - 11% relative improvements in character error rates over the monolingual baseline, and the language embedding learned from the proposed approach, when added to the acoustic feature vector, gave the best result.Abstract:
The recent success of the Transformer based sequence-to-sequence framework for various Natural Language Processing tasks has motivated its application to Automatic Speech Recognition. In this work, we explore the application of Transformers on low resource Indian languages in a multilingual framework. We explore various methods to incorporate language information into a multilingual Transformer, i.e., (i) at the decoder, (ii) at the encoder. These methods include using language identity tokens or providing language information to the acoustic vectors. Language information to the acoustic vectors can be given in the form of one hot vector or by learning a language embedding. From our experiments, we observed that providing language identity always improved performance. The language embedding learned from our proposed approach, when added to the acoustic feature vector, gave the best result. The proposed approach with retraining gave 6% - 11% relative improvements in character error rates over the monolingual baseline.read more
Citations
More filters
Proceedings ArticleDOI
Exploring the use of Common Label Set to Improve Speech Recognition of Low Resource Indian Languages
TL;DR: In this article, the authors explore the benefits of representing similar target subword units (e.g., Byte Pair Encoded(BPE) units) through a Common Label Set (CLS).
Journal ArticleDOI
Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models
Jing Zhao,Wei-Qiang Zhang +1 more
TL;DR: This paper exploits and analyzes a series of wav2vec pre-trained models for speech recognition in 15 low-resource languages in the OpenASR21 Challenge, and investigates data utilization, multilingual learning, and the use of a phoneme-level recognition task in fine-tuning.
Posted Content
Transfer Learning Approaches for Streaming End-to-End Speech Recognition System
TL;DR: This paper presents a comparative study of four different TL methods for RNN-T framework, showing 17% relative word error rate reduction with differentTL methods over randomly initialized Rnn-T model and showing the efficacy of TL for languages with small amount of training data.
Proceedings ArticleDOI
Multiple Softmax Architecture for Streaming Multilingual End-to-End ASR Systems
Proceedings Article
A Survey of Multilingual Models for Automatic Speech Recognition
Hemant Yadav,Soundarya Sitaram +1 more
TL;DR: The state of the art in multilingual ASR models that are built with cross-lingual transfer in mind are surveyed, best practices for building multilingual models from research across diverse languages and techniques are presented and recommendations for future work are provided.
References
More filters
Proceedings ArticleDOI
Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers
TL;DR: It is shown that the learned hidden layers sharing across languages can be transferred to improve recognition accuracy of new languages, with relative error reductions ranging from 6% to 28% against DNNs trained without exploiting the transferred hidden layers.
Proceedings ArticleDOI
A Comparative Study on Transformer vs RNN in Speech Applications
Shigeki Karita,Xiaofei Wang,Shinji Watanabe,Takenori Yoshimura,Wangyou Zhang,Nanxin Chen,Tomoki Hayashi,Takaaki Hori,Hirofumi Inaguma,Ziyan Jiang,Masao Someki,Nelson Yalta,Ryuichi Yamamoto +12 more
TL;DR: Transformer as mentioned in this paper is an emergent sequence-to-sequence model which achieves state-of-the-art performance in neural machine translation and other natural language processing applications, such as automatic speech recognition (ASR), speech translation (ST), and text to speech (TTS).
Proceedings ArticleDOI
A Comparative Study on Transformer vs RNN in Speech Applications
Shigeki Karita,Nanxin Chen,Tomoki Hayashi,Takaaki Hori,Hirofumi Inaguma,Ziyan Jiang,Masao Someki,Nelson Yalta,Ryuichi Yamamoto,Xiaofei Wang,Shinji Watanabe,Takenori Yoshimura,Wangyou Zhang +12 more
TL;DR: An emergent sequence-to-sequence model called Transformer achieves state-of-the-art performance in neural machine translation and other natural language processing applications, including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN.
Proceedings ArticleDOI
Multilingual training of deep neural networks
TL;DR: This work investigates multilingual modeling in the context of a DNN - hidden Markov model (HMM) hybrid, where the DNN outputs are used as the HMM state likelihoods and proposes that training the hidden layers on multiple languages makes them more suitable for cross-lingual transfer.
Proceedings ArticleDOI
Multilingual Speech Recognition with a Single End-to-End Model
Shubham Toshniwal,Tara N. Sainath,Ron Weiss,Bo Li,Pedro J. Moreno,Eugene Weinstein,Kanishka Rao +6 more
TL;DR: This model, which is not explicitly given any information about language identity, improves recognition performance by 21% relative compared to analogous sequence-to-sequence models trained on each language individually and improves performance by an additional 7% relative and eliminate confusion between different languages.