Open AccessPosted Content
AISPEECH-SJTU accent identification system for the Accented English Speech Recognition Challenge
Reads0
Chats0
TLDR
The AISpeech-SJTU system for the accent identification track of the Interspeech 2020 Accented English Speech Recognition Challenge as discussed by the authors achieved the best accuracy of 83.63% on the challenge evaluation data.Abstract:
This paper describes the AISpeech-SJTU system for the accent identification track of the Interspeech-2020 Accented English Speech Recognition Challenge. In this challenge track, only 160-hour accented English data collected from 8 countries and the auxiliary Librispeech dataset are provided for training. To build an accurate and robust accent identification system, we explore the whole system pipeline in detail. First, we introduce the ASR based phone posteriorgram (PPG) feature to accent identification and verify its efficacy. Then, a novel TTS based approach is carefully designed to augment the very limited accent training data for the first time. Finally, we propose the test time augmentation and embedding fusion schemes to further improve the system performance. Our final system is ranked first in the challenge and outperforms all the other participants by a large margin. The submitted system achieves 83.63\% average accuracy on the challenge evaluation data, ahead of the others by more than 10\% in absolute terms.read more
Citations
More filters
Proceedings ArticleDOI
AISpeech-SJTU ASR System for the Accented English Speech Recognition Challenge
TL;DR: AISpeech-SJTU ASR system for the Interspeech 2020 Accented English Speech Recognition Challenge (AESRC) as mentioned in this paper achieved the second position in the challenge with a word error rate of 4.00% on dev set and 4.47% on test set.
Posted Content
E2E-based Multi-task Learning Approach to Joint Speech and Accent Recognition
TL;DR: In this paper, a single multi-task learning framework was proposed to perform end-to-end speech recognition (ASR) and accent recognition (AR) simultaneously, and the proposed framework is not only more compact but can also yield comparable or even better results than standalone systems.
Posted Content
Accent Recognition with Hybrid Phonetic Features
TL;DR: The authors proposed a hybrid structure that incorporates the embeddings of both a fixed acoustic model and a trainable acoustic model, making the language-related acoustic feature more robust. But the results demonstrate that their approach can obtain a 6.57% relative improvement on the validation set.
Posted Content
Deep Discriminative Feature Learning for Accent Recognition.
Wei Wang,Chao Zhang,Xiaopei Wu +2 more
TL;DR: In this paper, the authors adopt Convolutional Recurrent Neural Network (CRNN) as front-end encoder and integrate local features using RNN to make an utterance-level accent representation.
References
More filters
Proceedings Article
ImageNet Classification with Deep Convolutional Neural Networks
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Journal Article
Visualizing Data using t-SNE
TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.
Journal ArticleDOI
Squeeze-and-Excitation Networks
TL;DR: This work proposes a novel architectural unit, which is term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and finds that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at minimal additional computational cost.
Proceedings Article
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke,Sam Gross,Francisco Massa,Adam Lerer,James Bradbury,Gregory Chanan,Trevor Killeen,Zeming Lin,Natalia Gimelshein,Luca Antiga,Alban Desmaison,Andreas Kopf,Edward Z. Yang,Zachary DeVito,Martin Raison,Alykhan Tejani,Sasank Chilamkurthy,Benoit Steiner,Lu Fang,Junjie Bai,Soumith Chintala +20 more
TL;DR: This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.