AISPEECH-SJTU accent identification system for the Accented English Speech Recognition Challenge

Open AccessPosted Content

AISPEECH-SJTU accent identification system for the Accented English Speech Recognition Challenge

Houjun Huang, +4 more

- 19 Feb 2021 -

arXiv: Sound

Chats0

TLDR

The AISpeech-SJTU system for the accent identification track of the Interspeech 2020 Accented English Speech Recognition Challenge as discussed by the authors achieved the best accuracy of 83.63% on the challenge evaluation data.

Abstract:

This paper describes the AISpeech-SJTU system for the accent identification track of the Interspeech-2020 Accented English Speech Recognition Challenge. In this challenge track, only 160-hour accented English data collected from 8 countries and the auxiliary Librispeech dataset are provided for training. To build an accurate and robust accent identification system, we explore the whole system pipeline in detail. First, we introduce the ASR based phone posteriorgram (PPG) feature to accent identification and verify its efficacy. Then, a novel TTS based approach is carefully designed to augment the very limited accent training data for the first time. Finally, we propose the test time augmentation and embedding fusion schemes to further improve the system performance. Our final system is ranked first in the challenge and outperforms all the other participants by a large margin. The submitted system achieves 83.63\% average accuracy on the challenge evaluation data, ahead of the others by more than 10\% in absolute terms.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

AISpeech-SJTU ASR System for the Accented English Speech Recognition Challenge

Tian Tan, +5 more

TL;DR: AISpeech-SJTU ASR system for the Interspeech 2020 Accented English Speech Recognition Challenge (AESRC) as mentioned in this paper achieved the second position in the challenge with a word error rate of 4.00% on dev set and 4.47% on test set.

...read moreread less

Posted Content

E2E-based Multi-task Learning Approach to Joint Speech and Accent Recognition

Jicheng Zhang, +5 more

- 15 Jun 2021 -

arXiv: Audio and Speech Processing

TL;DR: In this paper, a single multi-task learning framework was proposed to perform end-to-end speech recognition (ASR) and accent recognition (AR) simultaneously, and the proposed framework is not only more compact but can also yield comparable or even better results than standalone systems.

...read moreread less

Posted Content

Accent Recognition with Hybrid Phonetic Features

Zhan Zhang, +3 more

- 05 May 2021 -

arXiv: Audio and Speech Processing

TL;DR: The authors proposed a hybrid structure that incorporates the embeddings of both a fixed acoustic model and a trainable acoustic model, making the language-related acoustic feature more robust. But the results demonstrate that their approach can obtain a 6.57% relative improvement on the validation set.

...read moreread less

Posted Content

Deep Discriminative Feature Learning for Accent Recognition.

Wei Wang, +2 more

- 25 Nov 2020 -

arXiv: Sound

TL;DR: In this paper, the authors adopt Convolutional Recurrent Neural Network (CRNN) as front-end encoder and integrate local features using RNN to make an utterance-level accent representation.

...read moreread less

References

PDF

Open Access

More filters

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Journal Article

Visualizing Data using t-SNE

Laurens van der Maaten, +1 more

- 01 Jan 2008 -

Journal of Machine Learning Research

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.

...read moreread less

Journal ArticleDOI

Squeeze-and-Excitation Networks

Jie Hu, +4 more

TL;DR: This work proposes a novel architectural unit, which is term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and finds that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at minimal additional computational cost.

...read moreread less

Proceedings Article

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, +20 more

TL;DR: This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.

...read moreread less

Collapse

arXiv: Computation and Language

Best of Both Worlds: Robust Accented Speech Recognition with Adversarial Transfer Learning

Nilaksh Das, +4 more

Accent in Speech Samples: Support Vector Machines for Classification and Rule Extraction

Carol Pedersen, +2 more

- 15 Jan 2008 -

Studies in computational intelligence

AISPEECH-SJTU accent identification system for the Accented English Speech Recognition Challenge

Citations

AISpeech-SJTU ASR System for the Accented English Speech Recognition Challenge

E2E-based Multi-task Learning Approach to Joint Speech and Accent Recognition

Accent Recognition with Hybrid Phonetic Features

Deep Discriminative Feature Learning for Accent Recognition.

References

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

Visualizing Data using t-SNE

Squeeze-and-Excitation Networks

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Related Papers (5)

AISpeech-SJTU Accent Identification System for the Accented English Speech Recognition Challenge

Improving speech recognition using limited accent diverse British English training data with deep neural networks

AIPNet: Generative Adversarial Pre-training of Accent-invariant Networks for End-to-end Speech Recognition

Best of Both Worlds: Robust Accented Speech Recognition with Adversarial Transfer Learning

Accent in Speech Samples: Support Vector Machines for Classification and Rule Extraction