scispace - formally typeset
Open AccessPosted Content

AISPEECH-SJTU accent identification system for the Accented English Speech Recognition Challenge

Reads0
Chats0
TLDR
The AISpeech-SJTU system for the accent identification track of the Interspeech 2020 Accented English Speech Recognition Challenge as discussed by the authors achieved the best accuracy of 83.63% on the challenge evaluation data.
Abstract
This paper describes the AISpeech-SJTU system for the accent identification track of the Interspeech-2020 Accented English Speech Recognition Challenge. In this challenge track, only 160-hour accented English data collected from 8 countries and the auxiliary Librispeech dataset are provided for training. To build an accurate and robust accent identification system, we explore the whole system pipeline in detail. First, we introduce the ASR based phone posteriorgram (PPG) feature to accent identification and verify its efficacy. Then, a novel TTS based approach is carefully designed to augment the very limited accent training data for the first time. Finally, we propose the test time augmentation and embedding fusion schemes to further improve the system performance. Our final system is ranked first in the challenge and outperforms all the other participants by a large margin. The submitted system achieves 83.63\% average accuracy on the challenge evaluation data, ahead of the others by more than 10\% in absolute terms.

read more

Citations
More filters
Proceedings ArticleDOI

AISpeech-SJTU ASR System for the Accented English Speech Recognition Challenge

TL;DR: AISpeech-SJTU ASR system for the Interspeech 2020 Accented English Speech Recognition Challenge (AESRC) as mentioned in this paper achieved the second position in the challenge with a word error rate of 4.00% on dev set and 4.47% on test set.
Posted Content

E2E-based Multi-task Learning Approach to Joint Speech and Accent Recognition

TL;DR: In this paper, a single multi-task learning framework was proposed to perform end-to-end speech recognition (ASR) and accent recognition (AR) simultaneously, and the proposed framework is not only more compact but can also yield comparable or even better results than standalone systems.
Posted Content

Accent Recognition with Hybrid Phonetic Features

TL;DR: The authors proposed a hybrid structure that incorporates the embeddings of both a fixed acoustic model and a trainable acoustic model, making the language-related acoustic feature more robust. But the results demonstrate that their approach can obtain a 6.57% relative improvement on the validation set.
Posted Content

Deep Discriminative Feature Learning for Accent Recognition.

Wei Wang, +2 more
- 25 Nov 2020 - 
TL;DR: In this paper, the authors adopt Convolutional Recurrent Neural Network (CRNN) as front-end encoder and integrate local features using RNN to make an utterance-level accent representation.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Journal Article

Visualizing Data using t-SNE

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.
Journal ArticleDOI

Squeeze-and-Excitation Networks

TL;DR: This work proposes a novel architectural unit, which is term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and finds that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at minimal additional computational cost.
Related Papers (5)