scispace - formally typeset
Search or ask a question

Showing papers by "Pedro J. Moreno published in 2016"


Journal ArticleDOI
TL;DR: This work presents a comprehensive study on the use of deep neural networks for automatic language identification that includes a detailed performance analysis for different data selection strategies and DNN architectures, and presents a novel approach that combines DNN and i-vector systems by using bottleneck features.

67 citations


Proceedings ArticleDOI
Mohamed G. Elfeky1, Meysam Bastani1, Xavier Velez1, Pedro J. Moreno1, Austin Waters1 
01 Dec 2016
TL;DR: Two techniques are presented: Distillation and MultiTask Learning (MTL), which show that both techniques are superior to the jointly-trained model that is trained on all dialectal data, reducing word error rates by 4:2% and 0:6%, respectively.
Abstract: Acoustic model performance typically decreases when evaluated on a dialectal variation of the same language that was not used during training. Similarly, models simultaneously trained on a group of dialects tend to underperform dialect-specific models. In this paper, we report on our efforts towards building a unified acoustic model that can serve a multi-dialectal language. Two techniques are presented: Distillation and MultiTask Learning (MTL). In Distillation, we use an ensemble of dialect-specific acoustic models and distill its knowledge in a single model. In MTL, we utilize multitask learning to train a unified acoustic model that learns to distinguish dialects as a side task. We show that both techniques are superior to the jointly-trained model that is trained on all dialectal data, reducing word error rates by 4:2% and 0:6%, respectively. While achieving this improvement, neither technique degrades the performance of the dialect-specific models by more than 3:4%.

30 citations


Proceedings ArticleDOI
21 Mar 2016
TL;DR: This paper presents two methods to select and combine the best decoded hypothesis from a pool of dialectal recognizers, following a Machine Learning approach and extracts features from the Speech Recognition output along with Word Embeddings and use Shallow Neural Networks for classification.
Abstract: While research has often shown that building dialect-specific Automatic Speech Recognizers is the optimal approach to dealing with dialectal variations of the same language, we have observed that dialect-specific recognizers do not always output the best recognitions. Often enough, another dialectal recognizer outputs a better recognition than the dialect-specific one. In this paper, we present two methods to select and combine the best decoded hypothesis from a pool of dialectal recognizers. We follow a Machine Learning approach and extract features from the Speech Recognition output along with Word Embeddings and use Shallow Neural Networks for classification. Our experiments using Dictation and Voice Search data from the main four Arabic dialects show good WER improvements for the hypothesis selection scheme, reducing the WER by 2.1 to 12.1% depending on the test set, and promising results for the hypotheses combination scheme.

15 citations


Proceedings ArticleDOI
01 Dec 2016
TL;DR: This paper describes a new technique to automatically obtain large high-quality training speech corpora that are superior in transcript correctness even to those manually transcribed by humans, and can train new acoustic models which outperform those trained solely on previously available data sets.
Abstract: This paper describes a new technique to automatically obtain large high-quality training speech corpora for acoustic modeling. Traditional approaches select utterances based on confidence thresholds and other heuristics. We propose instead to use an ensemble approach: we transcribe each utterance using several recognizers, and only keep those on which they agree. The recognizers we use are trained on data from different dialects of the same language, and this diversity leads them to make different mistakes in transcribing speech utterances. In this work we show, however, that when they agree, this is an extremely strong signal that the transcript is correct. This allows us to produce automatically transcribed speech corpora that are superior in transcript correctness even to those manually transcribed by humans. Furthermore, we show that using the produced semi-supervised data sets, we can train new acoustic models which outperform those trained solely on previously available data sets.

14 citations