Open AccessPosted Content
Recognizing Multi-talker Speech with Permutation Invariant Training
TLDR
A novel technique for direct recognition of multiple speech streams given the single channel of mixed speech, without first separating them, based on permutation invariant training (PIT) for automatic speech recognition (ASR).Abstract:
In this paper, we propose a novel technique for direct recognition of multiple speech streams given the single channel of mixed speech, without first separating them. Our technique is based on permutation invariant training (PIT) for automatic speech recognition (ASR). In PIT-ASR, we compute the average cross entropy (CE) over all frames in the whole utterance for each possible output-target assignment, pick the one with the minimum CE, and optimize for that assignment. PIT-ASR forces all the frames of the same speaker to be aligned with the same output layer. This strategy elegantly solves the label permutation problem and speaker tracing problem in one shot. Our experiments on artificially mixed AMI data showed that the proposed approach is very promising.read more
Citations
More filters
Journal ArticleDOI
Recent progresses in deep learning based acoustic models
TL;DR: In this paper, the authors summarize recent progress made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques, and further illustrate robustness issues in speech recognition systems, and discuss acoustic model adaptation, speech enhancement and separation.
Proceedings ArticleDOI
End-to-End Multi-Speaker Speech Recognition
TL;DR: This work develops the first fully end-to-end, jointly trained deep learning system for separation and recognition of overlapping speech signals that enables training on more realistic data that contains only mixed signals and their transcriptions, and thus is suited to large scale training on existing transcribed data.
Journal ArticleDOI
Progressive Joint Modeling in Unsupervised Single-Channel Overlapped Speech Recognition
TL;DR: This work proposes a modular structure on the neural network, applying a progressive pretraining regimen, and improving the objective function with transfer learning and a discriminative training criterion, which achieves over 30% relative improvement of word error rate.
Proceedings ArticleDOI
End-to-end Monaural Multi-speaker ASR System without Pretraining
TL;DR: In this article, an end-to-end monaural multi-speaker speech recognition model was proposed to recognize multiple label sequences completely from scratch, without any indeterminate supervisions obtained from non-mixture speech or corresponding labels/alignments.
Journal ArticleDOI
Past review, current progress, and challenges ahead on the cocktail party problem
TL;DR: This overview paper focuses on the speech separation problem given its central role in the cocktail party environment, and describes the conventional single-channel techniques such as computational auditory scene analysis (CASA), non-negative matrix factorization (NMF) and generative models, and the newly developed deep learning-based techniques.
References
More filters
Journal ArticleDOI
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups
Geoffrey E. Hinton,Li Deng,Dong Yu,George E. Dahl,Abdelrahman Mohamed,Navdeep Jaitly,Andrew W. Senior,Vincent Vanhoucke,Patrick Nguyen,Tara N. Sainath,Brian Kingsbury +10 more
TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Proceedings Article
The Kaldi Speech Recognition Toolkit
Daniel Povey,Arnab Ghoshal,Gilles Boulianne,Lukas Burget,Ondrej Glembek,Nagendra Kumar Goel,Mirko Hannemann,Petr Motlicek,Yanmin Qian,Petr Schwarz,Jan Silovsky,Georg Stemmer,Karel Vesely +12 more
TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
Journal ArticleDOI
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
TL;DR: A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.
End to end speech recognition in English and Mandarin
Dario Amodei,Rishita Anubhai,Eric Battenberg,Carl Case,Jared Casper,Bryan Catanzaro,Jingdong Chen,Mike Chrzanowski,Adam Coates,Greg Diamos,Erich Elsen,Jesse Engel,Linxi Fan,Christopher Fougner,Tony X. Han,Awni Hannun,Billy Jun,Patrick LeGresley,Libby Lin,Sharan Narang,Andrew Y. Ng,Sherjil Ozair,Ryan Prenger,Jonathan Raiman,Sanjeev Satheesh,David Seetapun,Shubho Sengupta,Yi Wang,Zhiqian Wang,Chong Wang,Bo Xiao,Dani Yogatama,Jun Zhan,Zhenyao Zhu +33 more
TL;DR: It is shown that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages, and is competitive with the transcription of human workers when benchmarked on standard datasets.
Journal ArticleDOI
Convolutional neural networks for speech recognition
TL;DR: It is shown that further error rate reduction can be obtained by using convolutional neural networks (CNNs), and a limited-weight-sharing scheme is proposed that can better model speech features.