scispace - formally typeset
Open AccessPosted Content

Recognizing Multi-talker Speech with Permutation Invariant Training

TLDR
A novel technique for direct recognition of multiple speech streams given the single channel of mixed speech, without first separating them, based on permutation invariant training (PIT) for automatic speech recognition (ASR).
Abstract
In this paper, we propose a novel technique for direct recognition of multiple speech streams given the single channel of mixed speech, without first separating them. Our technique is based on permutation invariant training (PIT) for automatic speech recognition (ASR). In PIT-ASR, we compute the average cross entropy (CE) over all frames in the whole utterance for each possible output-target assignment, pick the one with the minimum CE, and optimize for that assignment. PIT-ASR forces all the frames of the same speaker to be aligned with the same output layer. This strategy elegantly solves the label permutation problem and speaker tracing problem in one shot. Our experiments on artificially mixed AMI data showed that the proposed approach is very promising.

read more

Citations
More filters
Journal ArticleDOI

Recent progresses in deep learning based acoustic models

TL;DR: In this paper, the authors summarize recent progress made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques, and further illustrate robustness issues in speech recognition systems, and discuss acoustic model adaptation, speech enhancement and separation.
Proceedings ArticleDOI

End-to-End Multi-Speaker Speech Recognition

TL;DR: This work develops the first fully end-to-end, jointly trained deep learning system for separation and recognition of overlapping speech signals that enables training on more realistic data that contains only mixed signals and their transcriptions, and thus is suited to large scale training on existing transcribed data.
Journal ArticleDOI

Progressive Joint Modeling in Unsupervised Single-Channel Overlapped Speech Recognition

TL;DR: This work proposes a modular structure on the neural network, applying a progressive pretraining regimen, and improving the objective function with transfer learning and a discriminative training criterion, which achieves over 30% relative improvement of word error rate.
Proceedings ArticleDOI

End-to-end Monaural Multi-speaker ASR System without Pretraining

TL;DR: In this article, an end-to-end monaural multi-speaker speech recognition model was proposed to recognize multiple label sequences completely from scratch, without any indeterminate supervisions obtained from non-mixture speech or corresponding labels/alignments.
Journal ArticleDOI

Past review, current progress, and challenges ahead on the cocktail party problem

TL;DR: This overview paper focuses on the speech separation problem given its central role in the cocktail party environment, and describes the conventional single-channel techniques such as computational auditory scene analysis (CASA), non-negative matrix factorization (NMF) and generative models, and the newly developed deep learning-based techniques.
References
More filters
Journal ArticleDOI

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Proceedings Article

The Kaldi Speech Recognition Toolkit

TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
Journal ArticleDOI

Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

TL;DR: A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.
Journal ArticleDOI

Convolutional neural networks for speech recognition

TL;DR: It is shown that further error rate reduction can be obtained by using convolutional neural networks (CNNs), and a limited-weight-sharing scheme is proposed that can better model speech features.
Related Papers (5)