Recognizing Multi-talker Speech with Permutation Invariant Training

Open AccessPosted Content

Recognizing Multi-talker Speech with Permutation Invariant Training

- 22 Mar 2017 -

TLDR

A novel technique for direct recognition of multiple speech streams given the single channel of mixed speech, without first separating them, based on permutation invariant training (PIT) for automatic speech recognition (ASR).

Abstract:

In this paper, we propose a novel technique for direct recognition of multiple speech streams given the single channel of mixed speech, without first separating them. Our technique is based on permutation invariant training (PIT) for automatic speech recognition (ASR). In PIT-ASR, we compute the average cross entropy (CE) over all frames in the whole utterance for each possible output-target assignment, pick the one with the minimum CE, and optimize for that assignment. PIT-ASR forces all the frames of the same speaker to be aligned with the same output layer. This strategy elegantly solves the label permutation problem and speaker tracing problem in one shot. Our experiments on artificially mixed AMI data showed that the proposed approach is very promising.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Recent progresses in deep learning based acoustic models

Dong Yu, +1 more

- 10 Jul 2017 -

IEEE/CAA Journal of Automatica Sinica

TL;DR: In this paper, the authors summarize recent progress made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques, and further illustrate robustness issues in speech recognition systems, and discuss acoustic model adaptation, speech enhancement and separation.

...read moreread less

Proceedings ArticleDOI

End-to-End Multi-Speaker Speech Recognition

Shane Settle, +4 more

TL;DR: This work develops the first fully end-to-end, jointly trained deep learning system for separation and recognition of overlapping speech signals that enables training on more realistic data that contains only mixed signals and their transcriptions, and thus is suited to large scale training on existing transcribed data.

...read moreread less

Journal ArticleDOI

Progressive Joint Modeling in Unsupervised Single-Channel Overlapped Speech Recognition

Zhehuai Chen, +3 more

- 01 Jan 2018 -

IEEE Transactions on Audio, Speech, and ...

TL;DR: This work proposes a modular structure on the neural network, applying a progressive pretraining regimen, and improving the objective function with transfer learning and a discriminative training criterion, which achieves over 30% relative improvement of word error rate.

...read moreread less

Proceedings ArticleDOI

End-to-end Monaural Multi-speaker ASR System without Pretraining

Xuankai Chang, +3 more

TL;DR: In this article, an end-to-end monaural multi-speaker speech recognition model was proposed to recognize multiple label sequences completely from scratch, without any indeterminate supervisions obtained from non-mixture speech or corresponding labels/alignments.

...read moreread less

Journal ArticleDOI

Past review, current progress, and challenges ahead on the cocktail party problem

Yanmin Qian, +4 more

- 23 Apr 2018 -

Journal of Zhejiang University Science C

TL;DR: This overview paper focuses on the speech separation problem given its central role in the cocktail party environment, and describes the conventional single-channel techniques such as computational auditory scene analysis (CASA), non-negative matrix factorization (NMF) and generative models, and the newly developed deep learning-based techniques.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

Geoffrey E. Hinton, +10 more

- 18 Oct 2012 -

IEEE Signal Processing Magazine

TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

...read moreread less

Proceedings Article

The Kaldi Speech Recognition Toolkit

Daniel Povey, +12 more

TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.

...read moreread less

Journal ArticleDOI

Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

George E. Dahl, +3 more

- 01 Jan 2012 -

IEEE Transactions on Audio, Speech, and ...

TL;DR: A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.

...read moreread less

Journal ArticleDOI

Convolutional neural networks for speech recognition

Ossama Abdel-Hamid, +5 more

- 01 Oct 2014 -

IEEE Transactions on Audio, Speech, and ...

TL;DR: It is shown that further error rate reduction can be obtained by using convolutional neural networks (CNNs), and a limited-weight-sharing scheme is proposed that can better model speech features.

...read moreread less

Collapse

Deep Learning based Multi-Source Localization with Source Splitting and its Effectiveness in Multi-Talker Speech Recognition.

Aswin Shanmugam Subramanian, +4 more

- 16 Feb 2021 -

arXiv: Audio and Speech Processing

Recognizing Multi-talker Speech with Permutation Invariant Training

Citations

Recent progresses in deep learning based acoustic models

End-to-End Multi-Speaker Speech Recognition

Progressive Joint Modeling in Unsupervised Single-Channel Overlapped Speech Recognition

End-to-end Monaural Multi-speaker ASR System without Pretraining

Past review, current progress, and challenges ahead on the cocktail party problem

References

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

The Kaldi Speech Recognition Toolkit

Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

End to end speech recognition in English and Mandarin

Convolutional neural networks for speech recognition

Related Papers (5)

Recognizing Multi-talker Speech with Permutation Invariant Training

Deep clustering: Discriminative embeddings for segmentation and separation

Permutation invariant training of deep models for speaker-independent multi-talker speech separation

Permutation invariant training for talker-independent multi-talker speech separation

Deep Learning based Multi-Source Localization with Source Splitting and its Effectiveness in Multi-Talker Speech Recognition.