All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis

doi:10.1109/ICASSP.2019.8682572

Open AccessProceedings ArticleDOI

All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis

Thilo von Neumann, +5 more

- pp 91-95

Chats0

TLDR

In this paper, an all-neural approach to simultaneous speaker counting, diarization and source separation is presented, where the neural network is recurrent over time as well as over the number of sources.

Abstract:

Automatic meeting analysis comprises the tasks of speaker counting, speaker diarization, and the separation of overlapped speech, followed by automatic speech recognition. This all has to be carried out on arbitrarily long sessions and, ideally, in an online or block-online manner. While significant progress has been made on individual tasks, this paper presents for the first time an all-neural approach to simultaneous speaker counting, diarization and source separation. The NN-based estimator operates in a block-online fashion and tracks speakers even if they remain silent for a number of time blocks, thus learning a stable output order for the separated sources. The neural network is recurrent over time as well as over the number of sources. The simulation experiments show that state of the art separation performance is achieved, while at the same time delivering good diarization and source counting results. It even generalizes well to an unseen large number of blocks.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Continuous Speech Separation: Dataset and Analysis

Zhuo Chen, +8 more

TL;DR: A new real recording dataset, called LibriCSS, is derived from LibriSpeech by concatenating the corpus utterances to simulate conversations and capturing the audio replays with far-field microphones, which helps researchers from developing systems that can be readily applied to real scenarios.

...read moreread less

Posted Content

Speaker Recognition Based on Deep Learning: An Overview

Zhongxin Bai, +1 more

- 02 Dec 2020 -

arXiv: Audio and Speech Processing

TL;DR: Several major subtasks of speaker recognition are reviewed, including speaker verification, identification, diarization, and robust speaker recognition, with a focus on deep-learning-based methods.

...read moreread less

Proceedings ArticleDOI

End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors

Shota Horiguchi, +4 more

TL;DR: In this article, an encoder-decoder based attractor calculation (EDA) method was proposed to generate a flexible number of attractors from a speech embedding sequence to produce the same number of speaker activities.

...read moreread less

Posted Content

End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors

Shota Horiguchi, +4 more

- 20 May 2020 -

arXiv: Audio and Speech Processing

TL;DR: This paper proposes a method for encoder-decoder based attractor calculation (EDA), which first generates a flexible number of attractors from a speech embedding sequence, and then the generated multiple attractors are multiplied by the speechembedding sequence to produce the same number of speaker activities.

...read moreread less

Journal ArticleDOI

Wavesplit: End-to-End Speech Separation by Speaker Clustering

Neil Zeghidour, +1 more

- 26 Jul 2021 -

IEEE Transactions on Audio, Speech, and ...

TL;DR: Wavesplit as mentioned in this paper infers a set of source representations via clustering, which addresses the fundamental permutation problem of separation, and then estimates each source signal given the inferred representations.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Front-End Factor Analysis for Speaker Verification

Najim Dehak, +4 more

- 01 May 2011 -

IEEE Transactions on Audio, Speech, and ...

TL;DR: An extension of the previous work which proposes a new speaker representation for speaker verification, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis, named the total variability space because it models both speaker and channel variabilities.

...read moreread less

Proceedings ArticleDOI

Deep clustering: Discriminative embeddings for segmentation and separation

John R. Hershey, +3 more

TL;DR: In this paper, a deep network is trained to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures.

...read moreread less

Proceedings ArticleDOI

Front-End Factor Analysis For Speaker Verification

Florin Curelaru

TL;DR: This paper investigates which configuration and which parameters lead to the best performance of an i-vectors/PLDA based speaker verification system and presents at the end some preliminary experiments in which the utterances comprised in the CSTR VCTK corpus were used besides utterances from MIT-MDSVC for training the total variability covariance matrix and the underlying PLDA matrices.

...read moreread less

Proceedings ArticleDOI

Permutation invariant training of deep models for speaker-independent multi-talker speech separation

Dong Yu, +3 more

TL;DR: In this paper, a permutation invariant training (PIT) was proposed for speaker independent multi-talker speech separation, commonly known as the cocktail-party problem, which minimizes the separation error directly.

...read moreread less

Book ChapterDOI

The AMI meeting corpus: a pre-announcement

Jean Carletta, +16 more

TL;DR: The AMI Meeting Corpus as mentioned in this paper is a multi-modal data set consisting of 100 hours of meeting recordings, which is being created in the context of a project that is developing meeting browsing technology and will eventually be released publicly.

...read moreread less

IEEE Transactions on Audio, Speech, and ...

All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis

Citations

Continuous Speech Separation: Dataset and Analysis

Speaker Recognition Based on Deep Learning: An Overview

End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors

End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors

Wavesplit: End-to-End Speech Separation by Speaker Clustering

References

Front-End Factor Analysis for Speaker Verification

Deep clustering: Discriminative embeddings for segmentation and separation

Front-End Factor Analysis For Speaker Verification

Permutation invariant training of deep models for speaker-independent multi-talker speech separation

The AMI meeting corpus: a pre-announcement

Related Papers (5)

Deep clustering: Discriminative embeddings for segmentation and separation

Permutation invariant training of deep models for speaker-independent multi-talker speech separation

Librispeech: An ASR corpus based on public domain audio books

The AMI meeting corpus: a pre-announcement

Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks