scispace - formally typeset
Open AccessProceedings ArticleDOI

All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis

Reads0
Chats0
TLDR
In this paper, an all-neural approach to simultaneous speaker counting, diarization and source separation is presented, where the neural network is recurrent over time as well as over the number of sources.
Abstract
Automatic meeting analysis comprises the tasks of speaker counting, speaker diarization, and the separation of overlapped speech, followed by automatic speech recognition. This all has to be carried out on arbitrarily long sessions and, ideally, in an online or block-online manner. While significant progress has been made on individual tasks, this paper presents for the first time an all-neural approach to simultaneous speaker counting, diarization and source separation. The NN-based estimator operates in a block-online fashion and tracks speakers even if they remain silent for a number of time blocks, thus learning a stable output order for the separated sources. The neural network is recurrent over time as well as over the number of sources. The simulation experiments show that state of the art separation performance is achieved, while at the same time delivering good diarization and source counting results. It even generalizes well to an unseen large number of blocks.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Continuous Speech Separation: Dataset and Analysis

TL;DR: A new real recording dataset, called LibriCSS, is derived from LibriSpeech by concatenating the corpus utterances to simulate conversations and capturing the audio replays with far-field microphones, which helps researchers from developing systems that can be readily applied to real scenarios.
Posted Content

Speaker Recognition Based on Deep Learning: An Overview

TL;DR: Several major subtasks of speaker recognition are reviewed, including speaker verification, identification, diarization, and robust speaker recognition, with a focus on deep-learning-based methods.
Proceedings ArticleDOI

End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors

TL;DR: In this article, an encoder-decoder based attractor calculation (EDA) method was proposed to generate a flexible number of attractors from a speech embedding sequence to produce the same number of speaker activities.
Posted Content

End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors

TL;DR: This paper proposes a method for encoder-decoder based attractor calculation (EDA), which first generates a flexible number of attractors from a speech embedding sequence, and then the generated multiple attractors are multiplied by the speechembedding sequence to produce the same number of speaker activities.
Journal ArticleDOI

Wavesplit: End-to-End Speech Separation by Speaker Clustering

TL;DR: Wavesplit as mentioned in this paper infers a set of source representations via clustering, which addresses the fundamental permutation problem of separation, and then estimates each source signal given the inferred representations.
References
More filters
Journal ArticleDOI

Front-End Factor Analysis for Speaker Verification

TL;DR: An extension of the previous work which proposes a new speaker representation for speaker verification, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis, named the total variability space because it models both speaker and channel variabilities.
Proceedings ArticleDOI

Deep clustering: Discriminative embeddings for segmentation and separation

TL;DR: In this paper, a deep network is trained to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures.
Proceedings ArticleDOI

Front-End Factor Analysis For Speaker Verification

TL;DR: This paper investigates which configuration and which parameters lead to the best performance of an i-vectors/PLDA based speaker verification system and presents at the end some preliminary experiments in which the utterances comprised in the CSTR VCTK corpus were used besides utterances from MIT-MDSVC for training the total variability covariance matrix and the underlying PLDA matrices.
Proceedings ArticleDOI

Permutation invariant training of deep models for speaker-independent multi-talker speech separation

TL;DR: In this paper, a permutation invariant training (PIT) was proposed for speaker independent multi-talker speech separation, commonly known as the cocktail-party problem, which minimizes the separation error directly.
Book ChapterDOI

The AMI meeting corpus: a pre-announcement

TL;DR: The AMI Meeting Corpus as mentioned in this paper is a multi-modal data set consisting of 100 hours of meeting recordings, which is being created in the context of a project that is developing meeting browsing technology and will eventually be released publicly.
Related Papers (5)