Single Channel Target Speaker Extraction and Recognition with Speaker Beam

doi:10.1109/ICASSP.2018.8462661

Citations

PDF

Open Access

More filters

Posted Content•

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking.

[...]

Quan Wang, Hannah Muckenhirn, Kevin W. Wilson, Prashant Sridhar, Zelin Wu, John R. Hershey, Rif A. Saurous, Ron Weiss, Ye Jia, Ignacio Lopez Moreno - Show less +6 more

11 Oct 2018-arXiv: Audio and Speech Processing

TL;DR: In this paper, a speaker recognition network that produces speaker-discriminative embeddings and a spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask.

...read moreread less

Abstract: In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.

...read moreread less

197 citations

Journal Article•DOI•

SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures

[...]

Katerina Zmolikova¹, Marc Delcroix², Keisuke Kinoshita², Tsubasa Ochiai², Tomohiro Nakatani², Lukas Burget¹, Jan Cernocky¹ - Show less +3 more•Institutions (2)

Brno University of Technology¹, NTT Communications Corp²

13 Jun 2019-IEEE Journal of Selected Topics in Signal Processing

TL;DR: This paper introduces SpeakerBeam, a method for extracting a target speaker from the mixture based on an adaptation utterance spoken by the target speaker and shows the benefit of including speaker information in the processing and the effectiveness of the proposed method.

...read moreread less

Abstract: The processing of speech corrupted by interfering overlapping speakers is one of the challenging problems with regards to today's automatic speech recognition systems. Recently, approaches based on deep learning have made great progress toward solving this problem. Most of these approaches tackle the problem as speech separation, i.e., they blindly recover all the speakers from the mixture. In some scenarios, such as smart personal devices, we may however be interested in recovering one target speaker from a mixture. In this paper, we introduce SpeakerBeam, a method for extracting a target speaker from the mixture based on an adaptation utterance spoken by the target speaker. Formulating the problem as speaker extraction avoids certain issues such as label permutation and the need to determine the number of speakers in the mixture. With SpeakerBeam, we jointly learn to extract a representation from the adaptation utterance characterizing the target speaker and to use this representation to extract the speaker. We explore several ways to do this, mostly inspired by speaker adaptation in acoustic models for automatic speech recognition. We evaluate the performance on the widely used WSJ0-2mix and WSJ0-3mix datasets, and these datasets modified with more noise or more realistic overlapping patterns. We further analyze the learned behavior by exploring the speaker representations and assessing the effect of the length of the adaptation data. The results show the benefit of including speaker information in the processing and the effectiveness of the proposed method.

...read moreread less

158 citations

Cites background or methods from "Single Channel Target Speaker Extra..."

...We gradually built and refined the SpeakerBeam approach over several studies [28]–[31]....
[...]
...While these studies [28]–[30] focused on a multichannel case, in [31], we investigated the ASR performance in a single-channel setting....
[...]

Proceedings Article•DOI•

Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario.

[...]

Ivan Medennikov¹, Maxim Korenevsky¹, Tatiana Prisyach, Yuri Y. Khokhlov, Mariya Korenevskaya, Ivan Sorokin¹, Tatiana Timofeeva, Anton Mitrofanov¹, Andrei Andrusenko¹, Ivan Podluzhny¹, Aleksandr Laptev¹, Aleksei Romanenko² - Show less +8 more•Institutions (2)

Saint Petersburg State University of Information Technologies, Mechanics and Optics¹, Lappeenranta University of Technology²

25 Oct 2020

TL;DR: A novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame, outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.

...read moreread less

Abstract: Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.

...read moreread less

141 citations

Cites methods from "Single Channel Target Speaker Extra..."

...These approaches include TSASR [21] for target speech recognition, Speaker Beam [22, 23] and Voice Filter [24] for target speech extraction, and Personal VAD [26] for target speech detection....
[...]
...This direction is represented by such approaches as Target-Speaker ASR [21], Speaker Beam [22, 23] and Voice Filter [24] aimed at the target-speaker speech extraction, etc....
[...]

Proceedings Article•DOI•

Multi-Channel Overlapped Speech Recognition with Location Guided Speech Extraction Network

[...]

Zhuo Chen¹, Xiong Xiao¹, Takuya Yoshioka¹, Hakan Erdogan¹, Jinyu Li¹, Yifan Gong¹ - Show less +2 more•Institutions (1)

Microsoft¹

18 Dec 2018

TL;DR: This work proposes a simple yet effective method for multi-channel far-field overlapped speech recognition that achieves more than 24% relative word error rate (WER) reduction than fixed beamforming with oracle selection.

...read moreread less

Abstract: Although advances in close-talk speech recognition have resulted in relatively low error rates, the recognition performance in far-field environments is still limited due to low signal-to-noise ratio, reverberation, and overlapped speech from simultaneous speakers which is especially more difficult. To solve these problems, beamforming and speech separation networks were previously proposed. However, they tend to suffer from leakage of interfering speech or limited generalizability. In this work, we propose a simple yet effective method for multi-channel far-field overlapped speech recognition. In the proposed system, three different features are formed for each target speaker, namely, spectral, spatial, and angle features. Then a neural network is trained using all features with a target of the clean speech of the required speaker. An iterative update procedure is proposed in which the mask-based beamforming and mask estimation are performed alternatively. The proposed system were evaluated with real recorded meetings with different levels of overlapping ratios. The results show that the proposed system achieves more than 24% relative word error rate (WER) reduction than fixed beamforming with oracle selection. Moreover, as overlap ratio rises from 20% to 70+%, only 3.8% WER increase is observed for the proposed system.

...read moreread less

129 citations

Cites background or methods from "Single Channel Target Speaker Extra..."

...To handle this problem two families of algorithm were proposed in recent years, namely the blind speech separation [5, 10, 3, 4] and informed speech extraction [13, 14, 15]....
[...]
...In [13, 14, 18], speaker identity features extracted from an additional enrollment utterance has been shown useful for separation....
[...]

Proceedings Article•DOI•

Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention

[...]

Yuma Koizumi, Kohei Yatabe¹, Marc Delcroix², Yoshiki Masuyama¹, Daiki Takeuchi¹ - Show less +1 more•Institutions (2)

Waseda University¹, Nippon Telegraph and Telephone²

04 May 2020

TL;DR: This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features; it extracts a speaker representation used for adaptation directly from the test utterance and uses multi-task learning of speech enhancement and speaker identification, and uses the output of the final hidden layer of speaker identification branch as an auxiliary feature.

...read moreread less

Abstract: This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features; we extract a speaker representation used for adaptation directly from the test utterance. Conventional studies of deep neural network (DNN)-based speech enhancement mainly focus on building a speaker independent model. Meanwhile, in speech applications including speech recognition and synthesis, it is known that model adaptation to the target speaker improves the accuracy. Our research question is whether a DNN for speech enhancement can be adopted to unknown speakers without any auxiliary guidance signal in test-phase. To achieve this, we adopt multi-task learning of speech enhancement and speaker identification, and use the output of the final hidden layer of speaker identification branch as an auxiliary feature. In addition, we use multi-head self-attention for capturing long-term dependencies in the speech and noise. Experimental results on a public dataset show that our strategy achieves the state-of-the-art performance and also outperform conventional methods in terms of subjective quality.

...read moreread less

100 citations

Cites methods from "Single Channel Target Speaker Extra..."

...In the SpeakerBeam method [20, 21], the guidance signal in the T-F-domain A ∈ CF×Ka is converted to the sequence-summarized feature λ ∈ R using an auxiliary neural network G : CF×Ka → RP×Ka as...
[...]

Collapse

Single Channel Target Speaker Extraction and Recognition with Speaker Beam

Citations

Cites background or methods from "Single Channel Target Speaker Extra..."

Cites methods from "Single Channel Target Speaker Extra..."

Cites background or methods from "Single Channel Target Speaker Extra..."

Cites methods from "Single Channel Target Speaker Extra..."

References

"Single Channel Target Speaker Extra..." refers methods in this paper

"Single Channel Target Speaker Extra..." refers background or methods in this paper

"Single Channel Target Speaker Extra..." refers background in this paper

Related Papers (5)