Multichannel Speaker Activity Detection for Meetings

doi:10.1109/ICASSP.2018.8461654

Home
/
Papers
/
Multichannel Speaker Activity Detection for Meetings

Proceedings Article•DOI•

Multichannel Speaker Activity Detection for Meetings

Patrick Meyer¹, Rolf Jongebloed¹, Tim Fingscheidt¹•Institutions (1)

Braunschweig University of Technology¹

15 Apr 2018-

TL;DR: This work investigates single- and multi-talk in a wide range of different crosstalk levels, and improved the detection accuracy towards a standardized voice activity detection overall by 12.89 % absolute.

read less

Abstract: Multichannel recordings of meetings with a (wireless) headset for each person deliver commonly the best audio quality for subsequent analyses. However, still speech portions of other participants can couple into the microphone channel of the associated target speaker. Due to this crosstalk, a speaker activity detection (SAD) is required in order to identify only the speech portions of the target speaker in the related microphone channel. While most solutions are either complex and need a training process, or achieve insufficient results in multi-talk situations, we propose a low complexity method, which can handle both crosstalk and multi-talk situations. We investigate single- and multi-talk in a wide range of different crosstalk levels, and improved the detection accuracy towards a standardized voice activity detection overall by 12.89 % absolute, whereas a state-of-the-art multichannel SAD was exceeded even by 13.76 % absolute.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

A Multichannel Kalman-Based Wiener Filter Approach for Speaker Interference Reduction in Meetings

[...]

Patrick Meyer¹, Samy Elshamy¹, Tim Fingscheidt¹•Institutions (1)

Braunschweig University of Technology¹

04 May 2020

TL;DR: This work extends an existing approach by integrating methods from acoustic echo cancellation to improve the estimation of the interferer (noise) components of the filter, which leads to an improved signal-to-interferer ratio by up to 2.1 dB absolute at constant speech component quality.

...read moreread less

Abstract: Recording a meeting and obtaining clean speech signals of each speaker is a challenging task. Even with a multichannel recording, in which all speakers are equipped with a close-talk microphone, speech of an active speaker still couples not only into his dedicated microphone, but also into all other microphone channels at a certain level. This is denoted as crosstalk and requires a multichannel speaker interference reduction to enhance the microphone channels for further processing. To solve this issue, we use a Wiener filter which is based on all individual microphone channels. We extend an existing approach by integrating methods from acoustic echo cancellation to improve the estimation of the interferer (noise) components of the filter, which leads to an improved signal-to-interferer ratio by up to 2.1 dB absolute at constant speech component quality.

...read moreread less

7 citations

Cites background from "Multichannel Speaker Activity Detec..."

...This effect is known as crosstalk or microphone leakage [3–6] and requires both a multichannel speaker activity detection [7] and a multichannel speaker interference reduction (MSIR) to facilitate further signal processing....
[...]

Book Chapter•DOI•

10. Speaker activity detection for distributed microphone systems in cars

[...]

Timo Matheja, Markus Buck, Tim Fingscheidt, Huseyin Abut, John H. L. Hansen, Gerhard Schmidt, Kazuya Takeda, Hanseok Ko - Show less +4 more

21 Jan 2017

2 citations

Journal Article•DOI•

Multichannel speaker interference reduction using frequency domain adaptive filtering

[...]

Patrick Meyer¹, Samy Elshamy¹, Tim Fingscheidt¹•Institutions (1)

Braunschweig University of Technology¹

01 Dec 2020-Eurasip Journal on Audio, Speech, and Music Processing

TL;DR: An adaptive filter method is integrated, which was originally proposed for acoustic echo cancellation (AEC), in order to obtain a well-performing interferer (noise) component estimation and results in an improved speech-to-interferer ratio by up to 2.7 dB at constant or even better speech component quality.

...read moreread less

Abstract: Microphone leakage or crosstalk is a common problem in multichannel close-talk audio recordings (e.g., meetings or live music performances), which occurs when a target signal does not only couple into its dedicated microphone, but also in all other microphone channels. For further signal processing such as automatic transcription of a meeting, a multichannel speaker interference reduction is required in order to eliminate the interfering speech signals in the microphone channels. The contribution of this paper is twofold: First, we consider multichannel close-talk recordings of a three-person meeting scenario with various different crosstalk levels. In order to eliminate the crosstalk in the target microphone channel, we extend a multichannel Wiener filter approach, which considers all individual microphone channels. Therefore, we integrate an adaptive filter method, which was originally proposed for acoustic echo cancellation (AEC), in order to obtain a well-performing interferer (noise) component estimation. This results in an improved speech-to-interferer ratio by up to 2.7 dB at constant or even better speech component quality. Second, since an AEC method requires typically clean reference channels, we investigate and report findings why the AEC algorithm is able to successfully estimate the interfering signals and the room impulse responses between the microphones of the interferer and the target speakers even though the reference signals are themselves disturbed by crosstalk in the considered meeting scenario.

...read moreread less

1 citations

Speech Activity Detection Using a Fusion of Dense Convolutional Network in the Movie Audio

[...]

Pantid Chantangphol¹, Sasiporn Usanavasin¹, Jessada Karnjana², Surasak Boonkla², Suthum Keerativittayanun², Anocha Rugchatjaroen², Takahiro Shinozaki³ - Show less +3 more•Institutions (3)

Sirindhorn International Institute of Technology¹, Thailand National Science and Technology Development Agency², Tokyo Institute of Technology³

01 Jul 2020

TL;DR: The result shows that the proposed speech activity detection on the entertainment media domain based on CNN can achieve better performance than previous work in a more complicated noise environment.

...read moreread less

Abstract: Speech activity detection (SAD) is a critical preparation process for speech-based applications. The speech activity detection is used to identify the speech in an audio recording. This paper aims to propose a speech activity detection on the entertainment media domain based on CNN. The fusion of two Dense Convolutional Network (DenseNet) with different feature extraction by using Dempster-Shafer theory (DS theory) was used to classify the speech segment. We combined acoustic features, which are the logmel spectrogram (LM), mel frequency cepstral coefficient (MFCC), chroma, spectral contrast, and tonnetz as the input feature. The combination of acoustic features operates on the raw speech signal and delivers it into a convolution neural network for classifying the speech. The result in this work shows that the proposed speech activity detection can achieve better performance (+1% Accuracy, +8% Precision, and +5% F1 score) than previous work in a more complicated noise environment.

...read moreread less

Cites background from "Multichannel Speaker Activity Detec..."

...Many works have been studying the SAD problem in telephone conversation records [5] and meeting domain [6], which is the mixture of speech and natural background noises....
[...]

Proceedings Article•DOI•

Multichannel Acoustic Echo Cancellation Applied to Microphone Leakage Reduction in Meetings

[...]

Patrick Meyer¹, Samy Elshamy¹, Jan Franzen¹, Tim Fingscheidt¹•Institutions (1)

Braunschweig University of Technology¹

24 Jan 2021

TL;DR: The purpose of this work is not to improve the MAEC method, but instead to show that it can be successfully applied to microphone leakage reduction, such as in meetings with headset-equipped participants.

...read moreread less

Abstract: Microphone leakage occurs in multichannel close-talk audio recordings of a meeting, when speech of an active speaker couples into both the dedicated target microphone and all other microphone channels. For an automatic transcription or analysis of a meeting, the interferer signals in the target microphone channels have to be eliminated. Therefore, we apply a frequency domain adaptive filtering-based multichannel acoustic echo cancellation (MAEC) method, which typically requires clean reference channels. We consider a wide range of different speech-to-interferer ratios and evaluate two cascading schemes for the MAEC, which leads to an improved speech component quality and interferer reduction by up to 0.1MOS points and 0.5dB, respectively. However, the purpose of this work is not to improve the MAEC method, but instead to show that it can be successfully applied to microphone leakage reduction, such as in meetings with headset-equipped participants. Therefore, we analyze and point out why the MAEC method is able to cancel the interferer signals in this scenario even though the reference signals are themselves disturbed by interfering speech portions.

...read moreread less

Cites background from "Multichannel Speaker Activity Detec..."

...A more detailed description can be found in [21, 26]....
[...]

References

PDF

Open Access

More filters

Book•

Social Signal Processing

[...]

Alessandro Vinciarelli¹, Maja Pantic², Hervé Bourlard¹•Institutions (2)

École Polytechnique Fédérale de Lausanne¹, University of Twente²

01 May 2017

TL;DR: It is argued that next-generation computing needs to include the essence of social intelligence - the ability to recognize human social signals and social behaviours like turn taking, politeness, and disagreement - in order to become more effective and more efficient.

...read moreread less

Abstract: The ability to understand and manage social signals of a person we are communicating with is the core of social intelligence. Social intelligence is a facet of human intelligence that has been argued to be indispensable and perhaps the most important for success in life. This paper argues that next-generation computing needs to include the essence of social intelligence - the ability to recognize human social signals and social behaviours like turn taking, politeness, and disagreement - in order to become more effective and more efficient. Although each one of us understands the importance of social signals in everyday life situations, and in spite of recent advances in machine analysis of relevant behavioural cues like blinks, smiles, crossed arms, laughter, and similar, design and development of automated systems for social signal processing (SSP) are rather difficult. This paper surveys the past efforts in solving these problems by a computer, it summarizes the relevant findings in social psychology, and it proposes a set of recommendations for enabling the development of the next generation of socially aware computing.

...read moreread less

988 citations

Noise Spectrum Estimation in Adverse Environments: Improved Minima Controlled

[...]

Recursive Averaging, Israel Cohen

01 Jan 2002

TL;DR: It is shown that in nonstationary noise environments and under low SNR conditions, the IMCRA approach is very effective, compared to a competitive method, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise.

...read moreread less

Abstract: Noise spectrum estimation is a fundamental component of speech enhancement and speech recognition systems. In this paper, we present an Improved Minima Con- trolled Recursive Averaging (IMCRA) approach, for noise es- timation in adverse environments involving non-stationary noise, weak speech components, and low input signal-to- noise ratio (SNR). The noise estimate is obtained by av- eraging past spectral power values, using a time-varying frequency-dependent smoothing parameter that is adjusted by the signal presence probability. The speech presence probability is controlled by the minima values of a smoothed periodogram. The proposed procedure comprises two iter- ations of smoothing and minimum tracking. The rst it- eration provides a rough voice activity detection in each frequency band. Then, smoothing in the second iteration excludes relatively strong speech components, which makes the minimum tracking during speech activity robust. We show that in non-stationary noise environments and under low SNR conditions, the IMCRA approach is very eectiv e. In particular, compared to a competitive method, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise.

...read moreread less

834 citations

"Multichannel Speaker Activity Detec..." refers methods in this paper

...We computed the noise signal PSD estimate Φ̂NN,m(`, k) following a simple, but in this context sufficient and effective 3-state approach [20], instead of the improved minimum recursive averaging approach [21] with higher complexity as applied in the baseline [17]....
[...]

Proceedings Article•DOI•

The ICSI Meeting Corpus

[...]

Adam Janin, Don Baron, Jane A. Edwards, Daniel P. W. Ellis, David Gelbart, Nelson Morgan, Barbara Peskin, Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke, Chuck Wooters - Show less +7 more

06 Apr 2003

TL;DR: A corpus of data from natural meetings that occurred at the International Computer Science Institute in Berkeley, California over the last three years is collected, which supports work in automatic speech recognition, noise robustness, dialog modeling, prosody, rich transcription, information retrieval, and more.

...read moreread less

Abstract: We have collected a corpus of data from natural meetings that occurred at the International Computer Science Institute (ICSI) in Berkeley, California over the last three years The corpus contains audio recorded simultaneously from head-worn and table-top microphones, word-level transcripts of meetings, and various metadata on participants, meetings, and hardware Such a corpus supports work in automatic speech recognition, noise robustness, dialog modeling, prosody, rich transcription, information retrieval, and more We present details on the contents of the corpus, as well as rationales for the decisions that led to its configuration The corpus were delivered to the Linguistic Data Consortium (LDC)

...read moreread less

793 citations

"Multichannel Speaker Activity Detec..." refers background in this paper

...1), or lapel microphones for each person [6, 7]....
[...]

Proceedings Article•DOI•

MUC-4 evaluation metrics

[...]

Nancy Chinchor¹•Institutions (1)

Science Applications International Corporation¹

16 Jun 1992

TL;DR: The scoring algorithms used to arrive at the metrics as well as the improvements that were made to the MUC-3 methods were described, showing that the M UC-4 systems' scores represent a larger improvement over MUC -3 performance than the numbers themselves suggest.

...read moreread less

Abstract: The MUC-4 evaluation metrics measure the performance of the message understanding systems. This paper describes the scoring algorithms used to arrive at the metrics as well as the improvements that were made to the MUC-3 methods. MUC-4 evaluation metrics were stricter than those used in MUC-3. Given the differences in scoring between MUC-3 and MUC-4, the MUC-4 systems' scores represent a larger improvement over MUC-3 performance than the numbers themselves suggest.

...read moreread less

468 citations

"Multichannel Speaker Activity Detec..." refers methods in this paper

...In order to provide a good impression of the MSAD performance, we applied the Fβ-measure [28, 29], which allows to weight the importance of detecting or not detecting speech frames....
[...]

Journal Article•DOI•

Bridging the Gap between Social Animal and Unsocial Machine: A Survey of Social Signal Processing

[...]

Alessandro Vinciarelli¹, Maja Pantic², Dirk Heylen³, Catherine Pelachaud⁴, Isabella Poggi, Francesca D'Errico, M. Schroeder - Show less +3 more•Institutions (4)

University of Glasgow¹, Imperial College London², University of Twente³, Télécom ParisTech⁴

01 Jan 2012-IEEE Transactions on Affective Computing

TL;DR: This is the first survey of the domain that jointly considers its three major aspects, namely, modeling, analysis, and synthesis of social behavior, which investigates laws and principles underlying social interaction, and explores approaches for automatic understanding of social exchanges recorded with different sensors.

...read moreread less

Abstract: Social Signal Processing is the research domain aimed at bridging the social intelligence gap between humans and machines. This paper is the first survey of the domain that jointly considers its three major aspects, namely, modeling, analysis, and synthesis of social behavior. Modeling investigates laws and principles underlying social interaction, analysis explores approaches for automatic understanding of social exchanges recorded with different sensors, and synthesis studies techniques for the generation of social behavior via various forms of embodiment. For each of the above aspects, the paper includes an extensive survey of the literature, points to the most important publicly available resources, and outlines the most fundamental challenges ahead.

...read moreread less

398 citations