scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Multichannel Speaker Activity Detection for Meetings

TL;DR: This work investigates single- and multi-talk in a wide range of different crosstalk levels, and improved the detection accuracy towards a standardized voice activity detection overall by 12.89 % absolute.
Abstract: Multichannel recordings of meetings with a (wireless) headset for each person deliver commonly the best audio quality for subsequent analyses. However, still speech portions of other participants can couple into the microphone channel of the associated target speaker. Due to this crosstalk, a speaker activity detection (SAD) is required in order to identify only the speech portions of the target speaker in the related microphone channel. While most solutions are either complex and need a training process, or achieve insufficient results in multi-talk situations, we propose a low complexity method, which can handle both crosstalk and multi-talk situations. We investigate single- and multi-talk in a wide range of different crosstalk levels, and improved the detection accuracy towards a standardized voice activity detection overall by 12.89 % absolute, whereas a state-of-the-art multichannel SAD was exceeded even by 13.76 % absolute.
Citations
More filters
Proceedings ArticleDOI
04 May 2020
TL;DR: This work extends an existing approach by integrating methods from acoustic echo cancellation to improve the estimation of the interferer (noise) components of the filter, which leads to an improved signal-to-interferer ratio by up to 2.1 dB absolute at constant speech component quality.
Abstract: Recording a meeting and obtaining clean speech signals of each speaker is a challenging task. Even with a multichannel recording, in which all speakers are equipped with a close-talk microphone, speech of an active speaker still couples not only into his dedicated microphone, but also into all other microphone channels at a certain level. This is denoted as crosstalk and requires a multichannel speaker interference reduction to enhance the microphone channels for further processing. To solve this issue, we use a Wiener filter which is based on all individual microphone channels. We extend an existing approach by integrating methods from acoustic echo cancellation to improve the estimation of the interferer (noise) components of the filter, which leads to an improved signal-to-interferer ratio by up to 2.1 dB absolute at constant speech component quality.

7 citations


Cites background from "Multichannel Speaker Activity Detec..."

  • ...This effect is known as crosstalk or microphone leakage [3–6] and requires both a multichannel speaker activity detection [7] and a multichannel speaker interference reduction (MSIR) to facilitate further signal processing....

    [...]

Journal ArticleDOI
TL;DR: An adaptive filter method is integrated, which was originally proposed for acoustic echo cancellation (AEC), in order to obtain a well-performing interferer (noise) component estimation and results in an improved speech-to-interferer ratio by up to 2.7 dB at constant or even better speech component quality.
Abstract: Microphone leakage or crosstalk is a common problem in multichannel close-talk audio recordings (e.g., meetings or live music performances), which occurs when a target signal does not only couple into its dedicated microphone, but also in all other microphone channels. For further signal processing such as automatic transcription of a meeting, a multichannel speaker interference reduction is required in order to eliminate the interfering speech signals in the microphone channels. The contribution of this paper is twofold: First, we consider multichannel close-talk recordings of a three-person meeting scenario with various different crosstalk levels. In order to eliminate the crosstalk in the target microphone channel, we extend a multichannel Wiener filter approach, which considers all individual microphone channels. Therefore, we integrate an adaptive filter method, which was originally proposed for acoustic echo cancellation (AEC), in order to obtain a well-performing interferer (noise) component estimation. This results in an improved speech-to-interferer ratio by up to 2.7 dB at constant or even better speech component quality. Second, since an AEC method requires typically clean reference channels, we investigate and report findings why the AEC algorithm is able to successfully estimate the interfering signals and the room impulse responses between the microphones of the interferer and the target speakers even though the reference signals are themselves disturbed by crosstalk in the considered meeting scenario.

1 citations

01 Jul 2020
TL;DR: The result shows that the proposed speech activity detection on the entertainment media domain based on CNN can achieve better performance than previous work in a more complicated noise environment.
Abstract: Speech activity detection (SAD) is a critical preparation process for speech-based applications. The speech activity detection is used to identify the speech in an audio recording. This paper aims to propose a speech activity detection on the entertainment media domain based on CNN. The fusion of two Dense Convolutional Network (DenseNet) with different feature extraction by using Dempster-Shafer theory (DS theory) was used to classify the speech segment. We combined acoustic features, which are the logmel spectrogram (LM), mel frequency cepstral coefficient (MFCC), chroma, spectral contrast, and tonnetz as the input feature. The combination of acoustic features operates on the raw speech signal and delivers it into a convolution neural network for classifying the speech. The result in this work shows that the proposed speech activity detection can achieve better performance (+1% Accuracy, +8% Precision, and +5% F1 score) than previous work in a more complicated noise environment.

Cites background from "Multichannel Speaker Activity Detec..."

  • ...Many works have been studying the SAD problem in telephone conversation records [5] and meeting domain [6], which is the mixture of speech and natural background noises....

    [...]

Proceedings ArticleDOI
24 Jan 2021
TL;DR: The purpose of this work is not to improve the MAEC method, but instead to show that it can be successfully applied to microphone leakage reduction, such as in meetings with headset-equipped participants.
Abstract: Microphone leakage occurs in multichannel close-talk audio recordings of a meeting, when speech of an active speaker couples into both the dedicated target microphone and all other microphone channels. For an automatic transcription or analysis of a meeting, the interferer signals in the target microphone channels have to be eliminated. Therefore, we apply a frequency domain adaptive filtering-based multichannel acoustic echo cancellation (MAEC) method, which typically requires clean reference channels. We consider a wide range of different speech-to-interferer ratios and evaluate two cascading schemes for the MAEC, which leads to an improved speech component quality and interferer reduction by up to 0.1MOS points and 0.5dB, respectively. However, the purpose of this work is not to improve the MAEC method, but instead to show that it can be successfully applied to microphone leakage reduction, such as in meetings with headset-equipped participants. Therefore, we analyze and point out why the MAEC method is able to cancel the interferer signals in this scenario even though the reference signals are themselves disturbed by interfering speech portions.

Cites background from "Multichannel Speaker Activity Detec..."

  • ...A more detailed description can be found in [21, 26]....

    [...]

References
More filters
Book
01 May 2017
TL;DR: It is argued that next-generation computing needs to include the essence of social intelligence - the ability to recognize human social signals and social behaviours like turn taking, politeness, and disagreement - in order to become more effective and more efficient.
Abstract: The ability to understand and manage social signals of a person we are communicating with is the core of social intelligence. Social intelligence is a facet of human intelligence that has been argued to be indispensable and perhaps the most important for success in life. This paper argues that next-generation computing needs to include the essence of social intelligence - the ability to recognize human social signals and social behaviours like turn taking, politeness, and disagreement - in order to become more effective and more efficient. Although each one of us understands the importance of social signals in everyday life situations, and in spite of recent advances in machine analysis of relevant behavioural cues like blinks, smiles, crossed arms, laughter, and similar, design and development of automated systems for social signal processing (SSP) are rather difficult. This paper surveys the past efforts in solving these problems by a computer, it summarizes the relevant findings in social psychology, and it proposes a set of recommendations for enabling the development of the next generation of socially aware computing.

988 citations

01 Jan 2002
TL;DR: It is shown that in nonstationary noise environments and under low SNR conditions, the IMCRA approach is very effective, compared to a competitive method, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise.
Abstract: Noise spectrum estimation is a fundamental component of speech enhancement and speech recognition systems. In this paper, we present an Improved Minima Con- trolled Recursive Averaging (IMCRA) approach, for noise es- timation in adverse environments involving non-stationary noise, weak speech components, and low input signal-to- noise ratio (SNR). The noise estimate is obtained by av- eraging past spectral power values, using a time-varying frequency-dependent smoothing parameter that is adjusted by the signal presence probability. The speech presence probability is controlled by the minima values of a smoothed periodogram. The proposed procedure comprises two iter- ations of smoothing and minimum tracking. The rst it- eration provides a rough voice activity detection in each frequency band. Then, smoothing in the second iteration excludes relatively strong speech components, which makes the minimum tracking during speech activity robust. We show that in non-stationary noise environments and under low SNR conditions, the IMCRA approach is very eectiv e. In particular, compared to a competitive method, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise.

834 citations


"Multichannel Speaker Activity Detec..." refers methods in this paper

  • ...We computed the noise signal PSD estimate Φ̂NN,m(`, k) following a simple, but in this context sufficient and effective 3-state approach [20], instead of the improved minimum recursive averaging approach [21] with higher complexity as applied in the baseline [17]....

    [...]

Proceedings ArticleDOI
06 Apr 2003
TL;DR: A corpus of data from natural meetings that occurred at the International Computer Science Institute in Berkeley, California over the last three years is collected, which supports work in automatic speech recognition, noise robustness, dialog modeling, prosody, rich transcription, information retrieval, and more.
Abstract: We have collected a corpus of data from natural meetings that occurred at the International Computer Science Institute (ICSI) in Berkeley, California over the last three years The corpus contains audio recorded simultaneously from head-worn and table-top microphones, word-level transcripts of meetings, and various metadata on participants, meetings, and hardware Such a corpus supports work in automatic speech recognition, noise robustness, dialog modeling, prosody, rich transcription, information retrieval, and more We present details on the contents of the corpus, as well as rationales for the decisions that led to its configuration The corpus were delivered to the Linguistic Data Consortium (LDC)

793 citations


"Multichannel Speaker Activity Detec..." refers background in this paper

  • ...1), or lapel microphones for each person [6, 7]....

    [...]

Proceedings ArticleDOI
16 Jun 1992
TL;DR: The scoring algorithms used to arrive at the metrics as well as the improvements that were made to the MUC-3 methods were described, showing that the M UC-4 systems' scores represent a larger improvement over MUC -3 performance than the numbers themselves suggest.
Abstract: The MUC-4 evaluation metrics measure the performance of the message understanding systems. This paper describes the scoring algorithms used to arrive at the metrics as well as the improvements that were made to the MUC-3 methods. MUC-4 evaluation metrics were stricter than those used in MUC-3. Given the differences in scoring between MUC-3 and MUC-4, the MUC-4 systems' scores represent a larger improvement over MUC-3 performance than the numbers themselves suggest.

468 citations


"Multichannel Speaker Activity Detec..." refers methods in this paper

  • ...In order to provide a good impression of the MSAD performance, we applied the Fβ-measure [28, 29], which allows to weight the importance of detecting or not detecting speech frames....

    [...]

Journal ArticleDOI
TL;DR: This is the first survey of the domain that jointly considers its three major aspects, namely, modeling, analysis, and synthesis of social behavior, which investigates laws and principles underlying social interaction, and explores approaches for automatic understanding of social exchanges recorded with different sensors.
Abstract: Social Signal Processing is the research domain aimed at bridging the social intelligence gap between humans and machines. This paper is the first survey of the domain that jointly considers its three major aspects, namely, modeling, analysis, and synthesis of social behavior. Modeling investigates laws and principles underlying social interaction, analysis explores approaches for automatic understanding of social exchanges recorded with different sensors, and synthesis studies techniques for the generation of social behavior via various forms of embodiment. For each of the above aspects, the paper includes an extensive survey of the literature, points to the most important publicly available resources, and outlines the most fundamental challenges ahead.

398 citations