The use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain and shows improvements in a speech recognition task.
Abstract:
When performing speaker diarization on recordings from meetings, multiple microphones of different qualities are usually available and distributed around the meeting room. Although several approaches have been proposed in recent years to take advantage of multiple microphones, they are either too computationally expensive and not easily scalable or they cannot outperform the simpler case of using the best single microphone. In this paper, the use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain. New techniques we are presenting include blind reference-channel selection, two-step time delay of arrival (TDOA) Viterbi postprocessing, and a dynamic output signal weighting algorithm, together with using such TDOA values in the diarization to complement the acoustic information. Tests on speaker diarization show a 25% relative improvement on the test set compared to using a single most centrally located microphone. Additional experimental results show improvements using these techniques in a speech recognition task.
TL;DR: An analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research are presented.
TL;DR: A neural network based approach to acoustic beamforming is presented, used to estimate spectral masks from which the Cross-Power Spectral Density matrices of speech and noise are estimated, which are used to compute the beamformer coefficients.
TL;DR: This book is very referred for you because it gives not only the experience but also lesson, that's not about who are reading this array signal processing book but about this book that will give wellness for all people from many societies.
TL;DR: It is found that training on different noise environments and different microphones barely affects the ASR performance, especially when several environments are present in the training data: only the number of microphones has a significant impact.
TL;DR: It is shown that using a single mask across microphones for covariance prediction with minima-limited post-masking yields the best result in terms of signal-level quality measures and speech recognition word error rates in a mismatched training condition.
TL;DR: The article consists of background material and of the basic problem formulation, and introduces spectral-based algorithmic solutions to the signal parameter estimation problem and contrast these suboptimal solutions to parametric methods.
TL;DR: In this paper, a maximum likelihood estimator is developed for determining time delay between signals received at two spatially separated sensors in the presence of uncorrelated noise, where the role of the prefilters is to accentuate the signal passed to the correlator at frequencies for which the signal-to-noise (S/N) ratio is highest and suppress the noise power.
TL;DR: The author explains the development of the Wiener Solution and some of the techniques used in its implementation, including Optimum Processing: Steady State Performance and theWiener Solution, which simplifies the implementation of the Covariance Matrix.
Q1. What are the contributions mentioned in the paper "Acoustic beamforming for speaker diarization of meetings" ?
In this work the use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain. New techniques the authors are present include blind reference-channel selection, two-step Time Delay of Arrival ( TDOA ) Viterbi postprocessing, and a dynamic output signal weighting algorithm, together with using such TDOA values in the diarization to complement the acoustic information.
Q2. What was used for the speaker diarization task?
The beamforming system developed for the speaker diarization task was also used to obtain an enhanced signal for the ASR systems that ICSI and SRI presented at the RT NIST evaluations.
Q3. What is the significance test of the RT06s system compared to the baseline?
The significance test of this system compared to the baseline is passed on both development and test cases (Z = 1.98 in development, Z = 2.31 in test).
Q4. What is the purpose of the acoustic beamforming system presented in this work?
The acoustic beamforming system presented in this work was created for use in the speaker diarization task for the meetings environment.
Q5. How is the noise cancellation achieved in standard beamforming systems?
In standard beamforming systems, this noise cancellation is achieved through the use of identical microphones placed only a few inches apart one from each other.
Q6. What is the TDOA filtering step?
1. The authors implemented two filtering steps, a noisy TDOA detection and elimination (TDOA continuity enhancement), and 1-best TDOA selection from the N -best vector.
Q7. What is the way to estimate the delay between channels?
In order to estimate the bias, an average cross-correlation metric was put in place in order to obtain the average (across time) delay between each channel and the reference channel for a set of long acoustic windows (around 20 seconds), evenly distributed along the meeting.
Q8. What is the TDOA value for each channel?
Assuming TDOAmg(l,m)[c] is the TDOA value for the g(l,m)best element in channel m for segment c, the transition weights between two TDOA combinations for all microphones are determined byTr2[i, j; c] =∑M m=1 ∆diff[i,j; c]−|TDOAmg(i,m)[c]−TDOAmg(j,m)[c−1]| ∆diff[i,j; c](8)where now ∆diff[i, j; c] = max(|TDOAmg(i,n)[c] − TDOAmg(j,m)[c− 1]|, ∀i, j, m).