Speech Enhancement and Recognition in Meetings With an Audio–Visual Sensor Array

doi:10.1109/TASL.2007.906197

Open AccessJournal ArticleDOI

Speech Enhancement and Recognition in Meetings With an Audio–Visual Sensor Array

H.K. Maganti, +2 more

- 01 Nov 2007 -

IEEE Transactions on Audio, Speech, and ...

- Vol. 15, Iss: 8, pp 2257-2269

TLDR

The accurate speaker tracking provided by the audio-visual sensor array proved beneficial to improve the recognition performance in a microphone array-based speech recognition system, both in terms of enhancement and recognition.

Abstract:

This paper addresses the problem of distant speech acquisition in multiparty meetings, using multiple microphones and cameras. Microphone array beamforming techniques present a potential alternative to close-talking microphones by providing speech enhancement through spatial filtering. Beamforming techniques, however, rely on knowledge of the speaker location. In this paper, we present an integrated approach, in which an audio-visual multiperson tracker is used to track active speakers with high accuracy. Speech enhancement is then achieved using microphone array beamforming followed by a novel postfiltering stage. Finally, speech recognition is performed to evaluate the quality of the enhanced speech signal. The approach is evaluated on data recorded in a real meeting room for stationary speaker, moving speaker, and overlapping speech scenarios. The results show that the speech enhancement and recognition performance achieved using our approach are significantly better than a single table-top microphone and are comparable to a lapel microphone for some of the scenarios. The results also indicate that the audio-visual-based system performs significantly better than audio-only system, both in terms of enhancement and recognition. This reveals that the accurate speaker tracking provided by the audio-visual sensor array proved beneficial to improve the recognition performance in a microphone array-based speech recognition system.

Speech Enhancement and Recognition in Meetings With an Audio–Visual Sensor Array

Citations

Audiovisual Information Fusion in Human–Computer Interfaces and Intelligent Environments: A Survey

A Multimodal Approach to Blind Source Separation of Moving Sources

Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional Camera

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

An unsupervised acoustic fall detection system using source separation for sound interference suppression

References

Multiple view geometry in computer vision

Modern Information Retrieval

Sequential Monte Carlo methods in practice

C ONDENSATION —Conditional Density Propagation forVisual Tracking

Two decades of array signal processing research: the parametric approach

Related Papers (5)

Blind separation of speech mixtures via time-frequency masking

Robust adaptive beamforming

Some Experiments on the Recognition of Speech, with One and with Two Ears

Performance measurement in blind audio source separation

Geometric source separation: merging convolutive source separation with geometric beamforming