scispace - formally typeset
Open AccessJournal ArticleDOI

Speech Enhancement and Recognition in Meetings With an Audio–Visual Sensor Array

TLDR
The accurate speaker tracking provided by the audio-visual sensor array proved beneficial to improve the recognition performance in a microphone array-based speech recognition system, both in terms of enhancement and recognition.
Abstract
This paper addresses the problem of distant speech acquisition in multiparty meetings, using multiple microphones and cameras. Microphone array beamforming techniques present a potential alternative to close-talking microphones by providing speech enhancement through spatial filtering. Beamforming techniques, however, rely on knowledge of the speaker location. In this paper, we present an integrated approach, in which an audio-visual multiperson tracker is used to track active speakers with high accuracy. Speech enhancement is then achieved using microphone array beamforming followed by a novel postfiltering stage. Finally, speech recognition is performed to evaluate the quality of the enhanced speech signal. The approach is evaluated on data recorded in a real meeting room for stationary speaker, moving speaker, and overlapping speech scenarios. The results show that the speech enhancement and recognition performance achieved using our approach are significantly better than a single table-top microphone and are comparable to a lapel microphone for some of the scenarios. The results also indicate that the audio-visual-based system performs significantly better than audio-only system, both in terms of enhancement and recognition. This reveals that the accurate speaker tracking provided by the audio-visual sensor array proved beneficial to improve the recognition performance in a microphone array-based speech recognition system.

read more

Citations
More filters
Journal ArticleDOI

Audiovisual Information Fusion in Human–Computer Interfaces and Intelligent Environments: A Survey

TL;DR: The fusion strategies and the corresponding models used in audiovisual tasks such as speech recognition, tracking, biometrics, affective state recognition, and meeting scene analysis are described.
Journal ArticleDOI

A Multimodal Approach to Blind Source Separation of Moving Sources

TL;DR: Experimental results confirm that by utilizing the visual modality, the proposed algorithm improves the performance of the BSS algorithm and mitigates the permutation problem for stationary sources, but also provides a good BSS performance for moving sources in a low reverberant environment.
Journal ArticleDOI

Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional Camera

TL;DR: The techniques and the attempt to achieve the low-latency monitoring of meetings are described, the experimental results for real-time meeting transcription are shown, and the goal is to recognize automatically “who is speaking what” in an online manner for meeting assistance.
Posted Content

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

TL;DR: This paper provides a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets; and objective functions.
Journal ArticleDOI

An unsupervised acoustic fall detection system using source separation for sound interference suppression

TL;DR: A novel unsupervised fall detection system that employs the collected acoustic signals from an elderly person's normal activities to construct a data description model to distinguish falls from non-falls as compared with existing single microphone based methods.
References
More filters
Book

Multiple view geometry in computer vision

TL;DR: In this article, the authors provide comprehensive background material and explain how to apply the methods and implement the algorithms directly in a unified framework, including geometric principles and how to represent objects algebraically so they can be computed and applied.
Book

Modern Information Retrieval

TL;DR: In this article, the authors present a rigorous and complete textbook for a first course on information retrieval from the computer science (as opposed to a user-centred) perspective, which provides an up-to-date student oriented treatment of the subject.
BookDOI

Sequential Monte Carlo methods in practice

TL;DR: This book presents the first comprehensive treatment of Monte Carlo techniques, including convergence results and applications to tracking, guidance, automated target recognition, aircraft navigation, robot navigation, econometrics, financial modeling, neural networks, optimal control, optimal filtering, communications, reinforcement learning, signal enhancement, model averaging and selection.
Journal ArticleDOI

C ONDENSATION —Conditional Density Propagation forVisual Tracking

TL;DR: The Condensation algorithm uses “factored sampling”, previously applied to the interpretation of static images, in which the probability distribution of possible interpretations is represented by a randomly generated set.
Journal ArticleDOI

Two decades of array signal processing research: the parametric approach

TL;DR: The article consists of background material and of the basic problem formulation, and introduces spectral-based algorithmic solutions to the signal parameter estimation problem and contrast these suboptimal solutions to parametric methods.
Related Papers (5)