Speech Enhancement and Recognition in Meetings With an Audio–Visual Sensor Array
TLDR
The accurate speaker tracking provided by the audio-visual sensor array proved beneficial to improve the recognition performance in a microphone array-based speech recognition system, both in terms of enhancement and recognition.Abstract:
This paper addresses the problem of distant speech acquisition in multiparty meetings, using multiple microphones and cameras. Microphone array beamforming techniques present a potential alternative to close-talking microphones by providing speech enhancement through spatial filtering. Beamforming techniques, however, rely on knowledge of the speaker location. In this paper, we present an integrated approach, in which an audio-visual multiperson tracker is used to track active speakers with high accuracy. Speech enhancement is then achieved using microphone array beamforming followed by a novel postfiltering stage. Finally, speech recognition is performed to evaluate the quality of the enhanced speech signal. The approach is evaluated on data recorded in a real meeting room for stationary speaker, moving speaker, and overlapping speech scenarios. The results show that the speech enhancement and recognition performance achieved using our approach are significantly better than a single table-top microphone and are comparable to a lapel microphone for some of the scenarios. The results also indicate that the audio-visual-based system performs significantly better than audio-only system, both in terms of enhancement and recognition. This reveals that the accurate speaker tracking provided by the audio-visual sensor array proved beneficial to improve the recognition performance in a microphone array-based speech recognition system.read more
Citations
More filters
Journal ArticleDOI
Audiovisual Information Fusion in Human–Computer Interfaces and Intelligent Environments: A Survey
TL;DR: The fusion strategies and the corresponding models used in audiovisual tasks such as speech recognition, tracking, biometrics, affective state recognition, and meeting scene analysis are described.
Journal ArticleDOI
A Multimodal Approach to Blind Source Separation of Moving Sources
TL;DR: Experimental results confirm that by utilizing the visual modality, the proposed algorithm improves the performance of the BSS algorithm and mitigates the permutation problem for stationary sources, but also provides a good BSS performance for moving sources in a low reverberant environment.
Journal ArticleDOI
Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional Camera
Takaaki Hori,Shoko Araki,Takuya Yoshioka,Masakiyo Fujimoto,Shinji Watanabe,Takanobu Oba,Atsunori Ogawa,Kazuhiro Otsuka,Dan Mikami,Keisuke Kinoshita,Tomohiro Nakatani,Atsushi Nakamura,Junji Yamato +12 more
TL;DR: The techniques and the attempt to achieve the low-latency monitoring of meetings are described, the experimental results for real-time meeting transcription are shown, and the goal is to recognize automatically “who is speaking what” in an online manner for meeting assistance.
Posted Content
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
TL;DR: This paper provides a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets; and objective functions.
Journal ArticleDOI
An unsupervised acoustic fall detection system using source separation for sound interference suppression
TL;DR: A novel unsupervised fall detection system that employs the collected acoustic signals from an elderly person's normal activities to construct a data description model to distinguish falls from non-falls as compared with existing single microphone based methods.
References
More filters
Book
Multiple view geometry in computer vision
Richard Hartley,Andrew Zisserman +1 more
TL;DR: In this article, the authors provide comprehensive background material and explain how to apply the methods and implement the algorithms directly in a unified framework, including geometric principles and how to represent objects algebraically so they can be computed and applied.
Book
Modern Information Retrieval
TL;DR: In this article, the authors present a rigorous and complete textbook for a first course on information retrieval from the computer science (as opposed to a user-centred) perspective, which provides an up-to-date student oriented treatment of the subject.
BookDOI
Sequential Monte Carlo methods in practice
TL;DR: This book presents the first comprehensive treatment of Monte Carlo techniques, including convergence results and applications to tracking, guidance, automated target recognition, aircraft navigation, robot navigation, econometrics, financial modeling, neural networks, optimal control, optimal filtering, communications, reinforcement learning, signal enhancement, model averaging and selection.
Journal ArticleDOI
C ONDENSATION —Conditional Density Propagation forVisual Tracking
Michael Isard,Andrew Blake +1 more
TL;DR: The Condensation algorithm uses “factored sampling”, previously applied to the interpretation of static images, in which the probability distribution of possible interpretations is represented by a randomly generated set.
Journal ArticleDOI
Two decades of array signal processing research: the parametric approach
Hamid Krim,Mats Viberg +1 more
TL;DR: The article consists of background material and of the basic problem formulation, and introduces spectral-based algorithmic solutions to the signal parameter estimation problem and contrast these suboptimal solutions to parametric methods.