Open AccessProceedings Article
Analysis of Correlation between Audio and Visual Speech Features for Clean Audio Feature Prediction in Noise
Reads0
Chats0
TLDR
Experiments reveal that features representing broad spectral information have higher correlation to visual features than those representing finer spectral detail.Abstract:
The aim of this work is to examine the correlation between audio and visual speech features. The motivation is to find visual features that can provide clean audio feature estimates which can be used for speech enhancement when the original audio signal is corrupted by noise. Two audio features (MFCCs and formants) and three visual features (active appearance model, 2-D DCT and cross-DCT) are considered with correlation measured using multiple linear regression. The correlation is then exploited through the development of a maximum a posteriori (MAP) prediction of audio features solely from the visual features. Experiments reveal that features representing broad spectral information have higher correlation to visual features than those representing finer spectral detail. The accuracy of prediction follows the results found in the correlation measurements.read more
Citations
More filters
Posted Content
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
TL;DR: This paper provides a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets; and objective functions.
The Challenge of Multispeaker Lip-Reading
TL;DR: This paper shows the danger of not using different speakers in the trainingand test-sets and demonstrates that lip-reading visual features, when compared with the MFCCs commonly used for audio speech recognition, have inherently small variation within a single speaker across all classes spoken.
Journal ArticleDOI
Visually-Derived Wiener Filters for Speech Enhancement
TL;DR: In this paper, a visually derived Wiener filter was proposed to extract clean speech and noise power spectrum statistics from visual speech features, which is used to enhance audio speech that has been contaminated by noise.
Journal ArticleDOI
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
TL;DR: In this paper, the authors provide a comprehensive survey of audio-visual speech enhancement and speech separation based on deep learning, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets and objective functions.
References
More filters
Book
Regression Analysis by Example
Samprit Chatterjee,B. Price +1 more
TL;DR: Simple linear regression Multiple linear regression Regression Diagnostics: Detection of Model Violations Qualitative Variables as Predictors Transformation of Variables Weighted Least Squares The Problem of Correlated Errors Analysis of Collinear Data Biased Estimation of Regression Coefficients Variable Selection Procedures Logistic Regression Appendix References as discussed by the authors
Journal ArticleDOI
Visual contribution to speech intelligibility in noise
W. H. Sumby,Irwin Pollack +1 more
TL;DR: In this article, the visual contribution to oral speech intelligibility was examined as a function of the speech-to-noise ratio and of the size of the vocabulary under test.
Journal ArticleDOI
Regression Analysis by Example
TL;DR: This book serves well as an introduction to the speci c area of methods for detecting and correcting model violations in the standard linear regression model and provides a general overview of transformations of variables and focuses on three traditional situations where transformations can be applied.
Proceedings ArticleDOI
Statistical models of appearance for medical image analysis and computer vision
TL;DR: The Active Shape Model essentially matches a model to boundaries in an image, and the Active Appearance Model finds model parameters which synthesize a complete image which is as similar as possible to the target image.
Journal ArticleDOI
Quantitative association of vocal-tract and facial behavior
TL;DR: Multilinear techniques are applied to support the claims that facial motion during speech is largely a by-product of producing the speech acoustics and better estimated by the 3D motion of the face than by the midsagittalmotion of the anterior vocal-tract.