scispace - formally typeset
Open AccessProceedings Article

Analysis of Correlation between Audio and Visual Speech Features for Clean Audio Feature Prediction in Noise

Reads0
Chats0
TLDR
Experiments reveal that features representing broad spectral information have higher correlation to visual features than those representing finer spectral detail.
Abstract
The aim of this work is to examine the correlation between audio and visual speech features. The motivation is to find visual features that can provide clean audio feature estimates which can be used for speech enhancement when the original audio signal is corrupted by noise. Two audio features (MFCCs and formants) and three visual features (active appearance model, 2-D DCT and cross-DCT) are considered with correlation measured using multiple linear regression. The correlation is then exploited through the development of a maximum a posteriori (MAP) prediction of audio features solely from the visual features. Experiments reveal that features representing broad spectral information have higher correlation to visual features than those representing finer spectral detail. The accuracy of prediction follows the results found in the correlation measurements.

read more

Citations
More filters
Posted Content

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

TL;DR: This paper provides a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets; and objective functions.

The Challenge of Multispeaker Lip-Reading

TL;DR: This paper shows the danger of not using different speakers in the trainingand test-sets and demonstrates that lip-reading visual features, when compared with the MFCCs commonly used for audio speech recognition, have inherently small variation within a single speaker across all classes spoken.
Journal ArticleDOI

Visually-Derived Wiener Filters for Speech Enhancement

TL;DR: In this paper, a visually derived Wiener filter was proposed to extract clean speech and noise power spectrum statistics from visual speech features, which is used to enhance audio speech that has been contaminated by noise.
Journal ArticleDOI

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

TL;DR: In this paper, the authors provide a comprehensive survey of audio-visual speech enhancement and speech separation based on deep learning, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets and objective functions.
References
More filters
Book

Regression Analysis by Example

TL;DR: Simple linear regression Multiple linear regression Regression Diagnostics: Detection of Model Violations Qualitative Variables as Predictors Transformation of Variables Weighted Least Squares The Problem of Correlated Errors Analysis of Collinear Data Biased Estimation of Regression Coefficients Variable Selection Procedures Logistic Regression Appendix References as discussed by the authors
Journal ArticleDOI

Visual contribution to speech intelligibility in noise

TL;DR: In this article, the visual contribution to oral speech intelligibility was examined as a function of the speech-to-noise ratio and of the size of the vocabulary under test.
Journal ArticleDOI

Regression Analysis by Example

Terri L. Moore
- 01 May 2001 - 
TL;DR: This book serves well as an introduction to the speciŽ c area of methods for detecting and correcting model violations in the standard linear regression model and provides a general overview of transformations of variables and focuses on three traditional situations where transformations can be applied.
Proceedings ArticleDOI

Statistical models of appearance for medical image analysis and computer vision

TL;DR: The Active Shape Model essentially matches a model to boundaries in an image, and the Active Appearance Model finds model parameters which synthesize a complete image which is as similar as possible to the target image.
Journal ArticleDOI

Quantitative association of vocal-tract and facial behavior

TL;DR: Multilinear techniques are applied to support the claims that facial motion during speech is largely a by-product of producing the speech acoustics and better estimated by the 3D motion of the face than by the midsagittalmotion of the anterior vocal-tract.
Related Papers (5)