Open AccessProceedings Article
Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization).
TLDR
The experiments show that the use of visual information in AVCDCN allows significant performance gains over CDCN, and the technique is considered for use with both audio-only and audio-visual speech recognition.Abstract:
We introduce a non-linear enhancement technique called audio-visual codebook dependent cepstral normalization (AVCDCN) and we consider its use with both audio-only and audio-visual speech recognition. AVCDCN is inspired from CDCN, an audio-only enhancement technique that approximates the nonlinear effect of noise on speech with a piecewise constant function. Our experiments show that the use of visual information in AVCDCN allows significant performance gains over CDCN.read more
Citations
More filters
Journal ArticleDOI
Recent advances in the automatic recognition of audiovisual speech
TL;DR: The main components of audiovisual automatic speech recognition (ASR) are reviewed and novel contributions in two main areas are presented: first, the visual front-end design, based on a cascade of linear image transforms of an appropriate video region of interest, and subsequently, audiovISual speech integration.
Audio-Visual Automatic Speech Recognition: An Overview
TL;DR: Novel, non-traditional approaches, that use orthogonal sources of information to the acoustic input, are needed to achieve ASR performance closer to the human speech perception level, and robust enough to be deployable in field applications.
Proceedings ArticleDOI
Pixels that sound
TL;DR: This work presents a stable and robust algorithm which grasps dynamic audio-visual events with high spatial resolution, and derives a unique solution based on canonical correlation analysis (CCA), which effectively detects pixels that are associated with the sound, while filtering out other dynamic pixels.
Journal ArticleDOI
Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks
TL;DR: The proposed AVDCNN model is structured as an audio–visual encoder–decoder network, in which audio and visual data are first processed using individual CNNs, and then fused into a joint network to generate enhanced speech and reconstructed images at the output layer.
Journal ArticleDOI
Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures
TL;DR: A novel algorithm plugging audiovisual coherence of speech signals, estimated by statistical tools, on audio blind source separation (BSS) techniques is presented, applied to the difficult and realistic case of convolutive mixtures.
References
More filters
BookDOI
Acoustical and environmental robustness in automatic speech recognition
TL;DR: This dissertation describes a number of algorithms developed to increase the robustness of automatic speech recognition systems with respect to changes in the environment, including the SNR-Dependent Cepstral Normalization, (SDCN) and the Codeword-Dependent Cep stral normalization (CDCN).
Proceedings ArticleDOI
Environmental robustness in automatic speech recognition
Alejandro Acero,Richard M. Stern +1 more
TL;DR: Initial efforts to make Sphinx, a continuous-speech speaker-independent recognition system, robust to changes in the environment are reported, and two novel methods based on additive corrections in the cepstral domain are proposed.
Audio-visual speech recognition
Chalapathy Neti,Gerasimos Potamianos,Juergen Luettin,Iain Matthews,Hervé Glotin,D. Vergyri,J. Sison,A. Mashari +7 more
TL;DR: Speech Reference EPFL-CONF-82637 is presented, which describes the development of a framework for future generations of interpreters to understand and respond toaudible language barriers.
Proceedings ArticleDOI
High-performance robust speech recognition using stereo training data
TL;DR: A novel technique of SPLICE (Stereo-based Piecewise Linear Compensation for Environments) for high performance robust speech recognition is described, an efficient noise reduction and channel distortion compensation technique that makes effective use of stereo training data.
Journal ArticleDOI
Audio-visual enhancement of speech in noise.
TL;DR: An audio-visual approach to the problem of speech enhancement in noise is considered, since it has been demonstrated in several studies that viewing the speaker's face improves message intelligibility, especially in noisy environments.