Digital processing of speech signals

The thud of a bouncing ball, the onset of speech as lips open—when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation. We propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. We use this learned representation for three applications: (a) sound source localization, i.e. visualizing the source of sound in a video; (b) audio-visual action recognition; and (c) on/off-screen audio source separation, e.g. removing the off-screen translator’s voice from a foreign official’s speech. Code, models, and video results are available on our webpage: http://andrewowens.com/multisensory.

/pdf/audio-visual-scene-analysis-with-self-supervised-1yor1x0d84.pdf

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem &#x2013; unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) a Watch, Listen, Attend and Spell (WLAS) network that learns to transcribe videos of mouth motion to characters, (2) a curriculum learning strategy to accelerate training and to reduce overfitting, (3) a Lip Reading Sentences (LRS) dataset for visual speech recognition, consisting of over 100,000 natural sentences from British television. The WLAS model trained on the LRS dataset surpasses the performance of all previous work on standard lip reading benchmark datasets, often by a significant margin. This lip reading performance beats a professional lip reader on videos from BBC television, and we also demonstrate that if audio is available, then visual information helps to improve speech recognition performance.

https://openaccess.thecvf.com/content_cvpr_2017/papers/Chung_Lip_Reading_Sentences_CVPR_2017_paper.pdf

Lip Reading Sentences in the Wild

Our aim is to recognise the words being spoken by a talking face, given only the video but not the audio. Existing works in this area have focussed on trying to recognise a small number of utterances in controlled environments (e.g. digits and alphabets), partially due to the shortage of suitable datasets.

Lip reading in the wild

Humans, like other animals, are exposed to a continuous stream of signals, which are dynamic, multimodal, extended, and time varying in nature. This complex input space must be transduced and sampled by our sensory systems and transmitted to the brain where it can guide the selection of appropriate actions. To simplify this process, it's been suggested that the brain exploits statistical regularities in the stimulus space. Tests of this idea have largely been confined to unimodal signals and natural scenes. One important class of multisensory signals for which a quantitative input space characterization is unavailable is human speech. We do not understand what signals our brain has to actively piece together from an audiovisual speech stream to arrive at a percept versus what is already embedded in the signal structure of the stream itself. In essence, we do not have a clear understanding of the natural statistics of audiovisual speech. In the present study, we identified the following major statistical features of audiovisual speech. First, we observed robust correlations and close temporal correspondence between the area of the mouth opening and the acoustic envelope. Second, we found the strongest correlation between the area of the mouth opening and vocal tract resonances. Third, we observed that both area of the mouth opening and the voice envelope are temporally modulated in the 2-7 Hz frequency range. Finally, we show that the timing of mouth movements relative to the onset of the voice is consistently between 100 and 300 ms. We interpret these data in the context of recent neural theories of speech which suggest that speech communication is a reciprocally coupled, multisensory event, whereby the outputs of the signaler are matched to the neural processes of the receiver.

/pdf/the-natural-statistics-of-audiovisual-speech-3j6fxkno2w.pdf

The Natural Statistics of Audiovisual Speech

An audio-visual corpus has been collected to support the use of common material in speech perception and automatic speech recognition studies. The corpus consists of high-quality audio and video recordings of 1000 sentences spoken by each of 34 talkers. Sentences are simple, syntactically identical phrases such as "place green at B 4 now". Intelligibility tests using the audio signals suggest that the material is easily identifiable in quiet and low levels of stationary noise. The annotated corpus is available on the web for research use.

An audio-visual corpus for speech perception and automatic speech recognition

This work proposes a method for predicting the fundamental frequency and voicing of a frame of speech from its mel-frequency cepstral coefficient (MFCC) vector representation. This information is subsequently used to enable a speech signal to be reconstructed solely from a stream of MFCC vectors and has particular application in distributed speech recognition systems. Prediction is achieved by modeling the joint density of fundamental frequency and MFCCs. This is first modeled using a Gaussian mixture model (GMM) and then extended by using a set of hidden Markov models to link together a series of state-dependent GMMs. Prediction accuracy is measured on unconstrained speech input for both a speaker-dependent system and a speaker-independent system. A fundamental frequency prediction error of 3.06% is obtained on the speaker-dependent system in comparison to 8.27% on the speaker-independent system. On the speaker-dependent system 5.22% of frames have voicing errors compared to 8.82% on the speaker-independent system. Spectrogram analysis of reconstructed speech shows that highly intelligible speech is produced with the quality of the speaker-dependent speech being slightly higher owing to the more accurate fundamental frequency and voicing predictions

Prediction of Fundamental Frequency and Voicing From Mel-Frequency Cepstral Coefficients for Unconstrained Speech Reconstruction

This work presents a method of reconstructing a speech signal from a stream of MFCC vectors using a source-filter model of speech production. The MFCC vectors are used to provide an estimate of the vocal tract filter. This is achieved by inverting the MFCC vector back to a smoothed estimate of the magnitude spectrum. The Wiener- Khintchine theorem and linear predictive analysis transform this into an estimate of the vocal tract filter coefficients. The excitation signal is produced from a series of pitch pulses or white noise, depending on whether the speech is voiced or unvoiced. This pitch estimate forms an extra element of the feature vector. Listening tests reveal that the reconstructed speech is intelligible and of similar quality to a system based on LPC analysis of the original speech. Spectrograms of the MFCC-derived speech and the real speech are included which confirm the similarity.

Xu Shao

Papers

An audio-visual corpus for speech perception and automatic speech recognition

Prediction of Fundamental Frequency and Voicing From Mel-Frequency Cepstral Coefficients for Unconstrained Speech Reconstruction

Speech reconstruction from mel-frequency cepstral coefficients using a source-filter model

Stream weight estimation for multistream audio-visual speech recognition in a multispeaker environment

Clean speech reconstruction from MFCC vectors and fundamental frequency using an integrated front-end