scispace - formally typeset
Search or ask a question

Showing papers by "Kevin G. Munhall published in 2022"


Proceedings ArticleDOI
01 Jun 2022
TL;DR: A large-scale speech and mocap dataset that focuses on capturing tongue, jaw, and lip motion is introduced and a deep-learning based method for accurate and generalizable speech to tongue and jaw animation is proposed.
Abstract: Advances in speech driven animation techniques allow the creation of convincing animations for virtual characters solely from audio data. Many existing approaches focus on facial and lip motion and they often do not provide realistic animation of the inner mouth. This paper addresses the problem of speech-driven inner mouth animation. Obtaining performance capture data of the tongue and jaw from video alone is difficult because the inner mouth is only partially observable during speech. In this work, we introduce a large-scale speech and mocap dataset that focuses on capturing tongue, jaw, and lip motion. This dataset enables research using data-driven techniques to generate realistic inner mouth animation from speech. We then propose a deep-learning based method for accurate and generalizable speech to tongue and jaw animation, and evaluate several encoder-decoder network architectures and audio feature encoders. We find that recent self-supervised deep learning based audio feature encoders are robust, generalize well to unseen speakers and content, and work best for our task. To demonstrate the practical application of our approach, we show animations on high-quality parametric 3D face models driven by the landmarks generated from our speech-to-tongue animation method.

5 citations


Journal ArticleDOI
TL;DR: In this paper , the authors used a real-time formant manipulation system to explore how reliant speech articulation is on the accuracy or predictability of auditory feedback information, and they found that speakers' responses to auditory feedback manipulations varied based on the relevance and degree of the error that was introduced in the various feedback conditions.
Abstract: Sensory information, including auditory feedback, is used by talkers to maintain fluent speech articulation. Current models of speech motor control posit that speakers continually adjust their motor commands based on discrepancies between the sensory predictions made by a forward model and the sensory consequences of their speech movements. Here, in two within-subject design experiments, we used a real-time formant manipulation system to explore how reliant speech articulation is on the accuracy or predictability of auditory feedback information. This involved introducing random formant perturbations during vowel production that varied systematically in their spatial location in formant space (Experiment 1) and temporal consistency (Experiment 2). Our results indicate that, on average, speakers’ responses to auditory feedback manipulations varied based on the relevance and degree of the error that was introduced in the various feedback conditions. In Experiment 1, speakers’ average production was not reliably influenced by random perturbations that were introduced every utterance to the first (F1) and second (F2) formants in various locations of formant space that had an overall average of 0 Hz. However, when perturbations were applied that had a mean of +100 Hz in F1 and −125 Hz in F2, speakers demonstrated reliable compensatory responses that reflected the average magnitude of the applied perturbations. In Experiment 2, speakers did not significantly compensate for perturbations of varying magnitudes that were held constant for one and three trials at a time. Speakers’ average productions did, however, significantly deviate from a control condition when perturbations were held constant for six trials. Within the context of these conditions, our findings provide evidence that the control of speech movements is, at least in part, dependent upon the reliability and stability of the sensory information that it receives over time.

TL;DR: It was found in both experimental setups, training the models with an input of 300 ms and 1000 ms that traditional audio feature representations did not generalize as well as deep-learning based representations on out-of-domain speech audio.
Abstract: We repeated the same set of experiments as described on Section 5, but with a larger input window of 1000 ms instead of 300 ms. As we can see in Table 1, the results across all the models and features follow the same pattern with an overall improvement, at the cost of an increase in the number of parameters and inference time per model. The inference time of these models make them non-practical to use in real-time applications, such as interactive avatars in video games or telecommunications. We also found in both experimental setups, training the models with an input of 300 ms and 1000 ms that traditional audio feature representations such as phonemes and MFCC did not generalize as well as deep-learning based representations on out-of-domain speech audio. A comparison of the resulting animations is shown in the supplementary video.