scispace - formally typeset
Open AccessPosted ContentDOI

Listen, denoise, action! Audio-driven motion synthesis with diffusion models

Simon Alexanderson, +3 more
- 17 Nov 2022 - 
- Vol. abs/2211.09707
Reads0
Chats0
TLDR
In this article , the authors adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power, and demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression.
Abstract
Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise path-driven locomotion using the same model architecture. Finally, we generalise the guidance procedure to obtain product-of-expert ensembles of diffusion models and demonstrate how these may be used for, e.g., style interpolation, a contribution we believe is of independent interest. See https://www.speech.kth.se/research/listen-denoise-action/ for video examples, data, and code.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents

Tenglong Ao, +1 more
- 26 Mar 2023 - 
TL;DR: The authors proposed GestureDiffuCLIP, a neural network framework for synthesizing realistic, stylized co-speech gestures with flexible style control using contrastive-language-image-pre-training (CLIP) model.
Journal ArticleDOI

QPGesture: Quantization-Based and Phase-Guided Motion Matching for Natural Speech-Driven Gesture Generation

TL;DR: In this article , a quantization-based and phase-guided motion-matching framework is proposed to solve the problem of random jitters of human motion in speech and gestures.

NAP : N eural 3D A rticulation P rior – Supplementary Material –

TL;DR: For instance, Neural 3D Articulation Prior (NAP) as mentioned in this paper uses a graph-attention denoising network for learning the reverse diffusion process to generate articulated objects.
Journal ArticleDOI

Probabilistic Human Mesh Recovery in 3D Scenes from Egocentric Views

TL;DR: Zhang et al. as discussed by the authors proposed a scene-conditioned diffusion method to estimate plausible 3D human pose and shape of our social partners from the egocentric view, which is able to solve the problem of severe body truncation due to close social distances.
Journal ArticleDOI

HumanMAC: Masked Motion Completion for Human Motion Prediction

TL;DR: Wang et al. as discussed by the authors proposed a masked completion framework for human motion prediction, which only needs one loss in optimization and is trained in an end-to-end manner. But it is not suitable for real-world tasks.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article

Attention is All you Need

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Book

Information Theory, Inference and Learning Algorithms

TL;DR: A fun and exciting textbook on the mathematics underpinning the most dynamic areas of modern science and engineering.
Posted Content

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

TL;DR: This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
Related Papers (5)