Listen, denoise, action! Audio-driven motion synthesis with diffusion models

doi:10.1145/3592458

Open AccessPosted ContentDOI

Listen, denoise, action! Audio-driven motion synthesis with diffusion models

Simon Alexanderson, +3 more

- 17 Nov 2022 -

arXiv.org

- Vol. abs/2211.09707

Chats0

TLDR

In this article , the authors adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power, and demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression.

Abstract:

Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise path-driven locomotion using the same model architecture. Finally, we generalise the guidance procedure to obtain product-of-expert ensembles of diffusion models and demonstrate how these may be used for, e.g., style interpolation, a contribution we believe is of independent interest. See https://www.speech.kth.se/research/listen-denoise-action/ for video examples, data, and code.

Listen, denoise, action! Audio-driven motion synthesis with diffusion models

Citations

GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents

QPGesture: Quantization-Based and Phase-Guided Motion Matching for Natural Speech-Driven Gesture Generation

NAP : N eural 3D A rticulation P rior – Supplementary Material –

Probabilistic Human Mesh Recovery in 3D Scenes from Egocentric Views

HumanMAC: Masked Motion Completion for Human Motion Prediction

References

Adam: A Method for Stochastic Optimization

Attention is All you Need

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Information Theory, Inference and Learning Algorithms

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Related Papers (5)

Motion difference histogram for recognizing human action in video

Human motion correction and representation method from motion camera

Construction of perspective and panoramic images from omni-images taken from hypercatadioptric cameras for visual surveillance

Investigating a new visual cue for image motion estimation

Recognizing Activities from Egocentric Images with Appearance and Motion Features