Listen, denoise, action!
Audio-driven motion synthesis with diffusion models
KTH Royal Institute of Technology
Stockholm, Sweden
Summary
Automatically generating human motion for a given speech or music recording is a challenging task in computer animation, with applications in film and video games. We present denoising diffusion probabilistic models for audio-driven motion synthesis, trained on motion-capture datasets of dance and gesticulation.
Through extensive user studies and objective evaluation, we demonstrate that our models are able to generate full-body gestures and dancing with top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. For details, please see our SIGGRAPH 2023 paper.
In addition to strong results in audio-driven motion synthesis, we find that our proposed approach also generalises to the task of generating stylised walking motion that follows a given path. When trained on the 100STYLE dataset, our model outputs natural walking motion along an arbitrary trajectory in numerous styles, whilst adapting to the requested turns and velocity changes.
We have also released the Motorica Dance Dataset used to train our dance models. It collects more than 6 hours of music and high-quality motion-capture recordings of dances in 8 styles:
-
≈3 hours of Casual, Hip-Hop, Krumping, and Popping dances representing the highest-quality subset of the PSMD dataset from Transflower.
-
≈3.5 hours of newly recorded Jazz, Charleston, Locking and Tap dancing performed by three accomplished professional dancers.
Technical details
Our diffusion-model architecture consists of a stack of Conformers, which replace the feedforward networks in Transformers with convolutional layers. Unlike recent diffusion models for motion, we use a translation-invariant method for encoding positional information, which generalises better to long sequences. We trained models for gesture generation, dance synthesis, and path-driven locomotion, all using the same hyperparameters, except for the number of training iterations.
Dances generated by our model
When trained on the new Motorica Dance Dataset, our approach generates dances in several styles, with dynamic movements that match the music.
Locking
Gestures generated by our model
When trained on the Trinity Speech-Gesture Dataset, our approach generates realistic and distinctive gestures that are well synchronised with the speech.
Sequence 1
Gesture style control
When trained on the ZEGGS gesture dataset, which has several different labelled styles, our approach demonstrates the ability to control gesture style expression independently from the speech. Below we show gestures synthesised from two different speech segments, using four different styles for each.
Segment 1 - Public Speaking
Blending and transitioning between styles
In our paper, we also propose a new way to build product-of-expert ensembles from several diffusion models. This can for example be used to blend different styles (actually probability distributions), which we call guided interpolation, as well as to seamlessly transition between styles, as shown in the videos below. All side-by-side motion clips here are generated from the same random seed, but at different points along the interpolation spectrum.
Guided interpolation: "high knees" to "aeroplane"
Guided interpolation: "relaxed" to "angry"
Style transitions
Synthesised fighting
Finally, we demonstrate a model trained to generate martial-arts motion based on user defined combinations of combat moves with different body parts. The following video shows the model performing a sequence of punches and kicks triggered by the music beats.
Fighting choreography synchronised to music
Citation information
@article{alexanderson2023listen,
title={Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models},
author={Alexanderson, Simon and Nagy, Rajmund and Beskow, Jonas and Henter, Gustav Eje},
journal={ACM Trans. Graph.},
volume={42},
number={4},
articleno={44},
numpages={20},
pages={44:1--44:20}
doi={10.1145/3592458},
year={2023}
}