Home | LDA

top of page

Listen, denoise, action!
Audio-driven motion synthesis with diffusion models

Simon Alexanderson, Rajmund Nagy, Jonas Beskow, Gustav Eje Henter

KTH Royal Institute of Technology
Stockholm, Sweden

motorica.ai

Video presentation

GitHub repository

Motorica Dance Dataset

Summary

Automatically generating human motion for a given speech or music recording is a challenging task in computer animation, with applications in film and video games. We present denoising diffusion probabilistic models for audio-driven motion synthesis, trained on motion-capture datasets of dance and gesticulation.

Through extensive user studies and objective evaluation, we demonstrate that our models are able to generate full-body gestures and dancing with top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. For details, please see our SIGGRAPH 2023 paper.

In addition to strong results in audio-driven motion synthesis, we find that our proposed approach also generalises to the task of generating stylised walking motion that follows a given path. When trained on the 100STYLE dataset, our model outputs natural walking motion along an arbitrary trajectory in numerous styles, whilst adapting to the requested turns and velocity changes.

We have also released the Motorica Dance Dataset used to train our dance models. It collects more than 6 hours of music and high-quality motion-capture recordings of dances in 8 styles:

≈3 hours of Casual, Hip-Hop, Krumping, and Popping dances representing the highest-quality subset of the PSMD dataset from Transflower.
≈3.5 hours of newly recorded Jazz, Charleston, Locking and Tap dancing performed by three accomplished professional dancers.

Technical details

Our diffusion-model architecture consists of a stack of Conformers, which replace the feedforward networks in Transformers with convolutional layers. Unlike recent diffusion models for motion, we use a translation-invariant method for encoding positional information, which generalises better to long sequences. We trained models for gesture generation, dance synthesis, and path-driven locomotion, all using the same hyperparameters, except for the number of training iterations.

Dances generated by our model

When trained on the new Motorica Dance Dataset, our approach generates dances in several styles, with dynamic movements that match the music.

Locking

Locking

Locking

00:33

Jazz

00:30

Hiphop

00:10

Krumping

00:10

Gestures generated by our model

When trained on the Trinity Speech-Gesture Dataset, our approach generates realistic and distinctive gestures that are well synchronised with the speech.

Sequence 1

Sequence 1

Sequence 1

00:09

Sequence 2

00:09

Sequence 3

00:09

Sequence 4

00:09

Gesture style control

When trained on the ZEGGS gesture dataset, which has several different labelled styles, our approach demonstrates the ability to control gesture style expression independently from the speech. Below we show gestures synthesised from two different speech segments, using four different styles for each.

Segment 1 - Public Speaking

Segment 1 - Public Speaking

Segment 1 - Public Speaking

00:09

Segment 1 - Happy

00:09

Segment 1 - Angry

00:09

Segment 1 - Old

00:09

Blending and transitioning between styles

In our paper, we also propose a new way to build product-of-expert ensembles from several diffusion models. This can for example be used to blend different styles (actually probability distributions), which we call  guided interpolation, as well as to seamlessly transition between styles, as shown in the videos below. All side-by-side motion clips here are generated from the same random seed, but at different points along the interpolation spectrum.

Guided interpolation: "high knees" to "aeroplane"

Guided interpolation: "high knees" to "aeroplane"

Guided interpolation: "relaxed" to "angry"

Guided interpolation: "relaxed" to "angry"

Style transitions

Style transitions

Synthesised fighting

Finally, we demonstrate a model trained to generate martial-arts motion based on user defined combinations of combat moves with different body parts. The following video shows the model performing a sequence of punches and kicks triggered by the music beats.

Fighting choreography synchronised to music

Fighting choreography synchronised to music

Citation information

@article{alexanderson2023listen,

title={Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models},

author={Alexanderson, Simon and Nagy, Rajmund and Beskow, Jonas and Henter, Gustav Eje},

journal={ACM Trans. Graph.},

volume={42},

number={4},

articleno={44},
numpages={20},

pages={44:1--44:20}

doi={10.1145/3592458},

year={2023}

}

bottom of page