top of page

Listen, denoise, action!
Audio-driven motion synthesis with diffusion models

KTH Royal Institute of Technology​
Stockholm, Sweden

Summary

Automatically generating human motion for a given speech or music recording is a challenging task in computer animation, with applications in film and video games. We present denoising diffusion probabilistic models for audio-driven motion synthesis, trained on motion-capture datasets of dance and gesticulation.

Through extensive user studies and objective evaluation, we demonstrate that our models are able to generate full-body gestures and dancing with top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. For details, please see our SIGGRAPH 2023 paper.

 

In addition to strong results in audio-driven motion synthesis, we find that our proposed approach also generalises to the task of generating stylised walking motion that follows a given path. When trained on the 100STYLE dataset, our model outputs natural walking motion along an arbitrary trajectory in numerous styles, whilst adapting to the requested turns and velocity changes.

We have also released the Motorica Dance Dataset used to train our dance models. It collects more than 6 hours of music and high-quality motion-capture recordings of dances in 8 styles:

  • ≈3 hours of Casual, Hip-Hop, Krumping, and Popping dances representing the highest-quality subset of the PSMD dataset from Transflower.

  • ≈3.5 hours of newly recorded Jazz, Charleston, Locking and Tap dancing performed by three accomplished professional dancers. 

Technical details

Our diffusion-model architecture consists of a stack of Conformers, which replace the feedforward networks in Transformers with convolutional layers. Unlike recent diffusion models for motion, we use a translation-invariant method for encoding positional information, which generalises better to long sequences. We trained models for gesture generation, dance synthesis, and path-driven locomotion, all using the same hyperparameters, except for the number of training iterations.

Dances generated by our model

When trained on the new Motorica Dance Dataset, our approach generates dances in several styles, with dynamic movements that match the music.

Jazz

Jazz
Search video...
Locking
00:33
Play Video
Jazz
00:30
Play Video
Hiphop
00:10
Play Video
Krumping
00:10
Play Video

Gestures generated by our model

When trained on the Trinity Speech-Gesture Dataset, our approach generates realistic and distinctive gestures that are well synchronised with the speech.

Sequence 1

Sequence 1
Search video...
Sequence 1
00:09
Play Video
Sequence 2
00:09
Play Video
Sequence 3
00:09
Play Video
Sequence 4
00:09
Play Video

Gesture style control

When trained on the ZEGGS gesture dataset, which has several different labelled styles, our approach demonstrates the ability to control gesture style expression independently from the speech. Below we show gestures synthesised from two different speech segments, using four different styles for each.

Segment 1 - Public Speaking

Segment 1 - Public Speaking
Search video...
Segment 1 - Public Speaking
00:09
Play Video
Segment 1 - Happy
00:09
Play Video
Segment 1 - Angry
00:09
Play Video
Segment 1 - Old
00:09
Play Video

Blending and transitioning between styles

In our paper, we also propose a new way to build product-of-expert ensembles from several diffusion models. This can for example be used to blend different styles (actually probability distributions), which we call  guided interpolation,  as well as to seamlessly transition between styles, as shown in the videos below. All side-by-side motion clips here are generated from the same random seed, but at different points along the interpolation spectrum.

Guided interpolation: "high knees" to "aeroplane"

Guided interpolation: "high knees" to "aeroplane"

Play Video
Guided interpolation: "relaxed" to "angry"

Guided interpolation: "relaxed" to "angry"

Play Video
Style transitions

Style transitions

Play Video

Synthesised fighting

Finally, we demonstrate a model trained to generate martial-arts motion based on user defined combinations of combat moves with different body parts. The following video shows the model performing a sequence of punches and kicks triggered by the music beats.

Fighting choreography synchronised to music

Fighting choreography synchronised to music

Play Video

Citation information

@article{alexanderson2023listen,

    title={Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models},

    author={Alexanderson, Simon and Nagy, Rajmund and Beskow, Jonas and Henter, Gustav Eje},

    journal={ACM Trans. Graph.},

    volume={42},

    number={4},

    articleno={44},
   
numpages={20},

    pages={44:1--44:20}

    doi={10.1145/3592458},

    year={2023}

}

bottom of page