Music-conditioned dance generation

Julien

Julien

Hai

Hai

Emile

Emile

Louis

Louis

Dance Is A Hard Sequence Problem

Improvising to unheard music is future-motion prediction in disguise.

01

Temporal coherence

A dancer must keep style in memory, react to new musical cues, and anticipate the next motion.

02

Body dynamics

Steps must remain balanced, grounded, joint-consistent, and physically believable.

03

Coordination and skill

Even trained humans struggle to align rhythm, intention, and full-body control in real time.

Can AI models learn to do it?

Music-Aware Motion Representations

Music as Action

\(a_t\): the music segment that drives the next movement

+

Position as state

\(s_t\): Sequence of 3D body poses

Embed

Predict Motion Embeddings

abstract pose into a smoother, predictive target

Coordinates are a fragile target

3D coordinates are pose-dependent, local, and sensitive to parameterization choices.

We need a motion representation that makes music actions predictable.
How do we learn a good motion representation for music actions?

Dataset: Paired Music and Motion Windows

\(a_t\)
Current and future motion windows
\(s_t \rightarrow s_{t+1}\)
Context Music Action Future Target

AIST++ / SMPL motion

paired choreography, music audio, and 3D body motion aligned on a shared timeline

Training examples

sample temporal windows: current state chunk, next music chunk, future state chunk

Protocol

Hyperparameters: fps 30, horizon 60, segment length 60, no filtering rules

Li et al., AIST++: Dance Motion Dataset for Music Conditioned 3D Dance Generation, ICCV 2021. Loper et al., SMPL: A Skinned Multi-Person Linear Model, SIGGRAPH Asia 2015.

JEPA to the Rescue! Predicting Motion in Latent Space

Current SMPL motion chunk \(s_t\)
\(f_{\theta}\)
\(z_t\)
\(g_{\phi}\)
\(\hat{z}_{t+1}\)
\(a_t\)
Future SMPL motion chunk \(s_{t+1}\)
\(f_{\theta}\)
\(z_{t+1}\)
\(C\)

Inference: Autoregressive Music Rollout

Current motion context \(s_t\)
\(f_{\theta}\)
\(z_t\)
\(g_{\phi}\)
\(\hat{z}_{t+1}\)
\(a_t\)
\(d_{\psi}\)
Decoded motion rollout \(\hat{s}_{t+1}\)

Architectural Choices

Variance Collapse
VICReg instead of SIGReg
RNN replaced with transformer

Thank You