LeJustDance - Music-conditioned dance generation

Music-conditioned dance generation

Julien

Hai

Emile

Louis

Dance Is A Hard Sequence Problem

Improvising to unheard music is future-motion prediction in disguise.

Temporal coherence

A dancer must keep style in memory, react to new musical cues, and anticipate the next motion.

Body dynamics

Steps must remain balanced, grounded, joint-consistent, and physically believable.

Coordination and skill

Even trained humans struggle to align rhythm, intention, and full-body control in real time.

Can AI models learn to do it?

Music-Aware Motion Representations

Music as Action

\(a_t\): the music segment that drives the next movement

Position as state

\(s_t\): Sequence of 3D body poses

Embed

Predict Motion Embeddings

abstract pose into a smoother, predictive target

Coordinates are a fragile target

3D coordinates are pose-dependent, local, and sensitive to parameterization choices.

We need a motion representation that makes music actions predictable.

How do we learn a good motion representation for music actions?

Dataset: Paired Music and Motion Windows

\(a_t\)

\(s_t \rightarrow s_{t+1}\)

Context Music Action Future Target

AIST++ / SMPL motion

paired choreography, music audio, and 3D body motion aligned on a shared timeline

Training examples

sample temporal windows: current state chunk, next music chunk, future state chunk

Protocol

Hyperparameters: fps 30, horizon 60, segment length 60, no filtering rules

Li et al., AIST++: Dance Motion Dataset for Music Conditioned 3D Dance Generation, ICCV 2021. Loper et al., SMPL: A Skinned Multi-Person Linear Model, SIGGRAPH Asia 2015.

JEPA to the Rescue! Predicting Motion in Latent Space

\(s_t\)

\(f_{\theta}\)

\(z_t\)

\(g_{\phi}\)

\(\hat{z}_{t+1}\)

\(a_t\)

\(s_{t+1}\)

\(f_{\theta}\)

\(z_{t+1}\)

\(C\)

Inference: Autoregressive Music Rollout

\(s_t\)

\(f_{\theta}\)

\(z_t\)

\(g_{\phi}\)

\(\hat{z}_{t+1}\)

\(a_t\)

\(d_{\psi}\)

\(\hat{s}_{t+1}\)

Architectural Choices

Variance Collapse
VICReg instead of SIGReg

RNN replaced with transformer

Julien

Hai

Emile

Louis

Dance Is A Hard Sequence Problem

Temporal coherence

Body dynamics

Coordination and skill

Music-Aware Motion Representations

Music as Action

Position as state

Predict Motion Embeddings

Coordinates are a fragile target

Dataset: Paired Music and Motion Windows

AIST++ / SMPL motion

Training examples

Protocol

JEPA to the Rescue! Predicting Motion in Latent Space

Inference: Autoregressive Music Rollout

Architectural Choices

Thank You