A paper that accelerates industry application of content such as AI Game Generation by adding interactivity to real-time video generation.
ControlNet method replaced with sinusoidal embedding-based track head → 40x faster.
- Weak at scene changes due to fixed attention sink.
- Distortion occurs with unrealistic or abrupt trajectories.
- Smaller backbone (Wan 2.1 1.3B) preserves structure better than larger one (Wan 2.2 5B).
Of MotionStream's total latency (≈ 0.39 s), approximately 70% comes from VAE decoding bottleneck, which is why Tiny VAE is used.

Seonglae Cho