A unified model that simultaneously reconstructs and tracks 3D+time (4D) scenes from a single video. Uses transformer attention with queries like "where is this pixel in 3D space at a specific time and camera viewpoint?" 18-300× faster than existing methods (reported example: ~1 minute of video processed in ~5 seconds). Given input video , the model can answer: for a pixel in a source frame at time , what 3D position does it correspond to at target time (tgt) , expressed in the camera coordinate frame of time ?
1) Global Scene Representation
- Process the video once through a global self-attention encoder (video ViT family) to create a global latent (encoder output), then keep fixed for subsequent queries.
2) Query Definition (Key Point)
Query is a token built by adding embeddings:
- Fourier features for pixel location
- Learned discrete embeddings for times
- Embedding of a local 9×9 RGB patch centered at (ablation study shows this is crucial)
Note: is the reference camera coordinate system in which is expressed—not the viewpoint to render from. This enables changing reference frames and world-coordinate tracking without explicit pose at query time.
3) Decoder: Independent Queries, Cross-Attention Only
- The decoder has each query perform cross-attention to the global latent to regress the 3D position of that point.
- Queries do NOT self-attend to each other. Authors explicitly state that enabling self-attention significantly degraded performance. (This is the key to parallelism/speed/stability)
"Why is it fast?": Abandons dense per-frame decoding, queries only what's needed
D4RT
Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks.
https://d4rt-paper.github.io/
Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful...
https://arxiv.org/abs/2512.08924

D4RT: Unified, Fast 4D Scene Reconstruction & Tracking
Meet D4RT, a unified AI model for 4D scene reconstruction and tracking.
https://deepmind.google/blog/d4rt-teaching-ai-to-see-the-world-in-four-dimensions/

Seonglae Cho