D4RT

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Feb 4 14:43
Editor
Edited
Edited
2026 Feb 4 14:53
Refs
Refs
A unified model that simultaneously reconstructs and tracks 3D+time (4D) scenes from a single video. Uses transformer attention with queries like "where is this pixel in 3D space at a specific time and camera viewpoint?" 18-300× faster than existing methods (reported example: ~1 minute of video processed in ~5 seconds). Given input video , the model can answer: for a pixel in a source frame at time , what 3D position does it correspond to at target time (tgt) , expressed in the camera coordinate frame of time ?

1) Global Scene Representation

  • Process the video once through a global self-attention encoder (video ViT family) to create a global latent (encoder output), then keep fixed for subsequent queries.

2) Query Definition (Key Point)

Query is a token built by adding embeddings:
  • Fourier features for pixel location
  • Learned discrete embeddings for times
  • Embedding of a local 9×9 RGB patch centered at (ablation study shows this is crucial)
Note: is the reference camera coordinate system in which is expressed—not the viewpoint to render from. This enables changing reference frames and world-coordinate tracking without explicit pose at query time.

3) Decoder: Independent Queries, Cross-Attention Only

  • The decoder has each query perform cross-attention to the global latent to regress the 3D position of that point.
  • Queries do NOT self-attend to each other. Authors explicitly state that enabling self-attention significantly degraded performance. (This is the key to parallelism/speed/stability)
"Why is it fast?": Abandons dense per-frame decoding, queries only what's needed
 
 
 
 

Recommendations