Sam3D

For human 3D models, it's not that good, but for objects it's amazing. The most impressive thing is that it extracts accurate relative positions within a single image.

Camera-relative layout

Generates 3D shape/geometry, texture, and camera-relative layout (6D rotation R, translation t, scale s) from just a single image + object mask (M)

Breaking the 3D "data barrier"

Since 3D ground truth for natural images is scarce, the approach uses an LLM-style multi-stage recipe: synthetic pretraining → semi-synthetic mid-training (render-paste) → real-image post-training. Particularly, instead of "humans directly creating meshes," it uses model-in-the-loop (MITL) + human to generate large-scale datasets through selecting/ranking/pose-fitting among candidates. Proposes and releases new "in-the-wild" 3D benchmarks like SA-3DAO (1K artist-made GT)

Architecture

(1) Geometry model first predicts coarse shape + pose/layout (approximately 1.2B params, flow transformer + Mixture-of-Transformers (MoT))

(2) Texture & Refinement model enhances detail/texture at high resolution based on voxels (approximately 600M params, sparse latent flow transformer)

Training

Pretraining: Large-scale training on Iso-3DO (rendered) created from Objaverse-XL, etc. (e.g., 2.5T tokens mentioned)

Mid-training: Learning occlusion/scene robustness on RP-3DO, which composites 3D assets into natural images (e.g., 61M samples, 2.8M meshes)

Post-training: Iteratively applies SFT + DPO (Direct Preference Optimization) on MITL-collected data to align with real-world conditions and "human-preferred" shape/texture quality

paper

www.arxiv.org

https://www.arxiv.org/pdf/2511.16624

demo

Segment Anything Playground | Meta

A playground for interactive media