Sam3D

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Dec 17 12:8
Editor
Edited
Edited
2025 Dec 17 12:26
For human 3D models, it's not that good, but for objects it's amazing. The most impressive thing is that it extracts accurate relative positions within a single image.

Camera-relative layout

Generates 3D shape/geometry, texture, and camera-relative layout (6D rotation R, translation t, scale s) from just a single image + object mask (M)

Breaking the 3D "data barrier"

Since 3D ground truth for natural images is scarce, the approach uses an LLM-style multi-stage recipe: synthetic pretraining → semi-synthetic mid-training (render-paste) → real-image post-training. Particularly, instead of "humans directly creating meshes," it uses model-in-the-loop (MITL) + human to generate large-scale datasets through selecting/ranking/pose-fitting among candidates. Proposes and releases new "in-the-wild" 3D benchmarks like SA-3DAO (1K artist-made GT)

Architecture

  1. (1) Geometry model first predicts coarse shape + pose/layout (approximately 1.2B params, flow transformer + Mixture-of-Transformers (MoT))
  1. (2) Texture & Refinement model enhances detail/texture at high resolution based on voxels (approximately 600M params, sparse latent flow transformer)

Training

  1. Pretraining: Large-scale training on Iso-3DO (rendered) created from Objaverse-XL, etc. (e.g., 2.5T tokens mentioned)
  1. Mid-training: Learning occlusion/scene robustness on RP-3DO, which composites 3D assets into natural images (e.g., 61M samples, 2.8M meshes)
  1. Post-training: Iteratively applies SFT + DPO (Direct Preference Optimization) on MITL-collected data to align with real-world conditions and "human-preferred" shape/texture quality
 
 
paper
demo
models
 
 
 

Recommendations