Map-Anything is a universal, feed-forward transformer for metric 3D reconstruction that predicts a scene’s geometry and camera parameters directly from visual inputs. Instead of stitching together many task-specific models, it uses a single architecture that supports a wide range of 3D tasks—multi-image structure-from-motion, multi-view stereo, monocular metric depth, registration, depth completion, and more. The model flexibly accepts different input combinations (images, intrinsics, poses, sparse or dense depth) and produces a rich set of outputs including per-pixel 3D points, camera intrinsics, camera poses, ray directions, confidence maps, and validity masks. Its inference path is fully feed-forward with optional mixed-precision and memory-efficient modes, making it practical to scale to long image sequences while keeping latency predictable.
Features
- One feed-forward transformer that covers >10 reconstruction tasks
- Multi-modal inputs (images, calibration, poses, depth) with unified APIs
- Dense metric outputs: 3D points, depth (z and along-ray), intrinsics, poses, ray directions, confidence and masks
- Turnkey demos plus exporters to COLMAP and Gaussian splatting pipelines
- Mixed-precision and memory-efficient inference for long sequences
- Modular “building blocks” (UniCeption, WAI) to scale data and models