SAM 3 (Segment Anything Model 3) is a unified foundation model for promptable segmentation in both images and videos, capable of detecting, segmenting, and tracking objects. It accepts both text prompts (open-vocabulary concepts like “red car” or “goalkeeper in white”) and visual prompts (points, boxes, masks) and returns high-quality masks, boxes, and scores for the requested concepts. Compared with SAM 2, SAM 3 introduces the ability to exhaustively segment all instances of an open-vocabulary concept specified by a short phrase or exemplars, scaling to a vastly larger set of categories than traditional closed-set models. This capability is grounded in a new data engine that automatically annotated over four million unique concepts, producing a massive open-vocabulary segmentation dataset and enabling the model to achieve 75–80% of human performance on the SA-CO benchmark, which itself spans 270K unique concepts.
Features
- Unified model for promptable segmentation and tracking in both images and videos using text or visual prompts
- Open-vocabulary instance segmentation that can exhaustively find all instances of a concept specified by short text or exemplars
- Massive underlying data engine with millions of automatically annotated concepts and the SA-CO benchmark for evaluation
- New architecture with a presence token to better disambiguate similar text prompts and a decoupled detector–tracker design
- Python package with APIs for inference, finetuning, and integration into larger applications or agents
- Rich examples and notebooks for image and video prompting, batched inference, and SA-CO evaluation workflows