CLIP, Predict the most relevant text snippet given an image
Audio foundation model excelling in audio understanding
Production-tested AI infrastructure tools
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Open Source Speech Language Model
Multimodal embedding and reranking models built on Qwen3-VL
Video understanding codebase from FAIR for reproducing video models
A Family of Open Foundation Models for Code Intelligence
A repository of trained models
Versatile 8B-base multimodal LLM, flexible foundation for custom AI