VGGSfM: Visual Geometry Grounded Deep Structure From Motion
State-of-the-art Image & Video CLIP, Multimodal Large Language Models
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning
Inference framework for 1-bit LLMs
Capable of understanding text, audio, vision, video
The official PyTorch implementation of Google's Gemma models
Inference code for scalable emulation of protein equilibrium ensembles
CodeGeeX2: A More Powerful Multilingual Code Generation Model
Chat & pretrained large vision language model
Repo of Qwen2-Audio chat & pretrained large audio language model
Implementation of "MobileCLIP" CVPR 2024
VMZ: Model Zoo for Video Modeling
High-resolution models for human tasks
Video understanding codebase from FAIR for reproducing video models
Towards Real-World Vision-Language Understanding
CLIP, Predict the most relevant text snippet given an image
Ling is a MoE LLM provided and open-sourced by InclusionAI
A Unified Framework for Text-to-3D and Image-to-3D Generation
Multimodal-Driven Architecture for Customized Video Generation
Multimodal Diffusion with Representation Alignment
Personalize Any Characters with a Scalable Diffusion Transformer
Stable Virtual Camera: Generative View Synthesis with Diffusion Models
Qwen3-omni is a natively end-to-end, omni-modal LLM
Official code for Style Aligned Image Generation via Shared Attention