Inference framework for 1-bit LLMs
Implementation of "MobileCLIP" CVPR 2024
Chat & pretrained large vision language model
Fast and Universal 3D reconstruction model for versatile tasks
A PyTorch library for implementing flow matching algorithms
PyTorch code and models for the DINOv2 self-supervised learning
Memory-efficient and performant finetuning of Mistral's models
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
Phi-3.5 for Mac: Locally-run Vision and Language Models
Revolutionizing Database Interactions with Private LLM Technology
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
The ChatGPT Retrieval Plugin lets you easily find personal documents
Tiny vision language model
The official PyTorch implementation of Google's Gemma models
Qwen3-omni is a natively end-to-end, omni-modal LLM
Diversity-driven optimization and large-model reasoning ability
A state-of-the-art open visual language model
High-resolution models for human tasks
Towards Real-World Vision-Language Understanding
CLIP, Predict the most relevant text snippet given an image
Ling is a MoE LLM provided and open-sourced by InclusionAI
Multimodal-Driven Architecture for Customized Video Generation
Personalize Any Characters with a Scalable Diffusion Transformer
Stable Virtual Camera: Generative View Synthesis with Diffusion Models
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning