Phi-3.5 for Mac: Locally-run Vision and Language Models
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
ICLR2024 Spotlight: curation/training code, metadata, distribution
State-of-the-art Image & Video CLIP, Multimodal Large Language Models
CogView4, CogView3-Plus and CogView3(ECCV 2024)
Reference PyTorch implementation and models for DINOv3
Towards Real-World Vision-Language Understanding
Large-language-model & vision-language-model based on Linear Attention
This repository contains the official implementation of FastVLM
NVIDIA Isaac GR00T N1.5 is the world's first open foundation model
Chat & pretrained large vision language model
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
A state-of-the-art open visual language model
PyTorch code and models for the DINOv2 self-supervised learning
High-resolution models for human tasks
Qwen2.5-VL is the multimodal large language model series
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Chinese and English multimodal conversational language model
Unified Multimodal Understanding and Generation Models
4M: Massively Multimodal Masked Modeling
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Official inference repo for FLUX.2 models
OCR expert VLM powered by Hunyuan's native multimodal architecture
Multimodal model achieving SOTA performance