Phi-3.5 for Mac: Locally-run Vision and Language Models
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
ICLR2024 Spotlight: curation/training code, metadata, distribution
Reference PyTorch implementation and models for DINOv3
Towards Real-World Vision-Language Understanding
Large-language-model & vision-language-model based on Linear Attention
This repository contains the official implementation of FastVLM
NVIDIA Isaac GR00T N1.5 is the world's first open foundation model
PyTorch code and models for the DINOv2 self-supervised learning
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
High-resolution models for human tasks
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Unified Multimodal Understanding and Generation Models
4M: Massively Multimodal Masked Modeling
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Official inference repo for FLUX.2 models
Contexts Optical Compression
Multimodal model achieving SOTA performance
Tiny vision language model
Official DeiT repository
Foundational Models for State-of-the-Art Speech and Text Translation
Implementation of the Surya Foundation Model for Heliophysics
Large Multimodal Models for Video Understanding and Editing
Blazeface is a lightweight model that detects faces in images
This repository contains the official implementation of research