Large Multimodal Models for Video Understanding and Editing
Large-language-model & vision-language-model based on Linear Attention
Qwen3-omni is a natively end-to-end, omni-modal LLM
Phi-3.5 for Mac: Locally-run Vision and Language Models
Revolutionizing Database Interactions with Private LLM Technology
The official PyTorch implementation of Google's Gemma models
High-resolution models for human tasks
CLIP, Predict the most relevant text snippet given an image
Ling is a MoE LLM provided and open-sourced by InclusionAI
Multimodal-Driven Architecture for Customized Video Generation
Personalize Any Characters with a Scalable Diffusion Transformer
Inference script for Oasis 500M
4M: Massively Multimodal Masked Modeling
FAIR Sequence Modeling Toolkit 2
ICLR2024 Spotlight: curation/training code, metadata, distribution
Official DeiT repository
GLM-4-Voice | End-to-End Chinese-English Conversational Model
Ring is a reasoning MoE LLM provided and open-sourced by InclusionAI
A Customizable Image-to-Video Model based on HunyuanVideo
Open-source large language model family from Tencent Hunyuan
NVIDIA Isaac GR00T N1.5 is the world's first open foundation model
Stable Diffusion WebUI Forge is a platform on top of Stable Diffusion
DeepMind model for tracking arbitrary points across videos & robotics
Tooling for the Common Objects In 3D dataset
code for Mesh R-CNN, ICCV 2019