A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
Qwen3 is the large language model series developed by Qwen team
The official repo of Qwen chat & pretrained large language model
OCR expert VLM powered by Hunyuan's native multimodal architecture
LLM-based Reinforcement Learning audio edit model
Implementation of "MobileCLIP" CVPR 2024
Ultra-Efficient LLMs on End Device
Generate Any 3D Scene in Seconds
CogView4, CogView3-Plus and CogView3(ECCV 2024)
Memory-efficient and performant finetuning of Mistral's models
Block Diffusion for Ultra-Fast Speculative Decoding
Qwen-Image-Layered: Layered Decomposition for Inherent Editablity
Repo of Qwen2-Audio chat & pretrained large audio language model
Audio foundation model excelling in audio understanding
Phi-3.5 for Mac: Locally-run Vision and Language Models
Fast-stable-diffusion + DreamBooth
Multimodal Diffusion with Representation Alignment
Multi-modal large language model designed for audio understanding
Large Multimodal Models for Video Understanding and Editing
Open-weight, large-scale hybrid-attention reasoning model
Renderer for the harmony response format to be used with gpt-oss
The official PyTorch implementation of Google's Gemma models
Open-source industrial-grade ASR models
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
FAIR Sequence Modeling Toolkit 2