Multimodal-Driven Architecture for Customized Video Generation
Tiny vision language model
VGGSfM: Visual Geometry Grounded Deep Structure From Motion
Qwen3-omni is a natively end-to-end, omni-modal LLM
Generate Any 3D Scene in Seconds
Large-language-model & vision-language-model based on Linear Attention
Uncommon Objects in 3D dataset
High-resolution models for human tasks
Towards Real-World Vision-Language Understanding
code for Mesh R-CNN, ICCV 2019
Memory-efficient and performant finetuning of Mistral's models
A SOTA open-source image editing model
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning
The ChatGPT Retrieval Plugin lets you easily find personal documents
High-Fidelity and Controllable Generation of Textured 3D Assets
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
LLM-based Reinforcement Learning audio edit model
MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training
Inference script for Oasis 500M
A Customizable Image-to-Video Model based on HunyuanVideo
Chinese LLaMA-2 & Alpaca-2 Large Model Phase II Project
GPT4V-level open-source multi-modal model based on Llama3-8B
Implementation of the Surya Foundation Model for Heliophysics
Chinese and English multimodal conversational language model
Multi-modal large language model designed for audio understanding