RGBD video generation model conditioned on camera input
Capable of understanding text, audio, vision, video
Code for running inference and finetuning with SAM 3 model
Benchmark LLMs by fighting in Street Fighter 3
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning
OCR expert VLM powered by Hunyuan's native multimodal architecture
A Pioneering Open-Source Alternative to GPT-4o
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark