Chinese and English multimodal conversational language model
Unified Multimodal Understanding and Generation Models
UI-TARS-desktop version that can operate on your local personal device
SGLang is a fast serving framework for large language models
Pre-trained Deep Learning models and demos
4M: Massively Multimodal Masked Modeling
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Nexa SDK is a comprehensive toolkit for supporting ONNX and GGML
Comprehensive and timely academic information on federated learning
Official inference repo for FLUX.2 models
Contexts Optical Compression
OCR expert VLM powered by Hunyuan's native multimodal architecture
A Modular Simulation Framework and Benchmark for Robot Learning
Sparsity-aware deep learning inference runtime for CPUs
Tiny vision language model
Open Source Computer Vision Library
Solve end to end problems using Llama model family
Benchmarking Multimodal Agents for Open-Ended Tasks
Tooling for the Common Objects In 3D dataset
code for Mesh R-CNN, ICCV 2019
A simple screen parsing tool towards pure vision based GUI agent
Automate native Android apps with AI using accessibility APIs
Uncommon Objects in 3D dataset
GPT4V-level open-source multi-modal model based on Llama3-8B