High-Resolution 3D Assets Generation with Large Scale Diffusion Models
Python inference and LoRA trainer package for the LTX-2 audio–video
RGBD video generation model conditioned on camera input
Qwen3-TTS is an open-source series of TTS models
This repository contains the official implementation of FastVLM
Official inference repo for FLUX.2 models
Stable Virtual Camera: Generative View Synthesis with Diffusion Models
Inference script for Oasis 500M
Recovering the Visual Space from Any Views
Designed for text embedding and ranking tasks
Qwen3-omni is a natively end-to-end, omni-modal LLM
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
An experimental version of DeepSeek model
Qwen2.5-VL is the multimodal large language model series
A Multi-Modal World Model for Reconstructing, Generating, Simulation
GLM-4-Voice | End-to-End Chinese-English Conversational Model
Qwen3-VL, the multimodal large language model series by Alibaba Cloud
Controllable & emotion-expressive zero-shot TTS
Tool for exploring and debugging transformer model behaviors
Ring is a reasoning MoE LLM provided and open-sourced by InclusionAI
Tiny vision language model
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Multimodal-Driven Architecture for Customized Video Generation
Ling-V2 is a MoE LLM provided and open-sourced by InclusionAI