High-Resolution 3D Assets Generation with Large Scale Diffusion Models
Python inference and LoRA trainer package for the LTX-2 audio–video
RGBD video generation model conditioned on camera input
Qwen3-TTS is an open-source series of TTS models
This repository contains the official implementation of FastVLM
Official inference repo for FLUX.2 models
Inference script for Oasis 500M
Recovering the Visual Space from Any Views
Qwen3-omni is a natively end-to-end, omni-modal LLM
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
An experimental version of DeepSeek model
Qwen2.5-VL is the multimodal large language model series
A Multi-Modal World Model for Reconstructing, Generating, Simulation
GLM-4-Voice | End-to-End Chinese-English Conversational Model
Qwen3-VL, the multimodal large language model series by Alibaba Cloud
Tool for exploring and debugging transformer model behaviors
Controllable & emotion-expressive zero-shot TTS
code for Mesh R-CNN, ICCV 2019
Ring is a reasoning MoE LLM provided and open-sourced by InclusionAI
Tiny vision language model
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Multimodal-Driven Architecture for Customized Video Generation
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Ling-V2 is a MoE LLM provided and open-sourced by InclusionAI
4M: Massively Multimodal Masked Modeling