AlphaFold 3 inference pipeline
High-Resolution 3D Assets Generation with Large Scale Diffusion Models
Python inference and LoRA trainer package for the LTX-2 audio–video
RGBD video generation model conditioned on camera input
Qwen3-TTS is an open-source series of TTS models
GPT4V-level open-source multi-modal model based on Llama3-8B
This repository contains the official implementation of FastVLM
Official inference repo for FLUX.2 models
Stable Virtual Camera: Generative View Synthesis with Diffusion Models
Open-source framework for intelligent speech interaction
Inference script for Oasis 500M
Recovering the Visual Space from Any Views
Designed for text embedding and ranking tasks
Qwen3-omni is a natively end-to-end, omni-modal LLM
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
An experimental version of DeepSeek model
Qwen2.5-VL is the multimodal large language model series
State-of-the-art (SoTA) text-to-video pre-trained model
A trainable PyTorch reproduction of AlphaFold 3
A Multi-Modal World Model for Reconstructing, Generating, Simulation
GLM-4-Voice | End-to-End Chinese-English Conversational Model
Controllable & emotion-expressive zero-shot TTS
Tool for exploring and debugging transformer model behaviors
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning
code for Mesh R-CNN, ICCV 2019