A Unified Framework for Text-to-3D and Image-to-3D Generation
Multimodal-Driven Architecture for Customized Video Generation
The Clay Foundation Model - An open source AI model and interface
Capable of understanding text, audio, vision, video
Release for Improved Denoising Diffusion Probabilistic Models
Chat & pretrained large audio language model proposed by Alibaba Cloud
Chinese LLaMA-2 & Alpaca-2 Large Model Phase II Project
A state-of-the-art open visual language model
Pushing the Limits of Mathematical Reasoning in Open Language Models
ICLR2024 Spotlight: curation/training code, metadata, distribution
Qwen3-omni is a natively end-to-end, omni-modal LLM
Towards Real-World Vision-Language Understanding
CLIP, Predict the most relevant text snippet given an image
AlphaFold 3 inference pipeline
Chat & pretrained large vision language model
Repo of Qwen2-Audio chat & pretrained large audio language model
OCR expert VLM powered by Hunyuan's native multimodal architecture
High-Resolution Image Synthesis with Latent Diffusion Models
High-resolution models for human tasks
GPT4V-level open-source multi-modal model based on Llama3-8B
Example Discord bot written in Python that uses the completions API
Lets make video diffusion practical
VMZ: Model Zoo for Video Modeling
Python SDK for Claude Agent
Tool for exploring and debugging transformer model behaviors