CLIP, Predict the most relevant text snippet given an image
A Family of Open Sourced Music Foundation Models
Recovering the Visual Space from Any Views
1B text generation model based on the HRM architecture
Official code base for LeWorldModel: Stable End-to-End Joint-Embedding
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
Implementation of the Surya Foundation Model for Heliophysics
Multimodal embedding and reranking models built on Qwen3-VL
Generate Any 3D Scene in Seconds
Diffusion Transformer with Fine-Grained Chinese Understanding
Language modeling in a sentence representation space
Large Multimodal Models for Video Understanding and Editing
Di♪♪Rhythm: Blazingly Fast & Simple End-to-End Song Generation
Official PyTorch Implementation of "Scalable Diffusion Models"
Text-to-3D & Image-to-3D & Mesh Exportation with NeRF + Diffusion
Learning embeddings for classification, retrieval and ranking
A library for Multilingual Unsupervised or Supervised word Embeddings
Code for reproducing key results in the paper
Open language model developed by NVIDIA as part of Nemotron-3 family
Metric monocular depth estimation (vision model)
Unified multimodal Gemma model for local coding and reasoning