DiffusionGemma 26B A4B IT NVFP4 is NVIDIA’s Model Optimizer quantized release of Google DeepMind’s DiffusionGemma 26B A4B IT model. It is an open-weights multimodal generative model that processes text, images, and video inputs to produce text output through discrete diffusion. Built on the Gemma 4 26B A4B Mixture-of-Experts architecture, it has 25.2B total parameters and 3.8B active parameters, balancing capability with efficient inference. Its diffusion-based generation produces tokens in parallel 256-token blocks, enabling very high-speed output, with reported generation above 1,100 tokens per second on NVIDIA Hopper H100 in FP8. The model supports a 256K-token context window, configurable thinking mode, native function calling, structured JSON output, and multilingual inference across 35+ languages. The NVFP4 quantization reduces weights and activations from 16-bit to 4-bit, lowering disk size and GPU memory needs for vLLM deployment.
Features
- NVFP4 4-bit quantization for lower memory usage
- 25.2B total parameters with 3.8B active parameters
- Multimodal input support for text, images, and video
- Discrete diffusion generation with parallel token blocks
- 256K-token context window for long-context workflows
- Native function calling and structured JSON output
- Multilingual inference across more than 35 languages
- Optimized for vLLM on NVIDIA Hopper and Blackwell GPUs