NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 is a state-of-the-art large language model developed and released by NVIDIA as part of its Nemotron 3 family, optimized for high-efficiency inference and strong reasoning performance in open AI workloads. It is the post-trained and FP8-quantized variant of the Nemotron 3 Nano model, meaning its weights and activations are represented in 8-bit floating point (FP8) to dramatically reduce memory usage and computational cost while retaining high accuracy. The base Nano architecture uses a hybrid Mamba-Transformer Mixture-of-Experts (MoE) design, allowing the model to activate only a small fraction of its 31.6 billion parameters per token, which improves speed and efficiency without sacrificing quality on complex queries. This configuration supports a massive context length of up to 1 million tokens, making it suitable for long-context reasoning, agentic tasks, extended dialogues, and applications like code generation or document summarization.
Features
- Mixture-of-Experts (MoE) architecture with hybrid Mamba-Transformer design
- FP8 quantization for efficient memory and compute usage
- Supports extremely long context windows (up to 1 million tokens)
- Configurable reasoning trace generation before final answers
- Strong general-purpose reasoning, chat, and code generation capabilities
- Safety-oriented training with guard mechanisms and open model license