Gemma 4 12B is Google DeepMind’s unified open-weight multimodal model designed for efficient local reasoning, coding, and multimodal understanding. Unlike other Gemma 4 models that rely on separate encoders, the 12B Unified model uses an encoder-free architecture that projects raw image patches and audio waveforms directly into the language model’s embedding space, reducing multimodal latency and simplifying fine-tuning. It supports text, image, audio, and video inputs with text output, making it useful for transcription, image understanding, video analysis, coding, and agentic workflows. The model has 11.95B parameters, 48 layers, a 256K-token context window, and support for over 140 languages. It also includes configurable thinking modes, native system prompt support, function calling, and strong benchmark performance for its size. It is optimized for consumer GPUs, workstations, and streamlined local deployment.
Features
- Encoder-free unified multimodal architecture
- Supports text, image, audio, and video inputs
- 11.95B-parameter dense transformer model
- 256K-token context window for long-context tasks
- Configurable thinking mode for reasoning workflows
- Native function calling for agentic applications
- Supports 140+ languages with strong multilingual coverage
- Optimized for consumer GPUs and local deployment