Gemma 4 12B
Unified multimodal Gemma model for local coding and reasoning
Gemma 4 12B is Google DeepMind’s unified open-weight multimodal model designed for efficient local reasoning, coding, and multimodal understanding. Unlike other Gemma 4 models that rely on separate encoders, the 12B Unified model uses an encoder-free architecture that projects raw image patches and audio waveforms directly into the language model’s embedding space, reducing multimodal latency and simplifying fine-tuning. It supports text, image, audio, and video inputs with text output, making it useful for transcription, image understanding, video analysis, coding, and agentic workflows. ...