OmniGen2 is a powerful, efficient open-source multimodal generation model designed for diverse AI tasks involving both images and text. It improves on its predecessor by introducing separate decoding pathways for text and image, along with unshared parameters and a decoupled image tokenizer, enhancing flexibility and performance. Built on a strong Qwen-VL-2.5 foundation, OmniGen2 excels in visual understanding, high-quality text-to-image generation, and instruction-guided image editing. It also supports in-context generation, enabling the combination of multiple inputs like humans, objects, and scenes to produce novel, coherent visuals. The project offers ready-to-use models, extensive demos via Gradio, and supports resource-efficient features like CPU offloading to accommodate limited VRAM devices. Users can fine-tune generation results with hyperparameters like text and image guidance scales, maximum image resolution, and negative prompts.
Features
- Unified multimodal model with distinct decoding paths for text and images
- Based on Qwen-VL-2.5 for strong visual understanding
- Generates high-fidelity images from text prompts with fine control
- Instruction-guided image editing for precise modifications
- Supports in-context generation combining diverse inputs into coherent outputs
- Resource-efficient with CPU offload options for devices with limited VRAM
- Comprehensive Gradio demos and example scripts for quick experimentation
- Open-source under Apache 2.0 license with training code and data forthcoming