Pixtral Large
Pixtral Large is a 124-billion-parameter open-weight multimodal model developed by Mistral AI, building upon their Mistral Large 2 architecture. It integrates a 123-billion-parameter multimodal decoder with a 1-billion-parameter vision encoder, enabling advanced understanding of documents, charts, and natural images while maintaining leading text comprehension capabilities. With a context window of 128,000 tokens, Pixtral Large can process at least 30 high-resolution images simultaneously. The model has demonstrated state-of-the-art performance on benchmarks such as MathVista, DocVQA, and VQAv2, surpassing models like GPT-4o and Gemini-1.5 Pro. Pixtral Large is available under the Mistral Research License for research and educational use, and under the Mistral Commercial License for commercial applications.
Learn more
Qwen3.6-35B-A3B
Qwen3.5-35B-A3B is part of the Qwen3.5 “Medium” model series, designed as a highly efficient, multimodal foundation model that balances strong reasoning ability with practical deployment requirements. It uses a Mixture-of-Experts (MoE) architecture with 35 billion total parameters but activates only about 3 billion per token, allowing it to deliver performance comparable to much larger models while significantly reducing computational cost. The model integrates a hybrid attention mechanism that combines linear attention with standard attention layers, enabling efficient long-context processing and improved scalability for complex tasks. As a native vision-language model, it can process both text and visual inputs, supporting use cases such as multimodal reasoning, coding, and agent-based workflows. It is designed to function as a general-purpose “AI agent,” capable of planning, tool use, and structured problem solving rather than just conversational responses.
Learn more
GLM-4.1V
GLM-4.1V is a vision-language model, providing a powerful, compact multimodal model designed for reasoning and perception across images, text, and documents. The 9-billion-parameter variant (GLM-4.1V-9B-Thinking) is built on the GLM-4-9B foundation and enhanced through a specialized training paradigm using Reinforcement Learning with Curriculum Sampling (RLCS). It supports a 64k-token context window and accepts high-resolution inputs (up to 4K images, any aspect ratio), enabling it to handle complex tasks such as optical character recognition, image captioning, chart and document parsing, video and scene understanding, GUI-agent workflows (e.g., interpreting screenshots, recognizing UI elements), and general vision-language reasoning. In benchmark evaluations at the 10 B-parameter scale, GLM-4.1V-9B-Thinking achieved top performance on 23 of 28 tasks.
Learn more
Molmo 2
Molmo 2 is a new suite of state-of-the-art open vision-language models with fully open weights, training data, and training code that extends the original Molmo family’s grounded image understanding to video and multi-image inputs, enabling advanced video understanding, pointing, tracking, dense captioning, and question-answering capabilities; all with strong spatial and temporal reasoning across frames. Molmo 2 includes three variants: an 8 billion-parameter model optimized for overall video grounding and QA, a 4 billion-parameter version designed for efficiency, and a 7 billion-parameter Olmo-backed model offering a fully open end-to-end architecture including the underlying language model. These models outperform earlier Molmo versions on core benchmarks and set new open-model high-water marks for image and video understanding tasks, often competing with substantially larger proprietary systems while training on a fraction of the data used by comparable closed models.
Learn more