Alternatives to Phi-3
Compare Phi-3 alternatives for your business or organization using the curated list below. SourceForge ranks the best alternatives to Phi-3 in 2026. Compare features, ratings, user reviews, pricing, and more from Phi-3 competitors and alternatives in order to make an informed decision for your business.
-
1
TinyLlama
TinyLlama
The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs. We adopted exactly the same architecture and tokenizer as Llama 2. This means TinyLlama can be plugged and played in many open-source projects built upon Llama. Besides, TinyLlama is compact with only 1.1B parameters. This compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint.Starting Price: Free -
2
OpenELM
Apple
OpenELM is an open-source language model family developed by Apple. It uses a layer-wise scaling strategy to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy compared to existing open language models of similar size. OpenELM is trained on publicly available datasets and achieves state-of-the-art performance for its size. -
3
Phi-4
Microsoft
Phi-4 is a 14B parameter state-of-the-art small language model (SLM) that excels at complex reasoning in areas such as math, in addition to conventional language processing. Phi-4 is the latest member of our Phi family of small language models and demonstrates what’s possible as we continue to probe the boundaries of SLMs. Phi-4 is currently available on Azure AI Foundry under a Microsoft Research License Agreement (MSRLA) and will be available on Hugging Face. Phi-4 outperforms comparable and larger models on math related reasoning due to advancements throughout the processes, including the use of high-quality synthetic datasets, curation of high-quality organic data, and post-training innovations. Phi-4 continues to push the frontier of size vs quality. -
4
Qwen2
Alibaba
Qwen2 is the large language model series developed by Qwen team, Alibaba Cloud. Qwen2 is a series of large language models developed by the Qwen team at Alibaba Cloud. It includes both base language models and instruction-tuned models, ranging from 0.5 billion to 72 billion parameters, and features both dense models and a Mixture-of-Experts model. The Qwen2 series is designed to surpass most previous open-weight models, including its predecessor Qwen1.5, and to compete with proprietary models across a broad spectrum of benchmarks in language understanding, generation, multilingual capabilities, coding, mathematics, and reasoning.Starting Price: Free -
5
Hunyuan-TurboS
Tencent
Tencent's Hunyuan-TurboS is a next-generation AI model designed to offer rapid responses and outstanding performance in various domains such as knowledge, mathematics, and creative tasks. Unlike previous models that require "slow thinking," Hunyuan-TurboS enhances response speed, doubling word output speed and reducing first-word latency by 44%. Through innovative architecture, it provides superior performance while lowering deployment costs. This model combines fast thinking (intuition-based responses) with slow thinking (logical analysis), ensuring quicker, more accurate solutions across diverse scenarios. Hunyuan-TurboS excels in benchmarks, competing with leading models like GPT-4 and DeepSeek V3, making it a breakthrough in AI-driven performance. -
6
Gemini Flash
Google
Gemini Flash is an advanced large language model (LLM) from Google, specifically designed for high-speed, low-latency language processing tasks. Part of Google DeepMind’s Gemini series, Gemini Flash is tailored to provide real-time responses and handle large-scale applications, making it ideal for interactive AI-driven experiences such as customer support, virtual assistants, and live chat solutions. Despite its speed, Gemini Flash doesn’t compromise on quality; it’s built on sophisticated neural architectures that ensure responses remain contextually relevant, coherent, and precise. Google has incorporated rigorous ethical frameworks and responsible AI practices into Gemini Flash, equipping it with guardrails to manage and mitigate biased outputs, ensuring it aligns with Google’s standards for safe and inclusive AI. With Gemini Flash, Google empowers businesses and developers to deploy responsive, intelligent language tools that can meet the demands of fast-paced environments. -
7
Arcee-SuperNova
Arcee.ai
Arcee-SuperNova is a small Language Model (SLM) with all the power and performance of leading closed-source LLMs. Excels at generalized tasks, instruction-following, and human preferences. The best 70B model on the market. SuperNova can be utilized for any generalized task, much like Open AI’s GPT4o, Claude Sonnet 3.5, and Cohere. Trained with the most advanced learning & optimization techniques, SuperNova generates highly accurate responses in human-like text. It's the most flexible, secure, and cost-effective language model on the market, saving customers up to 95% on total deployment costs vs. traditional closed-source models. Use SuperNova to integrate AI into apps and products, for general chat purposes, and for diverse use cases. Regularly update your models with the latest open-source tech, ensuring you're never locked into any one solution. Protect your data with industry-leading privacy measures.Starting Price: Free -
8
Claude Haiku 4.5
Anthropic
Anthropic has launched Claude Haiku 4.5, its latest small-language model designed to deliver near-frontier performance at significantly lower cost. The model provides similar coding and reasoning quality as the company’s mid-tier Sonnet 4, yet it runs at roughly one-third of the cost and more than twice the speed. In benchmarks cited by Anthropic, Haiku 4.5 meets or exceeds Sonnet 4’s performance in key tasks such as code generation and multi-step “computer use” workflows. It is optimized for real-time, low-latency scenarios such as chat assistants, customer service agents, and pair-programming support. Haiku 4.5 is made available via the Claude API under the identifier “claude-haiku-4-5” and supports large-scale deployments where cost, responsiveness, and near-frontier intelligence matter. Claude Haiku 4.5 is available now on Claude Code and our apps. Its efficiency means you can accomplish more within your usage limits while maintaining premium model performance.Starting Price: $1 per million input tokens -
9
Tiny Aya
Cohere AI
Tiny Aya is a family of open-weight multilingual language models from Cohere Labs designed to deliver powerful, adaptable AI that can run efficiently on local devices, including phones and laptops, without requiring constant cloud connectivity. It focuses on enabling high-quality text understanding and generation across more than 70 languages, including many lower-resource languages that are often underserved by mainstream models. Built with lightweight architectures around 3.35 billion parameters, Tiny Aya is optimized for balanced multilingual representation and realistic compute constraints, making it suitable for edge deployment and offline use. The models support downstream adaptation and instruction tuning, allowing developers to customize behavior for specific applications while maintaining strong cross-lingual performance.Starting Price: Free -
10
Cohere
Cohere
Cohere is an enterprise AI platform that enables developers and businesses to build powerful language-based applications. Specializing in large language models (LLMs), Cohere provides solutions for text generation, summarization, and semantic search. Their model offerings include the Command family for high-performance language tasks and Aya Expanse for multilingual applications across 23 languages. Focused on security and customization, Cohere allows flexible deployment across major cloud providers, private cloud environments, or on-premises setups to meet diverse enterprise needs. The company collaborates with industry leaders like Oracle and Salesforce to integrate generative AI into business applications, improving automation and customer engagement. Additionally, Cohere For AI, their research lab, advances machine learning through open-source projects and a global research community.Starting Price: Free -
11
Sarvam 30B
Sarvam
Sarvam-30B is an open source, next-generation large language model designed as a unified system for both real-time conversational AI and deep reasoning workloads, built with a strong focus on multilingual intelligence and practical deployment. The 30B model is optimized for speed and efficiency, using a Mixture-of-Experts (MoE) architecture that activates only a subset of parameters per request, enabling high throughput, low latency, and deployment even in resource-constrained environments such as local machines or edge systems. It delivers strong performance in conversational tasks, coding, and reasoning while achieving state-of-the-art results across more than 20 Indian languages, making it highly effective for multilingual applications and voice-based systems. It represents a dual-tier architecture, a fast, deployable “conversational workhorse”, leveraging MoE designs to reduce compute cost while maintaining high performance.Starting Price: Free -
12
Jamba
AI21 Labs
Jamba is the most powerful & efficient long context model, open for builders and built for the enterprise. Jamba's latency outperforms all leading models of comparable sizes. Jamba's 256k context window is the longest openly available. Jamba's Mamba-Transformer MoE architecture is designed for cost & efficiency gains. Jamba offers key features of OOTB including function calls, JSON mode output, document objects, and citation mode. Jamba 1.5 models maintain high performance across the full length of their context window. Jamba 1.5 models achieve top scores across common quality benchmarks. Secure deployment that suits your enterprise. Seamlessly start using Jamba on our production-grade SaaS platform. The Jamba model family is available for deployment across our strategic partners. We offer VPC & on-prem deployments for enterprises that require custom solutions. For enterprises that have unique, bespoke requirements, we offer hands-on management, continuous pre-training, etc. -
13
GPT-4.1 nano
OpenAI
GPT-4.1 nano is the smallest and most efficient version of OpenAI's GPT-4.1 model, optimized for low-latency, cost-effective AI processing. Despite its compact size, GPT-4.1 nano delivers strong performance with a 1 million token context window, making it ideal for applications like classification, autocompletion, and smaller-scale tasks that require fast responses. It provides a highly efficient solution for businesses and developers who need an AI model that balances speed, cost, and performance.Starting Price: $0.10 per 1M tokens (input) -
14
GLM-4.6V
Zhipu AI
GLM-4.6V is a state-of-the-art open source multimodal vision-language model from the Z.ai (GLM-V) family designed for reasoning, perception, and action. It ships in two variants: a full-scale version (106B parameters) for cloud or high-performance clusters, and a lightweight “Flash” variant (9B) optimized for local deployment or low-latency use. GLM-4.6V supports a native context window of up to 128K tokens during training, enabling it to process very long documents or multimodal inputs. Crucially, it integrates native Function Calling, meaning the model can take images, screenshots, documents, or other visual media as input directly (without manual text conversion), reason about them, and trigger tool calls, bridging “visual perception” with “executable action.” This enables a wide spectrum of capabilities; interleaved image-and-text content generation (for example, combining document understanding with text summarization or generation of image-annotated responses).Starting Price: Free -
15
DeepSeek V3.1
DeepSeek
DeepSeek V3.1 is a groundbreaking open-weight large language model featuring a massive 685-billion parameters and an extended 128,000‑token context window, enabling it to process documents equivalent to 400-page books in a single prompt. It delivers integrated capabilities for chat, reasoning, and code generation within a unified hybrid architecture, seamlessly blending these functions into one coherent model. V3.1 supports a variety of tensor formats to give developers flexibility in optimizing performance across different hardware. Early benchmark results show robust performance, including a 71.6% score on the Aider coding benchmark, putting it on par with or ahead of systems like Claude Opus 4 and doing so at a far lower cost. Made available under an open source license on Hugging Face with minimal fanfare, DeepSeek V3.1 is poised to reshape access to high-performance AI, challenging traditional proprietary models.Starting Price: Free -
16
Gemini 2.0
Google
Gemini 2.0 is an advanced AI-powered model developed by Google, designed to offer groundbreaking capabilities in natural language understanding, reasoning, and multimodal interactions. Building on the success of its predecessor, Gemini 2.0 integrates large language processing with enhanced problem-solving and decision-making abilities, enabling it to interpret and generate human-like responses with greater accuracy and nuance. Unlike traditional AI models, Gemini 2.0 is trained to handle multiple data types simultaneously, including text, images, and code, making it a versatile tool for research, business, education, and creative industries. Its core improvements include better contextual understanding, reduced bias, and a more efficient architecture that ensures faster, more reliable outputs. Gemini 2.0 is positioned as a major step forward in the evolution of AI, pushing the boundaries of human-computer interaction.Starting Price: Free -
17
Amazon Nova
Amazon
Amazon Nova is a new generation of state-of-the-art (SOTA) foundation models (FMs) that deliver frontier intelligence and industry leading price-performance, available exclusively on Amazon Bedrock. Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Pro are understanding models that accept text, image, or video inputs and generate text output. They provide a broad selection of capability, accuracy, speed, and cost operation points. Amazon Nova Micro is a text only model that delivers the lowest latency responses at very low cost. Amazon Nova Lite is a very low-cost multimodal model that is lightning fast for processing image, video, and text inputs. Amazon Nova Pro is a highly capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Pro’s capabilities, coupled with its industry-leading speed and cost efficiency, makes it a compelling model for almost any task, including video summarization, Q&A, math & more. -
18
GPT-5.4 mini
OpenAI
GPT-5.4 mini is a fast and efficient AI model designed for high-performance tasks such as coding, reasoning, and multimodal understanding. It delivers strong capabilities similar to larger models while maintaining lower latency and cost. The model is optimized for responsive applications where speed is critical, including coding assistants and real-time workflows. GPT-5.4 mini supports advanced features such as tool use, function calling, and image interpretation. It performs well on complex tasks while running significantly faster than previous mini models. The model is also suitable for subagent systems, where it handles smaller tasks within larger AI workflows. By combining speed, efficiency, and strong performance, GPT-5.4 mini enables scalable AI applications across various use cases. -
19
GPT-5.2 Thinking
OpenAI
GPT-5.2 Thinking is the highest-capability configuration in OpenAI’s GPT-5.2 model family, engineered for deep, expert-level reasoning, complex task execution, and advanced problem solving across long contexts and professional domains. Built on the foundational GPT-5.2 architecture with improvements in grounding, stability, and reasoning quality, this variant applies more compute and reasoning effort to generate responses that are more accurate, structured, and contextually rich when handling highly intricate workflows, multi-step analysis, and domain-specific challenges. GPT-5.2 Thinking excels at tasks that require sustained logical coherence, such as detailed research synthesis, advanced coding and debugging, complex data interpretation, strategic planning, and sophisticated technical writing, and it outperforms lighter variants on benchmarks that test professional skills and deep comprehension. -
20
Kimi K2 Thinking
Moonshot AI
Kimi K2 Thinking is an advanced open source reasoning model developed by Moonshot AI, designed specifically for long-horizon, multi-step workflows where the system interleaves chain-of-thought processes with tool invocation across hundreds of sequential tasks. The model uses a mixture-of-experts architecture with a total of 1 trillion parameters, yet only about 32 billion parameters are activated per inference pass, optimizing efficiency while maintaining vast capacity. It supports a context window of up to 256,000 tokens, enabling the handling of extremely long inputs and reasoning chains without losing coherence. Native INT4 quantization is built in, which reduces inference latency and memory usage without performance degradation. Kimi K2 Thinking is explicitly built for agentic workflows; it can autonomously call external tools, manage sequential logic steps (up to and typically between 200-300 tool calls in a single chain), and maintain consistent reasoning.Starting Price: Free -
21
GPT-5.4 nano
OpenAI
GPT-5.4 nano is a lightweight and highly efficient AI model designed for fast, cost-effective task execution. It is optimized for simple and high-volume tasks such as classification, data extraction, and basic coding support. The model delivers quick responses with minimal latency, making it ideal for real-time and large-scale applications. GPT-5.4 nano improves significantly over previous nano models in both performance and efficiency. It supports essential capabilities like tool use and structured data processing. The model is commonly used as a supporting component within larger AI systems. By focusing on speed and affordability, GPT-5.4 nano enables scalable automation across various workflows. -
22
GPT‑5.4 Thinking
OpenAI
GPT-5.4 Thinking is an advanced reasoning-focused AI model available within ChatGPT, designed to help users complete complex professional tasks more effectively. It combines improvements in reasoning, coding, and agent-based workflows to provide more accurate and reliable outputs. The model can present an upfront outline of its reasoning process, allowing users to adjust instructions while it is generating a response. This capability helps produce results that better align with user goals without requiring multiple follow-up prompts. GPT-5.4 Thinking also improves deep web research, enabling it to locate and synthesize information from multiple sources more efficiently. With stronger context management, it can handle longer conversations and complex problem-solving tasks with greater coherence. These capabilities make GPT-5.4 Thinking well suited for professional knowledge work and advanced analytical tasks. -
23
GPT-4o mini
OpenAI
A small model with superior textual intelligence and multimodal reasoning. GPT-4o mini enables a broad range of tasks with its low cost and latency, such as applications that chain or parallelize multiple model calls (e.g., calling multiple APIs), pass a large volume of context to the model (e.g., full code base or conversation history), or interact with customers through fast, real-time text responses (e.g., customer support chatbots). Today, GPT-4o mini supports text and vision in the API, with support for text, image, video and audio inputs and outputs coming in the future. The model has a context window of 128K tokens, supports up to 16K output tokens per request, and has knowledge up to October 2023. Thanks to the improved tokenizer shared with GPT-4o, handling non-English text is now even more cost effective. -
24
GLM-4.7-FlashX
Z.ai
GLM-4.7 FlashX is a lightweight, high-speed version of the GLM-4.7 large language model created by Z.ai that balances efficiency and performance for real-time AI tasks across English and Chinese while offering the core capabilities of the broader GLM-4.7 family in a more resource-friendly package. It is positioned alongside GLM-4.7 and GLM-4.7 Flash, delivering optimized agentic coding and general language understanding with faster response times and lower resource needs, making it suitable for applications that require rapid inference without heavy infrastructure. As part of the GLM-4.7 model series, it inherits the model’s strengths in programming, multi-step reasoning, and robust conversational understanding, and it supports long contexts for complex tasks while remaining lightweight enough for deployment with constrained compute budgets.Starting Price: $0.07 per 1M tokens -
25
Ministral 8B
Mistral AI
Mistral AI has introduced two advanced models for on-device computing and edge applications, named "les Ministraux": Ministral 3B and Ministral 8B. These models excel in knowledge, commonsense reasoning, function-calling, and efficiency within the sub-10B parameter range. They support up to 128k context length and are designed for various applications, including on-device translation, offline smart assistants, local analytics, and autonomous robotics. Ministral 8B features an interleaved sliding-window attention pattern for faster and more memory-efficient inference. Both models can function as intermediaries in multi-step agentic workflows, handling tasks like input parsing, task routing, and API calls based on user intent with low latency and cost. Benchmark evaluations indicate that les Ministraux consistently outperforms comparable models across multiple tasks. As of October 16, 2024, both models are available, with Ministral 8B priced at $0.1 per million tokens.Starting Price: Free -
26
Orpheus TTS
Canopy Labs
Canopy Labs has introduced Orpheus, a family of state-of-the-art speech large language models (LLMs) designed for human-level speech generation. These models are built on the Llama-3 architecture and are trained on over 100,000 hours of English speech data, enabling them to produce natural intonation, emotion, and rhythm that surpasses current state-of-the-art closed source models. Orpheus supports zero-shot voice cloning, allowing users to replicate voices without prior fine-tuning, and offers guided emotion and intonation control through simple tags. The models achieve low latency, with approximately 200ms streaming latency for real-time applications, reducible to around 100ms with input streaming. Canopy Labs has released both pre-trained and fine-tuned 3B-parameter models under the permissive Apache 2.0 license, with plans to release smaller models of 1B, 400M, and 150M parameters for use on resource-constrained devices. -
27
Seed2.0 Mini
ByteDance
Seed2.0 Mini is the smallest member of ByteDance’s Seed2.0 series of general-purpose multimodal agent models, designed for high-throughput inference and dense deployment while retaining the core strengths of its larger siblings in multimodal understanding and instruction following. Part of a family that also includes Pro and Lite, the Mini variant is optimized for high-concurrency and batch generation workloads, making it suitable for applications where efficient processing of many requests at scale matters as much as capability. Like other Seed2.0 models, it benefits from systematic enhancements in visual reasoning, motion perception, structured extraction from complex inputs like text and images, and reliable execution of multi-step instructions, but it trades some raw reasoning and output quality for faster, more cost-effective inference and better deployment efficiency. -
28
Grok 4 Fast
xAI
Grok 4 Fast is the latest AI model from xAI, engineered to deliver rapid and efficient query processing. It improves upon earlier versions with faster response times, lower latency, and higher accuracy across a variety of topics. With enhanced natural language understanding, the model excels in both casual conversation and complex problem-solving. A key feature is its real-time data analysis capability, ensuring users receive up-to-date insights when needed. Grok 4 Fast is accessible across multiple platforms, including Grok, X, and mobile apps for iOS and Android. By combining speed, reliability, and scalability, it offers an ideal solution for anyone seeking instant, intelligent answers. -
29
GPT-4.1 mini
OpenAI
GPT-4.1 mini is a compact version of OpenAI’s powerful GPT-4.1 model, designed to provide high performance while significantly reducing latency and cost. With a smaller size and optimized architecture, GPT-4.1 mini still delivers impressive results in tasks such as coding, instruction following, and long-context processing. It supports up to 1 million tokens of context, making it an efficient solution for applications that require fast responses without sacrificing accuracy or depth.Starting Price: $0.40 per 1M tokens (input) -
30
Grounded Language Model (GLM)
Contextual AI
Contextual AI introduces its Grounded Language Model (GLM), engineered specifically to minimize hallucinations and deliver highly accurate, source-based responses for retrieval-augmented generation (RAG) and agentic applications. The GLM prioritizes faithfulness to the provided data, ensuring responses are grounded in specific knowledge sources and backed by inline citations. With state-of-the-art performance on the FACTS groundedness benchmark, the GLM outperforms other foundation models in scenarios requiring high accuracy and reliability. The model is designed for enterprise use cases like customer service, finance, and engineering, where trustworthy and precise responses are critical to minimizing risks and improving decision-making. -
31
LFM2
Liquid AI
LFM2 is a next-generation series of on-device foundation models built to deliver the fastest generative-AI experience across a wide range of endpoints. It employs a new hybrid architecture that achieves up to 2x faster decode and prefill performance than comparable models, and up to 3x improvements in training efficiency compared to the previous generation. These models strike an optimal balance of quality, latency, and memory for deployment on embedded systems, allowing real-time, on-device AI across smartphones, laptops, vehicles, wearables, and other endpoints, enabling millisecond inference, device resilience, and full data sovereignty. Available in three dense checkpoints (0.35 B, 0.7 B, and 1.2 B parameters), LFM2 demonstrates benchmark performance that outperforms similarly sized models in tasks such as knowledge recall, mathematics, multilingual instruction-following, and conversational dialogue evaluations. -
32
Ministral 3B
Mistral AI
Mistral AI introduced two state-of-the-art models for on-device computing and edge use cases, named "les Ministraux": Ministral 3B and Ministral 8B. These models set a new frontier in knowledge, commonsense reasoning, function-calling, and efficiency in the sub-10B category. They can be used or tuned for various applications, from orchestrating agentic workflows to creating specialist task workers. Both models support up to 128k context length (currently 32k on vLLM), and Ministral 8B features a special interleaved sliding-window attention pattern for faster and memory-efficient inference. These models were built to provide a compute-efficient and low-latency solution for scenarios such as on-device translation, internet-less smart assistants, local analytics, and autonomous robotics. Used in conjunction with larger language models like Mistral Large, les Ministraux also serve as efficient intermediaries for function-calling in multi-step agentic workflows.Starting Price: Free -
33
Seed2.0 Lite
ByteDance
Seed2.0 Lite is part of ByteDance’s Seed2.0 family of general-purpose multimodal AI agent models designed to handle complex, real-world tasks with a balanced focus on performance and efficiency. It offers enhanced multimodal understanding and instruction-following capabilities compared with earlier Seed models, enabling it to process and reason about text, visual elements, and structured information reliably for production-grade applications. As a mid-sized model in the series, Lite is optimized to deliver good quality outputs with responsive performance at lower cost and faster inference than the Pro variant while surpassing the previous generation’s capabilities, making it suitable for workflows that require stable reasoning, long-context understanding, and multimodal task execution without needing the highest possible raw performance. -
34
Qwen3
Alibaba
Qwen3, the latest iteration of the Qwen family of large language models, introduces groundbreaking features that enhance performance across coding, math, and general capabilities. With models like the Qwen3-235B-A22B and Qwen3-30B-A3B, Qwen3 achieves impressive results compared to top-tier models, thanks to its hybrid thinking modes that allow users to control the balance between deep reasoning and quick responses. The platform supports 119 languages and dialects, making it an ideal choice for global applications. Its pre-training process, which uses 36 trillion tokens, enables robust performance, and advanced reinforcement learning (RL) techniques continue to refine its capabilities. Available on platforms like Hugging Face and ModelScope, Qwen3 offers a powerful tool for developers and researchers working in diverse fields.Starting Price: Free -
35
Reka Flash 3
Reka
Reka Flash 3 is a 21-billion-parameter multimodal AI model developed by Reka AI, designed to excel in general chat, coding, instruction following, and function calling. It processes and reasons with text, images, video, and audio inputs, offering a compact, general-purpose solution for various applications. Trained from scratch on diverse datasets, including publicly accessible and synthetic data, Reka Flash 3 underwent instruction tuning on curated, high-quality data to optimize performance. The final training stage involved reinforcement learning using REINFORCE Leave One-Out (RLOO) with both model-based and rule-based rewards, enhancing its reasoning capabilities. With a context length of 32,000 tokens, Reka Flash 3 performs competitively with proprietary models like OpenAI's o1-mini, making it suitable for low-latency or on-device deployments. The model's full precision requires 39GB (fp16), but it can be compressed to as small as 11GB using 4-bit quantization. -
36
GLM-5-Turbo
Z.ai
GLM-5-Turbo is a high-speed variant of Z.ai’s GLM-5 model, designed to deliver efficient and stable performance in agent-driven environments while maintaining strong reasoning and coding capabilities. It is optimized for high-throughput workloads, particularly long-chain agent tasks where multiple steps, tools, and decisions must be executed in sequence with reliability and low latency. It supports advanced agentic workflows, enabling systems to perform multi-step planning, tool calling, and task execution with improved responsiveness compared to larger flagship models. GLM-5-Turbo inherits core capabilities from the GLM-5 family, including strong reasoning, coding performance, and support for long-context processing, while focusing on optimization of core requirements such as speed, efficiency, and stability in production environments. It is designed to integrate with agent frameworks like OpenClaw, where it can coordinate actions, process inputs, and execute tasks.Starting Price: Free -
37
Sarvam 105B
Sarvam
Sarvam-105B is the flagship large language model in Sarvam’s open source model family, designed to deliver high-performance reasoning, multilingual understanding, and agent-based execution within a single scalable system. Built as a Mixture-of-Experts (MoE) model with approximately 105 billion total parameters, of which only a fraction are activated per token, it achieves strong computational efficiency while maintaining high capability across complex tasks. The model is optimized for advanced reasoning, coding, mathematics, and agentic workflows, making it suitable for tasks that require multi-step problem solving and structured outputs rather than simple conversational responses. Sarvam-105B supports long-context processing of up to around 128K tokens, enabling it to handle large documents, extended conversations, and deep analytical queries without losing coherence.Starting Price: Free -
38
Marco-o1
AIDC-AI
Marco-o1 is a robust, next-generation AI model tailored for high-performance natural language processing and real-time problem-solving. It is engineered to deliver precise and contextually rich responses, combining deep language comprehension with a streamlined architecture for speed and efficiency. Marco-o1 excels in a variety of applications, including conversational AI, content creation, technical support, and decision-making tasks, adapting seamlessly to diverse user needs. With a focus on intuitive interactions, reliability, and ethical AI principles, Marco-o1 stands out as a cutting-edge solution for individuals and organizations seeking intelligent, adaptive, and scalable AI-driven tools. MCTS allows the exploration of multiple reasoning paths using confidence scores derived from softmax-applied log probabilities of the top-k alternative tokens, guiding the model to optimal solutions.Starting Price: Free -
39
Llama 3.3
Meta
Llama 3.3 is the latest iteration in the Llama series of language models, developed to push the boundaries of AI-powered understanding and communication. With enhanced contextual reasoning, improved language generation, and advanced fine-tuning capabilities, Llama 3.3 is designed to deliver highly accurate, human-like responses across diverse applications. This version features a larger training dataset, refined algorithms for nuanced comprehension, and reduced biases compared to its predecessors. Llama 3.3 excels in tasks such as natural language understanding, creative writing, technical explanation, and multilingual communication, making it an indispensable tool for businesses, developers, and researchers. Its modular architecture allows for customizable deployment in specialized domains, ensuring versatility and performance at scale.Starting Price: Free -
40
Qwen3.6-27B
Alibaba
Qwen3.6-27B is a dense, open source multimodal language model in the Qwen3.6 series, designed to deliver flagship-level performance in coding, reasoning, and agent-based workflows while maintaining a relatively efficient parameter size of 27 billion. It is positioned as a high-performance general model that “punches above its weight,” achieving results competitive with or superior to significantly larger models on key benchmarks, particularly in agentic coding tasks. It supports both thinking and non-thinking modes, allowing it to dynamically balance deep reasoning with fast responses depending on the task, and integrates capabilities across text and multimodal inputs such as images and video. Built as part of the Qwen3.6 family, the model emphasizes real-world usability, stability, and developer productivity, incorporating improvements driven by community feedback and practical deployment needs.Starting Price: Free -
41
Qwen3-Max-Thinking
Alibaba
Qwen3-Max-Thinking is Alibaba’s latest flagship reasoning-enhanced large language model, built as an extension of the Qwen3-Max family and designed to deliver state-of-the-art analytical performance and multi-step reasoning capabilities. It scales up from one of the largest parameter bases in the Qwen ecosystem and incorporates advanced reinforcement learning and adaptive tool integration so the model can leverage search, memory, and code interpreter functions dynamically during inference to address difficult multi-stage tasks with higher accuracy and contextual depth compared with standard generative responses. Qwen3-Max-Thinking introduces a unique Thinking Mode that exposes deliberate, step-by-step reasoning before final outputs, enabling transparency and traceability of logical chains, and can be tuned with configurable “thinking budgets” to balance performance quality with computational cost. -
42
MiniMax-M2.1
MiniMax
MiniMax-M2.1 is an open-source, agentic large language model designed for advanced coding, tool use, and long-horizon planning. It was released to the community to make high-performance AI agents more transparent, controllable, and accessible. The model is optimized for robustness in software engineering, instruction following, and complex multi-step workflows. MiniMax-M2.1 supports multilingual development and performs strongly across real-world coding scenarios. It is suitable for building autonomous applications that require reasoning, planning, and execution. The model weights are fully open, enabling local deployment and customization. MiniMax-M2.1 represents a major step toward democratizing top-tier agent capabilities.Starting Price: Free -
43
Gemini 2.0 Flash
Google
The Gemini 2.0 Flash AI model represents the next generation of high-speed, intelligent computing, designed to set new benchmarks in real-time language processing and decision-making. Building on the robust foundation of its predecessor, it incorporates enhanced neural architecture and breakthrough advancements in optimization, enabling even faster and more accurate responses. Gemini 2.0 Flash is designed for applications requiring instantaneous processing and adaptability, such as live virtual assistants, automated trading systems, and real-time analytics. Its lightweight, efficient design ensures seamless deployment across cloud, edge, and hybrid environments, while its improved contextual understanding and multitasking capabilities make it a versatile tool for tackling complex, dynamic workflows with precision and speed. -
44
Tune AI
NimbleBox
Leverage the power of custom models to build your competitive advantage. With our enterprise Gen AI stack, go beyond your imagination and offload manual tasks to powerful assistants instantly – the sky is the limit. For enterprises where data security is paramount, fine-tune and deploy generative AI models on your own cloud, securely. -
45
LFM-40B
Liquid AI
LFM-40B offers a new balance between model size and output quality. It leverages 12B activated parameters at use. Its performance is comparable to models larger than itself, while its MoE architecture enables higher throughput and deployment on more cost-effective hardware. -
46
LearnLM
Google
LearnLM is an experimental, task-specific model designed to align with learning science principles for teaching and learning applications. It is trained to respond to system instructions like "You are an expert tutor," and is capable of inspiring active learning by encouraging practice and providing timely feedback. The model effectively manages cognitive load by presenting relevant, well-structured information across multiple modalities, while dynamically adapting to the learner’s goals and needs, grounding responses in appropriate materials. LearnLM also stimulates curiosity, motivating learners throughout their educational journey, and supports metacognition by helping learners plan, monitor, and reflect on their progress. This innovative model is available for experimentation in AI Studio.Starting Price: Free -
47
Octave TTS
Hume AI
Hume AI has introduced Octave (Omni-capable Text and Voice Engine), a groundbreaking text-to-speech system that leverages large language model technology to understand and interpret the context of words, enabling it to generate speech with appropriate emotions, rhythm, and cadence, unlike traditional TTS models that merely read text, Octave acts akin to a human actor, delivering lines with nuanced expression based on the content. Users can create diverse AI voices by providing descriptive prompts, such as "a sarcastic medieval peasant," allowing for tailored voice generation that aligns with specific character traits or scenarios. Additionally, Octave offers the flexibility to modify the emotional delivery and speaking style through natural language instructions, enabling commands like "sound more enthusiastic" or "whisper fearfully" to fine-tune the output.Starting Price: $3 per month -
48
Selene 1
atla
Atla's Selene 1 API offers state-of-the-art AI evaluation models, enabling developers to define custom evaluation criteria and obtain precise judgments on their AI applications' performance. Selene outperforms frontier models on commonly used evaluation benchmarks, ensuring accurate and reliable assessments. Users can customize evaluations to their specific use cases through the Alignment Platform, allowing for fine-grained analysis and tailored scoring formats. The API provides actionable critiques alongside accurate evaluation scores, facilitating seamless integration into existing workflows. Pre-built metrics, such as relevance, correctness, helpfulness, faithfulness, logical coherence, and conciseness, are available to address common evaluation scenarios, including detecting hallucinations in retrieval-augmented generation applications or comparing outputs to ground truth data. -
49
BitNet
Microsoft
The BitNet b1.58 2B4T is a cutting-edge 1-bit Large Language Model (LLM) developed by Microsoft, designed to enhance computational efficiency while maintaining high performance. This model, built with approximately 2 billion parameters and trained on 4 trillion tokens, uses innovative quantization techniques to optimize memory usage, energy consumption, and latency. The platform supports multiple modalities and is particularly valuable for applications in AI-powered text generation, offering substantial efficiency gains compared to full-precision models.Starting Price: Free -
50
MiniMax M2.5
MiniMax
MiniMax M2.5 is a frontier AI model engineered for real-world productivity across coding, agentic workflows, search, and office tasks. Extensively trained with reinforcement learning in hundreds of thousands of real-world environments, it achieves state-of-the-art performance in benchmarks such as SWE-Bench Verified and BrowseComp. The model demonstrates strong architectural thinking, decomposing complex problems before generating code across more than ten programming languages. M2.5 operates at high throughput speeds of up to 100 tokens per second, enabling faster completion of multi-step tasks. It is optimized for efficient reasoning, reducing token usage and execution time compared to previous versions. With dramatically lower pricing than competing frontier models, it delivers powerful performance at minimal cost. Integrated into MiniMax Agent, M2.5 supports professional-grade office workflows, financial modeling, and autonomous task execution.Starting Price: Free