Alternatives to ALBERT

Compare ALBERT alternatives for your business or organization using the curated list below. SourceForge ranks the best alternatives to ALBERT in 2026. Compare features, ratings, user reviews, pricing, and more from ALBERT competitors and alternatives in order to make an informed decision for your business.

  • 1
    Google AI Studio
    Google AI Studio is a unified development platform that helps teams explore, build, and deploy applications using Google’s most advanced AI models, including Gemini 3. It brings text, image, audio, and video models together in one interactive playground. With vibe coding, developers can use natural language to quickly turn ideas into working AI applications. The platform reduces friction by generating functional apps that are ready for deployment with minimal setup. Built-in integrations like Google Search enhance real-world use cases. Google AI Studio also centralizes API key management, usage monitoring, and billing. It offers a fast, intuitive path from prompt to production powered by vibe coding workflows.
    Compare vs. ALBERT View Software
    Visit Website
  • 2
    RoBERTa
    RoBERTa builds on BERT’s language masking strategy, wherein the system learns to predict intentionally hidden sections of text within otherwise unannotated language examples. RoBERTa, which was implemented in PyTorch, modifies key hyperparameters in BERT, including removing BERT’s next-sentence pretraining objective, and training with much larger mini-batches and learning rates. This allows RoBERTa to improve on the masked language modeling objective compared with BERT and leads to better downstream task performance. We also explore training RoBERTa on an order of magnitude more data than BERT, for a longer amount of time. We used existing unannotated NLP datasets as well as CC-News, a novel set drawn from public news articles.
    Starting Price: Free
  • 3
    InstructGPT
    InstructGPT is an open-source framework for training language models to generate natural language instructions from visual input. It uses a generative pre-trained transformer (GPT) model and the state-of-the-art object detector, Mask R-CNN, to detect objects in images and generate natural language sentences that describe the image. InstructGPT is designed to be effective across domains such as robotics, gaming and education; it can assist robots in navigating complex tasks with natural language instructions, or help students learn by providing descriptive explanations of processes or events.
    Starting Price: $0.0200 per 1000 tokens
  • 4
    ERNIE 3.0 Titan
    Pre-trained language models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. GPT-3 has shown that scaling up pre-trained language models can further exploit their enormous potential. A unified framework named ERNIE 3.0 was recently proposed for pre-training large-scale knowledge enhanced models and trained a model with 10 billion parameters. ERNIE 3.0 outperformed the state-of-the-art models on various NLP tasks. In order to explore the performance of scaling up ERNIE 3.0, we train a hundred-billion-parameter model called ERNIE 3.0 Titan with up to 260 billion parameters on the PaddlePaddle platform. Furthermore, We design a self-supervised adversarial loss and a controllable language modeling loss to make ERNIE 3.0 Titan generate credible and controllable texts.
  • 5
    BERT

    BERT

    Google

    BERT is a large language model and a method of pre-training language representations. Pre-training refers to how BERT is first trained on a large source of text, such as Wikipedia. You can then apply the training results to other Natural Language Processing (NLP) tasks, such as question answering and sentiment analysis. With BERT and AI Platform Training, you can train a variety of NLP models in about 30 minutes.
  • 6
    GPT-4

    GPT-4

    OpenAI

    GPT-4 (Generative Pre-trained Transformer 4) is a large-scale unsupervised language model, yet to be released by OpenAI. GPT-4 is the successor to GPT-3 and part of the GPT-n series of natural language processing models, and was trained on a dataset of 45TB of text to produce human-like text generation and understanding capabilities. Unlike most other NLP models, GPT-4 does not require additional training data for specific tasks. Instead, it can generate text or answer questions using only its own internally generated context as input. GPT-4 has been shown to be able to perform a wide variety of tasks without any task specific training data such as translation, summarization, question answering, sentiment analysis and more.
    Starting Price: $0.0200 per 1000 tokens
  • 7
    VideoPoet
    VideoPoet is a simple modeling method that can convert any autoregressive language model or large language model (LLM) into a high-quality video generator. It contains a few simple components. An autoregressive language model learns across video, image, audio, and text modalities to autoregressively predict the next video or audio token in the sequence. A mixture of multimodal generative learning objectives are introduced into the LLM training framework, including text-to-video, text-to-image, image-to-video, video frame continuation, video inpainting and outpainting, video stylization, and video-to-audio. Furthermore, such tasks can be composed together for additional zero-shot capabilities. This simple recipe shows that language models can synthesize and edit videos with a high degree of temporal consistency.
  • 8
    Azure OpenAI Service
    Apply advanced coding and language models to a variety of use cases. Leverage large-scale, generative AI models with deep understandings of language and code to enable new reasoning and comprehension capabilities for building cutting-edge applications. Apply these coding and language models to a variety of use cases, such as writing assistance, code generation, and reasoning over data. Detect and mitigate harmful use with built-in responsible AI and access enterprise-grade Azure security. Gain access to generative models that have been pretrained with trillions of words. Apply them to new scenarios including language, code, reasoning, inferencing, and comprehension. Customize generative models with labeled data for your specific scenario using a simple REST API. Fine-tune your model's hyperparameters to increase accuracy of outputs. Use the few-shot learning capability to provide the API with examples and achieve more relevant results.
    Starting Price: $0.0004 per 1000 tokens
  • 9
    LUIS

    LUIS

    Microsoft

    Language Understanding (LUIS): A machine learning-based service to build natural language into apps, bots, and IoT devices. Quickly create enterprise-ready, custom models that continuously improve. Add natural language to your apps. Designed to identify valuable information in conversations, LUIS interprets user goals (intents) and distills valuable information from sentences (entities), for a high quality, nuanced language model. LUIS integrates seamlessly with the Azure Bot Service, making it easy to create a sophisticated bot. Powerful developer tools are combined with customizable pre-built apps and entity dictionaries, such as Calendar, Music, and Devices, so you can build and deploy a solution more quickly. Dictionaries are mined from the collective knowledge of the web and supply billions of entries, helping your model to correctly identify valuable information from user conversations. Active learning is used to continuously improve the quality of the models.
  • 10
    BLOOM

    BLOOM

    BigScience

    BLOOM is an autoregressive Large Language Model (LLM), trained to continue text from a prompt on vast amounts of text data using industrial-scale computational resources. As such, it is able to output coherent text in 46 languages and 13 programming languages that is hardly distinguishable from text written by humans. BLOOM can also be instructed to perform text tasks it hasn't been explicitly trained for, by casting them as text generation tasks.
  • 11
    T5

    T5

    Google

    With T5, we propose reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. Our text-to-text framework allows us to use the same model, loss function, and hyperparameters on any NLP task, including machine translation, document summarization, question answering, and classification tasks (e.g., sentiment analysis). We can even apply T5 to regression tasks by training it to predict the string representation of a number instead of the number itself.
  • 12
    Qwen-7B

    Qwen-7B

    Alibaba

    Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. The features of the Qwen-7B series include: Trained with high-quality pretraining data. We have pretrained Qwen-7B on a self-constructed large-scale high-quality dataset of over 2.2 trillion tokens. The dataset includes plain texts and codes, and it covers a wide range of domains, including general domain data and professional domain data. Strong performance. In comparison with the models of the similar model size, we outperform the competitors on a series of benchmark datasets, which evaluates natural language understanding, mathematics, coding, etc. And more.
    Starting Price: Free
  • 13
    Amazon Nova
    Amazon Nova is a new generation of state-of-the-art (SOTA) foundation models (FMs) that deliver frontier intelligence and industry leading price-performance, available exclusively on Amazon Bedrock. Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Pro are understanding models that accept text, image, or video inputs and generate text output. They provide a broad selection of capability, accuracy, speed, and cost operation points. Amazon Nova Micro is a text only model that delivers the lowest latency responses at very low cost. Amazon Nova Lite is a very low-cost multimodal model that is lightning fast for processing image, video, and text inputs. Amazon Nova Pro is a highly capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Pro’s capabilities, coupled with its industry-leading speed and cost efficiency, makes it a compelling model for almost any task, including video summarization, Q&A, math & more.
  • 14
    XLNet

    XLNet

    XLNet

    XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context. Overall, XLNet achieves state-of-the-art (SOTA) results on various downstream language tasks including question answering, natural language inference, sentiment analysis, and document ranking.
    Starting Price: Free
  • 15
    mT5

    mT5

    Google

    Multilingual T5 (mT5) is a massively multilingual pretrained text-to-text transformer model, trained following a similar recipe as T5. This repo can be used to reproduce the experiments in the mT5 paper. mT5 is pretrained on the mC4 corpus, covering 101 languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, and more.
    Starting Price: Free
  • 16
    Olmo 3
    Olmo 3 is a fully open model family spanning 7 billion and 32 billion parameter variants that delivers not only high-performing base, reasoning, instruction, and reinforcement-learning models, but also exposure of the entire model flow, including raw training data, intermediate checkpoints, training code, long-context support (65,536 token window), and provenance tooling. Starting with the Dolma 3 dataset (≈9 trillion tokens) and its disciplined mix of web text, scientific PDFs, code, and long-form documents, the pre-training, mid-training, and long-context phases shape the base models, which are then post-trained via supervised fine-tuning, direct preference optimisation, and RL with verifiable rewards to yield the Think and Instruct variants. The 32 B Think model is described as the strongest fully open reasoning model to date, competitively close to closed-weight peers in math, code, and complex reasoning.
    Starting Price: Free
  • 17
    Reka

    Reka

    Reka

    Our enterprise-grade multimodal assistant carefully designed with privacy, security, and efficiency in mind. We train Yasa to read text, images, videos, and tabular data, with more modalities to come. Use it to generate ideas for creative tasks, get answers to basic questions, or derive insights from your internal data. Generate, train, compress, or deploy on-premise with a few simple commands. Use our proprietary algorithms to personalize our model to your data and use cases. We design proprietary algorithms involving retrieval, fine-tuning, self-supervised instruction tuning, and reinforcement learning to tune our model on your datasets.
  • 18
    Baichuan-13B

    Baichuan-13B

    Baichuan Intelligent Technology

    Baichuan-13B is an open source and commercially available large-scale language model containing 13 billion parameters developed by Baichuan Intelligent following Baichuan -7B . It has achieved the best results of the same size on authoritative Chinese and English benchmarks. This release contains two versions of pre-training ( Baichuan-13B-Base ) and alignment ( Baichuan-13B-Chat ). Larger size, more data : Baichuan-13B further expands the number of parameters to 13 billion on the basis of Baichuan -7B , and trains 1.4 trillion tokens on high-quality corpus, which is 40% more than LLaMA-13B. It is currently open source The model with the largest amount of training data in the 13B size. Support Chinese and English bilingual, use ALiBi position code, context window length is 4096.
    Starting Price: Free
  • 19
    Yi-Lightning

    Yi-Lightning

    Yi-Lightning

    Yi-Lightning, developed by 01.AI under the leadership of Kai-Fu Lee, represents the latest advancement in large language models with a focus on high performance and cost-efficiency. It boasts a maximum context length of 16K tokens and is priced at $0.14 per million tokens for both input and output, making it remarkably competitive. Yi-Lightning leverages an enhanced Mixture-of-Experts (MoE) architecture, incorporating fine-grained expert segmentation and advanced routing strategies, which contribute to its efficiency in training and inference. This model has excelled in various domains, achieving top rankings in categories like Chinese, math, coding, and hard prompts on the chatbot arena, where it secured the 6th position overall and 9th in style control. Its development included comprehensive pre-training, supervised fine-tuning, and reinforcement learning from human feedback, ensuring both performance and safety, with optimizations in memory usage and inference speed.
  • 20
    Alpa

    Alpa

    Alpa

    Alpa aims to automate large-scale distributed training and serving with just a few lines of code. Alpa was initially developed by folks in the Sky Lab, UC Berkeley. Some advanced techniques used in Alpa have been written in a paper published in OSDI'2022. Alpa community is growing with new contributors from Google. A language model is a probability distribution over sequences of words. It predicts the next word based on all the previous words. It is useful for a variety of AI applications, such the auto-completion in your email or chatbot service. For more information, check out the language model wikipedia page. GPT-3 is very large language model, with 175 billion parameters, that uses deep learning to produce human-like text. Many researchers and news articles described GPT-3 as "one of the most interesting and important AI systems ever produced". GPT-3 is gradually being used as a backbone in the latest NLP research and applications.
    Starting Price: Free
  • 21
    AudioLM

    AudioLM

    Google

    AudioLM is a pure audio language model that generates high‑fidelity, long‑term coherent speech and piano music by learning from raw audio alone, without requiring any text transcripts or symbolic representations. It represents audio hierarchically using two types of discrete tokens, semantic tokens extracted from a self‑supervised model to capture phonetic or melodic structure and global context, and acoustic tokens from a neural codec to preserve speaker characteristics and fine waveform details, and chains three Transformer stages to predict first semantic tokens for high‑level structure, then coarse and finally fine acoustic tokens for detailed synthesis. The resulting pipeline allows AudioLM to condition on a few seconds of input audio and produce seamless continuations that retain voice identity, prosody, and recording conditions in speech or melody, harmony, and rhythm in music. Human evaluations show that synthetic continuations are nearly indistinguishable from real recordings.
  • 22
    DeepSeek-V2

    DeepSeek-V2

    DeepSeek

    DeepSeek-V2 is a state-of-the-art Mixture-of-Experts (MoE) language model introduced by DeepSeek-AI, characterized by its economical training and efficient inference capabilities. With a total of 236 billion parameters, of which only 21 billion are active per token, it supports a context length of up to 128K tokens. DeepSeek-V2 employs innovative architectures like Multi-head Latent Attention (MLA) for efficient inference by compressing the Key-Value (KV) cache and DeepSeekMoE for cost-effective training through sparse computation. This model significantly outperforms its predecessor, DeepSeek 67B, by saving 42.5% in training costs, reducing the KV cache by 93.3%, and enhancing generation throughput by 5.76 times. Pretrained on an 8.1 trillion token corpus, DeepSeek-V2 excels in language understanding, coding, and reasoning tasks, making it a top-tier performer among open-source models.
    Starting Price: Free
  • 23
    Llama 3.2
    The open-source AI model you can fine-tune, distill and deploy anywhere is now available in more versions. Choose from 1B, 3B, 11B or 90B, or continue building with Llama 3.1. Llama 3.2 is a collection of large language models (LLMs) pretrained and fine-tuned in 1B and 3B sizes that are multilingual text only, and 11B and 90B sizes that take both text and image inputs and output text. Develop highly performative and efficient applications from our latest release. Use our 1B or 3B models for on device applications such as summarizing a discussion from your phone or calling on-device tools like calendar. Use our 11B or 90B models for image use cases such as transforming an existing image into something new or getting more information from an image of your surroundings.
    Starting Price: Free
  • 24
    ByteDance Seed
    Seed Diffusion Preview is a large-scale, code-focused language model that uses discrete-state diffusion to generate code non-sequentially, achieving dramatically faster inference without sacrificing quality by decoupling generation from the token-by-token bottleneck of autoregressive models. It combines a two-stage curriculum, mask-based corruption followed by edit-based augmentation, to robustly train a standard dense Transformer, striking a balance between speed and accuracy and avoiding shortcuts like carry-over unmasking to preserve principled density estimation. The model delivers an inference speed of 2,146 tokens/sec on H20 GPUs, outperforming contemporary diffusion baselines while matching or exceeding their accuracy on standard code benchmarks, including editing tasks, thereby establishing a new speed-quality Pareto frontier and demonstrating discrete diffusion’s practical viability for real-world code generation.
    Starting Price: Free
  • 25
    OPT

    OPT

    Meta

    Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models.
  • 26
    Llama 2
    The next generation of our open source large language model. This release includes model weights and starting code for pretrained and fine-tuned Llama language models — ranging from 7B to 70B parameters. Llama 2 pretrained models are trained on 2 trillion tokens, and have double the context length than Llama 1. Its fine-tuned models have been trained on over 1 million human annotations. Llama 2 outperforms other open source language models on many external benchmarks, including reasoning, coding, proficiency, and knowledge tests. Llama 2 was pretrained on publicly available online data sources. The fine-tuned model, Llama-2-chat, leverages publicly available instruction datasets and over 1 million human annotations. We have a broad range of supporters around the world who believe in our open approach to today’s AI — companies that have given early feedback and are excited to build with Llama 2.
    Starting Price: Free
  • 27
    ERNIE Bot
    ERNIE Bot is an AI-powered conversational assistant developed by Baidu, designed to facilitate seamless and natural interactions with users. Built on the ERNIE (Enhanced Representation through Knowledge Integration) model, ERNIE Bot excels at understanding complex queries and generating human-like responses across various domains. Its capabilities include processing text, generating images, and engaging in multimodal communication, making it suitable for a wide range of applications such as customer support, virtual assistants, and enterprise automation. With its advanced contextual understanding, ERNIE Bot offers an intuitive and efficient solution for businesses seeking to enhance their digital interactions and automate workflows.
    Starting Price: Free
  • 28
    CodeQwen

    CodeQwen

    Alibaba

    CodeQwen is the code version of Qwen, the large language model series developed by the Qwen team, Alibaba Cloud. It is a transformer-based decoder-only language model pre-trained on a large amount of data of codes. Strong code generation capabilities and competitive performance across a series of benchmarks. Supporting long context understanding and generation with the context length of 64K tokens. CodeQwen supports 92 coding languages and provides excellent performance in text-to-SQL, bug fixes, etc. You can just write several lines of code with transformers to chat with CodeQwen. Essentially, we build the tokenizer and the model from pre-trained methods, and we use the generate method to perform chatting with the help of the chat template provided by the tokenizer. We apply the ChatML template for chat models following our previous practice. The model completes the code snippets according to the given prompts, without any additional formatting.
    Starting Price: Free
  • 29
    Qwen2.5-Max
    Qwen2.5-Max is a large-scale Mixture-of-Experts (MoE) model developed by the Qwen team, pretrained on over 20 trillion tokens and further refined through Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). In evaluations, it outperforms models like DeepSeek V3 in benchmarks such as Arena-Hard, LiveBench, LiveCodeBench, and GPQA-Diamond, while also demonstrating competitive results in other assessments, including MMLU-Pro. Qwen2.5-Max is accessible via API through Alibaba Cloud and can be explored interactively on Qwen Chat.
    Starting Price: Free
  • 30
    Giga ML

    Giga ML

    Giga ML

    We just launched X1 large series of Models. Giga ML's most powerful model is available for pre-training and fine-tuning with on-prem deployment. Since we are Open AI compatible, your existing integrations with long chain, llama-index, and all others work seamlessly. You can continue pre-training of LLM's with domain-specific data books or docs or company docs. The world of large language models (LLMs) rapidly expanding, offering unprecedented opportunities for natural language processing across various domains. However, some critical challenges have remained unaddressed. At Giga ML, we proudly introduce the X1 Large 32k model, a pioneering on-premise LLM solution that addresses these critical issues.
  • 31
    GPT-3

    GPT-3

    OpenAI

    Our GPT-3 models can understand and generate natural language. We offer four main models with different levels of power suitable for different tasks. Davinci is the most capable model, and Ada is the fastest. The main GPT-3 models are meant to be used with the text completion endpoint. We also offer models that are specifically meant to be used with other endpoints. Davinci is the most capable model family and can perform any task the other models can perform and often with less instruction. For applications requiring a lot of understanding of the content, like summarization for a specific audience and creative content generation, Davinci is going to produce the best results. These increased capabilities require more compute resources, so Davinci costs more per API call and is not as fast as the other models.
    Starting Price: $0.0200 per 1000 tokens
  • 32
    GPT-NeoX

    GPT-NeoX

    EleutherAI

    An implementation of model parallel autoregressive transformers on GPUs, based on the DeepSpeed library. This repository records EleutherAI's library for training large-scale language models on GPUs. Our current framework is based on NVIDIA's Megatron Language Model and has been augmented with techniques from DeepSpeed as well as some novel optimizations. We aim to make this repo a centralized and accessible place to gather techniques for training large-scale autoregressive language models, and accelerate research into large-scale training.
    Starting Price: Free
  • 33
    GPT-3.5

    GPT-3.5

    OpenAI

    GPT-3.5 is the next evolution of GPT 3 large language model from OpenAI. GPT-3.5 models can understand and generate natural language. We offer four main models with different levels of power suitable for different tasks. The main GPT-3.5 models are meant to be used with the text completion endpoint. We also offer models that are specifically meant to be used with other endpoints. Davinci is the most capable model family and can perform any task the other models can perform and often with less instruction. For applications requiring a lot of understanding of the content, like summarization for a specific audience and creative content generation, Davinci is going to produce the best results. These increased capabilities require more compute resources, so Davinci costs more per API call and is not as fast as the other models.
    Starting Price: $0.0200 per 1000 tokens
  • 34
    Teuken 7B

    Teuken 7B

    OpenGPT-X

    Teuken-7B is a multilingual, open source language model developed under the OpenGPT-X initiative, specifically designed to cater to Europe's diverse linguistic landscape. It has been trained on a dataset comprising over 50% non-English texts, encompassing all 24 official languages of the European Union, ensuring robust performance across these languages. A key innovation in Teuken-7B is its custom multilingual tokenizer, optimized for European languages, which enhances training efficiency and reduces inference costs compared to standard monolingual tokenizers. The model is available in two versions, Teuken-7B-Base, the foundational pre-trained model, and Teuken-7B-Instruct, which has undergone instruction tuning for improved performance in following user prompts. Both versions are accessible on Hugging Face, promoting transparency and collaboration within the AI community. The development of Teuken-7B underscores a commitment to creating AI models that reflect Europe's diversity.
    Starting Price: Free
  • 35
    Orpheus TTS

    Orpheus TTS

    Canopy Labs

    Canopy Labs has introduced Orpheus, a family of state-of-the-art speech large language models (LLMs) designed for human-level speech generation. These models are built on the Llama-3 architecture and are trained on over 100,000 hours of English speech data, enabling them to produce natural intonation, emotion, and rhythm that surpasses current state-of-the-art closed source models. Orpheus supports zero-shot voice cloning, allowing users to replicate voices without prior fine-tuning, and offers guided emotion and intonation control through simple tags. The models achieve low latency, with approximately 200ms streaming latency for real-time applications, reducible to around 100ms with input streaming. Canopy Labs has released both pre-trained and fine-tuned 3B-parameter models under the permissive Apache 2.0 license, with plans to release smaller models of 1B, 400M, and 150M parameters for use on resource-constrained devices.
  • 36
    AI21 Studio

    AI21 Studio

    AI21 Studio

    AI21 Studio provides API access to Jurassic-1 large-language-models. Our models power text generation and comprehension features in thousands of live applications. Take on any language task. Our Jurassic-1 models are trained to follow natural language instructions and require just a few examples to adapt to new tasks. Use our specialized APIs for common tasks like summarization, paraphrasing and more. Access superior results at a lower cost without reinventing the wheel. Need to fine-tune your own custom model? You're just 3 clicks away. Training is fast, affordable and trained models are deployed immediately. Give your users superpowers by embedding an AI co-writer in your app. Drive user engagement and success with features like long-form draft generation, paraphrasing, repurposing and custom auto-complete.
    Starting Price: $29 per month
  • 37
    Llama

    Llama

    Meta

    Llama (Large Language Model Meta AI) is a state-of-the-art foundational large language model designed to help researchers advance their work in this subfield of AI. Smaller, more performant models such as Llama enable others in the research community who don’t have access to large amounts of infrastructure to study these models, further democratizing access in this important, fast-changing field. Training smaller foundation models like Llama is desirable in the large language model space because it requires far less computing power and resources to test new approaches, validate others’ work, and explore new use cases. Foundation models train on a large set of unlabeled data, which makes them ideal for fine-tuning for a variety of tasks. We are making Llama available at several sizes (7B, 13B, 33B, and 65B parameters) and also sharing a Llama model card that details how we built the model in keeping with our approach to Responsible AI practices.
  • 38
    PanGu-α

    PanGu-α

    Huawei

    PanGu-α is developed under the MindSpore and trained on a cluster of 2048 Ascend 910 AI processors. The training parallelism strategy is implemented based on MindSpore Auto-parallel, which composes five parallelism dimensions to scale the training task to 2048 processors efficiently, including data parallelism, op-level model parallelism, pipeline model parallelism, optimizer model parallelism and rematerialization. To enhance the generalization ability of PanGu-α, we collect 1.1TB high-quality Chinese data from a wide range of domains to pretrain the model. We empirically test the generation ability of PanGu-α in various scenarios including text summarization, question answering, dialogue generation, etc. Moreover, we investigate the effect of model scales on the few-shot performances across a broad range of Chinese NLP tasks. The experimental results demonstrate the superior capabilities of PanGu-α in performing various tasks under few-shot or zero-shot settings.
  • 39
    Lemonfox.ai

    Lemonfox.ai

    Lemonfox.ai

    Our models are deployed around the world to give you the best possible response times. Integrate our OpenAI-compatible API effortlessly into your application. Begin within minutes and seamlessly scale to serve millions of users. Benefit from our extensive scale and performance optimizations, making our API 4 times more affordable than OpenAI's GPT-3.5 API. Generate text and chat with our AI model that delivers ChatGPT-level performance at a fraction of the cost. Getting started just takes a few minutes with our OpenAI-compatible API. Harness the power of one of the most advanced AI image models to craft stunning, high-quality images, graphics, and illustrations in a few seconds.
    Starting Price: $5 per month
  • 40
    MAI-1-preview

    MAI-1-preview

    Microsoft

    MAI-1 Preview is Microsoft AI’s first end-to-end trained foundation model, built entirely in-house as a mixture-of-experts architecture. Pre-trained and post-trained on approximately 15,000 NVIDIA H100 GPUs, it is designed to follow instructions and generate helpful, responsive text for everyday user queries, representing a prototype of future Copilot capabilities. Now available for public testing on LMArena, MAI-1 Preview delivers an early glimpse into the platform’s trajectory, with plans to roll out select text-based applications within Copilot over the coming weeks to gather user feedback and refine performance. Microsoft reinforces that it will continue combining its own models, partner models, and developments from the open-source community to flexibly power experiences across millions of unique interactions each day.
  • 41
    GPT-J

    GPT-J

    EleutherAI

    GPT-J is a cutting-edge language model created by the research organization EleutherAI. In terms of performance, GPT-J exhibits a level of proficiency comparable to that of OpenAI's renowned GPT-3 model in a range of zero-shot tasks. Notably, GPT-J has demonstrated the ability to surpass GPT-3 in tasks related to generating code. The latest iteration of this language model, known as GPT-J-6B, is built upon a linguistic dataset referred to as The Pile. This dataset, which is publicly available, encompasses a substantial volume of 825 gibibytes of language data, organized into 22 distinct subsets. While GPT-J shares certain capabilities with ChatGPT, it is important to note that GPT-J is not designed to operate as a chatbot; rather, its primary function is to predict text. In a significant development in March 2023, Databricks introduced Dolly, a model that follows instructions and is licensed under Apache.
    Starting Price: Free
  • 42
    Jamba

    Jamba

    AI21 Labs

    Jamba is the most powerful & efficient long context model, open for builders and built for the enterprise. Jamba's latency outperforms all leading models of comparable sizes. Jamba's 256k context window is the longest openly available. Jamba's Mamba-Transformer MoE architecture is designed for cost & efficiency gains. Jamba offers key features of OOTB including function calls, JSON mode output, document objects, and citation mode. Jamba 1.5 models maintain high performance across the full length of their context window. Jamba 1.5 models achieve top scores across common quality benchmarks. Secure deployment that suits your enterprise. Seamlessly start using Jamba on our production-grade SaaS platform. The Jamba model family is available for deployment across our strategic partners. We offer VPC & on-prem deployments for enterprises that require custom solutions. For enterprises that have unique, bespoke requirements, we offer hands-on management, continuous pre-training, etc.
  • 43
    Mercury Coder

    Mercury Coder

    Inception Labs

    Mercury, the latest innovation from Inception Labs, is the first commercial-scale diffusion large language model (dLLM), offering a 10x speed increase and significantly lower costs compared to traditional autoregressive models. Built for high-performance reasoning, coding, and structured text generation, Mercury processes over 1000 tokens per second on NVIDIA H100 GPUs, making it one of the fastest LLMs available. Unlike conventional models that generate text one token at a time, Mercury refines responses using a coarse-to-fine diffusion approach, improving accuracy and reducing hallucinations. With Mercury Coder, a specialized coding model, developers can experience cutting-edge AI-driven code generation with superior speed and efficiency.
    Starting Price: Free
  • 44
    Inception Labs

    Inception Labs

    Inception Labs

    Inception Labs is pioneering the next generation of AI with diffusion-based large language models (dLLMs), a breakthrough in AI that offers 10x faster performance and 5-10x lower cost than traditional autoregressive models. Inspired by the success of diffusion models in image and video generation, Inception’s dLLMs introduce enhanced reasoning, error correction, and multimodal capabilities, allowing for more structured and accurate text generation. With applications spanning enterprise AI, research, and content generation, Inception’s approach sets a new standard for speed, efficiency, and control in AI-driven workflows.
  • 45
    CodeGemma
    CodeGemma is a collection of powerful, lightweight models that can perform a variety of coding tasks like fill-in-the-middle code completion, code generation, natural language understanding, mathematical reasoning, and instruction following. CodeGemma has 3 model variants, a 7B pre-trained variant that specializes in code completion and generation from code prefixes and/or suffixes, a 7B instruction-tuned variant for natural language-to-code chat and instruction following; and a state-of-the-art 2B pre-trained variant that provides up to 2x faster code completion. Complete lines, and functions, and even generate entire blocks of code, whether you're working locally or using Google Cloud resources. Trained on 500 billion tokens of primarily English language data from web documents, mathematics, and code, CodeGemma models generate code that's not only more syntactically correct but also semantically meaningful, reducing errors and debugging time.
  • 46
    Hippocratic AI

    Hippocratic AI

    Hippocratic AI

    Hippocratic AI is the new state of the art (SOTA) model, outperforming GPT-4 on 105 of 114 healthcare exams and certifications. Hippocratic AI has outperformed GPT-4 on 105 out of 114 tests and certifications, outperformed by a margin of five percent or more on 74 of the certifications, and outperformed by a margin of ten percent or more on 43 of the certifications. Most language models pre-train on the common crawl of the Internet, which may include incorrect and misleading information. Unlike these LLMs, Hippocratic AI is investing heavily in legally acquiring evidence-based healthcare content. We’re conducting a unique Reinforcement Learning with Human Feedback process using healthcare professionals to train and validate the model’s readiness for deployment. We call this RLHF-HP. Hippocratic AI will not release the model until a large number of these licensed professionals deem it safe.
  • 47
    Sparrow

    Sparrow

    DeepMind

    Sparrow is a research model and proof of concept, designed with the goal of training dialogue agents to be more helpful, correct, and harmless. By learning these qualities in a general dialogue setting, Sparrow advances our understanding of how we can train agents to be safer and more useful – and ultimately, to help build safer and more useful artificial general intelligence (AGI). Sparrow is not yet available for public use. Training a conversational AI is an especially challenging problem because it’s difficult to pinpoint what makes a dialogue successful. To address this problem, we turn to a form of reinforcement learning (RL) based on people's feedback, using the study participants’ preference feedback to train a model of how useful an answer is. To get this data, we show our participants multiple model answers to the same question and ask them which answer they like the most.
  • 48
    Cohere

    Cohere

    Cohere

    Cohere is an enterprise AI platform that enables developers and businesses to build powerful language-based applications. Specializing in large language models (LLMs), Cohere provides solutions for text generation, summarization, and semantic search. Their model offerings include the Command family for high-performance language tasks and Aya Expanse for multilingual applications across 23 languages. Focused on security and customization, Cohere allows flexible deployment across major cloud providers, private cloud environments, or on-premises setups to meet diverse enterprise needs. The company collaborates with industry leaders like Oracle and Salesforce to integrate generative AI into business applications, improving automation and customer engagement. Additionally, Cohere For AI, their research lab, advances machine learning through open-source projects and a global research community.
  • 49
    NVIDIA NeMo Megatron
    NVIDIA NeMo Megatron is an end-to-end framework for training and deploying LLMs with billions and trillions of parameters. NVIDIA NeMo Megatron, part of the NVIDIA AI platform, offers an easy, efficient, and cost-effective containerized framework to build and deploy LLMs. Designed for enterprise application development, it builds upon the most advanced technologies from NVIDIA research and provides an end-to-end workflow for automated distributed data processing, training large-scale customized GPT-3, T5, and multilingual T5 (mT5) models, and deploying models for inference at scale. Harnessing the power of LLMs is made easy through validated and converged recipes with predefined configurations for training and inference. Customizing models is simplified by the hyperparameter tool, which automatically searches for the best hyperparameter configurations and performance for training and inference on any given distributed GPU cluster configuration.
  • 50
    Qwen3-Max

    Qwen3-Max

    Alibaba

    Qwen3-Max is Alibaba’s latest trillion-parameter large language model, designed to push performance in agentic tasks, coding, reasoning, and long-context processing. It is built atop the Qwen3 family and benefits from the architectural, training, and inference advances introduced there; mixing thinker and non-thinker modes, a “thinking budget” mechanism, and support for dynamic mode switching based on complexity. The model reportedly processes extremely long inputs (hundreds of thousands of tokens), supports tool invocation, and exhibits strong performance on benchmarks in coding, multi-step reasoning, and agent benchmarks (e.g., Tau2-Bench). While its initial variant emphasizes instruction following (non-thinking mode), Alibaba plans to bring reasoning capabilities online to enable autonomous agent behavior. Qwen3-Max inherits multilingual support and extensive pretraining on trillions of tokens, and it is delivered via API interfaces compatible with OpenAI-style functions.
    Starting Price: Free