Download Latest Version v4.55.0_ New openai GPT OSS model! source code.zip (24.0 MB)
Email in envelope

Get an email when there's a new version of Transformers

Home / v4.55.0
Name Modified Size InfoDownloads / Week
Parent folder
README.md 2025-08-05 19.8 kB
v4.55.0_ New openai GPT OSS model! source code.tar.gz 2025-08-05 18.9 MB
v4.55.0_ New openai GPT OSS model! source code.zip 2025-08-05 24.0 MB
Totals: 3 Items   43.0 MB 1

Welcome GPT OSS, the new open-source model family from OpenAI!

image

For more detailed information about this model, we recommend reading the following blogpost: https://huggingface.co/blog/welcome-openai-gpt-oss

GPT OSS is a hugely anticipated open-weights release by OpenAI, designed for powerful reasoning, agentic tasks, and versatile developer use cases. It comprises two models: a big one with 117B parameters (gpt-oss-120b), and a smaller one with 21B parameters (gpt-oss-20b). Both are mixture-of-experts (MoEs) and use a 4-bit quantization scheme (MXFP4), enabling fast inference (thanks to fewer active parameters, see details below) while keeping resource usage low. The large model fits on a single H100 GPU, while the small one runs within 16GB of memory and is perfect for consumer hardware and on-device applications.

Overview of Capabilities and Architecture

  • 21B and 117B total parameters, with 3.6B and 5.1B active parameters, respectively.
  • 4-bit quantization scheme using mxfp4 format. Only applied on the MoE weights. As stated, the 120B fits in a single 80 GB GPU and the 20B fits in a single 16GB GPU.
  • Reasoning, text-only models; with chain-of-thought and adjustable reasoning effort levels.
  • Instruction following and tool use support.
  • Inference implementations using transformers, vLLM, llama.cpp, and ollama.
  • Responses API is recommended for inference.
  • License: Apache 2.0, with a small complementary use policy.

Architecture

  • Token-choice MoE with SwiGLU activations.
  • When calculating the MoE weights, a softmax is taken over selected experts (softmax-after-topk).
  • Each attention layer uses RoPE with 128K context.
  • Alternate attention layers: full-context, and sliding 128-token window.
  • Attention layers use a learned attention sink per-head, where the denominator of the softmax has an additional additive value.
  • It uses the same tokenizer as GPT-4o and other OpenAI API models.
  • Some new tokens have been incorporated to enable compatibility with the Responses API.

The following snippet shows simple inference with the 20B model. It runs on 16 GB GPUs when using mxfp4, or ~48 GB in bfloat16.

:::py
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
)

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

Flash Attention 3

The models use attention sinks, a technique the vLLM team made compatible with Flash Attention 3. We have packaged and integrated their optimized kernel in kernels-community/vllm-flash-attn3. At the time of writing, this super-fast kernel has been tested on Hopper cards with PyTorch 2.7 and 2.8. We expect increased coverage in the coming days. If you run the models on Hopper cards (for example, H100 or H200), you need to pip install –upgrade kernels and add the following line to your snippet:

:::diff
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
+    # Flash Attention with Sinks
+    attn_implementation="kernels-community/vllm-flash-attn3",
)

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

Even though the 120B model fits on a single H100 GPU (using mxfp4), you can also run it easily on multiple GPUs using accelerate or torchrun. Transformers provides a default parallelization plan, and you can leverage optimized attention kernels as well. The following snippet can be run with torchrun --nproc_per_node=4 generate.py on a system with 4 GPUs:

:::py
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.distributed import DistributedConfig
import torch

model_path = "openai/gpt-oss-120b"
tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left")

device_map = {
    "tp_plan": "auto",    # Enable Tensor Parallelism
}

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    attn_implementation="kernels-community/vllm-flash-attn3",
    **device_map,
)

messages = [
     {"role": "user", "content": "Explain how expert parallelism works in large language models."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1000)

# Decode and print
response = tokenizer.decode(outputs[0])
print("Model response:", response.split("<|channel|>final<|message|>")[-1].strip())

Other optimizations

If you have a Hopper GPU or better, we recommend you use mxfp4 for the reasons explained above. If you can additionally use Flash Attention 3, then by all means do enable it!

[!TIP] If your GPU is not compatible with mxfp4, then we recommend you use MegaBlocks MoE kernels for a nice speed bump. To do so, you just need to adjust your inference code like this:

:::diff
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
+    # Optimize MoE layers with downloadable MegaBlocksMoeMLP
+    use_kernels=True,
)

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

[!TIP] MegaBlocks optimized MoE kernels require the model to run on bfloat16, so memory consumption will be higher than running on mxfp4. We recommend you use mxfp4 if you can, otherwise opt in to MegaBlocks via use_kernels=True.

transformers serve

You can use transformers serve to experiment locally with the models, without any other dependencies. You can launch the server with just: transformers serve

To which you can send requests using the Responses API.

# responses API
curl -X POST http://localhost:8000/v1/responses \
-H "Content-Type: application/json" \
-d '{"input": [{"role": "system", "content": "hello"}], "temperature": 1.0, "stream": true, "model": "openai/gpt-oss-120b"}'

You can also send requests using the standard Completions API:

# completions API
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 1.0, "max_tokens": 1000, "stream": true, "model": "openai/gpt-oss-120b"}'

Command A Vision

image

Command A Vision is a state-of-the-art multimodal model designed to seamlessly integrate visual and textual information for a wide range of applications. By combining advanced computer vision techniques with natural language processing capabilities, Command A Vision enables users to analyze, understand, and generate insights from both visual and textual data.

The model excels at tasks including image captioning, visual question answering, document understanding, and chart understanding. This makes it a versatile tool for AI practitioners. Its ability to process complex visual and textual inputs makes it useful in settings where text-only representations are imprecise or unavailable, like real-world image understanding and graphics-heavy document processing.

Command A Vision is built upon a robust architecture that leverages the latest advancements in VLMs. It's highly performant and efficient, even when dealing with large-scale datasets. The model's flexibility makes it suitable for a wide range of use cases, from content moderation and image search to medical imaging analysis and robotics.

  • [Model] Cohere2 Vision by @zucchini-nlp in [#39810]

MM Grounding DINO

image

MM Grounding DINO model was proposed in An Open and Comprehensive Pipeline for Unified Object Grounding and Detection by Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, Haian Huang>.

MM Grounding DINO improves upon the Grounding DINO by improving the contrastive class head and removing the parameter sharing in the decoder, improving zero-shot detection performance on both COCO (50.6(+2.2) AP) and LVIS (31.9(+11.8) val AP and 41.4(+12.6) minival AP).

You can find all the original MM Grounding DINO checkpoints under the MM Grounding DINO collection. This model also supports LLMDet inference. You can find LLMDet checkpoints under the LLMDet collection.

  • Add MM Grounding DINO by @rziga in [#37925]

Bugfixes and improvements

  • More robust tied weight test by @Cyrilvallez in [#39681]
  • fix missing model._tp_size from ep refactor by @winglian in [#39688]
  • Fix missing initialization of FastSpeech2Conformer by @bvantuan in [#39689]
  • fix(tokenization): check token.content for trie by @pjo256 in [#39587]
  • xpu optimization for generation case by @sywangyi in [#39573]
  • [processors] add tests for helper fn by @zucchini-nlp in [#39629]
  • update ernie model card by @jzhang533 in [#39657]
  • [configuration] remove redundant classmethod by @zucchini-nlp in [#38812]
  • Add self-hosted runner scale set workflow for mi325 CI by @jitesh-gupta in [#39651]
  • PATCH: add back n-dim device-mesh + fix tp trainer saving by @S1ro1 in [#39693]
  • [CI] Add Eric to comment slow ci by @vasqu in [#39601]
  • Remove all expired deprecation cycles by @Cyrilvallez in [#39725]
  • mllama outputs refactor by @itazap in [#39643]
  • Update QAPipelineTests::test_large_model_course after [#39193] by @ydshieh in [#39666]
  • skip Glm4MoeModelTest::test_torch_compile_for_training by @ydshieh in [#39670]
  • Fix Qwen2AudioForConditionalGeneration.forward() and test_flash_attn_kernels_inference_equivalence by @ebezzam in [#39503]
  • Fix Layer device placement in Caches by @Cyrilvallez in [#39732]
  • Fix cache-related tests by @zucchini-nlp in [#39676]
  • Fix AMD dockerfile for audio models by @remi-or in [#39669]
  • Superpoint fast image processor by @arkhamHack in [#37804]
  • Add Fast Segformer Processor by @capnmav77 in [#37024]
  • BLIPs clean-up by @zucchini-nlp in [#35560]
  • extend more trainer test cases to XPU, all pass by @yao-matrix in [#39652]
  • fix cache inheritance by @ArthurZucker in [#39748]
  • [Fix] import two missing typos in models/__init__.py for typo checking by @hebangwen in [#39745]
  • Fix: add back base model plan by @S1ro1 in [#39733]
  • update GemmaIntegrationTest::test_model_2b_bf16_dola again by @ydshieh in [#39731]
  • Update IMPORTANT_MODELS list by @ivarflakstad in [#39734]
  • Fix mamba regression by @manueldeprada in [#39728]
  • Apply several ruff SIM rules by @cyyever in [#37283]
  • Use --gpus all in workflow files by @ydshieh in [#39752]
  • AMD disable torchcodec by @ivarflakstad in [#39757]
  • Avoid OOM when other tests are failing by @ydshieh in [#39758]
  • Fix GPT2 with cross attention by @zucchini-nlp in [#39754]
  • Support loading Qwen3 MoE GGUF by @ctcanbol in [#39638]
  • Enable xpu allocator on caching_allocator_warmup by @jiqing-feng in [#39654]
  • Fix version issue in modeling_utils.py by @Cyrilvallez in [#39759]
  • add libcst to extras["testing"] in setup.py by @ydshieh in [#39761]
  • [modenbert] fix regression by @zucchini-nlp in [#39750]
  • 🌐 [i18n-KO] Translated main_classes/peft.md by @luckyvickyricky in [#39515]
  • 🌐 [i18n-KO] Translated albert.md to Korean by @ahnjj in [#39524]
  • 🌐 [i18n-KO] Translated tvp.md to Korean by @Kim-Ju-won in [#39578]
  • 🌐 [i18n-KO] Translated tokenizer.md to Korean by @seopp in [#39532]
  • 🌐 [i18n-KO] Translated pipeline_gradio.md to Korean by @AhnJoonSung in [#39520]
  • 🌐 [i18n-KO] Translated perf_train_gpu_one.md to Korean by @D15M4S in [#39552]
  • 🌐 [i18n-KO] Translated how_to_hack_models.md to Korean by @skwh54 in [#39536]
  • fix(trainer): Correct loss scaling for incomplete gradient accumulation steps by @hutaiHang in [#39659]
  • Fix Cache.max_cache_len max value for Hybrid models by @manueldeprada in [#39737]
  • [docs] Ko doc fixes after toc update by @gante in [#39660]
  • Remove python3.7 reference from doc link by @st81 in [#39706]
  • Fix OmDet test after arg deprecation by @Cyrilvallez in [#39766]
  • docs: Update EfficientLoFTR documentation by @sbucaille in [#39620]
  • Standardize CLAP model card format by @yanamis in [#39738]
  • Don't set run_name when none by @qgallouedec in [#39695]
  • Fix Evolla and xLSTM tests by @Cyrilvallez in [#39769]
  • enable static cache on vision encoder decoder by @jiqing-feng in [#39773]
  • [ASR pipline] fix with datasets 4.0 by @eustlb in [#39504]
  • more info in model_results.json by @ydshieh in [#39783]
  • Super tiny update by @zucchini-nlp in [#39727]
  • fix chameleonvision UT failure by @yao-matrix in [#39646]
  • Fix an invalid condition by @cyyever in [#39762]
  • Simplify conditional code by @cyyever in [#39781]
  • Fix re-compilations for cross attention cache by @zucchini-nlp in [#39788]
  • standardized BARThez model card by @EthanV431 in [#39701]
  • Update model card for Cohere2 (Command R7B) by @arpon-kapuria in [#39604]
  • Update mT5 model card by @dross20 in [#39702]
  • Add callback to monitor progress in whisper transcription by @poke1024 in [#37483]
  • fix: providing a tensor to cache_position in model.generate kwargs always crashes because of boolean test by @gante in [#39300]
  • feat(tokenization): add encode_message to tokenize messages one by one by @pco111 in [#39507]
  • [docs] fix korean docs yet again by @gante in [#39813]
  • Update documentation for Cohere2Vision models by @kyle-cohere in [#39817]
  • [cohere2 vision] move doc to multimodal section by @zucchini-nlp in [#39820]
  • Fix broken links by @oToToT in [#39809]
  • Fix bad markdown links by @ebezzam in [#39819]
  • Fix tp cb by @ArthurZucker in [#39838]
  • [VLMs] split out "get placeholder mask" to helper by @zucchini-nlp in [#39777]
  • [attn_implementation] remove recursive, allows custom kernels with wrappers by @ArthurZucker in [#39823]
  • [typecheck] proper export of private symbols by @cyyever in [#39729]
  • Update ux cb by @ArthurZucker in [#39845]
  • Fix responses add tests by @LysandreJik in [#39848]
  • Add fast image processor Janus, Deepseek VL, Deepseek VL hybrid by @yonigozlan in [#39739]
  • [image-processing] deprecate plot_keypoint_matching, make visualize_keypoint_matching as a standard by @sbucaille in [#39830]
  • Allow TrackioCallback to work when pynvml is not installed by @qgallouedec in [#39851]
  • remove dtensors, not explicit by @ArthurZucker in [#39840]
  • Improve is_wandb_available function to verify WandB installation by @qgallouedec in [#39875]
  • Refactor label name handling for PEFT models in Trainer class by @qgallouedec in [#39265]
  • Use comment to build doc on PRs by @ydshieh in [#39846]
  • Add support for including in-memory videos (not just files/urls) in apply_chat_template by @akibjawad in [#39494]
  • [core] Fix attn_implementation setter with missing sub_configs by @qubvel in [#39855]
  • Fix quant docker for fp-quant by @SunMarc in [#39641]
  • Rework add-new-model-like with modular and make test filenames coherent by @Cyrilvallez in [#39612]
  • Replace Tokenizer with PreTrainedTokenizerFast in ContinuousBatchProcessor by @qgallouedec in [#39858]
  • Set torch.backends.cudnn.allow_tf32 = False for CI by @ydshieh in [#39885]
  • [typing] better return type hint for AutoModelForCausalLM and AutoModelForImageTextToText by @qubvel in [#39881]
  • Fix link to models in README by @qubvel in [#39880]
  • [DOCS] : Improved mimi model card by @rohitthewanderer in [#39824]
  • Update cohere2 vision test by @ydshieh in [#39888]
  • send some feedback when manually building doc via comment by @ydshieh in [#39889]
  • Add support for ModernBertForMultipleChoice by @netique in [#39232]
  • chore: update DETR model card by @arpon-kapuria in [#39822]
  • Reorder serving docs by @LysandreJik in [#39634]
  • [Exaone4] Fixes the attn implementation! by @ArthurZucker in [#39906]
  • fix test_working_of_tp failure of accelerate ut by @yao-matrix in [#39828]
  • [qwen] remove unnecessary CUDA sync in qwen2_5_vl by @cyyever in [#39870]
  • Avoid aliasing in cond's branches for torch 2.8 by @ydwu4 in [#39488]
  • Fix misleading WandB error when WANDB_DISABLED is set by @notkisk in [#39891]
  • Replace video_fps with fps in tests by @cyyever in [#39898]
  • Fix eval thread fork bomb by @JustinVanHeek in [#39717]
  • Fix aria tests by @zucchini-nlp in [#39879]

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @capnmav77
    • Add Fast Segformer Processor (#37024)
  • @cyyever
    • Apply several ruff SIM rules (#37283)
    • Fix an invalid condition (#39762)
    • Simplify conditional code (#39781)
    • [typecheck] proper export of private symbols (#39729)
    • [qwen] remove unnecessary CUDA sync in qwen2_5_vl (#39870)
    • Replace video_fps with fps in tests (#39898)
  • @rziga
    • Add MM Grounding DINO (#37925)
Source: README.md, updated 2025-08-05