| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| README.md | 2026-02-17 | 6.4 kB | |
| v1.9.0 source code.tar.gz | 2026-02-17 | 1.2 MB | |
| v1.9.0 source code.zip | 2026-02-17 | 1.4 MB | |
| Totals: 3 Items | 2.6 MB | 0 | |
What's changed?
🚨 Breaking changes
- Default
HiddenAct::Geluto GeLU + tanh in favour of GeLU erf by @vrdn-23 in https://github.com/huggingface/text-embeddings-inference/pull/753
Default GeLU implementation is now GeLU + tanh approximation instead of exact GeLU (aka. GeLU erf) to make sure that the CPU and CUDA embeddings are the same (as cuBLASlt only supports GeLU + tanh), which represents a slight misalignment from how Transformers handles it, as when
hidden_act="gelu"is set inconfig.json, GeLU erf should be used. The numerical differences between GeLU + tanh and GeLU erf should have negligible impact on inference quality.
- Set
--auto-truncatetotrueby default by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/829
--auto-truncatenow defaults to true, meaning that the sequences will be truncated to the lower value between the--max-batch-tokensor the maximum model length, to prevent the--max-batch-tokensfrom being lower than the actual maximum supported length.
🎉 Additions
- Add
--served-model-namefor OpenAI requests via HTTP by @vrdn-23 in https://github.com/huggingface/text-embeddings-inference/pull/685 - Extend
download_onnxto download sharded ONNX by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/817 - Add support for llama 2 by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/802
- Add support for blackwell architecture (sm100, sm120) by @danielealbano in https://github.com/huggingface/text-embeddings-inference/pull/735
- Mf/add-support-for-llama-3-and-nemotron by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/805
- Add support for DebertaV2 by @vrdn-23 in https://github.com/huggingface/text-embeddings-inference/pull/746
- Add bidirectional attention and projection layer support for Qwen3-based models by @williambarberjr in https://github.com/huggingface/text-embeddings-inference/pull/808
🐛 Fixes
- Fix reading non-standard config for
past_key_valuesin ONNX by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/751 - Fix
TruncationDirectionto deserialize from lowercase and capitalized by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/755 - Fix
sagemaker-entrypoint*& remove SageMaker and Vertex fromDockerfile*by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/699 - Bug: Critical accuracy bugs for model_type=qwen2: no causal attention and wrong tokenizer by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/762
- Fix
config.jsonreading w/ aliases for ORT by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/786 - Fix HTTP error code for validation by @vrdn-23 in https://github.com/huggingface/text-embeddings-inference/pull/818
- Fix to acquire the permit in a blocking way by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/726
- Read Hugging Face Hub token from cache if not provided by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/814
- Align the
normalizeparam between the gRPC and HTTP /embed interfaces by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/810
⚡ Improvements
- Serialization in tokio thread instead of blocking thread, 50% reduction in latency for small models by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/767
- Remove default
--model-idargument by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/679 - feat: better Tokenization # workers heuristic by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/766
- add faster index select kernel by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/773
- feat: speedup Parallel safetensors download by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/765
- feat: startup time: add cloned tokenzier fix, saves ~1-20s cold start time by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/772
- Adjust the warmup phase for CPU by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/792
📄 Other
- Skip Gemma3 tests when
HF_TOKENnot set by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/812 - Bump Rust 1.92, CUDA 12.6, Ubuntu 24.04 and add
Dockerfile-cuda-blackwell-allby @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/823 - Update
rustcversion to 1.92.0 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/826 - Add
use_flash_attnfor better FA + FA2 feature gating by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/825 - Update CUDA to 12.9 w/
cuda-compat-12-9by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/828 - Upgrade GitHub Actions for Node 24 compatibility by @salmanmkc in https://github.com/huggingface/text-embeddings-inference/pull/782
- Lint: cargo fmt and clippy fix warnings by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/776
- Fix
rustfmtonbackend/candle/tests/*.rsfiles by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/800 - Upgrade GitHub Actions to latest versions by @salmanmkc in https://github.com/huggingface/text-embeddings-inference/pull/783
- Update
versionto 1.9.0 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/830
🆕 New Contributors
- @salmanmkc made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/782
- @danielealbano made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/735
- @williambarberjr made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/808
Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.8.3...v1.9.0