Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
README.md | 2025-04-15 | 18.0 kB | |
v4.1.0 - ONNX and OpenVINO backends offering 2-3x speedups_ improved hard negatives mining source code.tar.gz | 2025-04-15 | 12.9 MB | |
v4.1.0 - ONNX and OpenVINO backends offering 2-3x speedups_ improved hard negatives mining source code.zip | 2025-04-15 | 13.3 MB | |
Totals: 3 Items | 26.2 MB | 0 |
This release introduces 2 new efficient computing backends for CrossEncoder (reranker) models: ONNX and OpenVINO + optimization & quantization, allowing for speedups up to 2x-3x; improved hard negatives mining strategies, and minor improvements.
Install this version with
:::bash
# Training + Inference
pip install sentence-transformers[train]==4.1.0
# Inference only, use one of:
pip install sentence-transformers==4.1.0
pip install sentence-transformers[onnx-gpu]==4.1.0
pip install sentence-transformers[onnx]==4.1.0
pip install sentence-transformers[openvino]==4.1.0
Faster ONNX and OpenVINO Backends for CrossEncoder (#3319)
Introducing a new backend
keyword argument to the CrossEncoder
initialization, allowing values of "torch"
(default), "onnx"
, and "openvino"
.
These require installing sentence-transformers
with specific extras:
:::bash
pip install sentence-transformers[onnx-gpu]
# or ONNX for CPU only:
pip install sentence-transformers[onnx]
# or
pip install sentence-transformers[openvino]
It's as simple as:
:::python
from sentence_transformers import CrossEncoder
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")
query = "Which planet is known as the Red Planet?"
passages = [
"Venus is often called Earth's twin because of its similar size and proximity.",
"Mars, known for its reddish appearance, is often referred to as the Red Planet.",
"Jupiter, the largest planet in our solar system, has a prominent red spot.",
"Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]
scores = model.predict([(query, passage) for passage in passages])
print(scores)
If you specify a backend
and your model repository or directory contains an ONNX/OpenVINO model file, it will automatically be used! And if your model repository or directory doesn't have one already, an ONNX/OpenVINO model will be automatically exported. Just remember to model.push_to_hub
or model.save_pretrained
into the same model repository or directory to avoid having to re-export the model every time.
All keyword arguments passed via model_kwargs
will be passed on to ORTModelForSequenceClassification.from_pretrained
or OVModelForSequenceClassification.from_pretrained
. The most useful arguments are:
provider
: (Only ifbackend="onnx"
) ONNX Runtime provider to use for loading the model, e.g."CPUExecutionProvider"
. See https://onnxruntime.ai/docs/execution-providers/ for possible providers. If not specified, the strongest provider (E.g."CUDAExecutionProvider"
) will be used.file_name
: The name of the ONNX file to load. If not specified, will default to "model.onnx" or otherwise "onnx/model.onnx" for ONNX, and "openvino_model.xml" and "openvino/openvino_model.xml" for OpenVINO. This argument is useful for specifying optimized or quantized models.export
: A boolean flag specifying whether the model will be exported. If not provided, export will be set to True if the model repository or directory does not already contain an ONNX or OpenVINO model.
For example:
:::python
from sentence_transformers import SentenceTransformer
model = CrossEncoder(
"cross-encoder/ms-marco-MiniLM-L6-v2",
backend="onnx",
model_kwargs={
"file_name": "model_O3.onnx",
"provider": "CPUExecutionProvider",
}
)
query = "Which planet is known as the Red Planet?"
passages = [
"Venus is often called Earth's twin because of its similar size and proximity.",
"Mars, known for its reddish appearance, is often referred to as the Red Planet.",
"Jupiter, the largest planet in our solar system, has a prominent red spot.",
"Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]
scores = model.predict([(query, passage) for passage in passages])
print(scores)
Benchmarks
We ran benchmarks for CPU and GPU, averaging findings across 4 models of various sizes, 3 datasets, and numerous batch sizes. Here are the findings:
These findings resulted in these recommendations:
For GPU, you can expect 1.88x speedup with fp16 at no cost, and for CPU you can expect ~3x speedup at no cost of accuracy in our evaluation. Your mileage with the accuracy hit for quantization may vary, but it seems to remain very small.
Read the Speeding up Inference documentation for more details.
ONNX & OpenVINO Optimization and Quantization
In addition to exporting default ONNX and OpenVINO models, you can also use one of the helper methods for optimizing and quantizing ONNX models: ### ONNX Optimization [`export_optimized_onnx_model`](https://sbert.net/docs/package_reference/util.html#sentence_transformers.backend.export_optimized_onnx_model): This function uses Optimum to implement several optimizations in the ONNX model, ranging from basic optimizations to approximations and mixed precision. Read about the 4 default options [here](https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/optimization#optimizing-a-model-during-the-onnx-export). This function accepts: * `model` A SentenceTransformer or CrossEncoder model loaded with `backend="onnx"`. * `optimization_config`: ["O1", "O2", "O3", or "O4" from 🤗 Optimum](https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/optimization) or a custom [`OptimizationConfig`](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/configuration#optimum.onnxruntime.OptimizationConfig) instance. * `model_name_or_path`: The directory or model repository where the optimized model will be saved. * `push_to_hub`: Whether the push the exported model to the hub with `model_name_or_path` as the repository name. If False, the model will be saved in the directory specified with `model_name_or_path`. * `create_pr`: If `push_to_hub`, then this denotes whether a pull request is created rather than pushing the model directly to the repository. Very useful for optimizing models of repositories that you don't have write access to. * `file_suffix`: The suffix to add to the optimized model file name. Will use the `optimization_config` string or `"optimized"` if not set. The usage is like this: :::python from sentence_transformers import SentenceTransformer, export_optimized_onnx_model onnx_model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx") export_optimized_onnx_model( model=onnx_model, optimization_config="O4", model_name_or_path="cross-encoder/ms-marco-MiniLM-L6-v2", push_to_hub=True, create_pr=True, ) After which you can load the model with: :::python from sentence_transformers import CrossEncoder pull_request_nr = 2 # TODO: Update this to the number of your pull request model = CrossEncoder( "cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx", model_kwargs={"file_name": "onnx/model_O4.onnx"}, revision=f"refs/pr/{pull_request_nr}" ) or when it gets merged: :::python from sentence_transformers import CrossEncoder model = CrossEncoder( "cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx", model_kwargs={"file_name": "onnx/model_O4.onnx"}, ) ### ONNX Quantization [`export_dynamic_quantized_onnx_model`](https://sbert.net/docs/package_reference/util.html#sentence_transformers.backend.export_dynamic_quantized_onnx_model): This function uses Optimum to quantize the ONNX model to int8, also allowing for hardware-specific optimizations. This results in impressive speedups for CPUs. In my findings, each of the default quantization configuration options gave approximately the same performance improvements. This function accepts * `model` A SentenceTransformer or CrossEncoder model loaded with `backend="onnx"`. * `quantization_config`: "arm64", "avx2", "avx512", or "avx512_vnni" representing quantization configurations from [AutoQuantizationConfig](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/configuration#optimum.onnxruntime.AutoQuantizationConfig), or an [QuantizationConfig](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/configuration#optimum.onnxruntime.QuantizationConfig) instance. * `model_name_or_path`: The directory or model repository where the optimized model will be saved. * `push_to_hub`: Whether the push the exported model to the hub with `model_name_or_path` as the repository name. If False, the model will be saved in the directory specified with `model_name_or_path`. * `create_pr`: If `push_to_hub`, then this denotes whether a pull request is created rather than pushing the model directly to the repository. Very useful for quantizing models of repositories that you don't have write access to. * `file_suffix`: The suffix to add to the optimized model file name. Will use the `quantization_config` string or e.g. `"int8_quantized"` if not set. The usage is like this: :::python from sentence_transformers import CrossEncoder, export_dynamic_quantized_onnx_model model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx") export_dynamic_quantized_onnx_model( model, "avx512_vnni", "sentence-transformers/cross-encoder/ms-marco-MiniLM-L6-v2", push_to_hub=True, create_pr=True, ) After which you can load the model with: :::python from sentence_transformers import CrossEncoder pull_request_nr = 2 # TODO: Update this to the number of your pull request model = CrossEncoder( "cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx", model_kwargs={"file_name": "onnx/model_qint8_avx512_vnni.onnx"}, revision=f"refs/pr/{pull_request_nr}", ) or when it gets merged: :::python from sentence_transformers import CrossEncoder model = CrossEncoder( "cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx", model_kwargs={"file_name": "onnx/model_qint8_avx512_vnni.onnx"}, ) ## OpenVINO Quantization OpenVINO models can be quantized to int8 precision using [Optimum Intel](https://huggingface.co/docs/optimum/main/en/intel/index) to speed up inference. To do this, you can use the [export_static_quantized_openvino_model()](https://sbert.net/docs/package_reference/util.html#sentence_transformers.backend.export_static_quantized_openvino_model) function, which saves the quantized model in a directory or model repository that you specify. Post-Training Static Quantization expects: * `model`: a Sentence Transformer or Cross Encoder model loaded with the OpenVINO backend. * `quantization_config`: (Optional) The quantization configuration. This parameter accepts either: None for the default 8-bit quantization, a dictionary representing quantization configurations, or an OVQuantizationConfig instance. * `model_name_or_path`: a path to save the quantized model file, or the repository name if you want to push it to the Hugging Face Hub. * `dataset_name`: (Optional) The name of the dataset to load for calibration. If not specified, defaults to sst2 subset from the glue dataset. * `dataset_config_name`: (Optional) The specific configuration of the dataset to load. * `dataset_split`: (Optional) The split of the dataset to load (e.g., ‘train’, ‘test’). * `column_name`: (Optional) The column name in the dataset to use for calibration. * `push_to_hub`: (Optional) a boolean to push the quantized model to the Hugging Face Hub. * `create_pr`: (Optional) a boolean to create a pull request when pushing to the Hugging Face Hub. Useful when you don’t have write access to the repository. * `file_suffix`: (Optional) a string to append to the model name when saving it. If not specified, "qint8_quantized" will be used. The usage is like this: :::python from sentence_transformers import CrossEncoder, export_static_quantized_openvino_model model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="openvino") export_static_quantized_openvino_model( model, quantization_config=None, model_name_or_path="cross-encoder/ms-marco-MiniLM-L6-v2", push_to_hub=True, create_pr=True, ) After which you can load the model with: :::python from sentence_transformers import CrossEncoder pull_request_nr = 2 # TODO: Update this to the number of your pull request model = CrossEncoder( "cross-encoder/ms-marco-MiniLM-L6-v2", backend="openvino", model_kwargs={"file_name": "openvino/openvino_model_qint8_quantized.xml"}, revision=f"refs/pr/{pull_request_nr}" ) or when it gets merged: :::python from sentence_transformers import CrossEncoder model = CrossEncoder( "cross-encoder/ms-marco-MiniLM-L6-v2", backend="openvino", model_kwargs={"file_name": "openvino/openvino_model_qint8_quantized.xml"}, ) Read the [Speeding up Inference documentation](https://sbert.net/docs/cross_encoder/usage/efficiency.html) for more details.Relative Margin in Hard Negatives Mining (#3321)
This PR softly deprecates the margin
option in mine_hard_negatives
in favor of absolute_margin
and relative_margin
. In short:
absolute_margin
: Discards negative candidates whoseanchor_negative_similarity
score is greater than or equal toanchor_positive_similarity - absolute_margin
. With anabsolute_margin
of 0.1 and an anchor-positive similarity of 0.86, the maximum anchor-negative similarity for that anchor (e.g. query) is 0.76.relative_margin
: Discards negative candidates whoseanchor_negative_similarity
score is greater than or equal toanchor_positive_similarity * (1 - relative_margin)
. With arelative_margin
of 0.05 and an anchor-positive similarity of 0.86, the maximum anchor-negative similarity for that anchor (e.g. query) is 0.817 (i.e. 95% of the anchor-positive similarity).
This means that we now support the recommended hard negatives mining strategy from the excellent NV-Retriever paper, a.k.a. the TopK-PercPos (95%) strategy:
:::python
from sentence_transformers.util import mine_hard_negatives
...
dataset = mine_hard_negatives(
dataset=dataset,
model=model,
relative_margin=0.05, # 0.05 means that the negative is at most 95% as similar to the anchor as the positive
num_negatives=num_negatives, # 10 or less is recommended
sampling_strategy="top", # "top" means that we sample the top candidates as negatives
batch_size=batch_size, # Adjust as needed
use_faiss=True, # Optional: Use faiss/faiss-gpu for faster similarity search
)
Minor Changes
- Add
margin
andmargin_strategy
to GISTEmbedLoss and CachedGISTEmbedLoss (#3299, [#3323]) - Support
activation_function=None
in Dense module (#3316) - Update how
all_layer_embeddings
outputs are determined (#3320) - Avoid error with
SentenceTransformer.encode
ifprompts
are provided andoutput_value=None
(#3327)
All Changes
- [
docs
] Update a removed article with a new source by @lakshminarasimmanv in https://github.com/UKPLab/sentence-transformers/pull/3309 - CachedGISTEmbedLoss Adding Margin by @daegonYu in https://github.com/UKPLab/sentence-transformers/pull/3299
- Support activation_function=None in Dense module by @OsamaS99 in https://github.com/UKPLab/sentence-transformers/pull/3316
- [
typing
] Fix typing for CrossEncoder.to by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/3324 - Update (C)GIST losses to support "relative" margin instead of "percentage" by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/3323
- [
feat
] hard neg mining: deprecate margin in favor of absolute_margin & relative margin by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/3321 - [
fix
] Use return_dict=True in Transformer; improve how all_layer_embeddings are determined by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/3320 - [
fix
] Avoid error if prompts & output_value=None by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/3327 - [
backend
] Add ONNX & OpenVINO support for Cross Encoder (reranker) models by @tomaarsen in https://github.com/UKPLab/sentence-transformers/pull/3319
New Contributors
- @lakshminarasimmanv made their first contribution in https://github.com/UKPLab/sentence-transformers/pull/3309
Full Changelog: https://github.com/UKPLab/sentence-transformers/compare/v4.0.2...v4.1.0