Download Latest Version v4.1.0 - ONNX and OpenVINO backends offering 2-3x speedups_ improved hard negatives mining source code.tar.gz (12.9 MB)
Email in envelope

Get an email when there's a new version of SentenceTransformers

Home / v4.1.0
Name Modified Size InfoDownloads / Week
Parent folder
README.md 2025-04-15 18.0 kB
v4.1.0 - ONNX and OpenVINO backends offering 2-3x speedups_ improved hard negatives mining source code.tar.gz 2025-04-15 12.9 MB
v4.1.0 - ONNX and OpenVINO backends offering 2-3x speedups_ improved hard negatives mining source code.zip 2025-04-15 13.3 MB
Totals: 3 Items   26.2 MB 0

This release introduces 2 new efficient computing backends for CrossEncoder (reranker) models: ONNX and OpenVINO + optimization & quantization, allowing for speedups up to 2x-3x; improved hard negatives mining strategies, and minor improvements.

Install this version with

:::bash
# Training + Inference
pip install sentence-transformers[train]==4.1.0

# Inference only, use one of:
pip install sentence-transformers==4.1.0
pip install sentence-transformers[onnx-gpu]==4.1.0
pip install sentence-transformers[onnx]==4.1.0
pip install sentence-transformers[openvino]==4.1.0

Faster ONNX and OpenVINO Backends for CrossEncoder (#3319)

Introducing a new backend keyword argument to the CrossEncoder initialization, allowing values of "torch" (default), "onnx", and "openvino". These require installing sentence-transformers with specific extras:

:::bash
pip install sentence-transformers[onnx-gpu]
# or ONNX for CPU only:
pip install sentence-transformers[onnx]
# or
pip install sentence-transformers[openvino]

It's as simple as:

:::python
from sentence_transformers import CrossEncoder

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")

query = "Which planet is known as the Red Planet?"
passages = [
   "Venus is often called Earth's twin because of its similar size and proximity.",
   "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
   "Jupiter, the largest planet in our solar system, has a prominent red spot.",
   "Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]

scores = model.predict([(query, passage) for passage in passages])
print(scores)

If you specify a backend and your model repository or directory contains an ONNX/OpenVINO model file, it will automatically be used! And if your model repository or directory doesn't have one already, an ONNX/OpenVINO model will be automatically exported. Just remember to model.push_to_hub or model.save_pretrained into the same model repository or directory to avoid having to re-export the model every time.

All keyword arguments passed via model_kwargs will be passed on to ORTModelForSequenceClassification.from_pretrained or OVModelForSequenceClassification.from_pretrained. The most useful arguments are:

  • provider: (Only if backend="onnx") ONNX Runtime provider to use for loading the model, e.g. "CPUExecutionProvider" . See https://onnxruntime.ai/docs/execution-providers/ for possible providers. If not specified, the strongest provider (E.g. "CUDAExecutionProvider") will be used.
  • file_name: The name of the ONNX file to load. If not specified, will default to "model.onnx" or otherwise "onnx/model.onnx" for ONNX, and "openvino_model.xml" and "openvino/openvino_model.xml" for OpenVINO. This argument is useful for specifying optimized or quantized models.
  • export: A boolean flag specifying whether the model will be exported. If not provided, export will be set to True if the model repository or directory does not already contain an ONNX or OpenVINO model.

For example:

:::python
from sentence_transformers import SentenceTransformer

model = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L6-v2",
    backend="onnx",
    model_kwargs={
        "file_name": "model_O3.onnx",
        "provider": "CPUExecutionProvider",
    }
)

query = "Which planet is known as the Red Planet?"
passages = [
   "Venus is often called Earth's twin because of its similar size and proximity.",
   "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
   "Jupiter, the largest planet in our solar system, has a prominent red spot.",
   "Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]

scores = model.predict([(query, passage) for passage in passages])
print(scores)

Benchmarks

We ran benchmarks for CPU and GPU, averaging findings across 4 models of various sizes, 3 datasets, and numerous batch sizes. Here are the findings:

These findings resulted in these recommendations: image

For GPU, you can expect 1.88x speedup with fp16 at no cost, and for CPU you can expect ~3x speedup at no cost of accuracy in our evaluation. Your mileage with the accuracy hit for quantization may vary, but it seems to remain very small.

Read the Speeding up Inference documentation for more details.

ONNX & OpenVINO Optimization and Quantization In addition to exporting default ONNX and OpenVINO models, you can also use one of the helper methods for optimizing and quantizing ONNX models: ### ONNX Optimization [`export_optimized_onnx_model`](https://sbert.net/docs/package_reference/util.html#sentence_transformers.backend.export_optimized_onnx_model): This function uses Optimum to implement several optimizations in the ONNX model, ranging from basic optimizations to approximations and mixed precision. Read about the 4 default options [here](https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/optimization#optimizing-a-model-during-the-onnx-export). This function accepts: * `model` A SentenceTransformer or CrossEncoder model loaded with `backend="onnx"`. * `optimization_config`: ["O1", "O2", "O3", or "O4" from 🤗 Optimum](https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/optimization) or a custom [`OptimizationConfig`](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/configuration#optimum.onnxruntime.OptimizationConfig) instance. * `model_name_or_path`: The directory or model repository where the optimized model will be saved. * `push_to_hub`: Whether the push the exported model to the hub with `model_name_or_path` as the repository name. If False, the model will be saved in the directory specified with `model_name_or_path`. * `create_pr`: If `push_to_hub`, then this denotes whether a pull request is created rather than pushing the model directly to the repository. Very useful for optimizing models of repositories that you don't have write access to. * `file_suffix`: The suffix to add to the optimized model file name. Will use the `optimization_config` string or `"optimized"` if not set. The usage is like this: :::python from sentence_transformers import SentenceTransformer, export_optimized_onnx_model onnx_model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx") export_optimized_onnx_model( model=onnx_model, optimization_config="O4", model_name_or_path="cross-encoder/ms-marco-MiniLM-L6-v2", push_to_hub=True, create_pr=True, ) After which you can load the model with: :::python from sentence_transformers import CrossEncoder pull_request_nr = 2 # TODO: Update this to the number of your pull request model = CrossEncoder( "cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx", model_kwargs={"file_name": "onnx/model_O4.onnx"}, revision=f"refs/pr/{pull_request_nr}" ) or when it gets merged: :::python from sentence_transformers import CrossEncoder model = CrossEncoder( "cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx", model_kwargs={"file_name": "onnx/model_O4.onnx"}, ) ### ONNX Quantization [`export_dynamic_quantized_onnx_model`](https://sbert.net/docs/package_reference/util.html#sentence_transformers.backend.export_dynamic_quantized_onnx_model): This function uses Optimum to quantize the ONNX model to int8, also allowing for hardware-specific optimizations. This results in impressive speedups for CPUs. In my findings, each of the default quantization configuration options gave approximately the same performance improvements. This function accepts * `model` A SentenceTransformer or CrossEncoder model loaded with `backend="onnx"`. * `quantization_config`: "arm64", "avx2", "avx512", or "avx512_vnni" representing quantization configurations from [AutoQuantizationConfig](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/configuration#optimum.onnxruntime.AutoQuantizationConfig), or an [QuantizationConfig](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/configuration#optimum.onnxruntime.QuantizationConfig) instance. * `model_name_or_path`: The directory or model repository where the optimized model will be saved. * `push_to_hub`: Whether the push the exported model to the hub with `model_name_or_path` as the repository name. If False, the model will be saved in the directory specified with `model_name_or_path`. * `create_pr`: If `push_to_hub`, then this denotes whether a pull request is created rather than pushing the model directly to the repository. Very useful for quantizing models of repositories that you don't have write access to. * `file_suffix`: The suffix to add to the optimized model file name. Will use the `quantization_config` string or e.g. `"int8_quantized"` if not set. The usage is like this: :::python from sentence_transformers import CrossEncoder, export_dynamic_quantized_onnx_model model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx") export_dynamic_quantized_onnx_model( model, "avx512_vnni", "sentence-transformers/cross-encoder/ms-marco-MiniLM-L6-v2", push_to_hub=True, create_pr=True, ) After which you can load the model with: :::python from sentence_transformers import CrossEncoder pull_request_nr = 2 # TODO: Update this to the number of your pull request model = CrossEncoder( "cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx", model_kwargs={"file_name": "onnx/model_qint8_avx512_vnni.onnx"}, revision=f"refs/pr/{pull_request_nr}", ) or when it gets merged: :::python from sentence_transformers import CrossEncoder model = CrossEncoder( "cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx", model_kwargs={"file_name": "onnx/model_qint8_avx512_vnni.onnx"}, ) ## OpenVINO Quantization OpenVINO models can be quantized to int8 precision using [Optimum Intel](https://huggingface.co/docs/optimum/main/en/intel/index) to speed up inference. To do this, you can use the [export_static_quantized_openvino_model()](https://sbert.net/docs/package_reference/util.html#sentence_transformers.backend.export_static_quantized_openvino_model) function, which saves the quantized model in a directory or model repository that you specify. Post-Training Static Quantization expects: * `model`: a Sentence Transformer or Cross Encoder model loaded with the OpenVINO backend. * `quantization_config`: (Optional) The quantization configuration. This parameter accepts either: None for the default 8-bit quantization, a dictionary representing quantization configurations, or an OVQuantizationConfig instance. * `model_name_or_path`: a path to save the quantized model file, or the repository name if you want to push it to the Hugging Face Hub. * `dataset_name`: (Optional) The name of the dataset to load for calibration. If not specified, defaults to sst2 subset from the glue dataset. * `dataset_config_name`: (Optional) The specific configuration of the dataset to load. * `dataset_split`: (Optional) The split of the dataset to load (e.g., ‘train’, ‘test’). * `column_name`: (Optional) The column name in the dataset to use for calibration. * `push_to_hub`: (Optional) a boolean to push the quantized model to the Hugging Face Hub. * `create_pr`: (Optional) a boolean to create a pull request when pushing to the Hugging Face Hub. Useful when you don’t have write access to the repository. * `file_suffix`: (Optional) a string to append to the model name when saving it. If not specified, "qint8_quantized" will be used. The usage is like this: :::python from sentence_transformers import CrossEncoder, export_static_quantized_openvino_model model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="openvino") export_static_quantized_openvino_model( model, quantization_config=None, model_name_or_path="cross-encoder/ms-marco-MiniLM-L6-v2", push_to_hub=True, create_pr=True, ) After which you can load the model with: :::python from sentence_transformers import CrossEncoder pull_request_nr = 2 # TODO: Update this to the number of your pull request model = CrossEncoder( "cross-encoder/ms-marco-MiniLM-L6-v2", backend="openvino", model_kwargs={"file_name": "openvino/openvino_model_qint8_quantized.xml"}, revision=f"refs/pr/{pull_request_nr}" ) or when it gets merged: :::python from sentence_transformers import CrossEncoder model = CrossEncoder( "cross-encoder/ms-marco-MiniLM-L6-v2", backend="openvino", model_kwargs={"file_name": "openvino/openvino_model_qint8_quantized.xml"}, ) Read the [Speeding up Inference documentation](https://sbert.net/docs/cross_encoder/usage/efficiency.html) for more details.

Relative Margin in Hard Negatives Mining (#3321)

This PR softly deprecates the margin option in mine_hard_negatives in favor of absolute_margin and relative_margin. In short:

  • absolute_margin: Discards negative candidates whose anchor_negative_similarity score is greater than or equal to anchor_positive_similarity - absolute_margin. With an absolute_margin of 0.1 and an anchor-positive similarity of 0.86, the maximum anchor-negative similarity for that anchor (e.g. query) is 0.76.
  • relative_margin: Discards negative candidates whose anchor_negative_similarity score is greater than or equal to anchor_positive_similarity * (1 - relative_margin). With a relative_margin of 0.05 and an anchor-positive similarity of 0.86, the maximum anchor-negative similarity for that anchor (e.g. query) is 0.817 (i.e. 95% of the anchor-positive similarity).

This means that we now support the recommended hard negatives mining strategy from the excellent NV-Retriever paper, a.k.a. the TopK-PercPos (95%) strategy:

:::python
from sentence_transformers.util import mine_hard_negatives

...

dataset = mine_hard_negatives(
    dataset=dataset,
    model=model,
    relative_margin=0.05,         # 0.05 means that the negative is at most 95% as similar to the anchor as the positive
    num_negatives=num_negatives,  # 10 or less is recommended
    sampling_strategy="top",      # "top" means that we sample the top candidates as negatives
    batch_size=batch_size,        # Adjust as needed
    use_faiss=True,               # Optional: Use faiss/faiss-gpu for faster similarity search
)

Minor Changes

  • Add margin and margin_strategy to GISTEmbedLoss and CachedGISTEmbedLoss (#3299, [#3323])
  • Support activation_function=None in Dense module (#3316)
  • Update how all_layer_embeddings outputs are determined (#3320)
  • Avoid error with SentenceTransformer.encode if prompts are provided and output_value=None (#3327)

All Changes

New Contributors

Full Changelog: https://github.com/UKPLab/sentence-transformers/compare/v4.0.2...v4.1.0

Source: README.md, updated 2025-04-15