Download Latest Version linux-64-torchserve-0.12.0-py311_0.tar.bz2 (42.3 MB)
Email in envelope

Get an email when there's a new version of TorchServe

Home / v0.10.0
Name Modified Size InfoDownloads / Week
Parent folder
README.md 2024-03-14 10.5 kB
TorchServe v0.10.0 Release Notes source code.tar.gz 2024-03-14 58.5 MB
TorchServe v0.10.0 Release Notes source code.zip 2024-03-14 59.1 MB
linux-64_torchserve-0.10.0-py311_0.tar.bz2 2024-03-14 24.4 MB
linux-64_torchserve-0.10.0-py39_0.tar.bz2 2024-03-14 24.4 MB
linux-64_torchserve-0.10.0-py38_0.tar.bz2 2024-03-14 24.4 MB
linux-64_torchserve-0.10.0-py310_0.tar.bz2 2024-03-14 24.4 MB
osx-arm64_torchserve-0.10.0-py311_0.tar.bz2 2024-03-14 24.4 MB
osx-arm64_torchserve-0.10.0-py39_0.tar.bz2 2024-03-14 24.4 MB
osx-arm64_torchserve-0.10.0-py38_0.tar.bz2 2024-03-14 24.4 MB
osx-arm64_torchserve-0.10.0-py310_0.tar.bz2 2024-03-14 24.4 MB
win-64_torchserve-0.10.0-py311_0.tar.bz2 2024-03-14 24.5 MB
win-64_torchserve-0.10.0-py39_0.tar.bz2 2024-03-14 24.4 MB
win-64_torchserve-0.10.0-py38_0.tar.bz2 2024-03-14 24.4 MB
win-64_torchserve-0.10.0-py310_0.tar.bz2 2024-03-14 24.4 MB
osx-64_torchserve-0.10.0-py311_0.tar.bz2 2024-03-14 24.4 MB
osx-64_torchserve-0.10.0-py39_0.tar.bz2 2024-03-14 24.4 MB
osx-64_torchserve-0.10.0-py38_0.tar.bz2 2024-03-14 24.4 MB
osx-64_torchserve-0.10.0-py310_0.tar.bz2 2024-03-14 24.4 MB
torchserve-0.10.0-py3-none-any.whl 2024-03-14 24.3 MB
Totals: 20 Items   532.6 MB 9

This is the release of TorchServe v0.10.0.

Highlights include - Extended support for PyTorch 2.x inference - C++ backend - GenAI fast series torch.compile showcase examples - Token authentication support for enhanced security.

C++ Backend

TorchServe presented the experimental C++ backend at the PyTorch Conference 2022. Similar to the Python backend, C++ backend also runs as a process and utilizes the BaseHandler to define APIs for customizing the handler. By providing a backend and handler written in pure C++ for TorchServe, it is now possible to deploy PyTorch models without any Python overhead. This release officially promoted the experimental branch to the master branch and included additional examples and Docker images for development.

  • Refactored C++ backend branch and promoted it to master [#2840] [#2927] [#2937] [#2953] [#2975] [#2980] [#2958] [#3006] [#3012] [#3014] [#3018] @mreso
  • C++ backend examples: a. Example Baby Llama [#2903] [#2911] @shrinath-suresh @mreso b. Example Llama2 [#2904] @shrinath-suresh @mreso
  • C++ dev Docker for CPU and GPU [#2976] [#3015] @namannandan

torch.compile

With the launch of PT2 Inference at the PyTorch Conference 2023, we have added several key examples showcasing out-of-box speedups for torch.compile and AOT Compile. Since there is no new development being done in TorchScript, starting this release, TorchServe is preparing the migration path for customers to switch from TorchScript to torch.compile.

GenAI torch.compile series

The fast series GenAI models - GPTFast, SegmentAnythingFast, DiffusionFast with 3-10x speedups using torch.compile and native PyTorch optimizations: + Example GPT Fast [#2815] [#2834] [#2935] @mreso and deployment with KServe [#2966] [#2895] @agunapal
+ Example Segment Anything Fast [#2802] @agunapal + Example Diffusion Fast [#2902] @agunapal

Cold start problem solution

To address cold start problems, there is an example included to show how torch._export.aot_load (experimental API) can be used to load a pre-compiled model. TorchServe has also started benchmarking models with torch.compile and tracking their performance compared to TorchScript.

The new TorchServe C++ backend also includes torch.compile and AOTInductor related examples for ResNet50, BERT and Llama2.

  1. torch.compile a. Example torch.compile with image classifier model densenet161 [#2915] @agunapal b. Example torch._export.aot_compile with image classification model ResNet-18 [#2832] [#2906] [#2932] [#2948] @agunapal c. Example torch inductor fx graph caching with image classification model densenet161 [#2925] @agunapal

  2. C++ AOTInductor a. Example AOT Inductor with Llama2 [#2913] @mreso b. Example AOT Inductor with ResNet-50 [#2944] @lxning c. Example AOT Inductor with BERTSequenceClassification [#2931] @lxning

Gen AI

  • Supported sequence batching for stateful inference in gRPC bi-directional streaming [#2513] @lxning
  • The fast series Gen AI models using torch.compile and native PyTorch optimizations.
  • Example Mistral 7B with vLLM [#2781] @agunapal
  • Example PyTorch native tensor parallel with Llama2 with continuous batching [#2709] @mreso @HamidShojanazeri
  • Supported inf2 Neuronx transformer continuous batching for both no coding style and advanced customers with Llama2-70B example [#2803] [#3016] @lxning
  • Example deepspeed mii fastgen with Llama2-13B [#2779] @lxning

Security

TorchServe has implemented token authentication for management and inference APIs. This is an optional config and can be enabled using torchserve-endpoint-plugin. This plugin can be downloaded from maven. This further strengthens TorchServe’s capability as a secure model serving solution. The security features of TorchServe are documented here

  • Supported token authentication in management and inference APIs, [#2888] [#2970] [#3002] @udaij12

Apple Silicon Support

TorchServe is now supported on Apple Silicon mac. The current support is for CPU only. We have also posted an RFC for the deprecation of x86 mac support.

  • Include arm64 mac in CI workflows [#2934] @udaij12
  • Conda binaries build support [#3013] @udaij12
  • Adding support for regression tests for binaries [#3019] @udaij12

KServe Updates

While serving large models, model loading can take some time even though the pod is running. Even though TorchServe is up, the worker is not ready till the model is loaded. To address this, TorchServe now sets the model ready status in KServe after the model has been loaded on workers. TorchServe also includes native open inference protocol support in gRPC. This is an experiment feature.

  • Supported native KServe open inference protocol in gRPC [#2609] @andyi2it
  • Refactored TorchServe configuration in KServe [#2995] @sgaist
  • Improved KServe protocol version handling [#2957] @sgaist
  • Updated KServe test script to return model version [#2973] @agunapal
  • Set model status using TorchServe API in KServe [#1878] @byeongjokim
  • Supported no-archive model archiver in KServe [#2839] @agunapal
  • How to deploy MNIST using KServe with minikube [#2718] @agunapal
  • Changes to support no-model archive mode with KServe [#2839] @agunpal

Metrics Updates

In order to extend backwards compatibility support for metrics, auto-detection of backend metrics enables the flexibility to publish custom model metrics without having to explicitly specify them in the metrics configuration file. Furthermore, a customized script to collect system metrics is also now supported. + Supported backend metrics auto-detection [#2769] @namannandan + Fixed backend metrics backward compatible [#2816] @namannandan + Supported customized system metrics script via config.properties [#3000] @lxning

Improvements and Bug Fixing

  • Supported PyTorch 2.2.1 [#2959] [#2972] and Release version updated [#3010] @agunapal
  • Enabled option of installing model's 3rd party dependency in Python virtual environment via model config yaml file [#2910] [#2946] [#2954] @namannandan
  • Fixed worker auto recovery [#2746] @mreso
  • Fixed worker thread write and flush incomplete [#2833] @lxning
  • Fixed the priority of parameters defined in register curl vs model-config.yaml [#2858] @lxning
  • Refactored sanity check with pytest [#2221] @mreso
  • Fixed model state if runtime is null from model archiver [#2928] @mreso
  • Refactored benchmark script for LLM benchmark integration [#2897] @mreso
  • Added pytest for tensor parallel [#2741] @mreso
  • Fixed continuous batching unit test [#2847] @mreso
  • Added separate pytest for send_intermediate_prediction_response [#2896] @mreso
  • Fixed GPU ID in GPT Fast handler [#2872] @sachanub
  • Added model archiver API [#2751] @GeeCastro
  • Updated torch.compile in BaseHandler to accept kwargs via model config yaml file [#2796] @eballesteros
  • Integrated pytorch-probot into the TorchServe [#2725] @atalman
  • Added queue time in benchmark report [#2854] @sachanub
  • Replaced no_grad with inference_mode in BaseHandler [#2804] @bryant1410
  • Fixed env var CUDA_VERSION conflict in Dockerfile [#2807] @rsbowman-striveworks
  • Fixed var USE_CUDA_VERSION in Dockerfile [#2982] @fyang93
  • Fixed BASE_IMAGE for k8s docker image [#2808] @rsbowman-striveworks
  • Fixed workflow store path in config.properties overwritten by the default workflow path [#2792] @udaij12
  • Removed invalid warning log [#2867] @lxning
  • Updated PyTorch nightly url and CPU version in install_dependency.py [#2971] [#3011] @agunapal
  • Deprecated Dockerfile.dev, build dev and prod docker image from single source Dockerfile [#2782] @sachanub
  • Updated transformers version to >= 4.34.0 [#2703] @agunapal
  • Fixed Neuronx requirements [#2887] [#2900] @namannandan
  • Added neuron SDK installation in install_dependencies.py [#2893] @mreso
  • Updated ResNet-152 example output [#2745] @sachanub
  • Clarified that "Not Accepted" is a valid classification in Huggingface_Transformers Sequence Classification example [#2786] @nathanweeks
  • Added dead link checking in md files [#2984] @mreso
  • Added comments in model_service_worker.py [#2809] @InakiRaba91
  • Enabled a new github workflow or updated an existing workflow [#2726] [#2732] [#2737] [#2734] [#2750] [#2767] [#2778] [#2792] [#2835] [#2846] [#2848] [#2855] [#2856] [#2859] [#2864] [#2863] [#2891] [#2938] [#2939] [#2961] [#2960] [#2964] [#3009] @agunapal @udaij12 @namannandan @sachanub

Documentation

  • Updated security readme [#2773] [#3020] @agunapal @udaij12
  • Added security readme to TorchServe site [#2784] @sekyondaMeta
  • Refactor the README.md [#2729] @chauhang
  • Updated git clone instruction in gRPC api documentation [#2799] @bryant1410
  • Highlighted code in README [#2805] @bryant1410
  • Fixed typos in the README.md [#2806] [#2871] @bryant1410 @rafijacSense
  • Fixed dead links in documentation [#2936] @agunapal

Platform Support

Ubuntu 20.04 MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.

GPU Support Matrix

TorchServe version PyTorch version Python Stable CUDA Experimental CUDA
0.10.0 2.2.1 >=3.8, <=3.11 CUDA 11.8, CUDNN 8.7.0.84 CUDA 12.1, CUDNN 8.9.2.26
0.9.0 2.1 >=3.8, <=3.11 CUDA 11.8, CUDNN 8.7.0.84 CUDA 12.1, CUDNN 8.9.2.26
0.8.0 2.0 >=3.8, <=3.11 CUDA 11.7, CUDNN 8.5.0.96 CUDA 11.8, CUDNN 8.7.0.84
0.7.0 1.13 >=3.7, <=3.10 CUDA 11.6, CUDNN 8.3.2.44 CUDA 11.7, CUDNN 8.5.0.96

Inferentia2 Support Matrix

TorchServe version PyTorch version Python Neuron SDK
0.10.0 1.13 >=3.8, <=3.11 2.16+
0.9.0 1.13 >=3.8, <=3.11 2.13.2+
Source: README.md, updated 2024-03-14