Download Latest Version Release v0.4.8rc4 _ CUDA 12.9 source code.tar.gz (6.5 MB)
Email in envelope

Get an email when there's a new version of LMCache

Home / v0.4.7
Name Modified Size InfoDownloads / Week
Parent folder
lmcache-0.4.7-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl 2026-06-13 13.2 MB
lmcache-0.4.7-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl 2026-06-13 13.3 MB
lmcache-0.4.7-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl 2026-06-13 13.3 MB
lmcache-0.4.7-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl 2026-06-13 13.3 MB
lmcache-0.4.7.tar.gz 2026-06-13 6.2 MB
README.md 2026-06-13 9.7 kB
v0.4.7 source code.tar.gz 2026-06-13 6.2 MB
v0.4.7 source code.zip 2026-06-13 7.3 MB
Totals: 8 Items   72.7 MB 0

LMCache v0.4.7 Release

Interface / Config / CLI / Build Changes

Breaking / behavior changes (action may be needed)

  • python_ops_fallback now requires completion recorder ops (added missing ops)
  • LMCacheGroupView renamed to EngineGroupInfo
  • report_status is now per-kernel-group
  • Per-group tokens_per_chunk / slots_per_chunk now used instead of inferring from cache_config.block_size
  • goblin is deprecated (documented)
  • Blend v2 CI removed; CacheBlend now uses Blend v3

New / additive (opt-in)

  • New mp_transfer_mode config option
  • New SHM-based data transfer path for GPUs/CPU/Accelerators (POSIX SHM infra for CPU KV-cache IPC)
  • New hybrid memory allocator (HMA) support, with per-group block sizes and Mamba/GDN hybrid model (Qwen3.5) support
  • New MP coordinator backbone: server registration, coordinator CLI, L2 quota/usage/eviction, global CacheBlend fingerprint directory
  • New CLI quota management commands (set/get/list/delete)
  • New runtime DAX hotplug HTTP API (MP)
  • New --mode cpu and --transfer-mode options in server_bench
  • New backends: NIXL DOCA_MEMOS (NVIDIA CMX), Cloud Bigtable remote storage, Moore Threads MUSA support, multipath KV-cache offloading in NIXL backend
  • New multi_layer_block_kv_transfer unified MP transfer primitive
  • LMCache startup banner now printed in CLI and vLLM connectors
  • vLLM CPU 2-fused KV layout support
  • Token-level matching for non-block-aligned KV reuse (CacheBlend)

MP (Multi-Process Mode)

  • [#3245] Retain CUDA IPC events in MP adapter
  • [#3359] SHM-based data transfer path for GPUs/CPU/Accelerators
  • [#3382] Fix GPU block exhaustion deadlock at high concurrency with chunked KV loading
  • [#3488] Add mp coordinator backbone
  • [#3513] Add mp_transfer_mode config option
  • [#3516] Register MP servers with the coordinator
  • [#3522] Add coordinator CLI and mp server registration
  • [#3531] Introduce create_cache_context factory
  • [#3557] Refactor LMCache layer group for better compat with hybrid models
  • [#3608] Introduce object_group_id into the ObjectKey
  • [#3352] Add SHM-based NonGpuContext (server-side copy)
  • [#3612] Implement interface for multi-object group and sliding window support (HMA)
  • [#3630] Coordinator L2 Quota, Usage, Eviction
  • [#3597] Global CacheBlend fingerprint directory on the MP coordinator
  • [#3264] Add runtime DAX hotplug http API
  • [#3477] Add l2_evicted_object, add cachesalt to L1/L2 metrics
  • [#3478] Consolidate ParallelStrategy construction in vllm_multi_process_adapter
  • [#3558] Align MP server id with OTel service.instance.id
  • [#3508] Add multi_layer_block_kv_transfer Python fallback as unified MP transfer primitive
  • [#3563] Add POSIX SHM infra for CPU KV-cache IPC

Core / HMA

  • [#3419] Add support for hybrid memory allocator
  • [#3491] Bitmap-based prefetch result + pluggable TrimPolicy
  • [#3503] Native bulk-set: build found bitmap via batched_set + gather
  • [#3492] Sparse prefetch via TrimPolicy.SPARSE + covered_keys
  • [#3521] Support different block size for different groups
  • [#3613] Support Mamba/GDN hybrid models (Qwen3.5)
  • [#3616] Per-group tokens_per_chunk and slots_per_chunk
  • [#3635] Optimize DSV4 store/load size
  • [#3589] Add GDS L1 slab-file tier (cuFile DMA) for MP mode

CacheBlend

  • [#3364] Blend v3
  • [#3582] Token-level matching + per-token slot scatter for non-block-aligned KV reuse
  • [#3629] Reuse gpu_transfer.cache_contexts; drop CB GPU-context mirror
  • [#3541] Cleanup/remove blend v2 ci

Storage / Backends

  • [#3486] NIXL DOCA_MEMOS storage backend (NVIDIA CMX)
  • [#3453] nixl_storage: use LocalCPUBackend if nixl_buffer_device=cpu
  • [#3263] Added HFbucket MP
  • [#2418] Add multipath KV-cache offloading support in LMCache NIXL backend
  • [#3404] Integrate native Cloud Bigtable remote storage connector
  • [#3483] Add Moore Threads MUSA support for LMCache v1
  • [#3568] nixl: create storage directory if it doesn't exist
  • [#3274] Missing io_uring changes + nvme io_uring_cmd passthrough

Observability

  • [#3384] Add NVTX annotations to LocalDiskBackend disk read path
  • [#3607] Blend server trace sub-spans + V3 hit-rate breakdown

Operator

  • [#3543] CacheBlend: CacheBlendEngine CRD + injection webhook
  • [#3647] Emit --engine-type blend for CacheBlend engine
  • [#3646] Install cert-manager in e2e smoke suite

CLI

  • [#3611] Print LMCache startup banner in CLI and vLLM connectors
  • [#3625] Refactor query and trace cli
  • [#3623] Add quota management commands (set/get/list/delete)

XPU / Accelerators

  • [#3360] Add SYCL CacheGen + RoPE kernels and in-process blender XPU tests

Bugfixes

  • [#3327] gds: use parse_cache_key to handle LayerCacheEngineKey on restart
  • [#3441] Drop EngineArgs+asdict to fix vLLM 0.20+ pydantic error
  • [#3189] Fix LocalCPUBackend recovery when pinned CPU chunks block eviction
  • [#3469] Add missing completion recorder ops to python_ops_fallback
  • [#3463] Prevent stale prefetches and registry memory leaks by purging unregistered KV layouts
  • [#3410] Prevent negative pin count on unpinned remote memory objects
  • [#3278] PD restore pin=True in PD sync backend dedup path
  • [#3525] Resolve AttributeError in test_execute_calls_run_http_server
  • [#3602] Handle NL_X_NB_NH_BS_TWO_HS in get_group_data_ptrs
  • [#3606] Add missing enum to GPUVKFormat
  • [#3325] Graceful skip on slot_mapping/token_ids desync in wait_for_save (fixes [#3318])
  • [#3648] Correct retrieve log label prefix -> non_shifted

Performance / Optimization

  • [#3413] Avoid redundant PCIe transfer on leader rank during retrieve
  • [#3591] Optimize Python fallback path for block transfer operations

Refactor / Cleanup

  • [#3460] Move serializer registry + encoder/decoder helpers to end of custom_types.py
  • [#3445] Simplify redundant conditions in RawBlockCore
  • [#3216] Put lmcache_frontend into lmcache repo
  • [#3514] Add set_shape_desc_dtype helper to avoid scattered try/except
  • [#3545] Normalize block_ids to tolerate legacy vLLM connectors
  • [#3577] Normalize flat/nested block_ids in flat_block_ids and connector str
  • [#3567] Support vLLM CPU 2-fused KV layout
  • [#3598] Rename LMCacheGroupView to EngineGroupInfo
  • [#3599] Change report_status to be per-kernel-group in LMCache
  • [#3581] Remove unnecessary global statement in cuda_extension
  • [#3600] Utilize multi_layer_block_kv_transfer ops for data transfer path
  • [#3524] Add transfer timing logs to non-GPU path similar to CUDA path

Benchmarking

  • [#3283] Support benchmark fs and hf3fs backend via storage_backend_io_benchmark
  • [#3528] server_bench supports --mode cpu and --transfer-mode
  • [#3603] Support aligned L1 buffers for L2 adapters

CI/CD & Build

  • [#3456] Add http_api e2e test for MP HTTP server endpoints and CLI commands
  • [#3498] Relax timeout to reduce flakiness of some CI/CD tests
  • [#3502] Force vLLM Model Runner V1 in the PD comprehensive test
  • [#3489] Add pickle/shm vLLM + LMCache e2e validation on CPU
  • [#3538] Hot fix for the CPU test in multiprocess mode CI
  • [#3507] Add parity test between c_ops and python_ops_fallback
  • [#3321] Add unit tests for v1/utils/bloom_filter
  • [#3556] Improve CI stability: gemma-4 test & serde test
  • [#3614] Reduce ci cpu e2e test memory request
  • [#3621] cu129 images: pin vllm to the cu129 index (drop unsafe-best-match)
  • [#3590] Add CPU e2e test (vLLM and bench server)

Docs

  • [#3457] Update and restructure CLI reference
  • [#3481] Fix Docker examples and build metadata
  • [#3501] Combined doc drift updates May 27-Jun 2
  • [#3504] KV Cache Size Calculator: add hybrid SWA, DSA, placeholders for Mamba / Linear
  • [#3506] Add recipe for Gemma 3
  • [#3518] Deprecate goblin in doc
  • [#3461] Update README.md
  • [#3433] Auto-select model in CPU-offloading example to fit GPU
  • [#3534] Add filesystem connector backend guide
  • [#3645] Recipe update for Qwen 3.6 27B and general guideline for mamba models
  • [#2834] kv_cache_calculator: add Hunyuan & DeepSeek models, fix head_dim/CLA, add i18n UI

Chinese Translation

  • [#3386] Update Chinese documentation translations
  • [#3482] Update Chinese documentation translations
  • [#3588] Update Chinese documentation translations
  • [#3592] Correct machine translation errors in documentation

Chore / Maintenance

  • [#3443] Convert loglevel_api f-strings to %-format
  • [#3447] Convert internal_api_server f-string log calls to %-format
  • [#3136] Bump go.opentelemetry.io/otel from 1.36.0 to 1.41.0 in /operator
  • [#3596] Bump sphinxcontrib-mermaid from 1.2.2 to 2.0.2

New Contributors

  • @ChiragB254 made their first contribution in [#3443]
  • @Alorun made their first contribution in [#3447]
  • @JinuJeong made their first contribution in [#3327]
  • @catyion made their first contribution in [#3460]
  • @3xdevv made their first contribution in [#3481]
  • @nayeonikim made their first contribution in [#3445]
  • @sihara made their first contribution in [#3384]
  • @XuanCS made their first contribution in [#3278]
  • @Lyj1007 made their first contribution in [#3507]
  • @kirklandsign made their first contribution in [#3321]
  • @superleo made their first contribution in [#3483]
  • @feixiangpeng made their first contribution in [#3263]
  • @sonimwang made their first contribution in [#3592]
  • @KimmoZAG made their first contribution in [#2834]
  • @Chris-Sigopt made their first contribution in [#3606]
  • @ekaynar made their first contribution in [#2418]
  • @dhruvatr made their first contribution in [#3581]
  • @Kushagra963-lab made their first contribution in [#3534]
Source: README.md, updated 2026-06-13