| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| lmcache-0.4.7-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl | 2026-06-13 | 13.2 MB | |
| lmcache-0.4.7-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl | 2026-06-13 | 13.3 MB | |
| lmcache-0.4.7-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl | 2026-06-13 | 13.3 MB | |
| lmcache-0.4.7-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl | 2026-06-13 | 13.3 MB | |
| lmcache-0.4.7.tar.gz | 2026-06-13 | 6.2 MB | |
| README.md | 2026-06-13 | 9.7 kB | |
| v0.4.7 source code.tar.gz | 2026-06-13 | 6.2 MB | |
| v0.4.7 source code.zip | 2026-06-13 | 7.3 MB | |
| Totals: 8 Items | 72.7 MB | 0 | |
LMCache v0.4.7 Release
Interface / Config / CLI / Build Changes
Breaking / behavior changes (action may be needed)
python_ops_fallbacknow requires completion recorder ops (added missing ops)LMCacheGroupViewrenamed toEngineGroupInforeport_statusis now per-kernel-group- Per-group
tokens_per_chunk/slots_per_chunknow used instead of inferring fromcache_config.block_size goblinis deprecated (documented)- Blend v2 CI removed; CacheBlend now uses Blend v3
New / additive (opt-in)
- New
mp_transfer_modeconfig option - New SHM-based data transfer path for GPUs/CPU/Accelerators (POSIX SHM infra for CPU KV-cache IPC)
- New hybrid memory allocator (HMA) support, with per-group block sizes and Mamba/GDN hybrid model (Qwen3.5) support
- New MP coordinator backbone: server registration, coordinator CLI, L2 quota/usage/eviction, global CacheBlend fingerprint directory
- New CLI quota management commands (set/get/list/delete)
- New runtime DAX hotplug HTTP API (MP)
- New
--mode cpuand--transfer-modeoptions inserver_bench - New backends: NIXL DOCA_MEMOS (NVIDIA CMX), Cloud Bigtable remote storage, Moore Threads MUSA support, multipath KV-cache offloading in NIXL backend
- New
multi_layer_block_kv_transferunified MP transfer primitive - LMCache startup banner now printed in CLI and vLLM connectors
- vLLM CPU 2-fused KV layout support
- Token-level matching for non-block-aligned KV reuse (CacheBlend)
MP (Multi-Process Mode)
- [#3245] Retain CUDA IPC events in MP adapter
- [#3359] SHM-based data transfer path for GPUs/CPU/Accelerators
- [#3382] Fix GPU block exhaustion deadlock at high concurrency with chunked KV loading
- [#3488] Add mp coordinator backbone
- [#3513] Add mp_transfer_mode config option
- [#3516] Register MP servers with the coordinator
- [#3522] Add coordinator CLI and mp server registration
- [#3531] Introduce create_cache_context factory
- [#3557] Refactor LMCache layer group for better compat with hybrid models
- [#3608] Introduce object_group_id into the ObjectKey
- [#3352] Add SHM-based NonGpuContext (server-side copy)
- [#3612] Implement interface for multi-object group and sliding window support (HMA)
- [#3630] Coordinator L2 Quota, Usage, Eviction
- [#3597] Global CacheBlend fingerprint directory on the MP coordinator
- [#3264] Add runtime DAX hotplug http API
- [#3477] Add l2_evicted_object, add cachesalt to L1/L2 metrics
- [#3478] Consolidate ParallelStrategy construction in vllm_multi_process_adapter
- [#3558] Align MP server id with OTel service.instance.id
- [#3508] Add multi_layer_block_kv_transfer Python fallback as unified MP transfer primitive
- [#3563] Add POSIX SHM infra for CPU KV-cache IPC
Core / HMA
- [#3419] Add support for hybrid memory allocator
- [#3491] Bitmap-based prefetch result + pluggable TrimPolicy
- [#3503] Native bulk-set: build found bitmap via batched_set + gather
- [#3492] Sparse prefetch via TrimPolicy.SPARSE + covered_keys
- [#3521] Support different block size for different groups
- [#3613] Support Mamba/GDN hybrid models (Qwen3.5)
- [#3616] Per-group tokens_per_chunk and slots_per_chunk
- [#3635] Optimize DSV4 store/load size
- [#3589] Add GDS L1 slab-file tier (cuFile DMA) for MP mode
CacheBlend
- [#3364] Blend v3
- [#3582] Token-level matching + per-token slot scatter for non-block-aligned KV reuse
- [#3629] Reuse gpu_transfer.cache_contexts; drop CB GPU-context mirror
- [#3541] Cleanup/remove blend v2 ci
Storage / Backends
- [#3486] NIXL DOCA_MEMOS storage backend (NVIDIA CMX)
- [#3453] nixl_storage: use LocalCPUBackend if nixl_buffer_device=cpu
- [#3263] Added HFbucket MP
- [#2418] Add multipath KV-cache offloading support in LMCache NIXL backend
- [#3404] Integrate native Cloud Bigtable remote storage connector
- [#3483] Add Moore Threads MUSA support for LMCache v1
- [#3568] nixl: create storage directory if it doesn't exist
- [#3274] Missing io_uring changes + nvme io_uring_cmd passthrough
Observability
- [#3384] Add NVTX annotations to LocalDiskBackend disk read path
- [#3607] Blend server trace sub-spans + V3 hit-rate breakdown
Operator
- [#3543] CacheBlend: CacheBlendEngine CRD + injection webhook
- [#3647] Emit --engine-type blend for CacheBlend engine
- [#3646] Install cert-manager in e2e smoke suite
CLI
- [#3611] Print LMCache startup banner in CLI and vLLM connectors
- [#3625] Refactor query and trace cli
- [#3623] Add quota management commands (set/get/list/delete)
XPU / Accelerators
- [#3360] Add SYCL CacheGen + RoPE kernels and in-process blender XPU tests
Bugfixes
- [#3327] gds: use parse_cache_key to handle LayerCacheEngineKey on restart
- [#3441] Drop EngineArgs+asdict to fix vLLM 0.20+ pydantic error
- [#3189] Fix LocalCPUBackend recovery when pinned CPU chunks block eviction
- [#3469] Add missing completion recorder ops to python_ops_fallback
- [#3463] Prevent stale prefetches and registry memory leaks by purging unregistered KV layouts
- [#3410] Prevent negative pin count on unpinned remote memory objects
- [#3278] PD restore pin=True in PD sync backend dedup path
- [#3525] Resolve AttributeError in test_execute_calls_run_http_server
- [#3602] Handle NL_X_NB_NH_BS_TWO_HS in get_group_data_ptrs
- [#3606] Add missing enum to GPUVKFormat
- [#3325] Graceful skip on slot_mapping/token_ids desync in wait_for_save (fixes [#3318])
- [#3648] Correct retrieve log label prefix -> non_shifted
Performance / Optimization
- [#3413] Avoid redundant PCIe transfer on leader rank during retrieve
- [#3591] Optimize Python fallback path for block transfer operations
Refactor / Cleanup
- [#3460] Move serializer registry + encoder/decoder helpers to end of custom_types.py
- [#3445] Simplify redundant conditions in RawBlockCore
- [#3216] Put lmcache_frontend into lmcache repo
- [#3514] Add set_shape_desc_dtype helper to avoid scattered try/except
- [#3545] Normalize block_ids to tolerate legacy vLLM connectors
- [#3577] Normalize flat/nested block_ids in flat_block_ids and connector str
- [#3567] Support vLLM CPU 2-fused KV layout
- [#3598] Rename LMCacheGroupView to EngineGroupInfo
- [#3599] Change report_status to be per-kernel-group in LMCache
- [#3581] Remove unnecessary global statement in cuda_extension
- [#3600] Utilize multi_layer_block_kv_transfer ops for data transfer path
- [#3524] Add transfer timing logs to non-GPU path similar to CUDA path
Benchmarking
- [#3283] Support benchmark fs and hf3fs backend via storage_backend_io_benchmark
- [#3528] server_bench supports --mode cpu and --transfer-mode
- [#3603] Support aligned L1 buffers for L2 adapters
CI/CD & Build
- [#3456] Add http_api e2e test for MP HTTP server endpoints and CLI commands
- [#3498] Relax timeout to reduce flakiness of some CI/CD tests
- [#3502] Force vLLM Model Runner V1 in the PD comprehensive test
- [#3489] Add pickle/shm vLLM + LMCache e2e validation on CPU
- [#3538] Hot fix for the CPU test in multiprocess mode CI
- [#3507] Add parity test between c_ops and python_ops_fallback
- [#3321] Add unit tests for v1/utils/bloom_filter
- [#3556] Improve CI stability: gemma-4 test & serde test
- [#3614] Reduce ci cpu e2e test memory request
- [#3621] cu129 images: pin vllm to the cu129 index (drop unsafe-best-match)
- [#3590] Add CPU e2e test (vLLM and bench server)
Docs
- [#3457] Update and restructure CLI reference
- [#3481] Fix Docker examples and build metadata
- [#3501] Combined doc drift updates May 27-Jun 2
- [#3504] KV Cache Size Calculator: add hybrid SWA, DSA, placeholders for Mamba / Linear
- [#3506] Add recipe for Gemma 3
- [#3518] Deprecate goblin in doc
- [#3461] Update README.md
- [#3433] Auto-select model in CPU-offloading example to fit GPU
- [#3534] Add filesystem connector backend guide
- [#3645] Recipe update for Qwen 3.6 27B and general guideline for mamba models
- [#2834] kv_cache_calculator: add Hunyuan & DeepSeek models, fix head_dim/CLA, add i18n UI
Chinese Translation
- [#3386] Update Chinese documentation translations
- [#3482] Update Chinese documentation translations
- [#3588] Update Chinese documentation translations
- [#3592] Correct machine translation errors in documentation
Chore / Maintenance
- [#3443] Convert loglevel_api f-strings to %-format
- [#3447] Convert internal_api_server f-string log calls to %-format
- [#3136] Bump go.opentelemetry.io/otel from 1.36.0 to 1.41.0 in /operator
- [#3596] Bump sphinxcontrib-mermaid from 1.2.2 to 2.0.2
New Contributors
- @ChiragB254 made their first contribution in [#3443]
- @Alorun made their first contribution in [#3447]
- @JinuJeong made their first contribution in [#3327]
- @catyion made their first contribution in [#3460]
- @3xdevv made their first contribution in [#3481]
- @nayeonikim made their first contribution in [#3445]
- @sihara made their first contribution in [#3384]
- @XuanCS made their first contribution in [#3278]
- @Lyj1007 made their first contribution in [#3507]
- @kirklandsign made their first contribution in [#3321]
- @superleo made their first contribution in [#3483]
- @feixiangpeng made their first contribution in [#3263]
- @sonimwang made their first contribution in [#3592]
- @KimmoZAG made their first contribution in [#2834]
- @Chris-Sigopt made their first contribution in [#3606]
- @ekaynar made their first contribution in [#2418]
- @dhruvatr made their first contribution in [#3581]
- @Kushagra963-lab made their first contribution in [#3534]