LMCache Reviews in 2025

Audience

AI engineers and infrastructure teams looking for a tool to lower latency, reduce compute cost, and scale throughput

About LMCache

LMCache is an open source Knowledge Delivery Network (KDN) designed as a caching layer for large language model serving that accelerates inference by reusing KV (key-value) caches across repeated or overlapping computations. It enables fast prompt caching, allowing LLMs to “prefill” recurring text only once and then reuse those stored KV caches, even in non-prefix positions, across multiple serving instances. This approach reduces time to first token, saves GPU cycles, and increases throughput in scenarios such as multi-round question answering or retrieval augmented generation. LMCache supports KV cache offloading (moving cache from GPU to CPU or disk), cache sharing across instances, and disaggregated prefill, which separates the prefill and decoding phases for resource efficiency. It is compatible with inference engines like vLLM and TGI and supports compressed storage, blending techniques to merge caches, and multiple backend storage options.

Other Popular Alternatives & Related Software

PrimoCache

Effectively cache your frequently used applications, documents and other data into faster storage devices, accessing them at up to RAM-like or SSD-like speeds. Make your computer more responsive for creating, gaming and producing, with less boot and load times. Complete write requests very quickly by temporarily storing incoming data into RAM or SSD storage first and writing them back to target disks later. Enable your computer to handle heavy or stream write IOs, while reducing writes and wear on disks. Capable of interoperating with almost all faster storage devices, including system memory, invisible memory, solid-state drives and flash drives, to accelerate relatively slow storage. Setup caching and accelerate storage in just few simple clicks! Special features such as multiple caching strategies, different writing modes, individual read/write space and individual volume control, make caching flexible to various scenarios.

Learn more

DeepSeek-V2

DeepSeek-V2 is a state-of-the-art Mixture-of-Experts (MoE) language model introduced by DeepSeek-AI, characterized by its economical training and efficient inference capabilities. With a total of 236 billion parameters, of which only 21 billion are active per token, it supports a context length of up to 128K tokens. DeepSeek-V2 employs innovative architectures like Multi-head Latent Attention (MLA) for efficient inference by compressing the Key-Value (KV) cache and DeepSeekMoE for cost-effective training through sparse computation. This model significantly outperforms its predecessor, DeepSeek 67B, by saving 42.5% in training costs, reducing the KV cache by 93.3%, and enhancing generation throughput by 5.76 times. Pretrained on an 8.1 trillion token corpus, DeepSeek-V2 excels in language understanding, coding, and reasoning tasks, making it a top-tier performer among open-source models.

Learn more

PlatinumCache

DTS PlatinumCache C4 is a caching system developed by Data Transmission System Incorporation. DTS PlatinumCache C4 is a solution for the storage bottleneck problems. DTS PlatinumCache C4 is basically a caching system supporting Write-Back, Write-Through, Write-Only and Pre-fetching policies. The system uses RAMDISK as its cache. In android we used SD card as our target drive. DTS PlatinumCache C4 reads and writes data to/from SD card to increase throughput. It is designed in such a way to fulfill the needs of every category of clients ranging from small to large enterprises. Performance becomes greatly enhanced because data transfer occurs from Cache which is RAM. MCell-II SSD first in the industry. SSD version of the hybrid memory disk. The via DRAM, 30,000 IOPS Random Read, 26,000 IOPS Ultra-fast access Random Write. The DTS chip, manage the number of writes. Achieve a long life. (DTS PlatinumCache that the chunk size to write to the SSD.

Learn more

Amazon ElastiCache

Amazon ElastiCache allows you to seamlessly set up, run, and scale popular open-Source compatible in-memory data stores in the cloud. Build data-intensive apps or boost the performance of your existing databases by retrieving data from high throughput and low latency in-memory data stores. Amazon ElastiCache is a popular choice for real-time use cases like Caching, Session Stores, Gaming, Geospatial Services, Real-Time Analytics, and Queuing. Amazon ElastiCache offers fully managed Redis and Memcached for your most demanding applications that require sub-millisecond response times. Amazon ElastiCache works as an in-memory data store and cache to support the most demanding applications requiring sub-millisecond response times. By utilizing an end-to-end optimized stack running on customer-dedicated nodes, Amazon ElastiCache provides secure, blazing-fast performance.

Learn more

Pricing

Starting Price:

Free

Free Version:

Free Version available.

Integrations

No integrations listed.

Ratings/Reviews

Overall 0.0 / 5

ease 0.0 / 5

features 0.0 / 5

design 0.0 / 5

support 0.0 / 5

This software hasn't been reviewed yet. Be the first to provide a review:

Review this Software

Videos and Screen Captures

Other Useful Business Software

Our Free Plans just got better! | Auth0

With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.

Try free now

Product Details

Platforms Supported

Cloud

Training

Documentation

Live Online

Support

Online

Compare This Software

Tensormesh

Tensormesh is a caching layer built specifically for large-language-model inference workloads that enables organizations to reuse intermediate computations, drastically reduce GPU usage, and accelerate time-to-first-token and latency. It works by capturing and reusing key-value cache states that...

Compare
PlatinumCache

DTS PlatinumCache C4 is a caching system developed by Data Transmission System Incorporation. DTS PlatinumCache C4 is a solution for the storage bottleneck problems. DTS PlatinumCache C4 is basically a caching system supporting Write-Back, Write-Through, Write-Only and Pre-fetching policies. The...

Compare
DeepSeek-V2

DeepSeek-V2 is a state-of-the-art Mixture-of-Experts (MoE) language model introduced by DeepSeek-AI, characterized by its economical training and efficient inference capabilities. With a total of 236 billion parameters, of which only 21 billion are active per token, it supports a context length...

Compare
PrimoCache

Effectively cache your frequently used applications, documents and other data into faster storage devices, accessing them at up to RAM-like or SSD-like speeds. Make your computer more responsive for creating, gaming and producing, with less boot and load times. Complete write requests very...

Compare
Squid

Squid is a caching proxy for the Web supporting HTTP, HTTPS, FTP, and more. It reduces bandwidth and improves response times by caching and reusing frequently-requested web pages. Squid has extensive access controls and makes a great server accelerator. It runs on most available operating...

Compare

Recommended Software

Tensormesh

Tensormesh is a caching layer built specifically for large-language-model inference workloads that enables organizations to reuse intermediate computations, drastically reduce GPU usage, and accelerate time-to-first-token and latency. It works by capturing and reusing key-value cache states that...

See Software
PlatinumCache

DTS PlatinumCache C4 is a caching system developed by Data Transmission System Incorporation. DTS PlatinumCache C4 is a solution for the storage bottleneck problems. DTS PlatinumCache C4 is basically a caching system supporting Write-Back, Write-Through, Write-Only and Pre-fetching policies. The...

See Software
DeepSeek-V2

DeepSeek-V2 is a state-of-the-art Mixture-of-Experts (MoE) language model introduced by DeepSeek-AI, characterized by its economical training and efficient inference capabilities. With a total of 236 billion parameters, of which only 21 billion are active per token, it supports a context length...

See Software