TokenSpeed is an LLM inference engine designed for high-performance production agent workloads. It aims to combine TensorRT-LLM-level speed with vLLM-level usability, making it relevant for teams that need fast generation without sacrificing developer ergonomics. The project is focused on the specific needs of agentic systems, where latency, throughput, and efficient scheduling matter across many short or tool-heavy requests. It builds on ideas and components from the broader open-source inference ecosystem while presenting its own execution stack. TokenSpeed is useful for developers building local or server-side LLM infrastructure for agents, coding systems, and high-volume AI applications. Its main value is providing an inference layer optimized for fast token generation under practical agent workloads.

Features

  • High-performance LLM inference engine
  • Designed for production agentic workloads
  • TensorRT-LLM-style performance goal
  • vLLM-style usability goal
  • Python package-oriented project structure
  • MIT-licensed open-source implementation

Project Samples

Project Activity

See All Activity >

License

MIT License

Follow TokenSpeed

TokenSpeed Web Site

Other Useful Business Software
Full-stack observability with actually useful AI | Grafana Cloud Icon
Full-stack observability with actually useful AI | Grafana Cloud

Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
Create free account
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of TokenSpeed!

Additional Project Details

Programming Language

Python

Related Categories

Python Large Language Models (LLM)

Registered

18 hours ago