mistral.rs is a fast and flexible LLM inference engine implemented in Rust, designed to run and serve modern language models with an emphasis on performance and practical deployment. It provides multiple entry points for developers, including a CLI for running models locally and an HTTP server that exposes an OpenAI-compatible API surface for easy integration with existing clients. The project includes hardware-aware tooling that can benchmark a system and choose sensible quantization and device-mapping strategies, helping users get strong performance without manual tuning. It also supports serving multiple models from the same server process, enabling routing or quick switching between models depending on workload needs. For user-facing testing, mistral.rs can provide a built-in web UI, and it also offers a dedicated lightweight web chat interface that supports richer interaction patterns.
Features
- High-performance Rust-based inference engine for running modern LLMs
- OpenAI-compatible HTTP server for drop-in client integration
- CLI tooling for local execution, configuration, and troubleshooting
- Hardware-aware tuning that selects quantization and device mapping strategies
- Multi-model serving support within a single server instance
- Built-in and optional web UI experiences for interactive testing and demos