ds4.c
DeepSeek 4 Flash local inference engine for Metal
...Unlike general-purpose inference runtimes, the project is intentionally optimized for a specific model family, enabling highly efficient execution and simplified architecture. The engine includes DS4-specific model loading, KV cache management, prompt rendering, and OpenAI-compatible server APIs for local deployment workflows. Built as a native low-level implementation, it focuses on performance, reduced abstraction overhead, and direct integration with Apple GPU acceleration through Metal compute graphs. The project also supports streaming inference behavior and local API serving for integration with external tools and AI applications. ...