Noureddine RAMDI / vllm-mlx: Efficient LLM serving on Apple Silicon with SSD-tiered KV cache and continuous batching

Created Mon, 04 May 2026 10:23:02 +0000 Modified Sat, 23 May 2026 20:41:27 +0000

waybarrios/vllm-mlx

Apple Silicon’s unified memory architecture poses a challenge for running large language models (LLMs) with long contexts due to limited RAM. vllm-mlx tackles this head-on by introducing an SSD-tiered key-value cache that spills prefix cache to disk, enabling workflows that would otherwise exhaust memory. On top of that, it provides OpenAI and Anthropic-compatible APIs in a single process and supports continuous batching, making it a practical option for production agent workloads on M1 to M4 Macs.

What vllm-mlx does and how it works

vllm-mlx is a Python-based inference server designed for Apple Silicon (M1 through M4) that uses the native MLX framework with Metal kernels. This native integration means the server leverages Apple’s GPU acceleration for transformer inference, rather than relying on generic CPU or external GPU solutions.

The server exposes both OpenAI-compatible endpoints (under /v1/*) and Anthropic-compatible endpoints (under /v1/messages) within the same process. This dual API compatibility simplifies deployment for applications built around either vendor’s API standards without needing separate model conversions or tooling.

Under the hood, vllm-mlx introduces several key architectural innovations:

  • Continuous batching: Inference requests are dynamically batched to maximize GPU utilization and throughput without increasing latency significantly.
  • Paged KV cache with prefix sharing: To handle the transformer attention mechanism efficiently, a paged key-value cache is used that supports sharing of common prefixes across requests.
  • SSD-tiered KV cache spilling: When the prefix cache grows beyond RAM capacity, it spills to SSD storage. This is a standout feature for Apple Silicon, where RAM limits can be restrictive for long context lengths.
  • Built-in MCP tool calling: The server includes integrated tool calling support with 12 parser implementations, facilitating multi-component pipelines.
  • Multimodal input support: The server can handle text, image, video, and audio inputs, along with native text-to-speech (TTS) and speech-to-text (STT) features.

The models supported include recent ones like Qwen3 and DeepSeek-R1, with advanced features such as reasoning extraction and mixture of experts (MoE) expert reduction to optimize inference.

Technical strengths and tradeoffs

The most compelling innovation here is the SSD-tiered KV cache spilling. On Apple Silicon, unified memory is limited and shared between CPU and GPU. This creates a bottleneck for long-context LLM inference where the transformer attention’s key-value cache can grow very large.

By spilling the KV cache to SSD, vllm-mlx can sustain much longer context windows without running out of memory. This approach trades off some I/O latency but benefits from the high-speed NVMe SSDs common in modern Macs. Combined with warm prompt preloading, this yields a 1.3x to 2.25x improvement in time-to-first-token (TTFT), a critical metric for interactive applications.

Continuous batching is another practical strength. Instead of processing each request individually, vllm-mlx batches incoming requests dynamically, improving GPU utilization and throughput while keeping latency under control. This is particularly important for production workloads where many concurrent requests arrive at irregular intervals.

Supporting both OpenAI and Anthropic APIs in one server process is a developer-friendly design choice. It avoids the overhead and complexity of running separate inference servers or translating formats between them.

The multimodal input and native speech capabilities are ambitious features that extend the server beyond simple text generation. However, these may come with tradeoffs in complexity and resource usage, especially for real-time audio processing.

Performance metrics from the README give concrete numbers:

LLM decode (M4 Max, 128 GB, greedy, single stream):
| Model                   | Tok/s | Memory  |
|-------------------------|-------|---------|
| Qwen3-0.6B-8bit         | 417.9 | 0.7 GB  |
| Llama-3.2-3B-Instruct-4bit | 205.6 | 1.8 GB |
| Qwen3-30B-A3B-4bit      | 127.7 | ~18 GB  |

Audio speech-to-text (M4 Max, RTF = real-time factor):
| Model                   | RTF   | Use case           |
|-------------------------|-------|--------------------|
| whisper-tiny            | 197x  | Real-time / low latency |
| whisper-large-v3-turbo  | 55x   | Quality + speed    |
| whisper-large-v3        | 24x   | Highest accuracy   |

These figures illustrate that vllm-mlx handles small to mid-sized models efficiently with a modest memory footprint, and its Whisper STT implementation achieves impressive real-time factors.

The codebase is Python-centric, which makes integration and extension accessible, but it relies heavily on the MLX framework and Metal kernels, meaning its portability is limited to Apple Silicon platforms.

Quick start

pip install vllm-mlx
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching

OpenAI SDK:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
r = client.chat.completions.create(model="default", messages=[{"role": "user", "content": "Hi!"}])
print(r.choices[0].message.content)

Anthropic SDK / Claude Code:

export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude

This quick start is straightforward: install the package via pip, launch the server with a model, and make requests using familiar OpenAI or Anthropic SDK interfaces.

Verdict

vllm-mlx is a solid choice if you want to run LLM inference natively on Apple Silicon with a focus on long-context applications. Its SSD-tiered KV cache spilling addresses a practical hardware limitation on Macs and enables agent workflows that would otherwise be infeasible.

The continuous batching and dual API support make it production-ready for developers targeting Apple platforms who want to avoid the overhead of cloud-based inference or multi-server setups.

However, the reliance on Apple MLX and Metal frameworks means this solution is Apple Silicon-exclusive. If you need cross-platform support or want to run larger models beyond what unified memory plus SSD caching can handle, you’ll need to look elsewhere.

In summary, vllm-mlx is worth exploring if your workload fits the Apple Silicon ecosystem and you want a performant, native inference server with advanced caching strategies and API compatibility baked in.


→ GitHub Repo: waybarrios/vllm-mlx ⭐ 1,083 · Python