Noureddine RAMDI / Hunting Tokens/sec: 4 LLM Backends, 1 Hard Ceiling (Part 2/4)

Created Tue, 28 Apr 2026 00:00:00 +0000 Modified Sat, 23 May 2026 20:41:27 +0000

My goal was simple: ≥ 100 tok/s of generation on Qwen3.6-27B quantized Q4_K_M, with speculative decoding turned on to grab a 2-3× factor.

Six hours, four different backends, and I landed at exactly 66 tok/s in every case that actually works. Here’s the lab notebook.

Part 2 of a 4-part series on running an LLM locally:

  1. Why I serve Qwen3.6 locally on my RTX 5090
  2. Hunting tokens/sec: 4 backends, 1 ceiling (this article)
  3. Speculative decoding meets hybrid architectures: why it breaks
  4. The NixOS setup, declarative and reproducible

What I bench, and how

Every round, the same test: a short prompt ("Write a Python function that computes the Fibonacci sequence."), 2 silent warmups to absorb the CUDA-kernel JIT compile, then the real measurement. I read three things:

  • The predicted_tokens_seconds exposed on /metrics (Prometheus).
  • The print_timing lines in the systemd journal.
  • When spec-dec is supposed to run: the draft acceptance rate and the time spent inside the draft model.
# Warmup
for i in 1 2; do
  curl -s http://127.0.0.1:11435/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -d '{"model":"x","messages":[{"role":"user","content":"hi"}]}' >/dev/null
done

# Measurement
time curl -s http://127.0.0.1:11435/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"Qwopus3.6-27B","messages":[{"role":"user","content":"..."}],"stream":false}' >/dev/null

curl -s http://127.0.0.1:11435/metrics | grep predicted_tokens_seconds
journalctl -u llama-server -n 30 --no-pager | grep -E 'eval time|draft'

The reference ExecStart stays constant across rounds:

-m Qwopus3.6-27B-v1-preview-Q4_K_M.gguf
-md Qwen3-1.7B-Q4_K_M.gguf       (only when spec-dec is on)
-ngl 99 -ngld 99
-c 262144                          (the model's native context)
-fa on -ctk q4_0 -ctv q4_0
--draft-max 12 --draft-min 3 --draft-p-min 0.6
--parallel 1

Round 1: nixpkgs llama.cpp 8770 — the baseline

The llama-cpp-8770 binary that nixpkgs unstable ships, recompiled locally with cudaSupport = true. Build with sm_120 (Blackwell) baked into CMAKE_CUDA_ARCHITECTURES. Fast to start because the store path is cached.

Without spec-dec : 66.21 tok/s
With spec-dec    : 66.13 tok/s   (← almost identical)

The equality is suspicious. I grep the logs:

common_speculative_is_compat: the target context does not support
                              partial sequence removal
srv    load_model: speculative decoding not supported by this context

llama.cpp refuses to enable spec-dec at startup. The draft model is loaded into VRAM, but inert. That’s why both numbers match: no acceleration ever happened.

That line makes me suspect the issue is the model architecture itself (more on this in part 3). But first I want to see what other builds do.

Round 2: llama.cpp upstream master — DFlash + checkpointing

I’m chasing a fresher build. Upstream ggml-org/llama.cpp is at build b8951 when I test, 181 commits ahead of nixpkgs. Two recent PRs catch my eye:

  • PR #19493: server: speculative checkpointing — a workaround designed specifically for contexts that don’t support partial sequence removal. Exactly my problem.
  • PR #22105: feat: add DFlash support — fast attention kernels for hybrid SSM models.

I add a flake input:

llama-cpp-upstream = {
  url = "github:ggml-org/llama.cpp";
};

Build: ~10 minutes (CUDA from source). Service starts:

common_context_can_seq_rm: the target context does not support
                           partial sequence removal
srv    load_model: speculative decoding will use checkpoints   ← ✓
slot   load_model: id 0 | task -1 | speculative decoding context initialized

It boots. Spec-dec is supposed to run via the checkpoint mechanism. Bench:

Without spec-dec : 66 tok/s   (matches round 1)
With spec-dec    : 51 tok/s   (← WORSE than without)

I peel back the draft statistics:

draft acceptance rate = 1.00000 (1369 / 1369)
#gen drafts = 520, #acc drafts = 331    → 64% draft acceptance
#gen tokens = 3017, #acc tokens = 1518  → 50% token acceptance
dur(g) = 62 499 ms / 80 347 ms total    → 78% of time inside the draft

Acceptance is fine (50 % of draft tokens kept). But the 1.7B draft takes 78 % of total eval time. The checkpoint mechanism allocates snapshots that scale with the main context, so at 262K it’s heavy. Net: we spend more time proposing/verifying than we save. Spec-dec is functional but anti-economic.

Side note: this upstream binary has another problem — its flake pins nixpkgs from November 2024, which doesn’t know Blackwell yet. So CMAKE_CUDA_ARCHITECTURES tops out at sm_900 (Hopper). On the 5090, kernels go through PTX→SASS JIT on first call. The very first prefill takes 33 seconds for 24 tokens. Post-warmup it’s fine, but it illustrates that an old pinned flake can cost you.

Round 3: ik_llama.cpp — the “mature spec-dec” fork

There’s a popular fork, ikawrakow/ik_llama.cpp, whose upstream description literally reads:

ik_llama.cpp: llama.cpp fork with better CPU performance

It also has its own speculative decoding implementation, advertised as more mature. Its flake pins nixpkgs from March 2026 — so Blackwell is known, sm_120 native compile.

Build: 45 minutes (CUDA + custom kernels from source). Startup is clean: speculative decoding context initialized. Bench:

With spec-dec    : 34 tok/s   (← MUCH worse)
draft acceptance = 12% (8 accepted / 63 generated)

12 % acceptance. That’s abnormally low. The draft proposes, the main rejects almost everything. Either ik’s acceptance heuristic is stricter, or its impl doesn’t get along with the qwen35 architecture.

I disable spec-dec to check that ik’s vanilla CUDA kernels at least hold:

Without spec-dec : 66.73 tok/s

Exact same number as mainline. So:

  • ik’s kernels are not faster than mainline on this model.
  • Its spec-dec impl is regressed on this architecture.
  • Bottom line: 45 minutes of compilation for nothing.

Round 4: back to baseline, reading the ceiling

I revert to pkgs.unstable.llama-cpp.override { cudaSupport = true; }, spec-dec off, 262K context, parallel 1. Rebuild is instant (the store path was still there).

prompt eval time =  2.24 ms/token →  447 tok/s
       eval time = 14.99 ms/token →   66 tok/s

66 tok/s. The same number as every other round that didn’t run useless spec-dec. Whether nixpkgs, upstream master, or ik_llama.cpp without spec-dec, the value converges.

Why 66 tok/s, and not more?

LLM inference is bound by memory bandwidth. To generate every token, the GPU re-reads the entire model.

5090 bandwidth      :  ~1700 GB/s
Q4_K_M model in VRAM:    15.3 GB
Theoretical ceiling : 1700 / 15.3 = 111 tok/s

At 66 tok/s observed, I’m at 60 % of the theoretical ceiling. The rest is overhead: Flash Attention, KV quant, SSM state read/write, CUDA synchronizations, etc. 60 % of optimum on a hybrid model is the current llama.cpp standard (GitHub issues report it for Mamba and Mamba-2 too).

To break this ceiling, you need either:

  1. More bandwidth — H100 (1.8 TB/s) barely changes anything, B200 (8 TB/s) would make a clear difference.
  2. A smaller model — Q3_K_M = ~12 GB → ~85 tok/s, Q3_K_S = ~10 GB → ~100 tok/s, at the cost of degradation.
  3. Speculative decoding that actually works — which isn’t the case here.

What I take away

  • On recent hybrid models (Qwen3.5, Qwen3.6, Mamba-derived…), speculative decoding is broken or unprofitable in every public llama.cpp build I tested. It’s a generic problem, not a config problem.
  • The tok/s of a properly-wired local setup is highly predictable: it’s (memory bandwidth × utilization) / model size. Any claim that exceeds this ratio implies either spec-dec, or a smaller model than advertised.
  • When you read impressive numbers on Twitter/X, verify three things first: the exact model size (Q4_K_M? Q4_K_S? a MoE variant?), whether spec-dec actually works (check the n_drafted / acceptance rate counters!), and the hardware (memory bandwidth ≠ TFLOPS, don’t conflate them).

I should have computed that ceiling before I started. Lesson learned: on LLM inference, physics decides much faster than benchmarking does.

→ For why-spec-dec-doesn’t-actually-work, see Part 3 — Speculative decoding meets hybrid architectures.