My goal was simple: ≥ 100 tok/s of generation on Qwen3.6-27B quantized Q4_K_M, with speculative decoding turned on to grab a 2-3× factor.
Six hours, four different backends, and I landed at exactly 66 tok/s in every case that actually works. Here’s the lab notebook.
Part 2 of a 4-part series on running an LLM locally:
- Why I serve Qwen3.6 locally on my RTX 5090
- Hunting tokens/sec: 4 backends, 1 ceiling (this article)
- Speculative decoding meets hybrid architectures: why it breaks
- The NixOS setup, declarative and reproducible
What I bench, and how
Every round, the same test: a short prompt ("Write a Python function that computes the Fibonacci sequence."), 2 silent warmups to absorb the CUDA-kernel JIT compile, then the real measurement. I read three things:
- The
predicted_tokens_secondsexposed on/metrics(Prometheus). - The
print_timinglines in the systemd journal. - When spec-dec is supposed to run: the
draft acceptance rateand the time spent inside the draft model.
# Warmup
for i in 1 2; do
curl -s http://127.0.0.1:11435/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"x","messages":[{"role":"user","content":"hi"}]}' >/dev/null
done
# Measurement
time curl -s http://127.0.0.1:11435/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"Qwopus3.6-27B","messages":[{"role":"user","content":"..."}],"stream":false}' >/dev/null
curl -s http://127.0.0.1:11435/metrics | grep predicted_tokens_seconds
journalctl -u llama-server -n 30 --no-pager | grep -E 'eval time|draft'
The reference ExecStart stays constant across rounds:
-m Qwopus3.6-27B-v1-preview-Q4_K_M.gguf
-md Qwen3-1.7B-Q4_K_M.gguf (only when spec-dec is on)
-ngl 99 -ngld 99
-c 262144 (the model's native context)
-fa on -ctk q4_0 -ctv q4_0
--draft-max 12 --draft-min 3 --draft-p-min 0.6
--parallel 1
Round 1: nixpkgs llama.cpp 8770 — the baseline
The llama-cpp-8770 binary that nixpkgs unstable ships, recompiled locally with cudaSupport = true. Build with sm_120 (Blackwell) baked into CMAKE_CUDA_ARCHITECTURES. Fast to start because the store path is cached.
Without spec-dec : 66.21 tok/s
With spec-dec : 66.13 tok/s (← almost identical)
The equality is suspicious. I grep the logs:
common_speculative_is_compat: the target context does not support
partial sequence removal
srv load_model: speculative decoding not supported by this context
llama.cpp refuses to enable spec-dec at startup. The draft model is loaded into VRAM, but inert. That’s why both numbers match: no acceleration ever happened.
That line makes me suspect the issue is the model architecture itself (more on this in part 3). But first I want to see what other builds do.
Round 2: llama.cpp upstream master — DFlash + checkpointing
I’m chasing a fresher build. Upstream ggml-org/llama.cpp is at build b8951 when I test, 181 commits ahead of nixpkgs. Two recent PRs catch my eye:
- PR #19493:
server: speculative checkpointing— a workaround designed specifically for contexts that don’t support partial sequence removal. Exactly my problem. - PR #22105:
feat: add DFlash support— fast attention kernels for hybrid SSM models.
I add a flake input:
llama-cpp-upstream = {
url = "github:ggml-org/llama.cpp";
};
Build: ~10 minutes (CUDA from source). Service starts:
common_context_can_seq_rm: the target context does not support
partial sequence removal
srv load_model: speculative decoding will use checkpoints ← ✓
slot load_model: id 0 | task -1 | speculative decoding context initialized
It boots. Spec-dec is supposed to run via the checkpoint mechanism. Bench:
Without spec-dec : 66 tok/s (matches round 1)
With spec-dec : 51 tok/s (← WORSE than without)
I peel back the draft statistics:
draft acceptance rate = 1.00000 (1369 / 1369)
#gen drafts = 520, #acc drafts = 331 → 64% draft acceptance
#gen tokens = 3017, #acc tokens = 1518 → 50% token acceptance
dur(g) = 62 499 ms / 80 347 ms total → 78% of time inside the draft
Acceptance is fine (50 % of draft tokens kept). But the 1.7B draft takes 78 % of total eval time. The checkpoint mechanism allocates snapshots that scale with the main context, so at 262K it’s heavy. Net: we spend more time proposing/verifying than we save. Spec-dec is functional but anti-economic.
Side note: this upstream binary has another problem — its flake pins nixpkgs from November 2024, which doesn’t know Blackwell yet. So CMAKE_CUDA_ARCHITECTURES tops out at sm_900 (Hopper). On the 5090, kernels go through PTX→SASS JIT on first call. The very first prefill takes 33 seconds for 24 tokens. Post-warmup it’s fine, but it illustrates that an old pinned flake can cost you.
Round 3: ik_llama.cpp — the “mature spec-dec” fork
There’s a popular fork, ikawrakow/ik_llama.cpp, whose upstream description literally reads:
ik_llama.cpp: llama.cpp fork with better CPU performance
It also has its own speculative decoding implementation, advertised as more mature. Its flake pins nixpkgs from March 2026 — so Blackwell is known, sm_120 native compile.
Build: 45 minutes (CUDA + custom kernels from source). Startup is clean: speculative decoding context initialized. Bench:
With spec-dec : 34 tok/s (← MUCH worse)
draft acceptance = 12% (8 accepted / 63 generated)
12 % acceptance. That’s abnormally low. The draft proposes, the main rejects almost everything. Either ik’s acceptance heuristic is stricter, or its impl doesn’t get along with the qwen35 architecture.
I disable spec-dec to check that ik’s vanilla CUDA kernels at least hold:
Without spec-dec : 66.73 tok/s
Exact same number as mainline. So:
- ik’s kernels are not faster than mainline on this model.
- Its spec-dec impl is regressed on this architecture.
- Bottom line: 45 minutes of compilation for nothing.
Round 4: back to baseline, reading the ceiling
I revert to pkgs.unstable.llama-cpp.override { cudaSupport = true; }, spec-dec off, 262K context, parallel 1. Rebuild is instant (the store path was still there).
prompt eval time = 2.24 ms/token → 447 tok/s
eval time = 14.99 ms/token → 66 tok/s
66 tok/s. The same number as every other round that didn’t run useless spec-dec. Whether nixpkgs, upstream master, or ik_llama.cpp without spec-dec, the value converges.
Why 66 tok/s, and not more?
LLM inference is bound by memory bandwidth. To generate every token, the GPU re-reads the entire model.
5090 bandwidth : ~1700 GB/s
Q4_K_M model in VRAM: 15.3 GB
Theoretical ceiling : 1700 / 15.3 = 111 tok/s
At 66 tok/s observed, I’m at 60 % of the theoretical ceiling. The rest is overhead: Flash Attention, KV quant, SSM state read/write, CUDA synchronizations, etc. 60 % of optimum on a hybrid model is the current llama.cpp standard (GitHub issues report it for Mamba and Mamba-2 too).
To break this ceiling, you need either:
- More bandwidth — H100 (1.8 TB/s) barely changes anything, B200 (8 TB/s) would make a clear difference.
- A smaller model — Q3_K_M = ~12 GB → ~85 tok/s, Q3_K_S = ~10 GB → ~100 tok/s, at the cost of degradation.
- Speculative decoding that actually works — which isn’t the case here.
What I take away
- On recent hybrid models (Qwen3.5, Qwen3.6, Mamba-derived…), speculative decoding is broken or unprofitable in every public llama.cpp build I tested. It’s a generic problem, not a config problem.
- The tok/s of a properly-wired local setup is highly predictable: it’s
(memory bandwidth × utilization) / model size. Any claim that exceeds this ratio implies either spec-dec, or a smaller model than advertised. - When you read impressive numbers on Twitter/X, verify three things first: the exact model size (Q4_K_M? Q4_K_S? a MoE variant?), whether spec-dec actually works (check the
n_drafted/acceptance ratecounters!), and the hardware (memory bandwidth ≠ TFLOPS, don’t conflate them).
I should have computed that ceiling before I started. Lesson learned: on LLM inference, physics decides much faster than benchmarking does.
→ For why-spec-dec-doesn’t-actually-work, see Part 3 — Speculative decoding meets hybrid architectures.