In the previous part, I observed three things:
- nixpkgs llama.cpp refuses to enable speculative decoding (
partial sequence removal not supported). - llama.cpp upstream enables it but it’s slower than without (51 vs 66 tok/s).
- ik_llama.cpp enables it too, and lands at 12 % acceptance — dropping to 34 tok/s.
This article explains why. It’s a bit more technical than the previous two, but worth it because it’s exactly the kind of trap that wrecks your optimization for hours if you don’t know about it.
Part 3 of a 4-part series on running an LLM locally:
- Why I serve Qwen3.6 locally on my RTX 5090
- Hunting tokens/sec: 4 backends, 1 ceiling
- Speculative decoding meets hybrid architectures: why it breaks (this article)
- The NixOS setup, declarative and reproducible
Refresher: speculative decoding in two sentences
Standard LLM inference is autoregressive decoding: one token at a time, each requiring a complete forward pass through the model. For a 27B Q4_K_M on a 5090, that’s ~15 ms/token → 66 tok/s.
Spec-dec’s clever trick is running two forward passes in parallel:
- A fast draft model (1.7B here) proposes 12 tokens ahead in sequence.
- The main model (27B) verifies them in parallel (one forward pass for all 12, because parallel attention can process the whole sequence at once).
If the draft is right often enough (~70-85 % acceptance), we average 5-8 tokens per main model forward pass. Theoretical speedup: 5-8×. In practice: 3-4×, because the draft also costs something.
It’s a classic technique, well-implemented in llama.cpp since late 2023. It works very well on classic transformers.
The concrete mechanism: partial sequence removal
When the draft proposes 12 tokens and the main model only accepts the first 7, llama.cpp has to “undo” the remaining 5 tokens from the KV cache. Concretely, undoing means:
Erase positions [N+8 ; N+12] from caches K and V, and continue from N+8.
This is what’s called partial sequence removal in llama.cpp jargon: removing a suffix from a sequence in the KV cache.
On an attention-only transformer, that’s trivial: KV is just matrices indexed by position, you can truncate a row on the fly. That’s what makes spec-dec possible.
Qwen3.6 is different: hybrid architecture
I posted this log in part 2 without dwelling on it:
qwen35.block_count = 64
qwen35.full_attention_interval = 4 ← 1 attention layer in 4
qwen35.ssm.conv_kernel = 4
qwen35.ssm.state_size = 128
qwen35.ssm.group_count = 16
qwen35.ssm.inner_size = 6144
Translation: out of the model’s 64 layers, only 1 in 4 does full attention (16 attention layers). The other 48 layers are state-space models (SSM) — a Mamba/S4-inspired mechanism that maintains a fixed recurrent state instead of a growing KV cache.
This architecture is intentional. It lets Qwen3.6 have a 262K native context with a reasonable memory footprint (a linear KV would explode at those sizes). The trade-off: the SSM portion doesn’t behave like attention.
Why SSM breaks spec-dec
An SSM layer maintains a recurrent state that summarizes all the context seen so far. To generate token N+1, the SSM reads the state at time N and produces one at time N+1. That state is overwritten in place — not “appended” to a cache like attention does.
So undoing token N+12 to return to N+8 requires reconstructing the state at N+8 from scratch (or from an earlier snapshot). You can’t just “truncate” — there’s nothing to truncate, it’s a scalar state that already absorbed the entire sequence.
That’s exactly the message llama.cpp displays at startup:
common_speculative_is_compat: the target context does not support
partial sequence removal
srv load_model: speculative decoding not supported by this context
llama.cpp detects the SSM (via GGUF metadata), notices it doesn’t know how to truncate its state, and disables spec-dec cleanly.
The workarounds: checkpointing
Upstream merged PR #19493 — server: speculative checkpointing shortly before my session. The idea: instead of truncating the KV cache (and the SSM state), take periodic snapshots, and when a draft is rejected, roll back to the last snapshot.
slot load_model: speculative decoding will use checkpoints ← PR #19493 active
slot load_model: speculative decoding context initialized
It works technically. What doesn’t work is the economics: SSM snapshots are expensive. Every spec-dec iteration, llama.cpp has to copy the full SSM state in VRAM. When you do that on every proposal, you end up spending more bandwidth on snapshot copies than you save through parallelism.
My exact measurements:
Total eval time : 80 347 ms / 4102 tokens
Draft duration : 62 499 ms → 78% of total time
Draft acceptance : 50 % of tokens
Net throughput : 51 tok/s (vs 66 without spec-dec)
The draft works — 50 % acceptance is decent. But 78 % of time goes into the draft model, and the checkpoint copies saturate the bandwidth. The net speedup is negative.
And ik_llama.cpp?
ik_llama.cpp has its own spec-dec implementation, presented as “mature” in its README. I hoped it would have smarter heuristics for hybrid models. Answer:
draft acceptance = 12 %
Net throughput = 34 tok/s
12 % acceptance. The draft proposes tokens that the main rejects 88 % of the time. Either ik’s acceptance algorithm is too strict, or its SSM state handling loses information from the main context between verifications. I didn’t dig further because the result is already clear: it’s even worse than mainline.
The big question: when will this be solved?
There are two open paths upstream:
- Optimize the cost of SSM snapshots — make them incremental (delta of state instead of full copy). Non-trivial work, no PR open that I know of.
- Self-speculative decoding (Medusa, EAGLE) — no separate draft model, certain layers of the main predict ahead. This sidesteps the SSM issue but requires a dedicated fine-tune.
Realistic outlook: several months before spec-dec is productively usable on Qwen3.x. For now, on this kind of model, the physical ceiling of memory bandwidth is what you can expect, period.
The physical ceiling, in numbers
RTX 5090 memory bandwidth ≈ 1700 GB/s
27B Q4_K_M model in VRAM ≈ 15 GB
Theoretical ceiling = 1700 / 15 ≈ 111 tok/s
Achieved (60% efficiency) ≈ 66 tok/s
To go above that, you need:
- A smaller model in VRAM (more aggressive quant: Q3_K_M → ~85 tok/s, Q3_K_S → ~100 tok/s, at the cost of quality).
- More bandwidth (B200 does 8 TB/s — theoretical 530 tok/s for the same model).
- Spec-dec that finally works (~3-4× when it does, so ~250 tok/s — but unreachable today on this architecture).
What I want you to take away
- Before benchmarking, compute the physical ceiling. It’s
bw / size, takes 5 seconds, and aligns your expectations. - When a benchmark claims to exceed this ceiling, either spec-dec is effective (check the counters!), or the model is smaller than announced, or something fishy is going on. Reading the context of a benchmark matters more than reading its number.
- Hybrid attention+SSM models (Qwen3.5/3.6, Mamba-2, RWKV) are intrinsically harder to accelerate than classic transformers. The “more context, less KV” trade-off costs you on the inference side.
→ For the actual NixOS config that runs all this, see Part 4 — The NixOS setup, declarative and reproducible.