I use Claude Code every day. It’s an excellent tool. But there are moments where I’d rather not depend on an external API: a flaky network, the token limits that compound across long sessions, the bill that builds up, or simply wanting to test things on proprietary code that I’d rather not ship to a US datacenter.
I have an RTX 5090 in my workstation. 32 GB of VRAM. ~1.7 TB/s of memory bandwidth. That’s plenty to run a 27B model locally. The question was whether the quality would hold up.
This is part 1 of a 4-part series on running an LLM locally:
- Why I serve Qwen3.6 locally on my RTX 5090 (this article)
- Hunting tokens/sec: 4 backends, 1 ceiling
- Speculative decoding meets hybrid architectures: why it breaks
- The NixOS setup, declarative and reproducible
The model: Qwopus3.6-27B
Qwen3.6-27B was released by Alibaba in early April. It’s a hybrid attention + SSM (state-space model) — I’ll come back to that in part 3 — with a native context window of 262K tokens, which is comfortable. On reasoning benchmarks, it lands among the best “small” models of the moment.
I went with a community fine-tune published by Jackrong as Qwopus3.6-27B-v1-preview. The premise: the model was distilled on Claude Opus reasoning traces, which gives it a response style and structure reminiscent of Claude. The tokenizer stays Qwen3.6’s (so 100 % plumbing-compatible) — only the weights are fine-tuned.
HF repo : Jackrong/Qwopus3.6-27B-v1-preview-GGUF
Quant : Q4_K_M (16.5 GB on disk, ~15 GB in VRAM)
Context : 262144 tokens (native)
The hardware
GPU : NVIDIA RTX 5090 (Blackwell, sm_120)
VRAM : 32 086 MiB
Mem BW : ~1.7 TB/s
CPU : ryzen 9800X3D, 16 threads
RAM : 48 GiB
OS : NixOS 25.11 unstable
The 5090 is a “big” GPU but not server-class. For scale:
- 4090 → ~1.0 TB/s mem bw
- 5090 → ~1.7 TB/s
- H100 → ~3.0 TB/s
- B200 → ~8.0 TB/s
LLM inference is typically memory-bandwidth-bound (we re-read the model weights for every token generated). So the 5090 is ~70 % faster than a 4090 for this kind of workload, at equivalent model and quant. This order-of-magnitude is going to be useful when comparing my results to others'.
Why NixOS
My whole system has been declarative for two years now. For a local AI setup, that means:
- Everything sits in
flake.nix+ a systemd module. The llama-server service restarts automatically, is versioned alongside the rest of my config, and I can roll back in 30 seconds if something breaks. - No Docker. NixOS handles the CUDA stack with my NVIDIA drivers better than any Docker image would. No layered hacks.
- Reproducible. I can replay the exact same setup on another machine by cloning the repo.
The downside: packaging recent compiled stuff (CUDA-enabled llama.cpp, for instance) sometimes means juggling overlays and overrides. But once it’s in place, it’s in place.
The final stack
┌─────────────────────────────────────────┐
│ claude-local │ ← wrapper that launches Claude Code
│ │ │ against the local backend
│ ▼ │
│ claude-code-router (npx) │ ← OpenAI-compatible routing
│ │ │
│ ▼ http://127.0.0.1:11435/v1 │
│ llama-server (CUDA) │ ← inference
│ │ │
│ ▼ │
│ /var/lib/llama/models/Qwopus3.6-27B... │ ← 16.5 GB GGUF
└─────────────────────────────────────────┘
llama.cpp exposes an OpenAI-compatible API. claude-code-router is an npm proxy that translates Claude Code requests into OpenAI calls. Net result: Claude Code’s official CLI talks to my local GPU natively, no patching required. Elegant.
The whole thing is wrapped in a single NixOS module, with a llama-pull command I can run to download or update weights from HuggingFace. The service refuses to start if the weights are missing (ConditionPathExists) — no crash loops in my logs.
What I wanted
Before starting any benchmarks, I had a simple goal in mind:
Hold at least 100 tok/s of generation on a long context window (>200K). Ideally more, ideally with speculative decoding turned on to scrape a 2-3× factor on top.
I got the answer to both expectations. Just not at all the answer I was imagining.
→ Continued in Part 2 — Hunting tokens/sec: 4 backends, 1 ceiling.