Noureddine RAMDI / Why I Serve Qwen3.6 Locally on My RTX 5090 (Part 1/4)

Created Tue, 28 Apr 2026 00:00:00 +0000 Modified Sat, 23 May 2026 20:41:27 +0000

I use Claude Code every day. It’s an excellent tool. But there are moments where I’d rather not depend on an external API: a flaky network, the token limits that compound across long sessions, the bill that builds up, or simply wanting to test things on proprietary code that I’d rather not ship to a US datacenter.

I have an RTX 5090 in my workstation. 32 GB of VRAM. ~1.7 TB/s of memory bandwidth. That’s plenty to run a 27B model locally. The question was whether the quality would hold up.

This is part 1 of a 4-part series on running an LLM locally:

  1. Why I serve Qwen3.6 locally on my RTX 5090 (this article)
  2. Hunting tokens/sec: 4 backends, 1 ceiling
  3. Speculative decoding meets hybrid architectures: why it breaks
  4. The NixOS setup, declarative and reproducible

The model: Qwopus3.6-27B

Qwen3.6-27B was released by Alibaba in early April. It’s a hybrid attention + SSM (state-space model) — I’ll come back to that in part 3 — with a native context window of 262K tokens, which is comfortable. On reasoning benchmarks, it lands among the best “small” models of the moment.

I went with a community fine-tune published by Jackrong as Qwopus3.6-27B-v1-preview. The premise: the model was distilled on Claude Opus reasoning traces, which gives it a response style and structure reminiscent of Claude. The tokenizer stays Qwen3.6’s (so 100 % plumbing-compatible) — only the weights are fine-tuned.

HF repo : Jackrong/Qwopus3.6-27B-v1-preview-GGUF
Quant   : Q4_K_M (16.5 GB on disk, ~15 GB in VRAM)
Context : 262144 tokens (native)

The hardware

GPU      : NVIDIA RTX 5090 (Blackwell, sm_120)
VRAM     : 32 086 MiB
Mem BW   : ~1.7 TB/s
CPU      : ryzen 9800X3D, 16 threads
RAM      : 48 GiB
OS       : NixOS 25.11 unstable

The 5090 is a “big” GPU but not server-class. For scale:

  • 4090 → ~1.0 TB/s mem bw
  • 5090 → ~1.7 TB/s
  • H100 → ~3.0 TB/s
  • B200 → ~8.0 TB/s

LLM inference is typically memory-bandwidth-bound (we re-read the model weights for every token generated). So the 5090 is ~70 % faster than a 4090 for this kind of workload, at equivalent model and quant. This order-of-magnitude is going to be useful when comparing my results to others'.

Why NixOS

My whole system has been declarative for two years now. For a local AI setup, that means:

  • Everything sits in flake.nix + a systemd module. The llama-server service restarts automatically, is versioned alongside the rest of my config, and I can roll back in 30 seconds if something breaks.
  • No Docker. NixOS handles the CUDA stack with my NVIDIA drivers better than any Docker image would. No layered hacks.
  • Reproducible. I can replay the exact same setup on another machine by cloning the repo.

The downside: packaging recent compiled stuff (CUDA-enabled llama.cpp, for instance) sometimes means juggling overlays and overrides. But once it’s in place, it’s in place.

The final stack

┌─────────────────────────────────────────┐
│  claude-local                           │  ← wrapper that launches Claude Code
│         │                               │     against the local backend
│         ▼                               │
│  claude-code-router (npx)               │  ← OpenAI-compatible routing
│         │                               │
│         ▼ http://127.0.0.1:11435/v1     │
│  llama-server (CUDA)                    │  ← inference
│         │                               │
│         ▼                               │
│  /var/lib/llama/models/Qwopus3.6-27B... │  ← 16.5 GB GGUF
└─────────────────────────────────────────┘

llama.cpp exposes an OpenAI-compatible API. claude-code-router is an npm proxy that translates Claude Code requests into OpenAI calls. Net result: Claude Code’s official CLI talks to my local GPU natively, no patching required. Elegant.

The whole thing is wrapped in a single NixOS module, with a llama-pull command I can run to download or update weights from HuggingFace. The service refuses to start if the weights are missing (ConditionPathExists) — no crash loops in my logs.

What I wanted

Before starting any benchmarks, I had a simple goal in mind:

Hold at least 100 tok/s of generation on a long context window (>200K). Ideally more, ideally with speculative decoding turned on to scrape a 2-3× factor on top.

I got the answer to both expectations. Just not at all the answer I was imagining.

→ Continued in Part 2 — Hunting tokens/sec: 4 backends, 1 ceiling.