The previous three articles described why, how I tested, and why-it-doesn’t-always-work. This one is the concrete what: the Nix files actually running on my machine, annotated, copy-paste-able.
Everything is in my config repo (Nix flake mono-repo, multi-host). I’m showing the relevant pieces.
Part 4 of a 4-part series on running an LLM locally:
- Why I serve Qwen3.6 locally on my RTX 5090
- Hunting tokens/sec: 4 backends, 1 ceiling
- Speculative decoding meets hybrid architectures: why it breaks
- The NixOS setup, declarative and reproducible (this article)
The systemd module: nixos/llama-server.nix
This is the core. A standard NixOS module with a services.llama-server namespace, defining:
- The systemd service that launches
llama-serverwith the right flags. - A dedicated
llamasystem user, owner of/var/lib/llama/. - A
llama-pullhelper that bootstraps GGUFs from HuggingFace. - All the tuning options (model, context, port, KV quant…).
{ config, pkgs, lib, ... }:
let
cfg = config.services.llama-server;
# llama.cpp from nixpkgs unstable. No external flake input: the CUDA
# build with sm_120 (Blackwell) is already wired in nixpkgs.
llamaCppCuda = pkgs.unstable.llama-cpp.override { cudaSupport = true; };
# Python env to download GGUFs from HuggingFace.
hfEnv = pkgs.python3.withPackages (ps: [ ps.huggingface-hub ]);
# One-shot helper: pull main + draft GGUFs to /var/lib/llama/models/.
llama-pull = pkgs.writeShellScriptBin "llama-pull" ''
set -euo pipefail
if [ "$(id -u)" -ne 0 ]; then echo "✗ Run as root: sudo llama-pull"; exit 1; fi
MODELS_DIR="${cfg.modelsDir}"
install -d -m 0755 -o llama -g llama "$MODELS_DIR"
pull() {
local repo="$1" file="$2"
[ -f "$MODELS_DIR/$file" ] && { echo "✓ $file already present"; return; }
${hfEnv}/bin/huggingface-cli download "$repo" "$file" \
--local-dir "$MODELS_DIR" --local-dir-use-symlinks False
chown llama:llama "$MODELS_DIR/$file"
}
pull "${cfg.mainRepo}" "${cfg.mainModel}"
pull "${cfg.draftRepo}" "${cfg.draftModel}"
# Auto-(re)start after pull. With ConditionPathExists, the service
# is "skipped" until the GGUF is present — the pull is what
# unblocks the first start.
if ${pkgs.systemd}/bin/systemctl is-enabled --quiet llama-server.service 2>/dev/null; then
${pkgs.systemd}/bin/systemctl restart llama-server
fi
'';
in
{
options.services.llama-server = {
enable = lib.mkEnableOption "llama.cpp server (CUDA)";
package = lib.mkOption { default = llamaCppCuda; };
modelsDir = lib.mkOption { default = "/var/lib/llama/models"; };
mainRepo = lib.mkOption { default = "Jackrong/Qwopus3.6-27B-v1-preview-GGUF"; };
mainModel = lib.mkOption { default = "Qwopus3.6-27B-v1-preview-Q4_K_M.gguf"; };
draftRepo = lib.mkOption { default = "lmstudio-community/Qwen3-1.7B-GGUF"; };
draftModel = lib.mkOption { default = "Qwen3-1.7B-Q4_K_M.gguf"; };
contextSize = lib.mkOption { default = 262144; }; # native Qwen3.6
kvQuant = lib.mkOption { default = "q4_0"; };
port = lib.mkOption { default = 11435; };
host = lib.mkOption { default = "127.0.0.1"; };
useSpeculativeDecoding = lib.mkOption { default = false; }; # cf. part 3
# … (other options: draftPMin, ropeScaling, extraArgs, etc.)
};
config = lib.mkIf cfg.enable {
environment.systemPackages = [ cfg.package llama-pull ];
systemd.services.llama-server = {
description = "llama.cpp inference server (CUDA)";
wantedBy = [ "multi-user.target" ];
# No crash loop: if the GGUF is missing, the service is skipped.
# `llama-pull` is what unblocks it.
unitConfig.ConditionPathExists = "${cfg.modelsDir}/${cfg.mainModel}";
serviceConfig = {
User = "llama";
Group = "llama";
Restart = "on-failure";
ProtectSystem = "strict";
ReadOnlyPaths = [ cfg.modelsDir ];
ExecStart = lib.escapeShellArgs (
[
"${cfg.package}/bin/llama-server"
"-m" "${cfg.modelsDir}/${cfg.mainModel}"
"-ngl" "99"
"-c" (toString cfg.contextSize)
"-fa" "on"
"-ctk" cfg.kvQuant "-ctv" cfg.kvQuant
"--metrics"
"--parallel" "1" # single-user, frees VRAM
"--host" cfg.host
"--port" (toString cfg.port)
]
++ lib.optionals cfg.useSpeculativeDecoding [
"-md" "${cfg.modelsDir}/${cfg.draftModel}"
"-ngld" "99"
# … spec-dec flags
]
);
};
};
users.users.llama = { isSystemUser = true; group = "llama"; };
users.groups.llama = { };
systemd.tmpfiles.rules = [
"d /var/lib/llama 0755 llama llama -"
"d ${cfg.modelsDir} 0755 llama llama -"
];
};
}
A few points that cost me time and are worth copying:
ConditionPathExistsinstead of apreStartshell check. Without it, the service crash-loops every 5 seconds while you haven’t pulled, andnixos-rebuild switchexits with code 4.--parallel 1. The default is 4, which multiplies the KV cache size by 4. On 32 GB VRAM with a 27B Q4, that’s an OOM.- Models outside
/nix/store. In/var/lib/llama/models/, managed byllama-pull(so independent of Nix rebuilds). Putting a 16 GB GGUF in the store would re-hash on every change and bloat derivation closures.
Activation on the host
A single line in hosts/dinoxy-nvidia/configuration.nix:
services.llama-server = {
enable = true;
# The defaults are already correct for Qwopus3.6.
# Override here only when changing model.
};
Claude Code routing
llama-server exposes an OpenAI-compatible API. claude-code-router (ccr) is an npm proxy that translates Claude Code calls into that format. The whole thing is packaged via home-manager:
# home/programs/claude-code-router/default.nix
{
home.file.".claude-code-router/config.json".text = builtins.toJSON {
Providers = [{
name = "llama-local";
api_base_url = "http://127.0.0.1:11435/v1/chat/completions";
api_key = "llama"; # llama-server ignores the key
models = [ "Qwopus3.6-27B" ];
}];
Router = {
default = "llama-local,Qwopus3.6-27B";
# … other tasks (background, think, longContext, webSearch)
};
};
}
And a claude-local wrapper that runs Claude Code with ccr:
# home/scripts/claude-local/default.nix
claude-local = pkgs.writeShellScriptBin "claude-local" ''
if ! ${pkgs.curl}/bin/curl -sf http://127.0.0.1:11435/v1/models >/dev/null; then
echo "✗ llama-server not reachable" ; exit 1
fi
exec ${pkgs.nodejs}/bin/npx -y @musistudio/claude-code-router@latest code "$@"
'';
The official claude CLI stays plugged into Anthropic. Only claude-local routes locally.
The user workflow
Once the config is in place:
# First setup
sudo nixos-rebuild switch --flake ~/.config/nixos#dinoxy-nvidia
sudo llama-pull # downloads ~17 GB of GGUFs → /var/lib/llama/models/
# auto-starts the service when done
# Verify
systemctl status llama-server
curl -sf http://127.0.0.1:11435/v1/models
# Use
claude-local # Claude Code routed to the local GPU
# Free VRAM
sudo systemctl stop llama-server # frees 26 GB of VRAM
nvidia-smi # confirm
# Switch model
$EDITOR hosts/dinoxy-nvidia/configuration.nix # change mainRepo / mainModel
sudo nixos-rebuild switch --flake .
sudo llama-pull # downloads the new one, auto-restart
# Switch quant (Q4_K_M → Q3_K_M for more speed)
# Same workflow: edit mainModel, rebuild, llama-pull.
Everything is versioned in git. If I break something, nixos-rebuild switch --rollback brings me back to the previous state in 30 seconds.
Series wrap-up
I started the night with one expectation — triple my 66 tok/s thanks to speculative decoding. I end it with a cleaner config, a clear understanding of why the speed plateaus, and a setup I can version and share.
What works:
- Qwopus3.6-27B locally on RTX 5090, 66 tok/s, 262K context.
- Transparent Claude Code routing (
claude-local). - 100 % declarative config, 30s rollback.
- ~26 GB VRAM used, instantly releasable with
systemctl stop.
What doesn’t work (yet):
- Speculative decoding on hybrid attention+SSM architectures.
- Going past 66 tok/s without an aggressive quant.
- 500K+ context simultaneously with spec-dec (VRAM limit).
All the code is in my public NixOS config repo if you want inspiration.
Thanks for reading.