Noureddine RAMDI / The NixOS Setup for llama.cpp: Declarative and Reproducible (Part 4/4)

Created Tue, 28 Apr 2026 00:00:00 +0000 Modified Sat, 23 May 2026 20:41:27 +0000

The previous three articles described why, how I tested, and why-it-doesn’t-always-work. This one is the concrete what: the Nix files actually running on my machine, annotated, copy-paste-able.

Everything is in my config repo (Nix flake mono-repo, multi-host). I’m showing the relevant pieces.

Part 4 of a 4-part series on running an LLM locally:

  1. Why I serve Qwen3.6 locally on my RTX 5090
  2. Hunting tokens/sec: 4 backends, 1 ceiling
  3. Speculative decoding meets hybrid architectures: why it breaks
  4. The NixOS setup, declarative and reproducible (this article)

The systemd module: nixos/llama-server.nix

This is the core. A standard NixOS module with a services.llama-server namespace, defining:

  • The systemd service that launches llama-server with the right flags.
  • A dedicated llama system user, owner of /var/lib/llama/.
  • A llama-pull helper that bootstraps GGUFs from HuggingFace.
  • All the tuning options (model, context, port, KV quant…).
{ config, pkgs, lib, ... }:

let
  cfg = config.services.llama-server;

  # llama.cpp from nixpkgs unstable. No external flake input: the CUDA
  # build with sm_120 (Blackwell) is already wired in nixpkgs.
  llamaCppCuda = pkgs.unstable.llama-cpp.override { cudaSupport = true; };

  # Python env to download GGUFs from HuggingFace.
  hfEnv = pkgs.python3.withPackages (ps: [ ps.huggingface-hub ]);

  # One-shot helper: pull main + draft GGUFs to /var/lib/llama/models/.
  llama-pull = pkgs.writeShellScriptBin "llama-pull" ''
    set -euo pipefail
    if [ "$(id -u)" -ne 0 ]; then echo "✗ Run as root: sudo llama-pull"; exit 1; fi

    MODELS_DIR="${cfg.modelsDir}"
    install -d -m 0755 -o llama -g llama "$MODELS_DIR"

    pull() {
      local repo="$1" file="$2"
      [ -f "$MODELS_DIR/$file" ] && { echo "✓ $file already present"; return; }
      ${hfEnv}/bin/huggingface-cli download "$repo" "$file" \
        --local-dir "$MODELS_DIR" --local-dir-use-symlinks False
      chown llama:llama "$MODELS_DIR/$file"
    }

    pull "${cfg.mainRepo}"  "${cfg.mainModel}"
    pull "${cfg.draftRepo}" "${cfg.draftModel}"

    # Auto-(re)start after pull. With ConditionPathExists, the service
    # is "skipped" until the GGUF is present — the pull is what
    # unblocks the first start.
    if ${pkgs.systemd}/bin/systemctl is-enabled --quiet llama-server.service 2>/dev/null; then
      ${pkgs.systemd}/bin/systemctl restart llama-server
    fi
  '';
in
{
  options.services.llama-server = {
    enable = lib.mkEnableOption "llama.cpp server (CUDA)";

    package      = lib.mkOption { default = llamaCppCuda; };
    modelsDir    = lib.mkOption { default = "/var/lib/llama/models"; };
    mainRepo     = lib.mkOption { default = "Jackrong/Qwopus3.6-27B-v1-preview-GGUF"; };
    mainModel    = lib.mkOption { default = "Qwopus3.6-27B-v1-preview-Q4_K_M.gguf"; };
    draftRepo    = lib.mkOption { default = "lmstudio-community/Qwen3-1.7B-GGUF"; };
    draftModel   = lib.mkOption { default = "Qwen3-1.7B-Q4_K_M.gguf"; };
    contextSize  = lib.mkOption { default = 262144; };  # native Qwen3.6
    kvQuant      = lib.mkOption { default = "q4_0"; };
    port         = lib.mkOption { default = 11435; };
    host         = lib.mkOption { default = "127.0.0.1"; };
    useSpeculativeDecoding = lib.mkOption { default = false; };  # cf. part 3
    # … (other options: draftPMin, ropeScaling, extraArgs, etc.)
  };

  config = lib.mkIf cfg.enable {
    environment.systemPackages = [ cfg.package llama-pull ];

    systemd.services.llama-server = {
      description = "llama.cpp inference server (CUDA)";
      wantedBy = [ "multi-user.target" ];

      # No crash loop: if the GGUF is missing, the service is skipped.
      # `llama-pull` is what unblocks it.
      unitConfig.ConditionPathExists = "${cfg.modelsDir}/${cfg.mainModel}";

      serviceConfig = {
        User = "llama";
        Group = "llama";
        Restart = "on-failure";
        ProtectSystem = "strict";
        ReadOnlyPaths = [ cfg.modelsDir ];

        ExecStart = lib.escapeShellArgs (
          [
            "${cfg.package}/bin/llama-server"
            "-m" "${cfg.modelsDir}/${cfg.mainModel}"
            "-ngl" "99"
            "-c" (toString cfg.contextSize)
            "-fa" "on"
            "-ctk" cfg.kvQuant "-ctv" cfg.kvQuant
            "--metrics"
            "--parallel" "1"     # single-user, frees VRAM
            "--host" cfg.host
            "--port" (toString cfg.port)
          ]
          ++ lib.optionals cfg.useSpeculativeDecoding [
            "-md" "${cfg.modelsDir}/${cfg.draftModel}"
            "-ngld" "99"
            # … spec-dec flags
          ]
        );
      };
    };

    users.users.llama  = { isSystemUser = true; group = "llama"; };
    users.groups.llama = { };

    systemd.tmpfiles.rules = [
      "d /var/lib/llama 0755 llama llama -"
      "d ${cfg.modelsDir} 0755 llama llama -"
    ];
  };
}

A few points that cost me time and are worth copying:

  • ConditionPathExists instead of a preStart shell check. Without it, the service crash-loops every 5 seconds while you haven’t pulled, and nixos-rebuild switch exits with code 4.
  • --parallel 1. The default is 4, which multiplies the KV cache size by 4. On 32 GB VRAM with a 27B Q4, that’s an OOM.
  • Models outside /nix/store. In /var/lib/llama/models/, managed by llama-pull (so independent of Nix rebuilds). Putting a 16 GB GGUF in the store would re-hash on every change and bloat derivation closures.

Activation on the host

A single line in hosts/dinoxy-nvidia/configuration.nix:

services.llama-server = {
  enable = true;
  # The defaults are already correct for Qwopus3.6.
  # Override here only when changing model.
};

Claude Code routing

llama-server exposes an OpenAI-compatible API. claude-code-router (ccr) is an npm proxy that translates Claude Code calls into that format. The whole thing is packaged via home-manager:

# home/programs/claude-code-router/default.nix
{
  home.file.".claude-code-router/config.json".text = builtins.toJSON {
    Providers = [{
      name = "llama-local";
      api_base_url = "http://127.0.0.1:11435/v1/chat/completions";
      api_key = "llama";  # llama-server ignores the key
      models = [ "Qwopus3.6-27B" ];
    }];
    Router = {
      default = "llama-local,Qwopus3.6-27B";
      # … other tasks (background, think, longContext, webSearch)
    };
  };
}

And a claude-local wrapper that runs Claude Code with ccr:

# home/scripts/claude-local/default.nix
claude-local = pkgs.writeShellScriptBin "claude-local" ''
  if ! ${pkgs.curl}/bin/curl -sf http://127.0.0.1:11435/v1/models >/dev/null; then
    echo "✗ llama-server not reachable" ; exit 1
  fi
  exec ${pkgs.nodejs}/bin/npx -y @musistudio/claude-code-router@latest code "$@"
'';

The official claude CLI stays plugged into Anthropic. Only claude-local routes locally.

The user workflow

Once the config is in place:

# First setup
sudo nixos-rebuild switch --flake ~/.config/nixos#dinoxy-nvidia
sudo llama-pull       # downloads ~17 GB of GGUFs → /var/lib/llama/models/
                      # auto-starts the service when done

# Verify
systemctl status llama-server
curl -sf http://127.0.0.1:11435/v1/models

# Use
claude-local         # Claude Code routed to the local GPU

# Free VRAM
sudo systemctl stop llama-server     # frees 26 GB of VRAM
nvidia-smi                           # confirm

# Switch model
$EDITOR hosts/dinoxy-nvidia/configuration.nix    # change mainRepo / mainModel
sudo nixos-rebuild switch --flake .
sudo llama-pull       # downloads the new one, auto-restart

# Switch quant (Q4_K_M → Q3_K_M for more speed)
# Same workflow: edit mainModel, rebuild, llama-pull.

Everything is versioned in git. If I break something, nixos-rebuild switch --rollback brings me back to the previous state in 30 seconds.

Series wrap-up

I started the night with one expectation — triple my 66 tok/s thanks to speculative decoding. I end it with a cleaner config, a clear understanding of why the speed plateaus, and a setup I can version and share.

What works:

  • Qwopus3.6-27B locally on RTX 5090, 66 tok/s, 262K context.
  • Transparent Claude Code routing (claude-local).
  • 100 % declarative config, 30s rollback.
  • ~26 GB VRAM used, instantly releasable with systemctl stop.

What doesn’t work (yet):

  • Speculative decoding on hybrid attention+SSM architectures.
  • Going past 66 tok/s without an aggressive quant.
  • 500K+ context simultaneously with spec-dec (VRAM limit).

All the code is in my public NixOS config repo if you want inspiration.

Thanks for reading.