Zinc: A Zig-based LLM inference engine optimized for AMD RDNA and Apple Silicon GPUs

Large language model (LLM) inference engines typically rely on CUDA or ROCm to run on GPUs, locking developers into specific hardware ecosystems. Zinc takes a different path: it’s a hand-tuned LLM inference engine built in Zig, targeting AMD RDNA3/RDNA4 GPUs via Vulkan and Apple Silicon through Metal. This means it sidesteps the usual CUDA/ROCm stack, bringing competitive throughput to hardware often neglected by mainstream ML frameworks.

zinc’s architecture and core capabilities

Zinc compiles to a single binary that embeds platform-specific shader pipelines. On AMD RDNA4 GPUs, it uses Vulkan compute shaders optimized with wave64 and cooperative matrix operations. For Apple Silicon, it leverages native Metal Shading Language (MSL) kernels with simdgroup operations and zero-copy memory mapping to handle model data efficiently.

The engine supports GGUF quantized models ranging from Q4_K up to full F32 precision. This flexibility allows balancing memory footprint and inference accuracy. Zinc exposes an OpenAI-compatible /v1 API endpoint with support for streaming responses, making it relatively straightforward to integrate with existing LLM applications.

Model management is handled via a CLI with commands to list available models, pull new ones, switch active models, and remove unused ones. The engine also includes a built-in browser chat UI, which serves as a lightweight interface for testing and demos.

The codebase is written entirely in Zig, a language chosen for its low-level control and performance characteristics. Zinc deliberately avoids CUDA, ROCm, or MLX, focusing instead on Vulkan and Metal to cover AMD RDNA and Apple Silicon, respectively.

technical strengths and tradeoffs

One of Zinc’s distinguishing features is its hand-crafted GPU shader pipelines tailored for each target architecture. The RDNA4 path uses wave64 and cooperative matrix shaders, which are advanced GPU programming techniques to maximize throughput by harnessing the hardware’s wavefront and matrix multiplication capabilities. On the Apple Silicon side, the use of native Metal shaders with simdgroup ops and zero-copy memory mapping is a smart optimization to reduce overhead and latency.

The choice to avoid CUDA and ROCm is both a strength and a limitation. It means Zinc can run on consumer AMD GPUs and Apple Silicon without relying on NVIDIA-centric toolchains, which are often the default in ML workloads. However, this also means the project doesn’t benefit from the mature ecosystem and tooling available for CUDA. Performance tuning is more mature on the AMD RDNA4 side, with ongoing work to optimize the Apple Silicon path further.

Code quality reflects a low-level systems approach typical of Zig projects: it’s explicit, performance-focused, and avoids unnecessary dependencies. This makes Zinc a good study in custom GPU compute pipeline design outside the usual CUDA ecosystem. The API compatibility with OpenAI’s interface is a pragmatic choice that lowers integration friction.

Benchmarks from the author show Zinc achieving 115.8 tokens per second prefill throughput on the Qwen 3 8B dense model running on AMD RDNA4 Radeon AI PRO R9700. This figure is comparable to other popular inference engines like llama.cpp on the same hardware, indicating Zinc is competitive despite its narrower platform focus.

quick start

prerequisites

Tool	Install
Zig 0.15.2+	ziglang.org/download
Vulkan loader + tools	`apt install libvulkan-dev vulkan-tools` (Linux) or `brew install vulkan-loader vulkan-headers` (macOS)
`glslc` on Linux	`apt install glslc`
Bun for tests and the docs site	`curl -fsSL https://bun.sh/install \| bash`

Important: On Linux with RDNA4, newer glslc releases can cause a large regression. Use the system package version.

build zinc

git clone https://github.com/zolotukhin/zinc.git
cd zinc

### Prerequisites

| Tool | Install |
|------|---------|
| Zig 0.15.2+ | ziglang.org/download |
| Vulkan loader + tools | `apt install libvulkan-dev vulkan-tools` (Linux) or `brew install vulkan-loader vulkan-headers` (macOS) |
| `glslc` on Linux | `apt install glslc` |
| Bun for tests and the docs site | `curl -fsSL https://bun.sh/install \| bash` |

**Important**: On Linux with RDNA4, newer `glslc` releases can cause a large regression. Use the system package version.

exploring the project structure

The source code is organized with a focus on platform-specific shader implementations and core inference logic. Key areas include Vulkan compute shaders for AMD GPUs and Metal Shading Language kernels for Apple Silicon. The CLI tools for model management and the embedded chat UI are also part of the repo.

The README provides extensive documentation on the supported quantization formats, shader pipeline design, and API usage. Benchmarks and performance tuning notes are available on the author’s website at zolotukhin.ai/zinc/benchmarks.

verdict

Zinc is a niche but well-executed LLM inference engine that fills a gap for AMD RDNA and Apple Silicon GPU users who want to avoid NVIDIA’s CUDA ecosystem. Its use of Zig and hand-tuned GPU shaders shows a deep understanding of low-level GPU programming and platform-specific optimization.

It’s particularly relevant for developers with the expertise and interest to dive into Vulkan and Metal shader design, or those targeting consumer AMD GPUs and Apple Silicon hardware for LLM inference. The project is less suitable if you need broad hardware support or rely heavily on existing CUDA-based ML tooling.

The uneven optimization status—more mature on AMD RDNA4 than on Apple Silicon—means that while Zinc is promising, it might require patience and community contributions to reach full potential on all platforms.

Overall, Zinc is worth exploring if you’re looking to understand alternative GPU inference paths or want a compact, single-binary engine focused on these specific architectures.

vLLM: Efficient large language model serving with paged attention and continuous batching — vLLM is a Python library for high-throughput LLM inference using paged attention and continuous batching. It supports qu
LlamaFactory: modular, extensible fine-tuning framework for large language models — LlamaFactory offers a modular Python framework for fine-tuning 100+ LLMs with diverse algorithms and optimizations, incl
Jan: a local-first desktop app for large language models with Tauri and Rust — Jan is an open-source desktop app that runs large language models locally using Tauri, Node.js, and Rust. It offers priv

→ GitHub Repo: zolotukhin/zinc ⭐ 371 · Zig

Noureddine RAMDI / Zinc: A Zig-based LLM inference engine optimized for AMD RDNA and Apple Silicon GPUs