Noureddine RAMDI / Lucebox Hub: hand-optimized CUDA kernels for efficient LLM inference on RTX 3090 and beyond

Created Mon, 04 May 2026 10:23:02 +0000 Modified Sat, 23 May 2026 20:41:27 +0000

Luce-Org/luce-megakernel

Lucebox Hub tackles a tough problem: squeezing the most out of consumer Nvidia GPUs, especially the RTX 3090, for running large language model (LLM) inference. Instead of waiting for new hardware to improve performance, it rewrites CUDA kernels per GPU generation to hit the power limit of the device rather than the compute limit. This approach is a deep dive into GPU programming and model quantization, trading off flexibility for raw throughput.

What Lucebox Hub does and its architecture

Lucebox Hub is a collection of hand-optimized LLM inference projects targeting specific Nvidia GPUs, primarily the RTX 3090 (Ampere architecture), but also supporting newer architectures including Ada (RTX 40xx), Blackwell (RTX 50xx), and even Jetson AGX Thor.

The repo features two main projects:

  • Megakernel: A single CUDA kernel dispatch that implements all 24 layers of a Qwen3.5 0.8B model transformer in one go. This eliminates the overhead of launching multiple kernels per token, reducing about 100 kernel launches per token to just one. The kernel runs with 82 blocks and 512 threads, using persistent kernels and cooperative grid synchronization to maximize utilization within VRAM constraints.

  • DFlash DDtree: Implements speculative decoding for larger Qwen3.5/3.6 27B models in GGUF format. Speculative decoding accelerates autoregressive inference by proposing multiple tokens at once with a draft model and verifying them with a full model in a single forward pass. DFlash uses a compressed KV cache with a custom quantization scheme (TQ3_0) to enable long context windows (up to 256K tokens in 24GB VRAM) while maintaining decoding throughput.

The core innovations revolve around rewriting kernels per GPU generation rather than relying on generic CUDA kernels. This includes persistent kernels that stay resident on the GPU, cooperative grid synchronization techniques, and custom quantization formats tailored for efficient KV cache compression.

The codebase is written in C++/CUDA with ggml integration, focusing on CUDA 12+ and modern Nvidia GPUs from Ampere through Blackwell and DGX Spark environments.

What makes Lucebox Hub’s megakernel approach interesting

The megakernel concept is the standout technical feature here. By fitting all transformer layers of the model into a single CUDA dispatch, it sidesteps the classic bottleneck of kernel launch overhead, which can be a big deal when you have dozens of layers and hundreds of tokens to process.

This approach requires deep knowledge of CUDA programming patterns:

  • Persistent kernel: The kernel is launched once and stays resident, avoiding repeated launches.
  • Cooperative grid sync: Blocks communicate and synchronize efficiently inside the kernel to implement the transformer logic across layers.
  • Per-chip tuning: Each GPU architecture gets its own carefully tuned kernel, exploiting hardware specifics like streaming multiprocessor counts and cache sizes.

The tradeoff is less flexibility — this megakernel is tightly coupled to a specific model size and GPU architecture. But the benchmarks reflect the payoff:

  • Megakernel achieves 21,347 tokens per second prefill, 413 tokens per second decode, and 1.87 tokens per joule at 220W power draw on RTX 3090.
  • This outperforms llama.cpp BF16 which gets 11,247/267/0.76 @350W on the same hardware.

For speculative decoding, the DFlash DDtree project boosts throughput up to 207 tokens per second demo, with mean HumanEval throughput of 129.5 tokens per second, about 3.4× faster than autoregressive decoding.

The custom quantization schemes (TQ3_0 KV cache) allow efficient compression of key-value caches, enabling longer context windows without exploding VRAM usage.

Under the hood, the code is surprisingly clean for such low-level optimization, showing an opinionated but pragmatic approach to CUDA kernel design and quantization.

Explore the project

The repo’s README provides a comprehensive list of supported GPUs and CUDA version requirements:

  • Ampere (RTX 3090 and A-series), CUDA 12+
  • Ada (RTX 40xx), CUDA 12+ (unverified)
  • Blackwell (RTX 50xx), CUDA 12.8+
  • DGX Spark / GB10, CUDA 12.9+
  • Jetson AGX Thor, CUDA 13+
  • Turing (RTX 2080), CUDA 12+

You’ll find the main projects under megakernel/ and dflash/ directories.

The megakernel project auto-detects GPU capabilities at build time using PyTorch’s torch.cuda.get_device_capability(), so no manual tuning is needed for different GPUs within the supported architectures.

The dflash/ directory requires CMake 3.18+ and uses recursive git submodules for a pinned fork of llama.cpp with custom ggml ops.

If you want to understand or contribute, start with these directories and the README’s detailed explanation of the quantization formats and kernel design. The repo does not provide a simple install script or quickstart commands, reflecting its niche focus on low-level CUDA kernel optimization rather than plug-and-play deployment.

Verdict

Lucebox Hub is a treasure trove for anyone interested in GPU kernel optimization, transformer inference, or pushing consumer GPUs beyond their usual limits.

The repo’s megakernel approach shows how software engineering can extract more performance from existing hardware by rethinking the kernel launch model and memory usage patterns. Its speculative decoding with DFlash complements this by tackling throughput bottlenecks in large-model inference.

That said, this project is not a general-purpose LLM inference framework. It demands a compatible Nvidia GPU, CUDA 12+, and a willingness to dig into CUDA C++ code and custom quantization schemes. It’s not plug-and-play but a great resource if you want to understand or build highly optimized inference engines targeting Ampere through Blackwell architectures.

For production use, the tradeoffs are clear: specialized kernels tuned for specific GPUs and models offer big performance gains but require ongoing maintenance as hardware evolves.

If you’re working on LLM inference pipelines where throughput and power efficiency on consumer GPUs matter, Lucebox Hub is worth diving into for ideas and code.


→ GitHub Repo: Luce-Org/luce-megakernel ⭐ 1,642 · C++