Gemma-gem: running large language models in Chrome with WebGPU acceleration

Running large language models (LLMs) in the browser is no small feat. The gemma-gem repository tackles this challenge by harnessing WebGPU in Chrome to accelerate model inference directly on the GPU, all within a TypeScript-based Chrome extension. This approach brings LLM execution closer to the user without relying on server round-trips, but it comes with technical tradeoffs and clear prerequisites.

What gemma-gem does and how it’s built

At its core, gemma-gem is a Chrome extension written in TypeScript that runs large language models like E2B and E4B using the WebGPU API. WebGPU is a modern browser API designed to expose GPU capabilities for compute workloads, enabling performant machine learning inference in the browser environment.

The project bundles the models (E2B requiring about 500MB disk and E4B about 1.5GB cached after first run) and performs inference on the GPU via WebGPU shaders. This setup sidesteps the usual bottleneck of CPU-bound JavaScript execution for ML tasks, aiming for more responsive and scalable model runs.

Architecturally, gemma-gem is structured as a Chrome Manifest V3 extension. The source compiles with pnpm into the .output/chrome-mv3-dev/ directory, which is then loaded as a developer extension in Chrome. The core codebase is in TypeScript, leveraging modern JS bundling and build tools.

The extension includes the model files, WebGPU shader code, and the runtime logic to manage loading, caching, and executing the models. It depends heavily on Chrome’s WebGPU support, which is still an experimental feature in some contexts and only available in Chromium-based browsers with recent versions.

The use of WebGPU for browser-based LLM inference

What distinguishes gemma-gem is its use of WebGPU to run large language models fully on the client side. Traditional browser ML usually relies on CPU or WebGL with limited performance, or offloads inference to remote APIs. Here, WebGPU exposes compute shaders that can directly operate on GPU buffers for matrix multiplications and other tensor operations needed by LLMs.

This approach offers a lower-latency, potentially offline-capable experience since the model execution is local. The tradeoff is the requirement for a GPU-capable environment with WebGPU enabled, which is not universally supported yet. Users must be running a recent Chrome with WebGPU support enabled.

The code quality appears solid, with a clear separation between build steps, extension packaging, and runtime logic. The TypeScript typing helps maintain clarity in dealing with complex GPU buffers and shader pipelines. However, this architecture also means the extension is relatively heavy (hundreds of MBs for models) and likely constrained by browser memory and GPU limits.

The project openly states the disk requirements upfront, which is refreshing. It’s also worth noting that WebGPU is an evolving API, so some instability or API surface changes might impact long-term maintenance.

Quick start

To try gemma-gem locally, you’ll need a Chrome browser with WebGPU support enabled and enough disk space for caching models.

The setup steps are:

pnpm install
pnpm build

Once built, load the extension manually in Chrome by navigating to chrome://extensions (enable developer mode) and loading the .output/chrome-mv3-dev/ folder.

This process builds the TypeScript source and packages the extension for Chrome Manifest V3, ready for testing.

Verdict

Gemma-gem is a technically interesting project that demonstrates running large language models in the browser with GPU acceleration via WebGPU. It’s relevant for developers and researchers experimenting with client-side AI inference and browser-based ML applications.

The main limitation is the hardware and browser support requirement. You need a recent Chromium-based browser with WebGPU enabled and sufficient disk space for the model cache. This restricts practical use to more advanced users and recent machines.

The codebase is clean and leverages modern TypeScript practices, making it approachable for contributors familiar with browser extensions and GPU programming. For anyone curious about pushing AI workloads into the browser without server dependency, gemma-gem offers a working example worth exploring.

It’s not a drop-in solution for production use but a solid technical exploration of WebGPU’s potential in ML. If you’re building browser-centric AI tools or want to understand how LLMs can run on client GPUs, this repo is a valuable resource.

nh: a Rust-based unified CLI for the Nix ecosystem with enhanced search and ergonomics — nh is a Rust CLI tool consolidating Nix, NixOS, and Home Manager commands with improved ergonomics, speed, and Elasticse
Navigating NixOS and Flakes with a community-driven beginner’s guide — A practical look at the “NixOS & Flakes Book,” an unofficial, community-driven guide demystifying NixOS and its experime
Pydoll: Async-native Chromium automation with typed extraction for web scraping — Pydoll is a Python library for Chromium automation using Chrome DevTools Protocol. It offers async-native APIs and Pydan
microvm.nix: declarative MicroVM management with Nix flakes — microvm.nix offers declarative MicroVMs on NixOS/macOS using eight hypervisors, enabling version-controlled, reproducibl
Colmena: A stateless, Rust-based deployment tool for NixOS with Nix Flakes support — Colmena is a lightweight Rust tool for stateless, parallel NixOS deployments using Nix Flakes. It wraps core Nix command

→ GitHub Repo: kessler/gemma-gem ⭐ 873 · TypeScript

Noureddine RAMDI / Gemma-gem: running large language models in Chrome with WebGPU acceleration

What gemma-gem does and how it’s built

The use of WebGPU for browser-based LLM inference

Quick start

Verdict

Related Articles