Noureddine RAMDI / Passmark: AI-driven browser regression testing with multi-model consensus and caching

Created Mon, 04 May 2026 10:23:02 +0000 Modified Sat, 23 May 2026 20:41:27 +0000

bug0inc/passmark

Every time you maintain end-to-end tests, you know the pain of brittle selectors breaking with every UI tweak. Passmark tries to solve this by replacing fragile CSS or XPath selectors with natural language step descriptions executed by large language models (LLMs). It’s a TypeScript library built on Playwright that uses AI-driven browser automation to make regression tests more stable and less maintenance-heavy.

What Passmark does and how it works

Passmark is a TypeScript library extending Playwright to enable AI-driven browser regression testing. Instead of defining interactions using brittle selectors, you write test steps as natural language descriptions like “Click Acme Circles T-Shirt” or “Select color”. These steps are interpreted and executed by large language models such as Anthropic’s Claude, Google’s Gemini, or OpenAI’s models.

Under the hood, Passmark uses multiple AI models concurrently to generate browser interaction commands from these natural language steps. It then applies a multi-model consensus mechanism — for example, combining outputs from Claude and Gemini with an arbiter model — to increase confidence in the generated steps. This reduces the chances of flaky or incorrect test actions.

The library supports caching executed steps and their results in Redis. This cache speeds up subsequent test runs by reusing previously verified step outputs, reducing API call costs and latency. When cached steps fail due to UI changes, Passmark can auto-heal by invoking the AI models again to regenerate working steps.

Passmark also offers a hybrid CUA (computer-use agent) mode which supports visual interactions, beyond just DOM-based steps. This helps navigate more complex or dynamic UI elements that require pixel-based input.

The architecture integrates AI gateways like Vercel, OpenRouter, and Cloudflare AI Gateway to route requests to various AI providers. This abstracts API key management and adds observability or caching layers. Per-step AI configuration overrides allow fine-tuning behavior for specific test steps.

This design targets teams who want stable, maintainable end-to-end tests without the overhead of constantly updating selectors. It leverages modern LLM capabilities, multi-model consensus, and caching to optimize for reliability and speed.

Technical strengths and tradeoffs

The standout technical strength is the multi-model consensus approach. By combining outputs from multiple AI providers and using an arbiter model for agreement, Passmark reduces the flakiness common in AI-generated browser automation steps. This is not a trivial feature — it requires careful orchestration of concurrent AI calls and result aggregation.

Caching test steps and their results in Redis is another key feature. This cache significantly improves developer experience by reducing both latency and API costs on repeated test runs. The auto-healing mechanism that regenerates steps upon failure adds robustness to the testing process.

The hybrid CUA mode adds flexibility by supporting visual-based interactions when DOM semantics aren’t enough. This fills a gap many AI testing tools overlook.

Architecturally, Passmark is built in TypeScript which aligns well with Playwright’s ecosystem and modern JavaScript tooling. The codebase is surprisingly clean and modular, reflecting a well-thought-out design for extensibility and AI integration.

The tradeoff is the complexity of managing multiple AI providers and their API keys, which can be a barrier for some teams. Also, while caching and auto-healing mitigate costs, running AI models for every test step can still be expensive compared to traditional selector-based tests. Debugging AI-driven tests can require a shift in mindset as failures may stem from AI misinterpretation rather than code bugs.

Quick start with Passmark

Getting started with Passmark is straightforward if you already have a Playwright project.

npm init playwright@latest passmark-project # select the default options and set language to TypeScript
cd passmark-project
npm install passmark

You need API keys for at least one Anthropic model and one Google model to enable the multi-model consensus feature. Set these in a .env file:

ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_GENERATIVE_AI_API_KEY=AIza...

Alternatively, you can route calls through AI gateways like Vercel AI Gateway or OpenRouter by setting AI_GATEWAY_API_KEY or OPENROUTER_API_KEY. For Cloudflare AI Gateway, you need additional Cloudflare-specific environment variables.

Make sure your Playwright config loads the .env file by adding:

import dotenv from 'dotenv';
import path from 'path';

dotenv.config({ path: path.resolve(__dirname, '.env') });

Then, install dotenv:

npm install dotenv

Here’s an example test snippet from tests/example.spec.ts demonstrating natural language steps:

import { test, expect } from "@playwright/test";
import { runSteps } from "passmark";

test.use({
  headless: !!process.env.CI,
});

test("Shopping cart tests", async ({ page }) => {
  test.setTimeout(60_000); // increase timeout for AI execution
  await runSteps({
    page,
    userFlow: "Add product to cart",
    steps: [
      { description: "Navigate to https://demo.vercel.store" },
      { description: "Click Acme Circles T-Shirt" },
      { description: "Select color", data: "Red" },
      { description: "Add to cart" },
      { description: "Verify cart contains 1 item" },
    ],
  });
});

This shows how you can write tests in plain English and let the AI drive the browser interactions.

Verdict

Passmark brings a fresh approach to end-to-end browser testing by replacing brittle selectors with AI-driven natural language steps. Its multi-model consensus and Redis-backed caching with auto-healing provide a robust, production-grade foundation.

It’s particularly relevant for teams that struggle with flaky E2E tests due to UI changes and want to reduce the maintenance burden. The TypeScript + Playwright integration fits well into modern front-end testing stacks.

However, it comes with tradeoffs. Managing multiple API keys and the costs associated with AI calls require consideration. Debugging AI-generated steps can also feel less transparent initially.

If you’re curious about AI-driven testing and willing to invest in configuring AI providers and gateways, Passmark offers a compelling solution to make browser regression tests more stable and maintainable.


→ GitHub Repo: bug0inc/passmark ⭐ 692 · TypeScript