Claw-Eval: a rigorous Python harness for trustworthy evaluation of LLM-powered autonomous agents

Claw-Eval tackles a common problem in evaluating autonomous agents powered by large language models (LLMs): how do you reliably measure their capabilities without mistaking lucky runs for consistent performance? This Python harness introduces a strict, multi-trial pass criterion called Pass^3, requiring an agent to succeed on three independent attempts per task. This raises the bar for benchmarking agents, focusing on consistent, trustworthy results rather than one-off successes.

what claw-eval does and how it structures evaluation

At its core, Claw-Eval is an evaluation framework designed to rigorously test autonomous agents driven by LLMs. It comes with a benchmark suite of 300 human-verified tasks spread across nine categories, including general tasks, multimodal challenges, and multi-turn interactions. These tasks are annotated with 2,159 detailed rubrics that define what constitutes success, capturing nuances beyond simple correctness.

The evaluation harness runs each task three times independently (N=3) to enforce the Pass^3 metric: an agent must pass all three trials to be credited with success. This approach helps weed out lucky passes that might occur in benchmarks relying on a single run.

The evaluation covers three dimensions:

Completion: Did the agent successfully complete the task?
Safety: Did the agent avoid unsafe or undesirable behaviors?
Robustness: How reliably does the agent handle variations and edge cases?

Instead of just checking the final output, Claw-Eval audits the full trajectory of agent actions in a sandboxed environment, allowing for detailed analysis of behavior over time. The sandbox isolation ensures tasks run securely without affecting the host environment.

Grading is performed using LLM-as-judge models: Gemini 3 Flash is used for general and multimodal tasks, while Claude Opus 4.6 handles multi-turn tasks, including grading and user-agent roles. This choice highlights a pragmatic balance between leveraging state-of-the-art LLM grading capabilities and managing task-specific nuances.

The project supports parallel batch evaluation, optimizing throughput when running large-scale experiments. The benchmark and related fixtures, including large video files for multimodal tasks, are hosted externally on Hugging Face due to GitHub size limits.

A public leaderboard at claw-eval.github.io showcases current agent rankings, promoting transparency and community engagement.

what distinguishes claw-eval: pass3 and sandboxed trajectory auditing

The standout feature of Claw-Eval is the Pass^3 metric. Unlike many benchmarks that count a task as passed after a single successful attempt, Pass^3 demands three consecutive successful independent trials. This design explicitly targets the problem of “lucky runs” — where an agent might randomly succeed once but fail consistently otherwise.

This methodology pushes agent developers to focus on reliability and consistency, which are crucial for deploying agents in real-world scenarios where one-off successes are insufficient.

Another significant strength is the evaluation of full trajectory data rather than just surface-level outputs. By auditing every step an agent takes within the sandbox, Claw-Eval can detect subtle failures or unsafe behaviors that might be invisible in a simple output comparison.

The use of sandbox isolation is also a practical safety measure, ensuring that tasks involving external interactions or resource-heavy operations do not interfere with the evaluation environment or host system.

The integration of LLM-based graders tailored to task types balances automation and accuracy. Using Gemini 3 Flash for general/multimodal and Claude Opus 4.6 for multi-turn grading leverages the strengths of both LLMs in a complementary fashion.

Tradeoffs include the computational overhead of running multiple trials per task and the complexity of setting up sandbox environments and API keys. However, these are necessary for the level of rigor Claw-Eval aims to provide.

The codebase, written in Python, is designed for parallel execution with configurable parameters for trials and parallelism, making it suitable for both research experiments and benchmarking pipelines.

quick start with claw-eval

The project recommends using uv for managing a Python virtual environment with Python 3.11. After installing dependencies, environment variables for API keys (OpenRouter and SERP) must be set to enable access to external LLMs and web search APIs.

A shell script (scripts/test_sandbox.sh) is provided to prepare the sandbox environment and verify setup.

Here is the quick start snippet exactly as provided:

pip install uv
uv venv --python 3.11
source .venv/bin/activate

Set your API keys and prepare the environment:

export OPENROUTER_API_KEY=sk-or-...
export SERP_DEV_KEY=... # add this for tasks need real web search.  You can get api key from https://www.novada.com for convenience.
bash scripts/test_sandbox.sh

To run a batch evaluation with the sandbox enabled, three trials per task, and 16 parallel jobs:

claw-eval batch --config model_configs/claude_opus_46.yaml --sandbox --trials 3 --parallel 16

Note that video fixtures for video-related tasks are stored externally on Hugging Face due to file size constraints.

verdict

Claw-Eval is a solid choice if you need a trustworthy, rigorous evaluation harness for LLM-powered autonomous agents. Its Pass^3 metric and multi-dimensional grading approach address common pitfalls in agent benchmarking, focusing on reliability and safety rather than superficial success.

The tradeoff is that setting up and running Claw-Eval requires some preparation: API keys, sandbox environments, and computational resources to handle multiple trials and parallel evaluations. It’s less suited for quick experiments or lightweight comparisons.

For researchers, developers, and teams serious about benchmarking and improving agent robustness, Claw-Eval offers a comprehensive, well-documented platform with transparent public leaderboards. It’s a practical tool to push the field toward more consistent and trustworthy AI agents.

For casual users or those new to agent evaluation, the complexity might be a barrier, but the clear quick start instructions and modular design make it approachable with some investment.

Overall, Claw-Eval’s methodology and tooling fill a meaningful gap in autonomous agent benchmarking, focusing on consistent, safe, and robust performance.

Browser Harness: a self-healing LLM agent for browser automation via Chrome DevTools — Browser Harness enables LLMs to automate browsers by dynamically generating helper functions using the Chrome DevTools P
DSPy agent skills pack with GEPA optimization for Claude Code and Codex CLI — Explore a production-grade pack of DSPy 3.2.x agent skills with GEPA optimization, delivering up to +19.53 accuracy on R
SkillForge: Efficient AI skill management for Claude Code and Codex — SkillForge v5.1 reduces AI skill prompt size by 64% using context-efficient design and trigger-based routing in Claude C
Meta-Harness: evolving the scaffolding around large language models for optimized task performance — Meta-Harness from Stanford IRIS Lab automates the search for optimal harness configurations around LLMs, evolving memory

→ GitHub Repo: claw-eval/claw-eval ⭐ 596 · Python

Noureddine RAMDI / Claw-Eval: a rigorous Python harness for trustworthy evaluation of LLM-powered autonomous agents

what claw-eval does and how it structures evaluation

what distinguishes claw-eval: pass3 and sandboxed trajectory auditing

quick start with claw-eval

verdict

Related Articles