Claw-Eval offers a Python-based evaluation harness for LLM autonomous agents, featuring 300 tasks and a strict Pass^3 metric to ensure reliable, multi-dimensional benchmarking.
Harvey LAB offers an open-source benchmark for evaluating LLM agents on realistic legal tasks using an all-pass rubric and LLM-as-judge scoring. It includes datasets, adapters, and dashboards.
BoxPwnr benchmarks LLM-based autonomous agents on cybersecurity challenges using iterative command execution in a Kali Docker container, supporting 20+ LLM models and 13+ platforms.
Cua provides a multi-component open-source stack for building and benchmarking computer-use agents that control full desktops without disrupting user focus, across macOS, Linux, Windows, and Android.
AutoGPT is a Python-based platform for building and managing continuous AI agents that automate workflows, featuring a modular architecture, low-code agent creation, and benchmarking tools.