<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Benchmarking on Noureddine RAMDI</title><link>https://ramdi.fr/tags/benchmarking/</link><description>Recent content in Benchmarking on Noureddine RAMDI</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sat, 23 May 2026 20:41:27 +0000</lastBuildDate><atom:link href="https://ramdi.fr/tags/benchmarking/index.xml" rel="self" type="application/rss+xml"/><item><title>Claw-Eval: a rigorous Python harness for trustworthy evaluation of LLM-powered autonomous agents</title><link>https://ramdi.fr/github-stars/claw-eval-a-rigorous-python-harness-for-trustworthy-evaluation-of-llm-powered-autonomous-agents/</link><pubDate>Sat, 23 May 2026 20:41:14 +0000</pubDate><guid>https://ramdi.fr/github-stars/claw-eval-a-rigorous-python-harness-for-trustworthy-evaluation-of-llm-powered-autonomous-agents/</guid><description>Claw-Eval offers a Python-based evaluation harness for LLM autonomous agents, featuring 300 tasks and a strict Pass^3 metric to ensure reliable, multi-dimensional benchmarking.</description></item><item><title>Harvey LAB: Benchmarking legal LLM agents with realistic tasks and automated scoring</title><link>https://ramdi.fr/github-stars/harvey-lab-benchmarking-legal-llm-agents-with-realistic-tasks-and-automated-scoring/</link><pubDate>Sat, 23 May 2026 20:41:14 +0000</pubDate><guid>https://ramdi.fr/github-stars/harvey-lab-benchmarking-legal-llm-agents-with-realistic-tasks-and-automated-scoring/</guid><description>Harvey LAB offers an open-source benchmark for evaluating LLM agents on realistic legal tasks using an all-pass rubric and LLM-as-judge scoring. It includes datasets, adapters, and dashboards.</description></item><item><title>BoxPwnr: benchmarking autonomous LLM agents on cybersecurity challenges with iterative command execution</title><link>https://ramdi.fr/github-stars/boxpwnr-benchmarking-autonomous-llm-agents-on-cybersecurity-challenges-with-iterative-command-execution/</link><pubDate>Mon, 04 May 2026 10:23:01 +0000</pubDate><guid>https://ramdi.fr/github-stars/boxpwnr-benchmarking-autonomous-llm-agents-on-cybersecurity-challenges-with-iterative-command-execution/</guid><description>BoxPwnr benchmarks LLM-based autonomous agents on cybersecurity challenges using iterative command execution in a Kali Docker container, supporting 20+ LLM models and 13+ platforms.</description></item><item><title>Cua: A unified stack for background desktop automation agents across macOS, Linux, Windows, and Android</title><link>https://ramdi.fr/github-stars/cua-a-unified-stack-for-background-desktop-automation-agents-across-macos-linux-windows-and-android/</link><pubDate>Sun, 26 Apr 2026 23:47:28 +0000</pubDate><guid>https://ramdi.fr/github-stars/cua-a-unified-stack-for-background-desktop-automation-agents-across-macos-linux-windows-and-android/</guid><description>Cua provides a multi-component open-source stack for building and benchmarking computer-use agents that control full desktops without disrupting user focus, across macOS, Linux, Windows, and Android.</description></item><item><title>AutoGPT: A modular platform for continuous AI agents and workflow automation</title><link>https://ramdi.fr/github-stars/autogpt-a-modular-platform-for-continuous-ai-agents-and-workflow-automation/</link><pubDate>Sun, 26 Apr 2026 17:51:11 +0000</pubDate><guid>https://ramdi.fr/github-stars/autogpt-a-modular-platform-for-continuous-ai-agents-and-workflow-automation/</guid><description>AutoGPT is a Python-based platform for building and managing continuous AI agents that automate workflows, featuring a modular architecture, low-code agent creation, and benchmarking tools.</description></item></channel></rss>