Benchmarking on Noureddine RAMDI

Benchmarking on Noureddine RAMDIhttps://ramdi.fr/tags/benchmarking/Recent content in Benchmarking on Noureddine RAMDIHugoenSat, 23 May 2026 20:41:27 +0000Claw-Eval: a rigorous Python harness for trustworthy evaluation of LLM-powered autonomous agentshttps://ramdi.fr/github-stars/claw-eval-a-rigorous-python-harness-for-trustworthy-evaluation-of-llm-powered-autonomous-agents/Sat, 23 May 2026 20:41:14 +0000https://ramdi.fr/github-stars/claw-eval-a-rigorous-python-harness-for-trustworthy-evaluation-of-llm-powered-autonomous-agents/Claw-Eval offers a Python-based evaluation harness for LLM autonomous agents, featuring 300 tasks and a strict Pass^3 metric to ensure reliable, multi-dimensional benchmarking.Harvey LAB: Benchmarking legal LLM agents with realistic tasks and automated scoringhttps://ramdi.fr/github-stars/harvey-lab-benchmarking-legal-llm-agents-with-realistic-tasks-and-automated-scoring/Sat, 23 May 2026 20:41:14 +0000https://ramdi.fr/github-stars/harvey-lab-benchmarking-legal-llm-agents-with-realistic-tasks-and-automated-scoring/Harvey LAB offers an open-source benchmark for evaluating LLM agents on realistic legal tasks using an all-pass rubric and LLM-as-judge scoring. It includes datasets, adapters, and dashboards.BoxPwnr: benchmarking autonomous LLM agents on cybersecurity challenges with iterative command executionhttps://ramdi.fr/github-stars/boxpwnr-benchmarking-autonomous-llm-agents-on-cybersecurity-challenges-with-iterative-command-execution/Mon, 04 May 2026 10:23:01 +0000https://ramdi.fr/github-stars/boxpwnr-benchmarking-autonomous-llm-agents-on-cybersecurity-challenges-with-iterative-command-execution/BoxPwnr benchmarks LLM-based autonomous agents on cybersecurity challenges using iterative command execution in a Kali Docker container, supporting 20+ LLM models and 13+ platforms.Cua: A unified stack for background desktop automation agents across macOS, Linux, Windows, and Androidhttps://ramdi.fr/github-stars/cua-a-unified-stack-for-background-desktop-automation-agents-across-macos-linux-windows-and-android/Sun, 26 Apr 2026 23:47:28 +0000https://ramdi.fr/github-stars/cua-a-unified-stack-for-background-desktop-automation-agents-across-macos-linux-windows-and-android/Cua provides a multi-component open-source stack for building and benchmarking computer-use agents that control full desktops without disrupting user focus, across macOS, Linux, Windows, and Android.AutoGPT: A modular platform for continuous AI agents and workflow automationhttps://ramdi.fr/github-stars/autogpt-a-modular-platform-for-continuous-ai-agents-and-workflow-automation/Sun, 26 Apr 2026 17:51:11 +0000https://ramdi.fr/github-stars/autogpt-a-modular-platform-for-continuous-ai-agents-and-workflow-automation/AutoGPT is a Python-based platform for building and managing continuous AI agents that automate workflows, featuring a modular architecture, low-code agent creation, and benchmarking tools.