BoxPwnr: benchmarking autonomous LLM agents on cybersecurity challenges with iterative command execution

BoxPwnr benchmarks autonomous large language model (LLM) agents by putting them through their paces on real-world cybersecurity challenges. What sets it apart is the core iterative execution loop where the agent generates commands, runs them inside a Kali Linux Docker container, and feeds the output back to the LLM. This loop continues until the agent extracts the target flag, simulating an autonomous penetration tester working inside a sandboxed environment.

benchmarking LLM-based agents on cybersecurity platforms

At its core, BoxPwnr is a modular Python framework designed to evaluate how well LLM-driven agents perform on a variety of cybersecurity challenge platforms. It currently supports over 13 platforms, including popular Capture The Flag (CTF) environments such as HackTheBox (HTB), TryHackMe, PortSwigger Labs, and others.

The architecture centers around multiple solver implementations — named single_loop, claude_code, hacksynth, and external — which represent different strategies for how the LLM interacts with the environment. Each solver runs an iterative loop that involves:

Receiving system prompts describing the current state or instructions.
Suggesting shell or tool commands to execute on the target system.
Executing these commands within a Kali Linux Docker container sandbox.
Capturing the command output and feeding it back to the LLM to inform the next step.

This cycle repeats until the agent successfully finds the flag or exhausts its attempts. The Docker container provides an isolated, reproducible environment mimicking a pentesting setup, which safeguards the host system while allowing realistic interaction.

BoxPwnr tracks detailed metrics including token usage, cost estimates based on the LLM model, and full conversation traces. These traces can be reviewed later using an integrated web replay viewer, offering transparency into the decision-making process of the agent.

The framework supports over 20 LLM models spanning Claude, OpenAI, DeepSeek, Grok, Gemini, and various API gateways like OpenRouter, Z.AI, and Kilo. This broad model support allows users to benchmark and compare performance across state-of-the-art LLMs.

iterative autonomous pentesting agents with modular solver architecture

What distinguishes BoxPwnr is its focus on the iterative command-execution loop as the heart of an autonomous pentesting agent. The solvers serve as interchangeable modules that define the agent’s behavior and reasoning style. For example, the claude_code solver leverages Claude’s code generation capabilities to formulate commands, while the external solver can integrate with third-party tools or run inside the Docker container for VPN-required targets.

The design tradeoff here is between realism and complexity. By using a Docker container running Kali Linux, BoxPwnr ensures a safe, consistent environment for execution. However, this adds overhead and requires Docker setup, which can be a barrier for quick experimentation.

The code quality reflects a pragmatic balance: the framework is modular and extensible, with clear separation between solver strategies, environment management, and result tracking. The presence of thousands of benchmark traces — 7,757 in total — across various platforms demonstrates extensive validation and a commitment to empirical rigor.

Completion rates vary significantly depending on platform difficulty:

HackTheBox Starting Point: 25/25 solved
HTB Labs: 268/526 solved
PortSwigger Labs: 163/270 solved
picoCTF Challenges: 502/503 solved
CyberGym Vulnerability Tasks: 1/1507 solved

These results reflect both the promise and current limitations of LLM agents in complex cybersecurity tasks.

quick start with boxpwnr

Getting started with BoxPwnr involves a few setup steps outlined in the README. The commands below are taken verbatim to ensure accuracy:

### Prerequisites

1. Clone the repository with submodules
  git clone --recurse-submodules https://github.com/0ca/BoxPwnr
  cd BoxPwnr

  # Install uv if you haven't already
  curl -LsSf https://astral.sh/uv/install.sh | sh

  # Sync dependencies (creates .venv)
  uv sync

Docker must be installed and running on your system. Installation instructions are available at https://docs.docker.com/get-docker/.

Once set up, you can run an example targeting a HackTheBox platform room named “meow” using the free Cline model:

uv run boxpwnr --platform htb --target meow --model cline/minimax/minimax-m2.5

For targets requiring VPN access, BoxPwnr supports running an external solver inside its Docker container to maintain network isolation.

verdict: practical benchmarking tool for LLM pentesting agents

BoxPwnr is a solid framework for researchers and practitioners interested in the autonomous capabilities of LLMs in cybersecurity contexts. Its modular architecture and iterative command execution loop simulate real-world pentesting workflows more faithfully than many black-box benchmarks.

The broad platform support and extensive trace collection provide valuable data for understanding LLM strengths and weaknesses. However, the setup complexity and reliance on Docker containers mean it’s best suited for technically comfortable users rather than casual experimenters.

LLM performance on these challenges remains uneven, particularly on complex or realistic tasks, but BoxPwnr offers a transparent, extensible base to push the boundaries of autonomous cybersecurity agents.

For anyone exploring autonomous LLM agents beyond text-only tasks, especially in security, BoxPwnr is worth a close look — with the caveat that it demands some infrastructure setup and patience to run large-scale benchmarks.

→ GitHub Repo: 0ca/BoxPwnr ⭐ 386 · Python

Noureddine RAMDI / BoxPwnr: benchmarking autonomous LLM agents on cybersecurity challenges with iterative command execution

benchmarking LLM-based agents on cybersecurity platforms

iterative autonomous pentesting agents with modular solver architecture

quick start with boxpwnr

verdict: practical benchmarking tool for LLM pentesting agents