Noureddine RAMDI / MAGI: A structured multi-LLM debate system with iterative critique and voting

Created Mon, 04 May 2026 10:23:02 +0000 Modified Sat, 23 May 2026 20:41:27 +0000

fshiori/magi

MAGI tackles the challenge of improving large language model (LLM) outputs not by bigger models alone but by coordinating multiple smaller models in a debate-style ensemble.

What MAGI is and how it orchestrates LLM collaboration

MAGI is a Python-based command-line tool and system that runs an Iterative Consensus Ensemble (ICE) protocol among three distinct LLMs: Xiaomi MiMo-v2-pro, MiniMax M2.7, and DeepSeek V3.2. Instead of simple majority voting on independent outputs, MAGI engages these models in multiple rounds of answering, critiquing each other’s reasoning, and potentially revising their positions before a final vote. This structured disagreement engine produces a detailed Decision Dossier including the ruling, confidence scores, minority reports, and a full trace of the interaction.

Under the hood, the system uses an adaptive protocol selection that can escalate from vote to critique or further rounds depending on consensus levels. It also gracefully degrades in the face of node timeouts or rate limits with exponential backoff, ensuring fault tolerance.

The architecture supports persona presets so that the models can adopt domain-specific perspectives or roles, enhancing the diversity and relevance of arguments. A NERV-themed dashboard visualizes the debates and outcomes, improving developer experience and traceability.

MAGI’s design is notable because it achieves state-of-the-art (SOTA) performance on the challenging MMLU Hell Mode benchmark, hitting 83.3% accuracy. This matches the single-shot accuracy of a much stronger model, Claude Sonnet 4.6, but using three cheaper models coordinated in multi-round iterative consensus.

Why MAGI’s iterative consensus ensemble stands out

What distinguishes MAGI is how it implements multi-round debate with mind-change tracking and adaptive escalation, rather than a one-shot ensemble vote. The ICE protocol allows each model to see others’ answers, critique reasoning, and revise its stance across rounds. This iterative feedback loop is closer to human debate and improves output quality.

The codebase is in Python and surprisingly clean given the complexity of orchestrating multiple LLMs with timeouts, retries, and different interaction protocols. The CLI tool supports features like code review, answer judging, benchmarking, and replaying full debate traces.

The tradeoff is clear: coordinating multiple models with multiple rounds increases latency and resource use compared to single-shot inference. However, this pays off in matching the accuracy of a much stronger single model at lower cost per model. The system also builds in fault tolerance with graceful degradation—timeouts default to 60 seconds per node, and exponential backoff handles rate limits.

Persona presets are an interesting touch, enabling domain specialization or role-playing that can diversify perspectives and critiques. This adds complexity but also flexibility.

The open-source code also contains a NERV-themed dashboard for visualizing debates, which improves DX and helps users understand the complex decision process and minority opinions.

Quick start: installing and running MAGI

To install MAGI from PyPI:

pip install magi-system

Or to install from source:

git clone https://github.com/fshiori/magi.git
cd magi
uv venv && uv pip install -e ".[dev]"

This CLI tool then supports commands for running debates, reviewing code, and benchmarking. The README provides more detail on usage patterns.

Verdict: who should explore MAGI and its limitations

MAGI is a compelling system for researchers and engineers exploring structured multi-agent collaboration among LLMs. Its iterative critique and voting protocol is worth understanding for those building ensemble methods beyond naive voting or cascading chains.

The tradeoff is increased complexity, latency, and resource usage due to multi-round interactions and managing multiple models. It’s not a drop-in replacement for single-model inference in production but an experimental platform to push ensemble accuracy with budget models.

The fault tolerance and adaptive escalation protocols demonstrate solid engineering for real-world API constraints and rate limits.

If you’re interested in multi-agent LLM workflows, debate protocols, or benchmarking ensemble accuracy, MAGI is worth a look. For straightforward applications needing fast outputs, sticking to a single strong model might be simpler.

Overall, MAGI’s code is surprisingly clean and thoughtfully designed considering the challenges of orchestrating multiple LLMs with iterative consensus. It offers concrete insights into how structured disagreement and multi-round critique can improve performance without resorting to expensive giant models.


→ GitHub Repo: fshiori/magi ⭐ 111 · Python