OptiLLM: transparent inference-time scaling for improved LLM reasoning

OptiLLM tackles a common challenge in working with large language models (LLMs): how to improve reasoning accuracy without retraining or fine-tuning the underlying models. Instead of costly model updates, OptiLLM acts as a transparent inference proxy that layers a suite of advanced reasoning optimizations on top of any LLM API call. The key innovation is how it orchestrates multiple calls behind the scenes using techniques like Mixture of Agents (MoA), Monte Carlo Tree Search (MCTS), plan search, and multi-agent cross-verification (MARS) to boost accuracy 2-10x on benchmarks involving math, coding, and logical reasoning.

what optillm does and how its architecture works

OptiLLM is a Python-based middleware proxy compatible with the OpenAI API. It intercepts inference calls and transparently dispatches multiple calls across different agents or reasoning strategies to aggregate better, verified responses. The user simply prefixes the model name with a technique slug (e.g., moa-gpt-4o-mini) to enable a specific optimization. This design means no client code changes beyond model naming; the proxy handles the orchestration.

Under the hood, OptiLLM supports over 100 models across providers like OpenAI, Anthropic, Google, and Cerebras through LiteLLM integration. It ships as a pip-installable package and Docker images, including variants for full inference, proxy-only, and offline use with pre-downloaded models.

The architecture is production-minded, featuring SSL configuration and a plugin system for privacy and memory extensions. The routing model is simple and prefix-based, minimizing DX friction while enabling sophisticated inference-time compute scaling. This approach effectively democratizes frontier-tier reasoning capabilities, for example enabling a MOA-wrapped GPT-4o-mini to match vanilla GPT-4 performance on the Arena-Hard-Auto benchmark.

technical strengths and design tradeoffs

What distinguishes OptiLLM is its layering of 20+ reasoning techniques at inference time without model retraining. Techniques include:

Mixture of Agents (MoA): orchestrates multiple specialized LLM agents working collaboratively.
Monte Carlo Tree Search (MCTS): explores reasoning paths with tree-based search algorithms.
Multi-agent cross-verification (MARS): agents check each other’s outputs to improve reliability.
Plan search: generates and evaluates plans to improve code and logical reasoning.

The proxy multiplexes multiple inference calls concurrently, critiques and verifies outputs, then aggregates higher-quality responses. This trades increased inference latency and compute cost for significantly better reasoning accuracy.

The codebase is Python-based, leveraging LiteLLM for model integration. It balances modularity with performance, evident from the plugin system and SSL support. The prefix routing model is elegant in its simplicity, requiring zero client code changes beyond model name modification, which is a clear win for developer experience.

Benchmarks show notable improvements. For example, MARS increased AIME 2025 scores by +30 points with Gemini 2.5 Flash Lite, and MOA-wrapped GPT-4o-mini matched vanilla GPT-4 on Arena-Hard-Auto. These results demonstrate that inference-time compute scaling can effectively raise the performance of smaller, cheaper models to match or exceed much larger baselines.

The tradeoff is obvious: the approach increases inference calls and therefore latency and compute requirements. This means it’s less suited for latency-sensitive or cost-constrained scenarios. Additionally, the complexity of orchestrating multiple agents and techniques may increase operational overhead.

quick start

Getting started with OptiLLM is straightforward, with installation options via pip or Docker. Here’s how you can get the proxy running quickly:

Using pip

pip install optillm
optillm
2024-10-22 07:45:05,612 - INFO - Loaded plugin: privacy
2024-10-22 07:45:06,293 - INFO - Loaded plugin: memory
2024-10-22 07:45:06,293 - INFO - Starting server with approach: auto

Using docker

docker pull ghcr.io/algorithmicsuperintelligence/optillm:latest
docker run -p 8000:8000 ghcr.io/algorithmicsuperintelligence/optillm:latest
2024-10-22 07:45:05,612 - INFO - Loaded plugin: privacy
2024-10-22 07:45:06,293 - INFO - Loaded plugin: memory
2024-10-22 07:45:06,293 - INFO - Starting server with approach: auto

OptiLLM offers several Docker image variants:

Full image (latest): includes all dependencies for local inference and plugins
Proxy-only (latest-proxy): lightweight image without local inference capabilities
Offline (latest-offline): self-contained image with pre-downloaded models (e.g., spaCy) for fully offline operation

Once running, you use OptiLLM by prefixing your model name with the technique slug in your API calls, such as moa-gpt-4o-mini. The proxy transparently handles the rest.

verdict

OptiLLM offers a practical and developer-friendly way to boost LLM reasoning accuracy without retraining, by transparently orchestrating multiple inference techniques at runtime. Its prefix-based routing model is elegant and keeps client changes minimal, which is rare in multi-agent or inference orchestration systems.

The tradeoff is added latency and compute cost, so it’s best suited for scenarios where accuracy on complex reasoning tasks outweighs inference speed or cost constraints. The support for 100+ models and production features like SSL and plugins make it a solid choice for teams experimenting with inference-time compute scaling.

If you want to punch above your LLM’s weight class on math, coding, or logic tasks without investing in model training, OptiLLM is worth exploring. Just be ready to manage the complexity of multi-agent orchestration and increased inference workload.

vLLM: Efficient large language model serving with paged attention and continuous batching — vLLM is a Python library for high-throughput LLM inference using paged attention and continuous batching. It supports qu
LiteRT-LM: Google’s C++ library for efficient edge language model inference — LiteRT-LM is a Google AI Edge C++ library for performant language model inference on edge devices with multi-language AP
Navigating free-tier LLM APIs with the awesome-free-llm-apis catalog — A curated catalog of free-tier LLM APIs compatible with OpenAI SDK, detailing rate limits, model specs, and providers to
OpenResearcher: An open-source 30B LLM for long-horizon deep research — OpenResearcher is a fully open 30B agentic LLM designed for deep research tasks, featuring a 96K-turn dataset and a self
A hands-on course for mastering large language models: fine-tuning, quantization, and tooling — Explore a comprehensive LLM course with practical notebooks on fine-tuning (QLoRA, DPO), quantization (GPTQ), and tools

→ GitHub Repo: codelion/optillm ⭐ 3,987 · Python

Noureddine RAMDI / OptiLLM: transparent inference-time scaling for improved LLM reasoning