Noureddine RAMDI / ReasoningBank: Experience-Driven Memory as a New Scaling Dimension for AI Agents

Created Sat, 23 May 2026 20:41:14 +0000 Modified Sat, 23 May 2026 20:41:27 +0000

google-research/reasoning-bank

ReasoningBank flips the usual AI scaling narrative by treating memory not just as a passive store but as an active, bidirectional scaling dimension. Instead of relying solely on bigger models or more compute, it leverages accumulated reasoning traces from both successful and failed agent trajectories. This experience-driven memory enables AI agents to self-evolve, improving performance by learning from the full spectrum of past attempts.

What ReasoningBank does and its architecture

ReasoningBank is a research project from Google Research that introduces a novel memory mechanism for AI agents. It stores detailed reasoning traces from agent trajectories, including those that led to success and those that failed. This comprehensive memory allows agents to learn from both positive and negative experiences, effectively using their accumulated reasoning as a new axis of scaling beyond model size and compute resources.

The project implements this concept across two distinct agent benchmarks:

  • SWE-Bench: Focused on software engineering tasks, it builds on top of mini-swe-agent to test and evolve agents in coding and software problem-solving scenarios.

  • WebArena: A web browsing environment built on browsergym, where agents interact with web interfaces and learn from their navigation and task performance.

ReasoningBank supports multiple large language model families, including GPT (e.g., gpt-3.5-turbo, gpt-4, gpt-4o), Gemini (e.g., gemini-2.5-flash, gemini-2.5-pro), and Claude (claude-3-7-sonnet@20250219). For Google Cloud users, it integrates with Vertex AI, enabling seamless cloud-based model deployment.

The codebase also includes patched versions of upstream dependencies to fix evaluation bugs, indicating active maintenance and attention to correctness. The project was presented at ICLR 2026, reflecting its research significance.

Technical strengths and design tradeoffs

What sets ReasoningBank apart is its treatment of memory as a scaling dimension that works bidirectionally. Traditional approaches often focus on increasing model size or compute at test time, or on using memory to store only successful trajectories. ReasoningBank instead stores reasoning traces from both successes and failures, giving the agent a richer experience base to draw upon.

This approach introduces several interesting technical strengths:

  • Memory-aware test-time scaling: Beyond just bigger models or more compute, the project proposes scaling by increasing the amount and quality of experience-driven memory. This is an orthogonal scaling axis that can improve agent performance without necessarily increasing model size.

  • Bidirectional memory scaling synergy: By incorporating failed trajectories alongside successes, the memory content becomes more informative, enabling the agent to avoid past mistakes and refine its reasoning strategies.

  • Support for diverse benchmarks and models: The implementation across SWE-Bench and WebArena shows flexibility in task domains (coding and web interaction). Supporting multiple LLM families (GPT, Gemini, Claude) and Vertex AI integration offers practical versatility.

  • Patch management of dependencies: Vendoring patched dependencies to fix evaluation bugs reflects a pragmatic approach to code reliability and reproducible research.

Tradeoffs and limitations:

  • Complexity of memory management: Storing and retrieving detailed reasoning traces from both successes and failures requires careful memory management to avoid performance bottlenecks.

  • Dependency on external LLM APIs: The need to configure environment variables for OpenAI and Google Cloud credentials adds setup complexity.

  • Domain specificity: While SWE-Bench and WebArena cover software engineering and web browsing, applying the approach to other domains may require adaptation.

  • Evaluation overhead: The patched dependencies suggest that evaluation correctness is non-trivial and demands careful handling.

Overall, the codebase is surprisingly clean for a cutting-edge research project, balancing experimental design with practical considerations.

Quick start

The repo provides clear setup instructions, especially for configuring the supported LLMs. Here’s the essential setup copied verbatim:

# Install required packages
pip install -r requirements.txt

# For GPT models
export OPENAI_API_KEY="your-openai-api-key"

# For Gemini and Claude on Vertex AI
# Install Google Cloud CLI and authenticate
gcloud auth application-default login

# Set project and location environment variables
export GOOGLE_CLOUD_PROJECT="your-project-id"
export GOOGLE_CLOUD_LOCATION="your-region"
export GOOGLE_GENAI_USE_VERTEXAI="True"

For the WebArena benchmark, there is a Docker-based environment setup:

  • Follow browsergym installation instructions from its official docs.
  • Download and configure Docker environment for WebArena, adjusting website addresses in scripts as required.

Directory structure highlights for WebArena:

  • WebArena/agents/: implementations of web agents integrating with browsergym
  • WebArena/autoeval/: LLM-as-a-judge modules for trajectory correctness
  • WebArena/config_files/: data processing for tasks
  • WebArena/prompt/: reusable instructions and prompts

Data preprocessing requires downloading raw test files and placing them in config_files.

This quick start reflects a hands-on approach, expecting users to have some familiarity with Docker, environment variables, and API keys.

Verdict

ReasoningBank offers an intriguing perspective on scaling AI agent performance by focusing on experience-driven memory as a first-class scaling dimension. Its bidirectional memory mechanism that incorporates reasoning traces from both successes and failures is a notable departure from typical approaches that focus solely on model size or compute.

It’s relevant for researchers and practitioners working on AI agents, especially those interested in software engineering automation or web interaction agents. The multi-model support and cloud integration broaden its applicability.

That said, the project carries the typical tradeoffs of research code: setup complexity, reliance on external APIs, and domain-specific benchmarks. Its memory management approach could introduce overhead in large-scale or real-time applications.

For anyone building or experimenting with reasoning-capable AI agents, ReasoningBank is worth exploring as a fresh approach to memory and scaling. It’s not a plug-and-play solution but provides solid foundations and tools to study how accumulated experience can shape smarter agents beyond brute force model scaling.


→ GitHub Repo: google-research/reasoning-bank ⭐ 357 · Python