Noureddine RAMDI / DATAGEN: a LangGraph multi-agent framework for automated data analysis workflows

Created Mon, 04 May 2026 10:23:01 +0000 Modified Sat, 23 May 2026 20:41:27 +0000

starpig1129/DATAGEN

DATAGEN tackles a complex problem in AI research automation: coordinating multiple specialized agents to run end-to-end data analysis workflows. It uses LangGraph-based multi-agent orchestration to manage hypothesis generation, human validation, data processing, visualization, literature search, report writing, and quality review in a single system. The standout feature is its progressive disclosure configuration that optimizes the use of LLM context windows across multiple providers.

LangGraph orchestration for multi-agent research workflows

DATAGEN is a Python 3.10+ framework built on LangChain and LangGraph that orchestrates eight distinct AI agents, each focused on a specific research task: hypothesis, process, visualization, code, search, report, quality review, and note-taking. These agents operate together in a state graph model, where the workflow progresses through different states reflecting the research lifecycle.

The architecture uses a state graph to synchronize and manage workflow transitions, allowing for iterative loops such as human-in-the-loop validation and quality review cycles. This graph-based coordination enables complex dependencies between agents without hardcoding control flow.

Under the hood, the system integrates with multiple LLM providers including OpenAI, Anthropic, Google, Ollama, and Groq. It also uses MCP (Model Context Protocol) servers to enable agents to access external tools like filesystem operations, GitHub, and web search, enhancing the capabilities of each specialized agent.

The note-taking agent preserves context over long research sessions, which helps maintain continuity and prevents loss of information across agent interactions.

Progressive disclosure configuration optimizes multi-provider LLM orchestration

What distinguishes DATAGEN is its progressive disclosure architecture for agent configuration. Instead of loading all agent skills upfront, it uses a three-level skill loading mechanism:

  • Basic skills loaded first with minimal context
  • Intermediate skills loaded subsequently with more detailed context
  • Advanced skills loaded last with full context

This approach conserves valuable LLM context window space and manages computational resources efficiently. Configuration and routing of models and skills are controlled via YAML files that support multi-LLM provider routing. This makes it easy to swap or combine different LLM providers for specific agents or skills.

The codebase reflects careful attention to modularity and extensibility. The YAML-driven routing abstracts provider-specific details from the core orchestration logic, improving maintainability.

The tradeoff here is complexity: managing multi-level skills and multiple LLM providers requires a solid understanding of the architecture and configuration files. The progressive disclosure also means some overhead in skill switching and state management.

Quick start

System Requirements

  • Python 3.10 or higher

Installation

  1. Clone the repository:
git clone https://github.com/starpig1129/DATAGEN.git
  1. Create and activate a Conda virtual environment:
conda create -n datagen python=3.10
conda activate datagen
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables: Rename .env Example to .env and fill all the values

This setup gets you the environment ready for running DATAGEN. From there, you can explore the YAML configuration files that define agent routing, skill loading, and LLM provider settings.

verdict

DATAGEN is a solid choice for researchers and developers looking to automate complex data analysis workflows with multi-agent AI orchestration. Its strength lies in the progressive disclosure pattern that balances context window usage and multi-provider support.

However, it is not a plug-and-play solution: users should be comfortable with YAML configuration, Python 3.10+, and the conceptual overhead of state graph orchestration. The system is well-suited for projects that demand iterative research workflows with human validation and rich external tool integration.

If your work involves automating research pipelines with multiple AI agents and you want fine-grained control over LLM provider routing and context management, DATAGEN provides a flexible, well-structured foundation to build on. The tradeoff is the increased complexity and configuration effort compared to simpler single-agent frameworks.


→ GitHub Repo: starpig1129/DATAGEN ⭐ 1,719 · Python