DATAGEN: a LangGraph multi-agent framework for automated data analysis workflows

DATAGEN tackles a complex problem in AI research automation: coordinating multiple specialized agents to run end-to-end data analysis workflows. It uses LangGraph-based multi-agent orchestration to manage hypothesis generation, human validation, data processing, visualization, literature search, report writing, and quality review in a single system. The standout feature is its progressive disclosure configuration that optimizes the use of LLM context windows across multiple providers.

LangGraph orchestration for multi-agent research workflows

DATAGEN is a Python 3.10+ framework built on LangChain and LangGraph that orchestrates eight distinct AI agents, each focused on a specific research task: hypothesis, process, visualization, code, search, report, quality review, and note-taking. These agents operate together in a state graph model, where the workflow progresses through different states reflecting the research lifecycle.

The architecture uses a state graph to synchronize and manage workflow transitions, allowing for iterative loops such as human-in-the-loop validation and quality review cycles. This graph-based coordination enables complex dependencies between agents without hardcoding control flow.

Under the hood, the system integrates with multiple LLM providers including OpenAI, Anthropic, Google, Ollama, and Groq. It also uses MCP (Model Context Protocol) servers to enable agents to access external tools like filesystem operations, GitHub, and web search, enhancing the capabilities of each specialized agent.

The note-taking agent preserves context over long research sessions, which helps maintain continuity and prevents loss of information across agent interactions.

Progressive disclosure configuration optimizes multi-provider LLM orchestration

What distinguishes DATAGEN is its progressive disclosure architecture for agent configuration. Instead of loading all agent skills upfront, it uses a three-level skill loading mechanism:

Basic skills loaded first with minimal context
Intermediate skills loaded subsequently with more detailed context
Advanced skills loaded last with full context

This approach conserves valuable LLM context window space and manages computational resources efficiently. Configuration and routing of models and skills are controlled via YAML files that support multi-LLM provider routing. This makes it easy to swap or combine different LLM providers for specific agents or skills.

The codebase reflects careful attention to modularity and extensibility. The YAML-driven routing abstracts provider-specific details from the core orchestration logic, improving maintainability.

The tradeoff here is complexity: managing multi-level skills and multiple LLM providers requires a solid understanding of the architecture and configuration files. The progressive disclosure also means some overhead in skill switching and state management.

Quick start

System Requirements

Python 3.10 or higher

Installation

Clone the repository:

git clone https://github.com/starpig1129/DATAGEN.git

Create and activate a Conda virtual environment:

conda create -n datagen python=3.10
conda activate datagen

Install dependencies:

pip install -r requirements.txt

Set up environment variables: Rename .env Example to .env and fill all the values

This setup gets you the environment ready for running DATAGEN. From there, you can explore the YAML configuration files that define agent routing, skill loading, and LLM provider settings.

verdict

DATAGEN is a solid choice for researchers and developers looking to automate complex data analysis workflows with multi-agent AI orchestration. Its strength lies in the progressive disclosure pattern that balances context window usage and multi-provider support.

However, it is not a plug-and-play solution: users should be comfortable with YAML configuration, Python 3.10+, and the conceptual overhead of state graph orchestration. The system is well-suited for projects that demand iterative research workflows with human validation and rich external tool integration.

If your work involves automating research pipelines with multiple AI agents and you want fine-grained control over LLM provider routing and context management, DATAGEN provides a flexible, well-structured foundation to build on. The tradeoff is the increased complexity and configuration effort compared to simpler single-agent frameworks.

AutoGen: exploring multi-agent AI orchestration with Python in maintenance mode — AutoGen is a Python framework for building multi-agent AI applications with LLM integration, now in maintenance mode wit
Langflow: Visual orchestration platform for AI agents and workflows — Langflow offers a Python-based visual platform to build and deploy AI agents and workflows with multi-agent orchestratio
Agno: Building production-ready agentic software with minimal code — Agno provides a minimal, production-ready Python framework for scalable agentic software with per-user isolation and nat
DeerFlow 2.0: orchestrating multi-agent AI workflows with flexible LLM integration — DeerFlow 2.0 is a Python framework for orchestrating AI sub-agents and memory with support for multiple LLMs and executi
MetaGPT: orchestrating multi-agent AI teams to automate software development — MetaGPT uses a multi-agent system with defined GPT roles following SOPs to automate software development from one-line p

→ GitHub Repo: starpig1129/DATAGEN ⭐ 1,719 · Python

Noureddine RAMDI / DATAGEN: a LangGraph multi-agent framework for automated data analysis workflows