WhyHow Knowledge Graph Studio: building RAG-native knowledge graphs with MongoDB and OpenAI

WhyHow Knowledge Graph Studio takes a different approach to building knowledge graphs for retrieval-augmented generation (RAG) workflows. Instead of relying on traditional graph databases like Neo4j, it uses MongoDB as a NoSQL backend to store triple-based graphs, combining this with OpenAI embeddings to integrate unstructured and structured data seamlessly. This architecture offers flexibility and scalability for AI-driven knowledge management but also involves tradeoffs worth understanding.

what WhyHow Knowledge Graph Studio does and how it works

This open-source Python platform is designed specifically for creating and managing knowledge graphs optimized for RAG workflows. The core concept is representing knowledge as triples—head, relation, and tail—linked to chunks of source data for provenance. This design supports flexible schema-less graphs that can incorporate both structured entities and unstructured text.

Under the hood, WhyHow uses MongoDB as the primary storage layer. MongoDB’s document model fits well with the chunk-based storage of text and triples, providing scalable and flexible data management. Although MongoDB is not a native graph database, the repo implements graph construction and traversal logic at the application level, allowing custom graph schema definitions and rule-based entity resolution.

The system exposes an API-first architecture, backed by a Python SDK, enabling programmatic creation and management of workspaces, chunks, triples, and graphs. Integration with OpenAI’s API provides embeddings for vector similarity search and generative AI tasks, facilitating retrieval and augmentation in AI pipelines.

The platform also supports command-line tooling for administrative tasks like setting up collections and users, and can be deployed via Docker containers, making it suitable for development and production environments.

the architecture and technical considerations behind WhyHow

What distinguishes WhyHow is the choice of MongoDB as a flexible NoSQL backend over more specialized graph databases. This is a clear tradeoff: MongoDB offers scalability and ease of use with flexible document schemas but does not provide native graph query optimizations like index-free adjacency or built-in graph algorithms.

The triple-based construction pattern (head-relation-tail) is a classic graph representation but here it’s enhanced by associating triples with chunks of source data. These chunks are stored as documents in MongoDB and enriched with OpenAI embeddings, which enables semantic search and similarity queries alongside traditional graph queries.

Rule-based entity resolution is another key feature—this helps reconcile different mentions of the same entity across documents and chunks, which is essential in noisy or heterogeneous data environments. The modular architecture separates concerns clearly: data ingestion, embedding generation, entity resolution, and graph construction.

The API-first design and Python SDK give a clean developer experience. The codebase appears well-structured with clear modules for CLI commands, API handlers, and data models. Docker support and CLI admin scripts contribute to ease of deployment and maintenance.

However, there are limitations to this approach. Using MongoDB means graph queries rely on application logic and are not as performant as native graph DB queries. Complex traversals might become bottlenecks at scale. Also, the platform currently targets MongoDB primarily, though there are plans for database agnosticism, which could broaden its applicability.

quick start with WhyHow Knowledge Graph Studio

Installation

To install the package you can first clone the repo

This client requires Python version 3.10 or higher.

$ git clone git@github.com:whyhow-ai/knowledge-graph-studio.git
$ cd knowledge-graph-studio
$ pip install .

If you are a developer you probably want to use an editable install. Additionally, you need to install development and documentation dependencies.

$ pip install -e .[dev,docs]

Quickstart

1. Pre-requisites

In order to get started with the WhyHow API with this quickstart, you will need the following:

OpenAI API key
MongoDB account
- You must create a project and cluster in MongoDB Atlas (dedicated M10+ recommended for best performance)

2. Configuration

Environment Variables Copy the .env.example file to .env and update the values per your environment. To get started with this version, you need to provide values for mongodb, openai.

$ cp .env.sample .env

To get started, you must configure, at minimum, the following enviroinment variables:

WHYHOW__EMBEDDING__OPENAI__API_KEY=<your openai api key>
WHYHOW__GENERATIVE__OPENAI__API_KEY=<your openai api key - can be the same>
WHYHOW__MONGODB__USERNAME=<your altas database username>
WHYHOW__MONGODB__PASSWORD=<your altas database password>
WHYHOW__MONGODB__DATABASE_NAME=main
WHYHOW__MONGODB__HOST=<your altas host i.e. 'xxx.xxx.mongodb.net'>

Create Collections

Once you have configured your environment variables, you must create the database, collections, and indexes in your Atlas cluster. To simplify this, we have included a cli script in src/whyhow_api/cli/. To set this up, run the following:

$ cd src/whyhow_api/cli/
$ python admin.py setup-collections --config-file collection_index_config.json

This script will create 11 collections: chunk, document, graph, node, query, rule, schema, task, triple, user, and workspace. To verify, browse your collections in your MongoDB Atlas dashboard.

verdict

WhyHow Knowledge Graph Studio offers a practical, flexible platform for developers building RAG-native knowledge graphs that combine unstructured and structured data leveraging OpenAI embeddings. Its MongoDB backend is a deliberate tradeoff favoring schema flexibility and scalability over native graph query performance.

For teams comfortable with Python and MongoDB who want an API-first approach and integration with OpenAI, this repo provides a solid foundation. However, if your use case demands complex graph analytics or very high-performance graph traversals, a native graph database might be more appropriate.

The code quality and modular design make it a promising starting point for experimentation and extending knowledge graph capabilities in AI pipelines. The detailed quickstart also lowers the barrier to entry, making it accessible to practitioners looking to integrate knowledge graphs into RAG workflows.

Beads: a distributed graph issue tracker for multi-agent AI workflows — Beads is a Go-based CLI tool that uses Dolt-backed version control to manage AI agent tasks as a dependency-aware graph,
OpenAI Codex CLI: local-first AI coding assistant with ChatGPT integration — OpenAI Codex CLI brings AI coding assistance local to your terminal, integrating with ChatGPT plans for powerful hybrid
n8n: hybrid AI-driven workflow automation with low-code flexibility — n8n blends no-code workflow automation with AI agent workflows via LangChain, offering 400+ integrations and flexible se
MindsDB: unified AI-powered SQL querying and data fusion for diverse sources — MindsDB offers an AI-powered SQL-compatible engine that unifies structured and unstructured data across 200+ sources, en
Inside AI Engineering Hub: a hands-on collection of production-ready AI projects — AI Engineering Hub offers 90+ production-ready AI projects spanning LLMs, RAG, AI agents, and MCP, organized by difficul

→ GitHub Repo: whyhow-ai/knowledge-graph-studio ⭐ 918 · Python

Noureddine RAMDI / WhyHow Knowledge Graph Studio: building RAG-native knowledge graphs with MongoDB and OpenAI