Noureddine RAMDI / Mapping the open source data engineering landscape: a curated catalog of storage engines and databases

Created Mon, 04 May 2026 10:23:01 +0000 Modified Sat, 23 May 2026 20:41:27 +0000

pracdata/awesome-open-source-data-engineering

The fragmentation in data engineering storage is striking — the landscape today features over 15 categories of databases, each solving different trade-offs between consistency, latency, and scale. From embedded key-value stores to distributed SQL engines, streaming databases, and real-time OLAP systems, the choice of storage technology shapes your platform architecture profoundly.

What the awesome-open-source-data-engineering repo catalogs

This repository is a curated catalog of open source data engineering tools organized by architectural layer and data model. It doesn’t provide a single product or library but instead serves as a comprehensive reference for practitioners and architects navigating the complex data storage ecosystem.

The catalog covers the entire analytics stack, starting with embedded storage engines like RocksDB and BadgerDB, which provide low-level key-value storage often embedded in applications or edge systems. It moves up to distributed SQL databases such as CockroachDB and YugabyteDB that offer horizontal scalability and strong consistency for transactional workloads.

Next, the list includes streaming databases like RisingWave and Materialize, designed for real-time data processing and continuous queries on streaming data. At the higher end, it covers real-time OLAP engines such as ClickHouse, StarRocks, and Apache Doris, optimized for analytical queries on large volumes of data with low latency.

Beyond these core categories, the catalog also tracks ecosystem churn by marking inactive or archived projects, which is crucial for understanding the stability and maturity of various tools. It highlights emerging trends such as Postgres analytics extensions (e.g., pg_duckdb), LLM-optimized graph databases like FalkorDB, and cloud-native time-series systems like GreptimeDB and HoraeDB.

The repo is primarily a landscape reference at 551 stars, useful for architecture decisions, technology radar exercises, and keeping up with the evolving data engineering stack.

What makes this catalog technically valuable

The strength of this repo lies in its breadth and structured organization rather than in executable code. It offers a panoramic view of the data engineering storage layer, highlighting the fragmentation and specialization that have taken place over the past decade.

One can appreciate the tradeoffs different database categories make: embedded key-value stores optimize for local, low-latency storage with minimal dependencies; distributed SQL engines emphasize consistency and scale; streaming databases target low-latency continuous queries; and OLAP engines focus on high-throughput analytical workloads.

By marking inactive or archived projects, the catalog also provides insight into where the ecosystem is stable and where it is volatile. For example, document stores and graph databases show significant churn, indicating shifting research and commercial interest.

The inclusion of modern trends like LLM-optimized databases reflects a cutting-edge awareness of how emerging AI workloads influence data infrastructure needs.

While the repo itself is not a library or tool to deploy, its value is in surfacing the right technology for the right use case, helping teams avoid the trap of one-size-fits-all solutions and understand the implications of choosing a particular storage engine.

Explore the project

Since this repo is a catalog rather than a software package, there are no installation commands or quickstart scripts. Instead, the best way to leverage it is to explore its organized lists and documentation.

The repo’s README is the entry point, categorizing tools by architectural layer and data model. Each section links to individual projects, often with status markers indicating activity or archival state. This helps prioritize which tools are actively maintained.

If you’re architecting a data platform, start by identifying your workload type — transactional, streaming, analytical — and then explore the corresponding categories in the catalog. For example, if you need a distributed SQL layer, look into CockroachDB, YugabyteDB, and similar entries. For real-time analytics, check out ClickHouse or Materialize.

The repository also serves as a technology radar: by tracking trends and new entrants, it helps teams stay informed about emerging directions such as cloud-native time-series databases or AI-optimized data stores.

verdict

This curated catalog is a solid resource for data engineers, architects, and platform owners who want a broad and organized overview of open source data storage technologies. It’s not a plug-and-play tool but a reference to inform technology choices and understand the landscape’s complexity.

Its main limitation is that it does not provide direct integration or benchmarks. Instead, it relies on the reader to dive deeper into individual projects. Still, the repo’s structured approach saves time and reduces the noise around data engineering tooling.

If you’re making architecture decisions or tracking ecosystem trends in data engineering, this repo is worth bookmarking. It helps frame the tradeoffs between consistency, latency, scale, and workload specialization that define modern data platform design.


→ GitHub Repo: pracdata/awesome-open-source-data-engineering ⭐ 551