Hivemind: decentralized peer-to-peer deep learning with PyTorch

Hivemind rethinks distributed deep learning by ditching the traditional master-worker setup in favor of a fully decentralized peer-to-peer architecture. Instead of relying on a single coordinator node, Hivemind uses a Distributed Hash Table (DHT) to connect participants across the internet. This design enables a more fault-tolerant, scalable, and flexible approach to collaborative model training on unreliable, volunteer hardware.

What hivemind does and how it works

Hivemind is a PyTorch library designed to enable decentralized deep learning across distributed nodes connected over the internet. Unlike standard distributed training frameworks that use a master-worker or parameter server architecture, Hivemind replaces this with a peer-to-peer network backed by a Distributed Hash Table (DHT).

The DHT acts as the foundational layer for peer discovery, routing, and decentralized coordination. This means there is no central point of failure or synchronization bottleneck. Nodes can join and leave dynamically without halting training.

Key features include fault-tolerant backpropagation where forward and backward passes continue successfully even if some nodes become slow or unresponsive. Parameter updates are aggregated using a decentralized parameter averaging algorithm called Moshpit SGD, which avoids the need for global synchronization barriers common in synchronous SGD.

Hivemind also implements Decentralized Mixture-of-Experts (MoE), enabling models of arbitrary size to be split across multiple participants. This facilitates collaborative training of very large models beyond the memory or compute capacity of individual nodes.

Under the hood, the networking stack relies on the go-libp2p-daemon, a Go implementation of libp2p, which manages peer-to-peer communication and routing.

In production, Hivemind powers systems like Petals, a platform for fine-tuning 100B+ parameter language models collaboratively, and sahajBERT, a community-driven Bengali ALBERT model. It was also showcased at NeurIPS 2021 for distributed transformer training demos.

Technical strengths and design tradeoffs

The core technical strength of Hivemind lies in its decentralized architecture based on the DHT, which fundamentally changes how distributed training is coordinated. This approach offers several advantages:

Fault tolerance: Training can proceed uninterrupted despite nodes dropping out or slowing down, thanks to the peer-to-peer design and the Moshpit SGD algorithm.
Scalability: Without a central coordinator, the system can scale to many participants without creating synchronization bottlenecks.
Flexibility: Nodes can join or leave at any time, making it suited for volunteer or unreliable hardware scenarios.
Support for large models: Decentralized Mixture-of-Experts allows splitting models across participants, enabling training beyond single-node memory limits.

The tradeoffs include an increased complexity in the networking layer and potential challenges in debugging and monitoring a fully decentralized system. The reliance on go-libp2p-daemon ties the system to external binary dependencies, which may complicate deployment.

Additionally, while Linux is the primary supported platform and best tested, macOS support is partial and Windows support is experimental via WSL, which limits immediate accessibility for some users.

Code quality is strong, with comprehensive testing and a modular design that separates networking, training algorithms, and compression techniques. The option to use blockwise 8-bit compression from bitsandbytes during data transfer is a practical optimization for bandwidth.

Quick start

The project provides straightforward installation options:

Installation with pip

pip install hivemind

For blockwise 8-bit compression support:

pip install hivemind[bitsandbytes]

Installation from source

git clone https://github.com/learning-at-home/hivemind.git
cd hivemind
pip install .

For development and testing:

pip install .[dev]
pytest tests/

If you encounter compatibility issues with the precompiled go-libp2p-daemon, you can rebuild it locally (requires Go 1.20+):

HIVEMIND_BUILDGO=1 pip install .

System requirements

Linux (recommended, Ubuntu 18.04+ 64-bit preferred)
Partial macOS support (Docker recommended if issues arise)
Windows 10+ experimental support via WSL with GPU enabled

verdict

Hivemind is a solid choice for researchers and developers interested in decentralized, peer-to-peer deep learning training. Its architecture is particularly well suited for collaborative training over unreliable or volunteer hardware, where traditional centralized coordination would be a bottleneck or single point of failure.

The tradeoff is that it requires some comfort with peer-to-peer networking concepts and potentially more complex deployment setups, especially on non-Linux platforms. If your use case involves large-scale model training with many nodes distributed globally, or you want to experiment with decentralized Mixture-of-Experts, Hivemind offers capabilities that typical distributed training frameworks do not.

For those focused on more traditional or tightly controlled cluster environments, the added complexity might not be worth it. But for open, fault-tolerant, and scalable distributed training over the internet, Hivemind is a practical and innovative tool worth understanding.

The codebase is clean and well-tested, and the provided installation and build options make getting started straightforward on supported platforms.

HASH: Autonomous AI-driven knowledge graph platform with Rust and multi-service architecture — HASH is a Rust-based multi-tenant knowledge graph platform using autonomous AI agents to build and validate data. It com
PyTorch’s dynamic neural networks and tape-based autograd: a deep dive into flexible deep learning — Explore PyTorch’s unique tape-based autograd and dynamic neural networks architecture that enables flexible model develo
SkillClaw: A modular Python framework for orchestrating AI agents across OpenAI-compatible and AWS Bedrock APIs — SkillClaw is a Python framework enabling flexible AI agent orchestration across OpenAI-compatible and AWS Bedrock APIs,
TensorFlow: a versatile platform powering machine learning from research to production — TensorFlow is a comprehensive open-source machine learning platform with stable multi-language APIs and broad hardware s
Mapping the open-source AI stack with the awesome-opensource-ai curated list — A curated directory cataloging over 200 production-ready open-source AI projects across the machine learning stack, from

→ GitHub Repo: learning-at-home/hivemind ⭐ 2,458 · Python

Noureddine RAMDI / Hivemind: decentralized peer-to-peer deep learning with PyTorch