DeepEP: Optimizing communication for large Mixture-of-Experts models with CUDA kernels

DeepEP tackles one of the thornier bottlenecks in distributed training and inference of large Mixture-of-Experts (MoE) models: efficient communication between experts on GPUs. The library focuses on expert parallelism (EP), where parts of the model are spread across devices, requiring all-to-all communication patterns that can saturate GPU interconnects and network links. What makes DeepEP worth a look is its highly optimized CUDA kernels for dispatching and combining MoE tokens, and its careful exploitation of NVLink for intranode and RDMA for internode communication — all designed to squeeze maximum bandwidth and minimize latency.

what DeepEP does and its architecture

DeepEP is a communication library built specifically for MoE and expert parallelism scenarios in large language models. Its core functionality revolves around efficient all-to-all GPU kernels that perform dispatch and combine operations. These operations are critical for MoE models, where input tokens are routed to different expert networks distributed across multiple GPUs.

The library supports both high-throughput scenarios, like pre-filling tokens in training or batch inference, and low-latency scenarios essential for autoregressive decoding. It achieves this through specialized CUDA kernels tailored for different EP configurations.

Under the hood, DeepEP uses NVLink for intranode communication, enabling around 153-158 GB/s bandwidth for dispatch and combine operations across 8 GPUs. For internode communication, the library leverages RDMA networks, sustaining about 43 GB/s bandwidth for 16-expert parallelism setups. These numbers come from measurements documented in the README and reflect a 30% performance improvement as of mid-2025.

The codebase is primarily CUDA, with Python bindings allowing easy integration in PyTorch projects (requires PyTorch 2.1+). It assumes modern GPU architectures — Ampere (SM80), Hopper (SM90), or newer — and CUDA 11.0+ or 12.3+ depending on the GPU generation.

what makes DeepEP technically interesting

The standout technical aspect of DeepEP lies in its low-level CUDA kernel design for the all-to-all communication pattern unique to MoE expert parallelism. These kernels implement asymmetric-domain bandwidth forwarding, an approach that optimizes data flow by considering the differing bandwidth characteristics of the dispatch and combine phases.

Another innovative feature is the hook-based communication-computation overlapping. Instead of occupying Streaming Multiprocessor (SM) resources during communication, DeepEP uses hooks that allow overlapping communication with computation, improving GPU utilization and throughput. This is a subtle but powerful optimization that helps reduce communication stalls in distributed model execution.

The library also supports low-precision FP8 operations, reflecting the trend towards reduced-precision compute to save memory and bandwidth without sacrificing model quality.

From a code quality perspective, the CUDA kernels are specialized for various expert parallelism sizes and communication scenarios, showing a deep understanding of GPU architecture and communication bottlenecks. Tradeoffs include a dependency on NVSHMEM for internode communication, which requires careful installation and environment setup, and the focus on specific GPU architectures means limited portability to older or non-NVIDIA hardware.

Performance benchmarks are impressive and concrete. For example, intranode dispatch achieves 153 GB/s bandwidth on 8 GPUs connected by NVLink, and low-latency kernels reach 77 microseconds latency for dispatch with RDMA bandwidth of 98 GB/s. These figures demonstrate that DeepEP is pushing the limits of what current hardware can deliver for MoE communication.

quick start with DeepEP

Requirements

Ampere (SM80), Hopper (SM90) GPUs, or other architectures with SM90 PTX ISA support
Python 3.8 and above
CUDA version
- CUDA 11.0 and above for SM80 GPUs
- CUDA 12.3 and above for SM90 GPUs
PyTorch 2.1 and above
NVLink for intranode communication
RDMA network for internode communication

Download and install NVSHMEM dependency

DeepEP also depends on NVSHMEM. Please refer to our NVSHMEM Installation Guide for instructions.

Installation

NVSHMEM_DIR=/path/to/installed/nvshmem python setup.py install

Installation environment variables

NVSHMEM_DIR: the path to the NVSHMEM directory, disable all internode and low-latency features if not specified
DISABLE_SM90_FEATURES: 0 or 1, whether to disable SM90 features, it is required for SM90 devices or CUDA 11
TORCH_CUDA_ARCH_LIST: the list of target architectures, e.g. TORCH_CUDA_ARCH_LIST="9.0"
DISABLE_AGGRESSIVE_PTX_INSTRS: 0 or 1, whether to disable aggressive load/store instructions, see Undefined-behavior PTX usage for more details

Then, import deep_ep in your Python project, and enjoy!

verdict

DeepEP is a specialized but well-engineered tool for anyone working on scaling MoE models with expert parallelism across multiple GPUs and nodes. Its efficient CUDA kernels and advanced communication strategies deliver bandwidth and latency close to hardware limits, making it a solid choice for production systems pushing the boundaries of distributed AI training and inference.

However, its scope is narrow — it’s not a general-purpose communication library and depends heavily on specific hardware and software stacks (NVLink, RDMA, NVSHMEM). Installation and setup can be non-trivial due to these dependencies.

If you’re building or researching large-scale MoE models and need expert parallelism communication optimized for modern NVIDIA GPUs, DeepEP provides a valuable foundation. For more general or cross-vendor needs, other solutions might be more suitable. The code is surprisingly clean for such a low-level CUDA library, and the performance gains justify the complexity for the right use case.

OpenAI Codex CLI: local-first AI coding assistant with ChatGPT integration — OpenAI Codex CLI brings AI coding assistance local to your terminal, integrating with ChatGPT plans for powerful hybrid
Cloudflare Agents: Building persistent AI agents with stateful Durable Objects — Cloudflare Agents offers a TypeScript framework for stateful AI agents on Durable Objects with real-time communication,
Browser Harness: a self-healing LLM agent for browser automation via Chrome DevTools — Browser Harness enables LLMs to automate browsers by dynamically generating helper functions using the Chrome DevTools P
PinchTab: Token-efficient Chrome automation for AI agents with Go — PinchTab is a Go HTTP server enabling AI agents to control Chrome instances efficiently by extracting structured text, c
Polaris: A provider-agnostic feature flag and config management tool in Go — Polaris is a Go library that abstracts feature flag and configuration management across providers via clean interfaces.

→ GitHub Repo: deepseek-ai/DeepEP ⭐ 9,288 · Cuda

Noureddine RAMDI / DeepEP: Optimizing communication for large Mixture-of-Experts models with CUDA kernels