A production guide to multi-node Mixture-of-Experts (MoE) inference in vLLM: model communication vs compute, choose DeepEP low-latency (LL) vs high-throughput (HT) by measurement, and tune Dual-Batch Overlap (DBO) and Expert Parallel Load Balancing (EPLB) for maximum throughput.
For ML systems engineers: a practical systems model for when MoE serving scales, when it stalls, and which knobs recover line-rate throughput.
At DeepSeek‑class scale, serving throughput is usually limited by token movement, not tensor-core math. Every Mixture-of-Experts (MoE) layer triggers dispatch/combine traffic through an all-to-all variable-size collective (all‑to‑allv) across the Expert Parallelism (EP) group, repeated at high frequency.
In production the pattern is counterintuitive: graphics processing unit (GPU) count goes up, utilization can go down, network interface controller (NIC) pressure spikes, and tokens/sec can flatten or regress. The root cause is that widening Expert Parallelism (EP) shrinks per-peer payloads, re-exposing fixed communication latency.
At Impala AI, we tune vLLM in exactly this regime. This post distills the model and operating playbook we use to predict behavior, choose DeepEP kernel mode, and map vLLM knobs to measurable outcomes.
Wide‑EP throughput is set by per-peer payload size, not headline global batch. If
Working terms
- Wide‑EP (Expert Parallelism) — Expert Parallelism across multiple nodes (typically 16–64+ GPUs), where each rank owns a subset of experts and tokens are routed via an all-to-all variable-size collective (all‑to‑allv).
- Dual‑Batch Overlap (DBO) — A double-buffering technique that overlaps communication with compute so one hides the other.
- DeepEP — An open-source CUDA kernel library by DeepSeek (GitHub) providing optimized dispatch/combine kernels for Mixture-of-Experts all‑to‑allv.
- Low-latency (LL) / High-throughput (HT) — DeepEP’s two kernel families: low-latency minimizes fixed overhead
L ; high-throughput maximizes effective bandwidthBW_{\text{eff}} .- Locality domain — The largest group of GPUs connected by a fast dedicated fabric (e.g., 8 GPUs via NVLink or xGMI within a node, or 72 GPUs in an NVL72 rack). Traffic inside is “cheap”; traffic outside hits the scale‑out network.
\gamma (straggler factor) — Ratio of max to mean per-GPU token load. When\gamma > 1 , some GPUs are overloaded and the entire EP group waits. EPLB drives\gamma \to 1 .- Expert Parallel Load Balancing (EPLB) / Linear-Programming-Based Load Balancer (LPLB) — Expert load balancing strategies. EPLB replicates hot experts periodically; LPLB solves a per‑batch linear program to optimally redistribute tokens across replicas.
- Remote Direct Memory Access (RDMA), network interface controller (NIC), High-Bandwidth Memory (HBM), and key-value (KV) cache are used with their standard systems meanings in this post.
A toy but faithful model for one practical question: why does wide expert parallelism become more attractive as you push toward huge batches, and what exactly does DBO buy you?
Every MoE layer pays dispatch all‑to‑all, then expert compute, then combine all‑to‑all. Wide‑EP spreads experts across many GPUs, which improves capacity and can raise throughput, but it also amplifies collective communication pressure.
DBO does not reduce total work. It changes the schedule so communication can run under compute. When compute is large enough, the exposed critical path approaches
Wide‑EP increases communication volume, but huge batches amortize the fixed costs. DBO helps only when compute can cover communication.
DeepSeek‑V3 and DeepSeek‑R1 represent a new class of “hyper-scale” sparse models. Their specs are daunting:
First, there is the sheer size. In FP8, the model weights alone consume ~700GB. A standard 8xH100 node typically offers 640GB of HBM. You physically cannot fit the model parameters on a single node, let alone leave room for the KV cache. For long‑context R1 reasoning traces, the KV cache alone can exceed the size of the weights.
To serve this, you must scale out to multiple nodes. This forces you into Wide-EP: splitting the experts across a multi-node cluster (e.g., 16 or 32 GPUs) connected by Ethernet or InfiniBand, rather than just NVLink.
This scale-out creates a distributed systems nightmare. MoE models are sparse by design, meaning they have low arithmetic intensity (compute-per-byte) compared to dense models. To keep powerful tensor cores (H100, MI300X, or newer) from idling, you must compensate for this sparsity by processing massive batch sizes—often tens of thousands of concurrent tokens.
This transforms inference from a simple ML task into a complex orchestration challenge:
The cost you pay for this scale is MoE routing: dispatch and combine become high-fanout, all-to-all communication patterns that saturate your network fabric.
Now that we know why Wide‑EP is unavoidable, let’s frame the operational constraints before we quantify the model.
If we only cared about compute, scaling would be easy: add GPUs and increase throughput. Wide‑EP breaks that simplicity because tokens must move.
A few things make this hard in practice:
all‑to‑allv), not all-reduce: every rank sends different amounts to every other rank; routing skew turns “average” into “tail.”We’ll now build a model that is simple enough to fit in your head, but rich enough to predict these behaviors.
We’ll use a minimal set of symbols and tie them to what you control in an offline run.
The roofline below shows how these two regimes play out as message size changes:
As you scale out to more GPUs (increasing
The Trap: If you keep your global batch size
The Solution: To make Wide-EP work, you must increase the global batch size
With the trap identified, let’s quantify the dispatch/combine payload that drives it.
To understand why the network chokes, we need to look at the exact volume of data hitting the wire.
In a Wide-EP setup, every forward pass involves two massive shuffles per layer:
For one direction (dispatch or combine), the average payload size per rank is determined by the total volume of routed tokens divided by the number of participants.
Let’s unpack why these terms multiply:
The formula above gives you the idealized average. Real-world serving is messier:
Let’s plug in real numbers for a DeepSeek-V3/R1 deployment to see the magnitude of the problem.
Why this matters: 115 MB might sound manageable. But remember:
If you are running at 10 steps/second, your NICs are pushing 140 GB/s. This is why we say Wide-EP is network-bound: you are essentially trying to stream the entire active state of the model through the interconnect continuously.
Now that we know what we are moving (the payload), we need to know how long it takes. We need a model that links batch size, cluster size, and hardware specs to end-to-end latency.
For the full dispatch+combine round-trip, we use a standard latency-bandwidth model (Hockney). This assumes dispatch and combine are roughly symmetric in cost; in practice skew can make combine slightly cheaper (fewer tokens return to hot ranks). The factor of 2 is a conservative upper bound.
Why does this equation matter so much in practice?
While the network moves data, the GPUs must process it. The time to compute the experts is:
This leads to the fundamental tension of Wide-EP serving.
Widening EP (increasing
If
The natural question: can we hide the communication behind the compute instead of just racing them? That’s exactly what DBO does.
Sequential execution is wasteful. In a naive implementation, GPUs sit idle while the network moves data, and then the network sits idle while the GPUs crunch numbers.
Dual-Batch Overlap (DBO) is the standard solution. It relies on a technique called double buffering (or “ping-pong” buffering) to parallelize work.
In a standard sequential flow, you have one buffer: the network fills it, then the GPU reads it. The resource not working is idle. With double buffering, we allocate two sets of communication buffers:
When both are finished, they swap roles. This allows us to fetch the data for the next micro-batch while computing the current one.
Without overlap, you pay the sum of the latencies:
With ideal pipelining (steady state), the step time is determined by the slowest component:
This implies a theoretical speedup limit of
It also reveals the failure mode.
DBO is not magic. It relies on “compute cover”—using the time spent calculating experts to hide the time spent moving tokens. This creates two distinct operating regimes:
Tuning DBO: The Control Panel
- Enable it:
--enable-dbo is essential for hiding inter-node latency, but it costs HBM (for double buffering).- Protect small batches: Use
--dbo-prefill-token-threshold (default 512) and--dbo-decode-token-threshold (default 32) to disable overlap when the batch is too small to justify the pipeline overhead. Increase these if you see regressions at low concurrency.- Balance Compute vs. Comm:
VLLM_DBO_COMM_SMS (default 20) reserves GPU SMs to drive the network.
- Increase if: Network transfers are jittery or lagging.
- Decrease if: Expert kernels become the bottleneck and end-to-end throughput drops even when overlap is active. On an H100 (132 SMs), 20 SMs is ~15% of total SMs; on other GPUs, adjust proportionally.
Modeling the SM split explicitly
Let
S be total SMs,S_{\text{comm}} beVLLM_DBO_COMM_SMS , and\rho = S_{\text{comm}}/S . A practical first-order model is:t_{\text{compute}}(B,\rho) \approx \gamma \cdot \frac{B\cdot k}{P} \cdot \frac{c_{\text{tok},0}}{1-\rho} where
c_{\text{tok},0} is time/token with no SM reservation. The communication side also depends on\rho because extra comm SMs improve progress:t_{\text{comm}}(B,\rho) \approx 2 \cdot \left(L(\rho) + \frac{V_{\text{rank}}}{BW_{\text{eff}}(\rho)}\right) So
VLLM_DBO_COMM_SMS is a true trade-off knob: it can reducet_{\text{comm}} while increasingt_{\text{compute}} .
We can now turn the intuition from Section 5 into a concrete inequality. Plugging the SM-aware model into
The key inequality: minimum batch for overlap
B \gtrsim \frac{2 \cdot (P/k) \cdot L(\rho)}{\gamma \cdot \frac{c_{\text{tok},0}}{1-\rho} - \frac{2 \cdot d \cdot s}{BW_{\text{eff}}(\rho)}} Scale
B withP , or adding GPUs makes you slower.
Feasibility guard: The denominator can be zero or negative. If
When the denominator is positive, the numerator still scales linearly with
This is the second “key takeaway”: As you add GPUs to your cluster, you must increase your batch size just to maintain the same level of overlap efficiency. If you scale
Single-node setups (NVLink/xGMI only) have much better
A practical keep/remove rule for single-node:
Let’s put concrete numbers on this for a standard 2-node cluster (16 GPUs total), serving DeepSeek-V3 (
When you span 2 nodes, your effective bandwidth (
| Hardware Stack | Interconnect | Per-GPU Link Speed | Peak BW ( | Realized | Min Batch |
|---|---|---|---|---|---|
| NVIDIA H100/H200 | InfiniBand NDR / CX7 | 400 Gbps (per‑port, raw) | 50 GB/s | ~36‑41 GB/s | ~72k‑96k |
| AMD MI300X | Broadcom Thor2 / RoCE | 400 Gbps | 50 GB/s | ~32‑38 GB/s | ~80k‑112k |
| AWS P5 (H100) | EFA v2 | 400 Gbps (per‑port, raw) | 50 GB/s | ~26‑34 GB/s | ~96k‑160k |
| NVIDIA B200 | ConnectX‑8 | 800 Gbps (per‑port, raw) | 100 GB/s | ~70‑82 GB/s | ~24k‑40k |
| GB200 NVL72 | NVLink Switch | N/A (In-Rack) | 900 GB/s | ~650‑800 GB/s | ~2k‑8k |
All link speeds are raw signaling rates; effective throughput after encoding/protocol overhead is lower. Realized
The Cliff: Inside a single node, the fast fabric (NVLink at ~450 GB/s/GPU, or xGMI at ~300+ GB/s) gives you abundant bandwidth. The moment you add a second node, your effective bandwidth drops to ~40 GB/s (400G IB/RoCE).
This is a 7–10× drop in bandwidth. To maintain the inequality
The widget below visualizes exactly this dynamic.
For readability, the widget uses an equivalent effective-parameter form:
Pro tip: on the “Step Time vs Batch Size” chart, focus on where the compute curve sits relative to comm. If compute is below comm, overlap will not save you.
DBO tells you when overlap helps. The next question is which kernel to use for the communication itself.
In Section 5, we modeled communication time as
LL (Low-Latency): Attacks
HT (High-Throughput): Attacks
Why do we model two different bandwidths? Because
| Mode | Realized BW (est. on 400G) | Why? |
|---|---|---|
| LL (Direct RDMA) | ~20–30 GB/s (Fragmented) | Many small, non-contiguous messages. Congestion from random access. |
| HT (Hierarchical) | ~42–48 GB/s (Peak-ish) | Few large, coalesced messages. Topology-aware routing minimizes hops. |
Estimates based on typical 400Gbps RDMA efficiency (40–60% for fragmented traffic vs 85%+ for large contiguous bursts). See DeepEP benchmarks for detailed performance characterization of LL vs HT kernels.
This is not a philosophical choice; it is an arithmetic one. You should switch from LL to HT exactly when the bandwidth gains outweigh the setup costs.
vLLM Configuration: Select your kernel backend using
--all2all-backend .
- Use
deepep_low_latency for the LL kernel (latency-optimized).- Use
deepep_high_throughput for the HT kernel (bandwidth-optimized).Advanced Tuning Guide:
VLLM_DEEPEP_BUFFER_SIZE_MB (default 1024): Controls RDMA buffer size.
- Increase when: You see buffer overflow errors or run extremely large batches/hidden dims.
- Decrease when: You hit OOM and need to reclaim HBM for KV cache (especially at small batches).
VLLM_DEEPEP_HIGH_THROUGHPUT_FORCE_INTRA_NODE :
- Set to 1 for large NVLink domains (e.g., NVL72): Treats the entire fabric as one node to use NVLink instead of RDMA.
- Keep at 0 for multi‑node clusters: Essential for standard H100/MI300X clusters where nodes are connected by IB/RoCE.
VLLM_DEEPEP_LOW_LATENCY_USE_MNNVL :
- Set to 1 for multi‑node NVLink domains (LL mode): Allows the Low-Latency kernel to write directly over cross-node NVLink.
If we model each as
Solving yields the crossover payload:
Note: because LL has lower latency but lower bandwidth than HT, we have
In Impala’s offline huge‑batch setting, we usually operate above this crossover—so HT is often the right starting point. The value of the crossover framing is that you can measure it and justify that choice on your own hardware/topology.
Suppose you measured the following on a 2‑node H100 cluster (
| Parameter | LL kernel | HT kernel |
|---|---|---|
| 35 µs | 120 µs | |
| 22 GB/s | 44 GB/s |
And from compute profiling:
Step A — Kernel crossover (
Per‑rank payload at
Step B — DBO feasibility check:
Denominator is positive (1.98 > 0.33), so DBO can work. Minimum batch:
Our batch of 8192 is well above the threshold. Enable DBO with HT kernels.
The theory is clear, but the actual crossover point depends on your specific hardware and workload. Use the interactive tool below to explore the trade-off space.
What to look for:
Batch tokens slider to see exactly where the Low-Latency (LL) and High-Throughput (HT) curves intersect. This is your decision boundary.Animate to see how LL minimizes the fixed handshake overhead, while HT compresses the bulk transfer time via bandwidth efficiency.Kernel choice is only half the story: the same model can land in very different comm/compute regimes depending on hardware locality. Before tuning knobs, anchor on where your interconnect cliff appears.
There are two deployment questions hiding inside “which GPU is faster?”:
Key idea: Wide‑EP economics are set by how much dispatch/combine stays inside the “cheap” locality domain versus how much spills into the “expensive” scale‑out network.
| Stack | Locality domain (cliff boundary) | Scale‑up anchor | Scale‑out anchor | Sources |
|---|---|---|---|---|
| NVIDIA H100/H200 | 8 GPUs / HGX node | NVLink: 900 GB/s bidirectional per GPU | ConnectX‑7: up to 400 Gb/s | H100/H200, Hopper, CX7 |
| NVIDIA GB200 NVL72 | 72 GPUs / rack domain | NVLink Switch: 130 TB/s rack, 1.8 TB/s bidirectional GPU↔GPU (HGX B200) | ConnectX‑8: up to 800 Gb/s | GB200 NVL72, HGX, CX8 |
| AMD MI300X | 8 GPUs / platform | 896 GB/s aggregate bidirectional P2P (spec); ~315–336 GB/s aggregated unidirectional measured xGMI | 400 Gb/s class NIC deployments are common | MI300 platform, ROCm xGMI |
| Public MoE internode reference | Multi‑node DeepEP benchmark | BW_eff (normal kernels): ~43–58 GB/s | BW_eff (low-latency kernels): ~39–46 GB/s | DeepEP README |
For offline serving, cost scales with delivered tokens, not raw FLOPS.
In a standard Wide‑EP deployment, each GPU owns a fixed subset of the 256 logical experts. The router selects experts per token, but not uniformly: some experts are “hot” (popular across the workload) while others are cold. The GPU hosting a cluster of hot experts becomes the straggler—and because all‑to‑allv is a synchronization barrier, the entire EP group waits for it.
Recall the straggler factor from Section 5:
Typical values range from 1.05 (well‑balanced) to 1.3+ (heavy skew). At
Expert Parallel Load Balancing (EPLB) breaks the 1:1 mapping between logical and physical experts. Instead of
If expert #42 gets 3× the average token load, EPLB gives it 3 physical copies spread across different GPUs. Each copy handles ~1/3 of the tokens routed to that expert. The result: the max‑load GPU drops closer to the mean.
EPLB runs periodically during inference (e.g., every 3000 steps) and follows a three‑step hierarchical algorithm:
Between rebalancing steps, vLLM tracks per‑expert token counts over a sliding window (default 1000 steps) to detect shifts in routing distribution.
EPLB’s effect is straightforward: it drives
A reduction from
Each redundant expert costs memory. For DeepSeek‑V3 with 61 MoE layers:
In practice, this works out to roughly ~2.4 GB per redundant expert per EP rank. Setting
vLLM Configuration:
--enable-eplb --eplb-config '{"num_redundant_experts": 32, "window_size": 1000, "step_interval": 3000}'
num_redundant_experts : How many extra physical expert slots to allocate. Start with 10–15% of your logical expert count; increase if\gamma > 1.15 persists.window_size : Steps of token counts to track before rebalancing. Shorter windows adapt faster but may oscillate.step_interval : How often to run the rebalancing algorithm. Lower values react faster to distribution shifts but add overhead from weight transfers.use_async : Set totrue for non‑blocking weight rearrangement (recommended for production).
Use the widget below to build intuition for how EPLB works:
EPLB is static: it rebalances every few thousand steps based on historical averages. Between rebalancing events, the mapping is frozen. If a particular batch has an unusual routing pattern—which is normal, since routing varies stochastically—EPLB cannot react. The straggler still pays.
LPLB (Linear‑Programming‑Based Load Balancer) extends EPLB with dynamic, per‑batch token redistribution. It keeps the same redundant expert topology, but instead of splitting tokens uniformly across replicas, it solves an optimization problem on every forward pass to find the best assignment.
Redundant experts create edges in a bipartite graph between GPUs. If GPU
The choice of which edges exist—the topology—is a design parameter. LPLB supports several structured topologies (cube, hypercube, torus) that map onto physical NVLink/NVSwitch connectivity so that redirected tokens travel on the fast fabric.
For each batch, LPLB observes the actual per‑GPU load
where:
This is a standard minimax linear program. Minimizing
LPLB solves this LP on‑GPU using a single‑SM Interior Point Method (IPM) backed by cuSolverDx/cuBLASDx, achieving ~100 µs solve time for intra‑node configurations.
| EPLB (static) | LPLB (dynamic) | |
|---|---|---|
| When it rebalances | Every ~3000 steps | Every batch |
| What it optimizes | Historical average load | Actual per-batch load |
| How it assigns tokens | Uniform split across replicas ( | Optimal LP solution (minimax) |
| Adaptiveness | Cannot react to batch‑to‑batch variance | Tracks instantaneous routing fluctuations |
| Weight movement | Requires all‑to‑all weight transfer | No weight movement; only token redirection |
The key insight: EPLB’s uniform split is optimal on average but suboptimal for any specific batch. LPLB finds the per‑batch optimum, which matters most under high routing variance (small batches, bursty workloads, or models with sharp expert specialization).
LPLB is not a strict upgrade—it has its own failure modes:
Solver overhead (~100 µs): For small batches where the entire MoE layer step takes under 1 ms, the LP solve time is non‑negligible. The optimization must pay for itself by saving more compute than it costs.
Token count ≠ compute time: LPLB minimizes max token count, but grouped GEMM execution time is a non‑linear function of batch size (due to padding, tiling, and kernel launch granularity). Perfectly equal token counts can still produce unequal compute times.
Extreme global imbalance: LPLB constrains each redundant expert to map to exactly one original expert. Under extreme skew, EPLB can assign multiple redundant slots to the same hot expert, effectively creating 3x or 4x replicas. LPLB’s topology constrains it to at most one edge per redundant expert, limiting its rebalancing capacity.
Topology mismatch: The fixed graph topology (cube, torus) must align with the physical interconnect. If the topology is poorly chosen, redirected tokens may cross slow links—converting a compute imbalance into a communication penalty.
Research stage: LPLB is an early research project from DeepSeek. Performance improvements are still under evaluation and it is not yet integrated into vLLM.
Practical guidance: Start with EPLB for production workloads. If profiling shows that
\gamma varies significantly across batches (i.e., the per‑batch\gamma is much higher than the average\gamma ), LPLB’s per‑batch optimization can close the remaining gap—especially for training workloads where small‑batch variance is high.
DeepEP is CUDA-centric, but the systems strategy (hierarchical all‑to‑allv, coalescing, and overlap) is universal.
| Concept | NVIDIA | AMD |
|---|---|---|
| GPU kernel runtime | CUDA | HIP |
| Collective library | NCCL | RCCL |
| In‑node fabric | NVLink | xGMI |
| Scale‑out transport | GPUDirect RDMA + IB/RoCE | ROCm peer memory + IB/RoCE |
| Practical HT trick | hierarchy/coalescing | hierarchy/coalescing |
GPU‑driven communication is the right direction for fine‑grained all‑to‑allv, but DeepEP couples the GPU and NIC through NVIDIA‑specific plumbing (e.g., NVSHMEM/IBGDA).
Projects like UCCL‑EP are demonstrating that portability doesn’t have to cost performance. In fact, they are showing state-of-the-art (SOTA) results—beating DeepEP on GH200 and saturating AWS EFA—by fundamentally rethinking the control plane.
The core idea is control-plane / data-plane separation:
| Platform | Hardware | Baseline | Reported UCCL‑EP delta |
|---|---|---|---|
| NVIDIA | H100 + InfiniBand | DeepEP (native) | Parity |
| AMD | MI300X + Broadcom | AMD Primus / Megatron-LM | +45% training throughput |
| AWS | H100 + EFA (SRD) | Existing EP on EFA | +40% SGLang throughput |
On GH200 specifically, coherent CPU↔GPU memory further reduces proxy overhead, so this split architecture can preserve flexibility without paying a large latency tax.
If you keep one operational view from this post, use this:
| What you tune | vLLM Argument / Env Var | Moves which term | Why it matters |
|---|---|---|---|
| Expert Parallelism | Enables Wide-EP. Without this, MoE layers use Tensor Parallelism (TP). | ||
| Token budget | Larger | ||
| Chunked prefill | shape of | Controls per‑forward granularity; trades launch overhead for steadier batches. | |
| Concurrency | effective | Higher concurrency keeps per‑GPU work above the overlap threshold. | |
| Kernel mode | |||
| Overlap (DBO) | Hides comm behind compute. Tune with | ||
| DeepEP Buffers | Memory | Adjusts reserved HBM for RDMA buffers (competes with KV cache). | |
| Load balancing | Use only when routing skew is the bottleneck (Section 9). |
This is the shortest path from model terms to production knobs.
Exploring the frontiers of efficient AI infrastructure.