Wide‑EP Mixture-of-Experts (MoE) Serving: Dispatch/Combine, Dual-Batch Overlap (DBO), and Real Scaling Limits

A production guide to multi-node Mixture-of-Experts (MoE) inference in vLLM: model communication vs compute, choose DeepEP low-latency (LL) vs high-throughput (HT) by measurement, and tune Dual-Batch Overlap (DBO) and Expert Parallel Load Balancing (EPLB) for maximum throughput.

Why Wide‑EP Scaling Breaks (and How to Fix It)

For ML systems engineers: a practical systems model for when MoE serving scales, when it stalls, and which knobs recover line-rate throughput.

At DeepSeek‑class scale, serving throughput is usually limited by token movement, not tensor-core math. Every Mixture-of-Experts (MoE) layer triggers dispatch/combine traffic through an all-to-all variable-size collective (all‑to‑allv) across the Expert Parallelism (EP) group, repeated at high frequency.

In production the pattern is counterintuitive: graphics processing unit (GPU) count goes up, utilization can go down, network interface controller (NIC) pressure spikes, and tokens/sec can flatten or regress. The root cause is that widening Expert Parallelism (EP) shrinks per-peer payloads, re-exposing fixed communication latency.

At Impala AI, we tune vLLM in exactly this regime. This post distills the model and operating playbook we use to predict behavior, choose DeepEP kernel mode, and map vLLM knobs to measurable outcomes.

In this post

A compact model linking B, P, L, BW_{\text{eff}}, \gamma to per-step latency — and why each term matters.
The exact condition under which Dual-Batch Overlap (DBO) hides communication, and when it cannot.
A measurement-first method to choose DeepEP low-latency (LL) vs high-throughput (HT) kernels from crossover points.
Where the bandwidth cliff falls on H100, MI300X, and NVL72 — and what it means for your batch size.
Expert Parallel Load Balancing (EPLB) and Linear-Programming-Based Load Balancer (LPLB): how expert replication and per-batch linear-program optimization tame the straggler factor \gamma.
A production checklist and vLLM knob map that ties everything together.
The portability path: UCCL‑EP and what it takes to run this strategy on AMD/ROCm.

TL;DR

Wide‑EP throughput is set by per-peer payload size, not headline global batch. If B/P gets too small, fixed latency dominates and scaling can invert. Dual-Batch Overlap (DBO) helps only when t_{\text{compute}} \ge t_{\text{comm}}; otherwise communication is exposed on the critical path. DeepEP low-latency (LL) vs high-throughput (HT) is a measurable crossover: fit (L, BW_{\text{eff}}) on your topology and choose by payload range. When routing skew makes the straggler factor \gamma the bottleneck, Expert Parallel Load Balancing (EPLB) replicates hot experts to rebalance load across GPUs; Linear-Programming-Based Load Balancer (LPLB) goes further with per-batch linear-program optimization. The systems strategy — hierarchical all‑to‑allv, overlap, and load balancing — is hardware-agnostic; projects like UCCL‑EP are proving it works across NVIDIA, AMD, and cloud fabrics.

Working terms

Wide‑EP (Expert Parallelism) — Expert Parallelism across multiple nodes (typically 16–64+ GPUs), where each rank owns a subset of experts and tokens are routed via an all-to-all variable-size collective (all‑to‑allv).

Dual‑Batch Overlap (DBO) — A double-buffering technique that overlaps communication with compute so one hides the other.

DeepEP — An open-source CUDA kernel library by DeepSeek (GitHub) providing optimized dispatch/combine kernels for Mixture-of-Experts all‑to‑allv.

Low-latency (LL) / High-throughput (HT) — DeepEP’s two kernel families: low-latency minimizes fixed overhead L; high-throughput maximizes effective bandwidth BW_{\text{eff}}.

Locality domain — The largest group of GPUs connected by a fast dedicated fabric (e.g., 8 GPUs via NVLink or xGMI within a node, or 72 GPUs in an NVL72 rack). Traffic inside is “cheap”; traffic outside hits the scale‑out network.

\gamma (straggler factor) — Ratio of max to mean per-GPU token load. When \gamma > 1, some GPUs are overloaded and the entire EP group waits. EPLB drives \gamma \to 1.

Expert Parallel Load Balancing (EPLB) / Linear-Programming-Based Load Balancer (LPLB) — Expert load balancing strategies. EPLB replicates hot experts periodically; LPLB solves a per‑batch linear program to optimally redistribute tokens across replicas.

Remote Direct Memory Access (RDMA), network interface controller (NIC), High-Bandwidth Memory (HBM), and key-value (KV) cache are used with their standard systems meanings in this post.

Operational intuition

A toy but faithful model for one practical question: why does wide expert parallelism become more attractive as you push toward huge batches, and what exactly does DBO buy you?

Every MoE layer pays dispatch all‑to‑all, then expert compute, then combine all‑to‑all. Wide‑EP spreads experts across many GPUs, which improves capacity and can raise throughput, but it also amplifies collective communication pressure.

DBO does not reduce total work. It changes the schedule so communication can run under compute. When compute is large enough, the exposed critical path approaches max(comm, compute) instead of comm + compute.

Quick mental model

Wide‑EP increases communication volume, but huge batches amortize the fixed costs. DBO helps only when compute can cover communication.

1) Why V3/R1 force Wide‑EP

DeepSeek‑V3 and DeepSeek‑R1 represent a new class of “hyper-scale” sparse models. Their specs are daunting:

671B total parameters, with ~37B activated per token.
61 MoE layers, each with 256 routed experts and top‑k=8 selection.
A massive hidden size of d=7168.

Why single‑node deployment fails

First, there is the sheer size. In FP8, the model weights alone consume ~700GB. A standard 8xH100 node typically offers 640GB of HBM. You physically cannot fit the model parameters on a single node, let alone leave room for the KV cache. For long‑context R1 reasoning traces, the KV cache alone can exceed the size of the weights.

To serve this, you must scale out to multiple nodes. This forces you into Wide-EP: splitting the experts across a multi-node cluster (e.g., 16 or 32 GPUs) connected by Ethernet or InfiniBand, rather than just NVLink.

Keeping tensor cores busy

This scale-out creates a distributed systems nightmare. MoE models are sparse by design, meaning they have low arithmetic intensity (compute-per-byte) compared to dense models. To keep powerful tensor cores (H100, MI300X, or newer) from idling, you must compensate for this sparsity by processing massive batch sizes—often tens of thousands of concurrent tokens.

This transforms inference from a simple ML task into a complex orchestration challenge:

Massive State: You must manage terabytes of active KV cache distributed across the cluster.
High-Frequency Sync: With 61 MoE layers, every single token generation step requires 122 global synchronization points (one dispatch and one combine per layer).
Straggler Sensitivity: At this scale, a single slow GPU or a congested network link stalls the entire EP group.

The cost you pay for this scale is MoE routing: dispatch and combine become high-fanout, all-to-all communication patterns that saturate your network fabric.

Now that we know why Wide‑EP is unavoidable, let’s frame the operational constraints before we quantify the model.

2) The challenges of multi‑node MoE serving

If we only cared about compute, scaling would be easy: add GPUs and increase throughput. Wide‑EP breaks that simplicity because tokens must move.

A few things make this hard in practice:

All-to-all variable-size collective (all‑to‑allv), not all-reduce: every rank sends different amounts to every other rank; routing skew turns “average” into “tail.”
“Huge batch” doesn’t guarantee large messages: as you widen the EP group (more ranks), the same global batch of tokens is sliced across more peers. Per‑peer messages can become small again, and fixed per‑message overhead comes back.
Overlap is conditional: DBO only helps when compute is big enough to cover comm; widening EP can shrink per‑GPU compute and remove that cover.
Memory is a first‑class constraint: DBO double buffers and LL kernels tend to pre‑allocate; both compete with KV/cache and activations.
Topology matters: intra‑node fabric (NVLink, xGMI) is “cheap,” inter‑node RDMA/RoCE is “expensive”; your achieved effective bandwidth is really a property of topology + congestion.

We’ll now build a model that is simple enough to fit in your head, but rich enough to predict these behaviors.

3) Problem setting and notation

We’ll use a minimal set of symbols and tie them to what you control in an offline run.

B: global tokens per MoE layer step across the EP group —the token budget per forward pass. In vLLM this is mainly driven by the token scheduling budget (e.g., --max-num-batched-tokens) plus how prefill is chunked.
P: EP ranks participating in dispatch/combine for that MoE layer.
T = B/P: tokens per rank under balanced ownership.
k: top‑k experts per token.
d: hidden size of the routed activation.
s: bytes per element on the wire (FP8 payload is commonly 1 byte; BF16 is 2).
L: effective all‑to‑allv latency overhead at the chosen P (fixed cost per collective, independent of payload size; includes kernel launch, signaling, and synchronization across all peers).
BW_{\text{eff}}: effective bandwidth achieved during the collective. In practice this is often well below peak, due to topology and congestion.

The roofline below shows how these two regimes play out as message size changes:

The “Wide-EP Trap”

As you scale out to more GPUs (increasing P), you might expect performance to improve linearly. However, Wide-EP introduces a subtle “slicing” penalty:

Latency‑bound regime: When messages are small (e.g., < 64KB), the fixed overhead L (kernel launch, PCIe traversal, RDMA handshake) dominates the total time. In this regime, your expensive 400Gbps links are effectively idle most of the time, achieving < 10% of their peak bandwidth.
Bandwidth‑bound regime: Only when messages are large enough (e.g., > 500KB) does the transfer time V/BW_{\text{eff}} dominate. This is where you get the throughput you paid for.

The Trap: If you keep your global batch size B constant while increasing P, the per-peer message size (\approx B/P) shrinks. You inadvertently push your workload from the efficient bandwidth-bound regime into the inefficient latency-bound regime.

The Solution: To make Wide-EP work, you must increase the global batch size B proportionally with P to keep per-peer messages large. This is why “huge batch” offline inference is critical for multi‑node MoE.

With the trap identified, let’s quantify the dispatch/combine payload that drives it.

4) Dispatch/Combine payload: the thing your NIC must move

To understand why the network chokes, we need to look at the exact volume of data hitting the wire.

In a Wide-EP setup, every forward pass involves two massive shuffles per layer:

Dispatch: Sending tokens from their source GPU to the GPU hosting their selected experts.
Combine: Sending the expert results back to the source GPU.

Anatomy of the payload

For one direction (dispatch or combine), the average payload size per rank is determined by the total volume of routed tokens divided by the number of participants.

V_{\text{rank}} \approx \underbrace{\frac{B}{P}}_{\text{tokens/rank}} \cdot \underbrace{k}_{\text{fan-out}} \cdot \underbrace{d}_{\text{hidden dim}} \cdot \underbrace{s}_{\text{precision}}

Let’s unpack why these terms multiply:

B/P: If load is perfectly balanced, each GPU holds this slice of the global batch. Note: T = B/P is always per‑rank, not the global batch.
k: Each token is replicated to k experts. This is the fan-out factor that multiplies traffic.
d \cdot s: The raw size of a single activation vector in bytes.

Reality Check: It’s never just the average

The formula above gives you the idealized average. Real-world serving is messier:

All-to-Allv (The Skew Problem): Experts aren’t chosen uniformly. Some experts are “hot,” meaning the rank hosting them receives far more data than the average. Since all_to_all is a synchronization barrier, the entire group waits for the busiest rank. Your effective payload is defined by the maximum load, not the average.
Metadata Tax: You aren’t just moving raw activation tensors. You must also transmit token indices (to track request IDs), expert scales (for the weighted sum), and padding (for kernel alignment). In practice, model this as a tax (1+\epsilon) on top of your raw payload.

Example: The 140 GB/s Hose

Let’s plug in real numbers for a DeepSeek-V3/R1 deployment to see the magnitude of the problem.

Global Batch (B): 128k tokens (a heavy offline workload).
EP Size (P): 64 GPUs (spanning multiple nodes).
Model Config: k=8 experts, d=7168 hidden size.
Precision (s): 1 byte (FP8).

V_{\text{rank}} \approx \frac{128,000}{64} \cdot 8 \cdot 7168 \cdot 1 \approx 115\ \text{MB}

Why this matters: 115 MB might sound manageable. But remember:

This happens twice per layer (dispatch + combine).
There are 61 layers.
That’s 115 \text{ MB} \times 2 \times 61 \approx 14 \text{ GB} of network traffic per forward pass.

If you are running at 10 steps/second, your NICs are pushing 140 GB/s. This is why we say Wide-EP is network-bound: you are essentially trying to stream the entire active state of the model through the interconnect continuously.

5) A time model that’s simple enough to tune

Now that we know what we are moving (the payload), we need to know how long it takes. We need a model that links batch size, cluster size, and hardware specs to end-to-end latency.

The Communication Tax

For the full dispatch+combine round-trip, we use a standard latency-bandwidth model (Hockney). This assumes dispatch and combine are roughly symmetric in cost; in practice skew can make combine slightly cheaper (fewer tokens return to hot ranks). The factor of 2 is a conservative upper bound.

t_{\text{comm}}(B) \approx 2 \cdot \left( L + \frac{V_{\text{rank}}}{BW_{\text{eff}}} \right)

Why does this equation matter so much in practice?

The Multiplier: The factor of 2 is non-negotiable. You must send data out (Dispatch) and get answers back (Combine).
L (The Fixed Tax): This isn’t just speed-of-light wire delay. It includes kernel launch overhead, packing/unpacking tensors, RDMA handshakes, and synchronization barriers. In the “latency-bound” regime (small batches), this term dominates. You pay this tax 122 times per token generated (61 layers \times 2).
BW_{\text{eff}} (The Speed Limit): This is your effective bandwidth. It is rarely the datasheet peak. It accounts for topology constraints (e.g., crossing rails vs. NVLink), protocol overheads, and network congestion.

The Compute & The Straggler

While the network moves data, the GPUs must process it. The time to compute the experts is:

t_{\text{compute}}(B) \approx \underbrace{\gamma}_{\text{straggler}} \cdot \underbrace{\frac{B \cdot k}{P}}_{\text{work/rank}} \cdot \underbrace{c_{\text{tok}}}_{\text{time/token}}

c_{\text{tok}}: The raw speed of your GPU on the expert kernels (GEMMs).
\gamma \ge 1 (The Straggler Factor): This is critical. In a distributed system, the step doesn’t finish until the slowest rank finishes. If one GPU receives 10% more tokens due to routing noise, or downclocks due to heat, \gamma becomes 1.1. The entire cluster slows down to match that one straggler.

The tension: comm vs. compute

This leads to the fundamental tension of Wide-EP serving.

Widening EP (increasing P) reduces compute per rank. As you add GPUs, t_{\text{compute}} shrinks because the work is split more ways. But t_{\text{comm}} often grows or stays flat (due to fixed L).

If t_{\text{compute}} becomes smaller than t_{\text{comm}}, you can no longer “hide” the communication behind the compute. Tensor cores sit idle, waiting for the network. This is why you must increase B (batch size) to keep the compute term large enough to cover the communication term.

The natural question: can we hide the communication behind the compute instead of just racing them? That’s exactly what DBO does.

6) DBO: The art of hiding the wire

Sequential execution is wasteful. In a naive implementation, GPUs sit idle while the network moves data, and then the network sits idle while the GPUs crunch numbers.

Dual-Batch Overlap (DBO) is the standard solution. It relies on a technique called double buffering (or “ping-pong” buffering) to parallelize work.

In a standard sequential flow, you have one buffer: the network fills it, then the GPU reads it. The resource not working is idle. With double buffering, we allocate two sets of communication buffers:

Compute Phase: The GPU processes tokens from Buffer A (which arrived in the previous step).
Comm Phase: Simultaneously, the NIC streams the next set of tokens into Buffer B.

When both are finished, they swap roles. This allows us to fetch the data for the next micro-batch while computing the current one.

How overlap works

Without overlap, you pay the sum of the latencies:

t_{\text{seq}} \approx t_{\text{comm}} + t_{\text{compute}}

With ideal pipelining (steady state), the step time is determined by the slowest component:

t_{\text{dbo}} \approx \max\left(t_{\text{comm}},\ t_{\text{compute}}\right)

This implies a theoretical speedup limit of 2\times (when t_{\text{comm}} = t_{\text{compute}}). This is a steady‑state result: the first and last micro‑batch pay full fill/drain overhead. The max model applies when per‑step time dominates the pipeline startup cost—typically when you process many micro‑batches (large B). For small batches, add ~1 extra step of latency for fill/drain.

It also reveals the failure mode.

The “Exposed Wire” Problem

DBO is not magic. It relies on “compute cover”—using the time spent calculating experts to hide the time spent moving tokens. This creates two distinct operating regimes:

Compute‑dominant (t_{\text{compute}} \ge t_{\text{comm}}): The GPU takes longer to compute than the network takes to send. Network latency is effectively zero (hidden). Throughput is compute-bound.
Comm‑dominant (t_{\text{comm}} > t_{\text{compute}}): The network is too slow. The GPU finishes its work and then waits. Throughput is capped by the wire.

Tuning DBO: The Control Panel

Enable it: --enable-dbo is essential for hiding inter-node latency, but it costs HBM (for double buffering).

Protect small batches: Use --dbo-prefill-token-threshold (default 512) and --dbo-decode-token-threshold (default 32) to disable overlap when the batch is too small to justify the pipeline overhead. Increase these if you see regressions at low concurrency.

Balance Compute vs. Comm: VLLM_DBO_COMM_SMS (default 20) reserves GPU SMs to drive the network.

Increase if: Network transfers are jittery or lagging.

Decrease if: Expert kernels become the bottleneck and end-to-end throughput drops even when overlap is active. On an H100 (132 SMs), 20 SMs is ~15% of total SMs; on other GPUs, adjust proportionally.

Modeling the SM split explicitly

Let S be total SMs, S_{\text{comm}} be VLLM_DBO_COMM_SMS, and \rho = S_{\text{comm}}/S. A practical first-order model is:
t_{\text{compute}}(B,\rho) \approx \gamma \cdot \frac{B\cdot k}{P} \cdot \frac{c_{\text{tok},0}}{1-\rho}
where c_{\text{tok},0} is time/token with no SM reservation. The communication side also depends on \rho because extra comm SMs improve progress:
t_{\text{comm}}(B,\rho) \approx 2 \cdot \left(L(\rho) + \frac{V_{\text{rank}}}{BW_{\text{eff}}(\rho)}\right)
So VLLM_DBO_COMM_SMS is a true trade-off knob: it can reduce t_{\text{comm}} while increasing t_{\text{compute}}.

Quantifying the overlap condition

We can now turn the intuition from Section 5 into a concrete inequality. Plugging the SM-aware model into t_{\text{compute}}(B,\rho) \ge t_{\text{comm}}(B,\rho) gives the minimum batch size required to maintain overlap:

The key inequality: minimum batch for overlap
B \gtrsim \frac{2 \cdot (P/k) \cdot L(\rho)}{\gamma \cdot \frac{c_{\text{tok},0}}{1-\rho} - \frac{2 \cdot d \cdot s}{BW_{\text{eff}}(\rho)}}
Scale B with P, or adding GPUs makes you slower.

Feasibility guard: The denominator can be zero or negative. If \gamma \cdot \frac{c_{\text{tok},0}}{1-\rho} \le \frac{2 \cdot d \cdot s}{BW_{\text{eff}}(\rho)}, per‑token communication remains too large relative to per‑token compute, and full cover is impossible at any batch size. In that regime, you need better comm (L(\rho)\downarrow, BW_{\text{eff}}(\rho)\uparrow), faster expert kernels, or a smaller routed payload.

When the denominator is positive, the numerator still scales linearly with P. For fixed B, pick \rho by minimizing t_{\text{dbo}}(B,\rho)=\max\left(t_{\text{comm}}(B,\rho),t_{\text{compute}}(B,\rho)\right); the best operating point is usually near the balance line t_{\text{comm}} \approx t_{\text{compute}}.

This is the second “key takeaway”: As you add GPUs to your cluster, you must increase your batch size just to maintain the same level of overlap efficiency. If you scale P without scaling B, you strip away the compute cover and expose the raw network latency.

Single-node DBO: when it is actually worth it

Single-node setups (NVLink/xGMI only) have much better L and BW_{\text{eff}} than multi-node RDMA, so DBO is not automatically a win. The best way to think about it:

If comm is already tiny, DBO has little headroom. When t_{\text{comm}} \ll t_{\text{compute}}, speedup is small because t_{\text{seq}} \approx t_{\text{compute}} even without overlap.
DBO is strongest near balance. You get the biggest gain when 2t_d and t_c are of similar magnitude (the overlap frontier).
Very small batches can regress. Pipeline fill/drain and synchronization overhead can dominate; keep DBO thresholds high enough to avoid forcing overlap at low concurrency.
SM reservation matters more on-node. Because comm is faster, over-reserving comm SMs can hurt more than it helps. In practice, start with a lower VLLM_DBO_COMM_SMS than your cross-node setting and increase only if comm jitter is visible.

A practical keep/remove rule for single-node:

Measure step-time decomposition with DBO off.
Enable DBO and sweep VLLM_DBO_COMM_SMS in a small range.
Keep DBO only if end-to-end throughput and tail latency improve at your target batch/concurrency, not just synthetic peak.

Real-world numbers: The 2-Node Cliff

Let’s put concrete numbers on this for a standard 2-node cluster (16 GPUs total), serving DeepSeek-V3 (d=7168, k=8) in FP8.

When you span 2 nodes, your effective bandwidth (BW_{\text{eff}}) is capped by the inter-node interconnect. Here is how common hardware stacks compare:

Hardware Stack	Interconnect	Per-GPU Link Speed	Peak BW (BW_{\text{peak}})	Realized BW_{\text{eff}} (est.)	Min Batch B for DBO (operational target)
NVIDIA H100/H200	InfiniBand NDR / CX7	400 Gbps (per‑port, raw)	50 GB/s	~36‑41 GB/s	~72k‑96k
AMD MI300X	Broadcom Thor2 / RoCE	400 Gbps	50 GB/s	~32‑38 GB/s	~80k‑112k
AWS P5 (H100)	EFA v2	400 Gbps (per‑port, raw)	50 GB/s	~26‑34 GB/s	~96k‑160k
NVIDIA B200	ConnectX‑8	800 Gbps (per‑port, raw)	100 GB/s	~70‑82 GB/s	~24k‑40k
GB200 NVL72	NVLink Switch	N/A (In-Rack)	900 GB/s	~650‑800 GB/s	~2k‑8k

All link speeds are raw signaling rates; effective throughput after encoding/protocol overhead is lower. Realized BW_{\text{eff}} values and batch targets above are operational ranges (not theoretical minima), assuming DeepSeek‑class shape (d=7168,k=8,P=16), moderate routing skew (\gamma \approx 1.05\text{\u2013}1.15), and DBO SM reservation in the common range (\rho \approx 0.10\text{\u2013}0.20, e.g., VLLM_DBO_COMM_SMS ≈ 12–28 on H100).

The Cliff: Inside a single node, the fast fabric (NVLink at ~450 GB/s/GPU, or xGMI at ~300+ GB/s) gives you abundant bandwidth. The moment you add a second node, your effective bandwidth drops to ~40 GB/s (400G IB/RoCE).

This is a 7–10× drop in bandwidth. To maintain the inequality t_{\text{compute}} \ge t_{\text{comm}}, you must increase your batch size B by roughly the same factor (or more, due to latency L) when you cross the node boundary. This cliff is the central challenge of multi‑node MoE serving, regardless of GPU vendor.

The widget below visualizes exactly this dynamic.

Try this: Push “Expert Parallelism” to 64. Watch the “Compute” bar shrink until it’s shorter than “Comm”.
Then: Push “Batch tokens” to the right to see how a huge batch restores the balance.

For readability, the widget uses an equivalent effective-parameter form: L_{\text{eff}}, BW_{\text{eff}}, c_{\text{tok,eff}}. This means SM partition effects (e.g., VLLM_DBO_COMM_SMS, or \rho) are folded into the slider values rather than exposed as a separate control.

Pro tip: on the “Step Time vs Batch Size” chart, focus on where the compute curve sits relative to comm. If compute is below comm, overlap will not save you.

DBO tells you when overlap helps. The next question is which kernel to use for the communication itself.

7) DeepEP LL vs HT: a crossover, not a religion

In Section 5, we modeled communication time as t_{\text{comm}} \approx 2 \cdot (L + V/BW_{\text{eff}}). DeepEP—an open‑source CUDA kernel library by DeepSeek for MoE dispatch/combine—gives you a direct way to attack these two terms separately. It exposes two families of dispatch/combine kernels, and the right choice depends entirely on which side of the “Wide-EP Trap” (Section 3) you are currently stuck in.

The two modes

LL (Low-Latency): Attacks L.
- Mechanism: Uses pre-allocated buffers and simplified signaling (often just RDMA writes with immediate data) to shave off every microsecond of handshake overhead.
- Time: t_{\text{LL}} \approx \underbrace{(t_{\text{kernel}} + L_{\text{RDMA}})}_{\text{Low fixed cost}} + \underbrace{\frac{V}{BW_{\text{frag}}}}_{\text{Slow transfer}} (Direct RDMA suffers from fragmentation/congestion).
- Best for: Small batches, high expert parallelism (where B/P is small), or latency-sensitive decode steps.
- The Cost: Memory. LL often requires O(P) registered buffer space per rank. For our P=64 example, maintaining dedicated buffers for every peer (to handle skew) can consume ~1–2 GB of HBM. On memory‑constrained nodes (Section 1), this fights directly against your KV cache.
HT (High-Throughput): Attacks BW_{\text{eff}}.
- Mechanism: Uses hierarchical collectives. Instead of every GPU talking to every other GPU (naive all-to-all), GPUs within a node first gather their data via NVLink, then a designated “leader” performs the heavy RDMA transfer to other nodes, followed by a local scatter.
- Time: t_{\text{HT}} \approx \underbrace{(3t_{\text{kernel}} + 2L_{\text{NVL}} + L_{\text{RDMA}})}_{\text{High fixed cost}} + \underbrace{\frac{V}{BW_{\text{peak}}}}_{\text{Fast transfer}} (Hierarchical aggregation enables peak RDMA bandwidth).
- Best for: Large “offline” batches where payload size V is huge. This is essential for crossing the “2-Node Cliff” (Section 6) efficiently.
- The Cost: Latency. The extra gather/scatter steps add a fixed setup cost. However, it is memory-efficient: it only buffers the aggregated payload (~230 MB with double buffering). If your payload is small, HT is actually slower than LL.

Real-world intuition: Fragmented vs Peak Bandwidth

Why do we model two different bandwidths? Because all_to_allv efficiency depends heavily on message size and topology.

Mode	Realized BW (est. on 400G)	Why?
LL (Direct RDMA)	~20–30 GB/s (Fragmented)	Many small, non-contiguous messages. Congestion from random access.
HT (Hierarchical)	~42–48 GB/s (Peak-ish)	Few large, coalesced messages. Topology-aware routing minimizes hops.

Estimates based on typical 400Gbps RDMA efficiency (40–60% for fragmented traffic vs 85%+ for large contiguous bursts). See DeepEP benchmarks for detailed performance characterization of LL vs HT kernels.

The Crossover Point

This is not a philosophical choice; it is an arithmetic one. You should switch from LL to HT exactly when the bandwidth gains outweigh the setup costs.

vLLM Configuration: Select your kernel backend using --all2all-backend.

Use deepep_low_latency for the LL kernel (latency-optimized).

Use deepep_high_throughput for the HT kernel (bandwidth-optimized).

Advanced Tuning Guide:

VLLM_DEEPEP_BUFFER_SIZE_MB (default 1024): Controls RDMA buffer size.

Increase when: You see buffer overflow errors or run extremely large batches/hidden dims.

Decrease when: You hit OOM and need to reclaim HBM for KV cache (especially at small batches).

VLLM_DEEPEP_HIGH_THROUGHPUT_FORCE_INTRA_NODE:

Set to 1 for large NVLink domains (e.g., NVL72): Treats the entire fabric as one node to use NVLink instead of RDMA.

Keep at 0 for multi‑node clusters: Essential for standard H100/MI300X clusters where nodes are connected by IB/RoCE.

VLLM_DEEPEP_LOW_LATENCY_USE_MNNVL:

Set to 1 for multi‑node NVLink domains (LL mode): Allows the Low-Latency kernel to write directly over cross-node NVLink.

If we model each as (L, BW), choose HT when:

L_{\text{HT}} + \frac{V}{BW_{\text{HT}}} < L_{\text{LL}} + \frac{V}{BW_{\text{LL}}}

Solving yields the crossover payload:

V^{*} = \frac{L_{\text{LL}}-L_{\text{HT}}}{\frac{1}{BW_{\text{HT}}}-\frac{1}{BW_{\text{LL}}}}

Note: because LL has lower latency but lower bandwidth than HT, we have L_{\text{LL}} < L_{\text{HT}} and BW_{\text{LL}} < BW_{\text{HT}}. Both numerator and denominator are negative, so V^{*} is positive. Below V^{*} LL wins (lower fixed cost dominates); above it HT wins (higher bandwidth dominates).

In Impala’s offline huge‑batch setting, we usually operate above this crossover—so HT is often the right starting point. The value of the crossover framing is that you can measure it and justify that choice on your own hardware/topology.

Real-world worked example: 16‑GPU H100 cluster serving DeepSeek‑V3

Suppose you measured the following on a 2‑node H100 cluster (P = 16):

Parameter	LL kernel	HT kernel
L	35 µs	120 µs
BW_{\text{eff}}	22 GB/s	44 GB/s

And from compute profiling: c_{\text{tok}} = 1.8 µs/token, \gamma = 1.1.

Step A — Kernel crossover (V^*):

V^{*} = \frac{L_{\text{LL}} - L_{\text{HT}}}{\frac{1}{BW_{\text{HT}}} - \frac{1}{BW_{\text{LL}}}} = \frac{35\mu s - 120\mu s}{\frac{1}{44\ \text{GB/s}} - \frac{1}{22\ \text{GB/s}}} = \frac{-85\mu s}{-22.7\ \text{ps/B}} \approx 3.7\ \text{MB}

Per‑rank payload at B = 8192: V = \frac{8192}{16} \cdot 8 \cdot 7168 \cdot 1 \approx 29\ \text{MB} \gg V^*. Use HT.

Step B — DBO feasibility check:

\gamma \cdot c_{\text{tok}} = 1.1 \times 1.8\ \mu s = 1.98\ \mu s/\text{tok}

\frac{2 \cdot d \cdot s}{BW_{\text{eff}}} = \frac{2 \times 7168 \times 1}{44 \times 10^9} \approx 0.33\ \mu s/\text{tok}

Denominator is positive (1.98 > 0.33), so DBO can work. Minimum batch:

B \gtrsim \frac{2 \cdot (16/8) \cdot 120\ \mu s}{1.98 - 0.33\ \mu s/\text{tok}} \approx \frac{480\ \mu s}{1.65\ \mu s/\text{tok}} \approx 291\ \text{tokens}

Our batch of 8192 is well above the threshold. Enable DBO with HT kernels.

Interactive Analysis: Finding Your Crossover

The theory is clear, but the actual crossover point depends on your specific hardware and workload. Use the interactive tool below to explore the trade-off space.

What to look for:

Find the Crossover: Drag the Batch tokens slider to see exactly where the Low-Latency (LL) and High-Throughput (HT) curves intersect. This is your decision boundary.
Visualize the Mechanism: Toggle Animate to see how LL minimizes the fixed handshake overhead, while HT compresses the bulk transfer time via bandwidth efficiency.
Check the Impact: Watch the analysis table to see how your kernel choice ripples through to end-to-end throughput and latency.

Kernel choice is only half the story: the same model can land in very different comm/compute regimes depending on hardware locality. Before tuning knobs, anchor on where your interconnect cliff appears.

8) Hardware stacks: where the cliff falls

There are two deployment questions hiding inside “which GPU is faster?”:

What is the largest locality domain where GPU↔GPU traffic stays on a dedicated fabric (and never touches the NIC)?
What happens to your all‑to‑allv once it spills outside that domain (into RDMA + switches + congestion)?

Key idea: Wide‑EP economics are set by how much dispatch/combine stays inside the “cheap” locality domain versus how much spills into the “expensive” scale‑out network.

Minimal reference table (numbers + sources)

Stack	Locality domain (cliff boundary)	Scale‑up anchor	Scale‑out anchor	Sources
NVIDIA H100/H200	8 GPUs / HGX node	NVLink: 900 GB/s bidirectional per GPU	ConnectX‑7: up to 400 Gb/s	H100/H200, Hopper, CX7
NVIDIA GB200 NVL72	72 GPUs / rack domain	NVLink Switch: 130 TB/s rack, 1.8 TB/s bidirectional GPU↔GPU (HGX B200)	ConnectX‑8: up to 800 Gb/s	GB200 NVL72, HGX, CX8
AMD MI300X	8 GPUs / platform	896 GB/s aggregate bidirectional P2P (spec); ~315–336 GB/s aggregated unidirectional measured xGMI	400 Gb/s class NIC deployments are common	MI300 platform, ROCm xGMI
Public MoE internode reference	Multi‑node DeepEP benchmark	BW_eff (normal kernels): ~43–58 GB/s	BW_eff (low-latency kernels): ~39–46 GB/s	DeepEP README

How to read these numbers in our model

Unit conversion is direct arithmetic (for example, 400 Gb/s ≈ 50 GB/s raw line-rate ceiling).
Inference from topology + Section 5 model: the comm cliff appears when P exceeds the locality domain size (typically ~8 on H100/H200/MI300X platforms, ~72 inside NVL72).
Post‑cliff BW_{\text{eff}} is workload- and congestion-dependent, so treat it as a measured runtime quantity (LL/HT choice and network load matter).

What this means for deployment

For offline serving, cost scales with delivered tokens, not raw FLOPS.

Network‑heavy regimes (Wide‑EP across many nodes): a larger locality domain directly lowers L and raises BW_{\text{eff}}, which relaxes the minimum batch for DBO. Platforms with extended NVLink domains (NVL72) or future high‑bandwidth inter‑node fabrics have a structural advantage here.
Compute‑bound regimes (EP groups that fit within a single locality domain): standard multi‑GPU nodes (H100, MI300X) are very competitive, because the cliff never triggers and you get full intra‑node bandwidth.

9) Failure modes (what actually breaks in production)

“DBO didn’t help”: per‑GPU work too small. Fix is almost always to increase effective B (concurrency / chunk size), not to keep toggling overlap.
Congestion collapse: all‑to‑allv at multi‑node scale can implode BW_{\text{eff}}. Mitigate with coalescing/hierarchy (HT kernels), topology‑aware grouping, and sane injection rates.
Memory pressure: double buffers + LL preallocations compete with KV cache. Watch OOMs and “mysterious” throughput drops from paging/eviction.
Tail ranks dominate: routing imbalance (high \gamma) and noisy neighbors show up as step‑time variance. Enable EPLB (below) to replicate hot experts; fix requires balancing, not just faster kernels.

EPLB: taming the straggler factor

In a standard Wide‑EP deployment, each GPU owns a fixed subset of the 256 logical experts. The router selects experts per token, but not uniformly: some experts are “hot” (popular across the workload) while others are cold. The GPU hosting a cluster of hot experts becomes the straggler—and because all‑to‑allv is a synchronization barrier, the entire EP group waits for it.

Recall the straggler factor from Section 5:

\gamma = \frac{\max_r\ \text{tokens}_r}{\text{mean}_r\ \text{tokens}_r}

Typical values range from 1.05 (well‑balanced) to 1.3+ (heavy skew). At \gamma = 1.3, 30% of your GPU compute is wasted waiting for the busiest rank.

Quick symbol map: \alpha vs \gamma

\alpha (alpha) is the routing-skew parameter used in the EPLB widget (Zipf-style expert popularity). Higher \alpha means traffic concentrates on fewer hot experts.
\gamma (gamma) is the observed GPU straggler factor from the equation above: max load divided by mean load.
Causal link for intuition: \alpha \uparrow usually pushes baseline \gamma \uparrow; increasing redundant experts R with EPLB is the lever that drives \gamma \downarrow.

The idea: redundant experts

Expert Parallel Load Balancing (EPLB) breaks the 1:1 mapping between logical and physical experts. Instead of N logical experts mapped to N physical slots, EPLB allocates N + R physical slots where R extra copies (redundant experts) are given to the hottest experts.

If expert #42 gets 3× the average token load, EPLB gives it 3 physical copies spread across different GPUs. Each copy handles ~1/3 of the tokens routed to that expert. The result: the max‑load GPU drops closer to the mean.

How it works (high level)

EPLB runs periodically during inference (e.g., every 3000 steps) and follows a three‑step hierarchical algorithm:

Pack expert groups to nodes — topology‑aware assignment that respects locality‑domain boundaries (NVLink, xGMI), keeping intra‑node traffic on the fast fabric.
Replicate hot experts — within each node, the greediest strategy: iteratively assign the next redundant slot to whichever expert has the highest \text{load} / \text{replica\_count}.
Pack physical experts to GPUs — balanced bin‑packing that minimizes max GPU load across all ranks.

Between rebalancing steps, vLLM tracks per‑expert token counts over a sliding window (default 1000 steps) to detect shifts in routing distribution.

Connecting back to the model

EPLB’s effect is straightforward: it drives \gamma \to 1.0. Plugging a lower \gamma into the compute model:

t_{\text{compute}}(B) = \gamma \cdot \frac{B \cdot k}{P} \cdot c_{\text{tok}}

A reduction from \gamma = 1.3 to \gamma = 1.05 cuts t_{\text{compute}} by ~19%. More importantly, it relaxes the DBO overlap condition from Section 6—the minimum batch B required for overlap shrinks because the compute term is larger relative to comm.

Trade‑offs: memory vs. balance

Each redundant expert costs memory. For DeepSeek‑V3 with 61 MoE layers:

\text{Extra HBM} \approx R \times 61 \times \text{bytes\_per\_expert} \div P

In practice, this works out to roughly ~2.4 GB per redundant expert per EP rank. Setting R = 32 adds ~77 GB across a 64‑GPU cluster—non‑trivial, but often a good trade when the alternative is 30% idle GPU time from straggler effects.

vLLM Configuration:
--enable-eplb --eplb-config '{"num_redundant_experts": 32, "window_size": 1000, "step_interval": 3000}'

num_redundant_experts: How many extra physical expert slots to allocate. Start with 10–15% of your logical expert count; increase if \gamma > 1.15 persists.

window_size: Steps of token counts to track before rebalancing. Shorter windows adapt faster but may oscillate.

step_interval: How often to run the rebalancing algorithm. Lower values react faster to distribution shifts but add overhead from weight transfers.

use_async: Set to true for non‑blocking weight rearrangement (recommended for production).

Interactive exploration

Use the widget below to build intuition for how EPLB works:

Start with high skew (\alpha \ge 1.5) and zero redundant experts. Notice how \gamma exceeds 1.2.
Drag “Redundant Experts” up and watch the load bars equalize and \gamma drop toward 1.0.
Check the Expert Replication Map to see which experts received copies—it’s always the hottest ones.

Beyond EPLB: per‑batch optimal balancing with LPLB

EPLB is static: it rebalances every few thousand steps based on historical averages. Between rebalancing events, the mapping is frozen. If a particular batch has an unusual routing pattern—which is normal, since routing varies stochastically—EPLB cannot react. The straggler still pays.

LPLB (Linear‑Programming‑Based Load Balancer) extends EPLB with dynamic, per‑batch token redistribution. It keeps the same redundant expert topology, but instead of splitting tokens uniformly across replicas, it solves an optimization problem on every forward pass to find the best assignment.

The graph structure

Redundant experts create edges in a bipartite graph between GPUs. If GPU g_i hosts the original expert and GPU g_j hosts its replica, there is a directed edge e = (g_i, g_j) along which tokens can be redirected.

The choice of which edges exist—the topology—is a design parameter. LPLB supports several structured topologies (cube, hypercube, torus) that map onto physical NVLink/NVSwitch connectivity so that redirected tokens travel on the fast fabric.

The LP formulation

For each batch, LPLB observes the actual per‑GPU load w_g (total tokens routed to experts on GPU g) and solves:

\begin{alignedat}{3} \min_{f, z}\quad & z \\ \text{subject to}\quad & L_g = w_g - \sum_{e \in \text{out}(g)} f_e + \sum_{e \in \text{in}(g)} f_e && \forall\, g \\ & L_g \le z && \forall\, g \\ & 0 \le f_e \le c_e && \forall\, e \in E \\ & z \ge 0 \end{alignedat}

z = \max_g L_g \quad \text{(at optimum)}

where:

f_e is the number of tokens redirected along edge e (the decision variable),
c_e is the capacity of edge e—the number of tokens assigned to that redundant expert in the current batch,
L_g is the effective load on GPU g after redistribution,
z is the worst-case per-GPU load variable (units: tokens per batch) that upper-bounds every L_g.

This is a standard minimax linear program. Minimizing z is equivalent to minimizing \gamma, since \gamma = z / \text{mean}(L) and the mean is fixed (total tokens don’t change, they just move between GPUs).

LPLB solves this LP on‑GPU using a single‑SM Interior Point Method (IPM) backed by cuSolverDx/cuBLASDx, achieving ~100 µs solve time for intra‑node configurations.

Why LPLB improves on EPLB

	EPLB (static)	LPLB (dynamic)
When it rebalances	Every ~3000 steps	Every batch
What it optimizes	Historical average load	Actual per-batch load
How it assigns tokens	Uniform split across replicas (\text{load}/\text{count})	Optimal LP solution (minimax)
Adaptiveness	Cannot react to batch‑to‑batch variance	Tracks instantaneous routing fluctuations
Weight movement	Requires all‑to‑all weight transfer	No weight movement; only token redirection

The key insight: EPLB’s uniform split is optimal on average but suboptimal for any specific batch. LPLB finds the per‑batch optimum, which matters most under high routing variance (small batches, bursty workloads, or models with sharp expert specialization).

When LPLB fails

LPLB is not a strict upgrade—it has its own failure modes:

Solver overhead (~100 µs): For small batches where the entire MoE layer step takes under 1 ms, the LP solve time is non‑negligible. The optimization must pay for itself by saving more compute than it costs.
Token count ≠ compute time: LPLB minimizes max token count, but grouped GEMM execution time is a non‑linear function of batch size (due to padding, tiling, and kernel launch granularity). Perfectly equal token counts can still produce unequal compute times.
Extreme global imbalance: LPLB constrains each redundant expert to map to exactly one original expert. Under extreme skew, EPLB can assign multiple redundant slots to the same hot expert, effectively creating 3x or 4x replicas. LPLB’s topology constrains it to at most one edge per redundant expert, limiting its rebalancing capacity.
Topology mismatch: The fixed graph topology (cube, torus) must align with the physical interconnect. If the topology is poorly chosen, redirected tokens may cross slow links—converting a compute imbalance into a communication penalty.
Research stage: LPLB is an early research project from DeepSeek. Performance improvements are still under evaluation and it is not yet integrated into vLLM.

Practical guidance: Start with EPLB for production workloads. If profiling shows that \gamma varies significantly across batches (i.e., the per‑batch \gamma is much higher than the average \gamma), LPLB’s per‑batch optimization can close the remaining gap—especially for training workloads where small‑batch variance is high.

10) Software Stack: Porting the Strategy (CUDA vs ROCm)

DeepEP is CUDA-centric, but the systems strategy (hierarchical all‑to‑allv, coalescing, and overlap) is universal.

Stack equivalence table

Concept	NVIDIA	AMD
GPU kernel runtime	CUDA	HIP
Collective library	NCCL	RCCL
In‑node fabric	NVLink	xGMI
Scale‑out transport	GPUDirect RDMA + IB/RoCE	ROCm peer memory + IB/RoCE
Practical HT trick	hierarchy/coalescing	hierarchy/coalescing

The Portability Challenge

GPU‑driven communication is the right direction for fine‑grained all‑to‑allv, but DeepEP couples the GPU and NIC through NVIDIA‑specific plumbing (e.g., NVSHMEM/IBGDA).

Projects like UCCL‑EP are demonstrating that portability doesn’t have to cost performance. In fact, they are showing state-of-the-art (SOTA) results—beating DeepEP on GH200 and saturating AWS EFA—by fundamentally rethinking the control plane.

Algorithmic intuition (without vendor lock-in)

The core idea is control-plane / data-plane separation:

GPU kernels do what they are best at: packing tokens, running expert compute, and issuing tiny transfer descriptors.
A CPU proxy does what CPUs are best at: queue management, flow control, packet reordering, and transport-specific decisions.
The NIC is driven by standard host-side verbs from the proxy, instead of direct GPU MMIO/NIC control logic.
GPU compute continues while communication progress happens asynchronously through the proxy path.
Completion events update buffer state so the next dispatch/combine wave can proceed without stalls.

Why this helps across NVIDIA, AMD, and cloud fabrics

Ordered fabrics (typical IB paths): performance can match native GPU-driven approaches while keeping the transport layer portable.
Unordered fabrics (for example AWS EFA/SRD): host-side reordering and queue control are easier and more robust than implementing reorder logic inside CUDA kernels.
Heterogeneous stacks (for example AMD GPUs + non-NVIDIA NICs): decoupling GPU kernels from NIC-specific semantics reduces porting friction.

Public results snapshot (from UCCL‑EP report)

Platform	Hardware	Baseline	Reported UCCL‑EP delta
NVIDIA	H100 + InfiniBand	DeepEP (native)	Parity
AMD	MI300X + Broadcom	AMD Primus / Megatron-LM	+45% training throughput
AWS	H100 + EFA (SRD)	Existing EP on EFA	+40% SGLang throughput

On GH200 specifically, coherent CPU↔GPU memory further reduces proxy overhead, so this split architecture can preserve flexibility without paying a large latency tax.

11) Summary: vLLM knobs and decision flow

If you keep one operational view from this post, use this:

Size your effective batch so t_{\text{compute}} can cover t_{\text{comm}} (Sections 5–6).
Pick LL vs HT by measured crossover, not preference (Section 7).
Place EP to maximize traffic inside fast locality domains and avoid the inter-node cliff (Section 8).
Validate with profiler signals, then move to load balancing only when \gamma remains high (Section 9).

What you tune	vLLM Argument / Env Var	Moves which term	Why it matters
Expert Parallelism	--enable-expert-parallel (-ep)	P	Enables Wide-EP. Without this, MoE layers use Tensor Parallelism (TP).
Token budget	--max-num-batched-tokens	B	Larger B amortizes L and can make DBO effective.
Chunked prefill	--enable-chunked-prefill	shape of B	Controls per‑forward granularity; trades launch overhead for steadier batches.
Concurrency	--max-num-seqs	effective B	Higher concurrency keeps per‑GPU work above the overlap threshold.
Kernel mode	--all2all-backend	L vs BW_{\text{eff}}	deepep_low_latency vs deepep_high_throughput.
Overlap (DBO)	--enable-dbo	t_{\text{step}}	Hides comm behind compute. Tune with VLLM_DBO_COMM_SMS.
DeepEP Buffers	VLLM_DEEPEP_BUFFER_SIZE_MB	Memory	Adjusts reserved HBM for RDMA buffers (competes with KV cache).
Load balancing	--enable-eplb + --eplb-config	\gamma	Use only when routing skew is the bottleneck (Section 9).

This is the shortest path from model terms to production knobs.

References

NVIDIA Hopper NVLink bandwidth reference: NVIDIA Hopper Architecture
NVIDIA H100/H200 product specs (node sizing + scale-up): NVIDIA H100, NVIDIA H200
NVIDIA HGX platform specs (B200 NVLink GPU-to-GPU bandwidth): NVIDIA HGX
NVIDIA ConnectX NIC docs: ConnectX‑7 introduction, ConnectX‑8 introduction, NVIDIA InfiniBand adapters
NVIDIA rack-scale locality domain: NVIDIA GB200 NVL72
AWS P5 instance specs and EFA: AWS P5 Instances, AWS EFA
AMD MI300X platform specs (8x OAM, peer-to-peer I/O): AMD Instinct MI300 Platform
AMD xGMI/RCCL realized bandwidth analysis: Understanding RCCL Bandwidth and xGMI Performance on MI300X
DeepSeek‑V3 configuration (hidden size, experts, top‑k): DeepSeek‑V3 config.json
DeepSeek‑R1 overview / architecture: DeepSeek‑R1 model architecture
DeepEP published performance tables (internode bandwidth and latency breakdowns): DeepEP README (raw), DeepEP repository
Collective communication background: Collective operation, MPI Alltoallv
Remote Direct Memory Access background: RDMA
DeepSeek EPLB algorithm: DeepSeek EPLB GitHub Repository
DeepSeek LPLB (LP‑based dynamic load balancer): DeepSeek LPLB GitHub Repository

Impala AI

Exploring the frontiers of efficient AI infrastructure.