Wide‑EP Mixture-of-Experts (MoE) Serving: Dispatch/Combine, Dual-Batch Overlap (DBO), and Real Scaling Limits

A production guide to multi-node Mixture-of-Experts (MoE) inference in vLLM: model communication vs compute, choose DeepEP low-latency (LL) vs high-throughput (HT) by measurement, and tune Dual-Batch Overlap (DBO) and Expert Parallel Load Balancing (EPLB) for maximum throughput.

Why Wide‑EP Scaling Breaks (and How to Fix It)

For ML systems engineers: a practical systems model for when MoE serving scales, when it stalls, and which knobs recover line-rate throughput.

At DeepSeek‑class scale, serving throughput is usually limited by token movement, not tensor-core math. Every Mixture-of-Experts (MoE) layer triggers dispatch/combine traffic through an all-to-all variable-size collective (all‑to‑allv) across the Expert Parallelism (EP) group, repeated at high frequency.

In production the pattern is counterintuitive: graphics processing unit (GPU) count goes up, utilization can go down, network interface controller (NIC) pressure spikes, and tokens/sec can flatten or regress. The root cause is that widening Expert Parallelism (EP) shrinks per-peer payloads, re-exposing fixed communication latency.

At Impala AI, we tune vLLM in exactly this regime. This post distills the model and operating playbook we use to predict behavior, choose DeepEP kernel mode, and map vLLM knobs to measurable outcomes.

In this post


TL;DR

Wide‑EP throughput is set by per-peer payload size, not headline global batch. If B/P gets too small, fixed latency dominates and scaling can invert. Dual-Batch Overlap (DBO) helps only when t_{\text{compute}} \ge t_{\text{comm}}; otherwise communication is exposed on the critical path. DeepEP low-latency (LL) vs high-throughput (HT) is a measurable crossover: fit (L, BW_{\text{eff}}) on your topology and choose by payload range. When routing skew makes the straggler factor \gamma the bottleneck, Expert Parallel Load Balancing (EPLB) replicates hot experts to rebalance load across GPUs; Linear-Programming-Based Load Balancer (LPLB) goes further with per-batch linear-program optimization. The systems strategy — hierarchical all‑to‑allv, overlap, and load balancing — is hardware-agnostic; projects like UCCL‑EP are proving it works across NVIDIA, AMD, and cloud fabrics.

Working terms


Operational intuition

A toy but faithful model for one practical question: why does wide expert parallelism become more attractive as you push toward huge batches, and what exactly does DBO buy you?

Every MoE layer pays dispatch all‑to‑all, then expert compute, then combine all‑to‑all. Wide‑EP spreads experts across many GPUs, which improves capacity and can raise throughput, but it also amplifies collective communication pressure.

DBO does not reduce total work. It changes the schedule so communication can run under compute. When compute is large enough, the exposed critical path approaches max(comm, compute) instead of comm + compute.

Quick mental model

Wide‑EP increases communication volume, but huge batches amortize the fixed costs. DBO helps only when compute can cover communication.


1) Why V3/R1 force Wide‑EP

DeepSeek‑V3 and DeepSeek‑R1 represent a new class of “hyper-scale” sparse models. Their specs are daunting:

Why single‑node deployment fails

First, there is the sheer size. In FP8, the model weights alone consume ~700GB. A standard 8xH100 node typically offers 640GB of HBM. You physically cannot fit the model parameters on a single node, let alone leave room for the KV cache. For long‑context R1 reasoning traces, the KV cache alone can exceed the size of the weights.

To serve this, you must scale out to multiple nodes. This forces you into Wide-EP: splitting the experts across a multi-node cluster (e.g., 16 or 32 GPUs) connected by Ethernet or InfiniBand, rather than just NVLink.

Keeping tensor cores busy

This scale-out creates a distributed systems nightmare. MoE models are sparse by design, meaning they have low arithmetic intensity (compute-per-byte) compared to dense models. To keep powerful tensor cores (H100, MI300X, or newer) from idling, you must compensate for this sparsity by processing massive batch sizes—often tens of thousands of concurrent tokens.

This transforms inference from a simple ML task into a complex orchestration challenge:

  1. Massive State: You must manage terabytes of active KV cache distributed across the cluster.
  2. High-Frequency Sync: With 61 MoE layers, every single token generation step requires 122 global synchronization points (one dispatch and one combine per layer).
  3. Straggler Sensitivity: At this scale, a single slow GPU or a congested network link stalls the entire EP group.

The cost you pay for this scale is MoE routing: dispatch and combine become high-fanout, all-to-all communication patterns that saturate your network fabric.


Now that we know why Wide‑EP is unavoidable, let’s frame the operational constraints before we quantify the model.

2) The challenges of multi‑node MoE serving

If we only cared about compute, scaling would be easy: add GPUs and increase throughput. Wide‑EP breaks that simplicity because tokens must move.

A few things make this hard in practice:

We’ll now build a model that is simple enough to fit in your head, but rich enough to predict these behaviors.


3) Problem setting and notation

We’ll use a minimal set of symbols and tie them to what you control in an offline run.

The roofline below shows how these two regimes play out as message size changes:

The “Wide-EP Trap”

As you scale out to more GPUs (increasing P), you might expect performance to improve linearly. However, Wide-EP introduces a subtle “slicing” penalty:

  1. Latency‑bound regime: When messages are small (e.g., < 64KB), the fixed overhead L (kernel launch, PCIe traversal, RDMA handshake) dominates the total time. In this regime, your expensive 400Gbps links are effectively idle most of the time, achieving < 10% of their peak bandwidth.
  2. Bandwidth‑bound regime: Only when messages are large enough (e.g., > 500KB) does the transfer time V/BW_{\text{eff}} dominate. This is where you get the throughput you paid for.

The Trap: If you keep your global batch size B constant while increasing P, the per-peer message size (\approx B/P) shrinks. You inadvertently push your workload from the efficient bandwidth-bound regime into the inefficient latency-bound regime.

The Solution: To make Wide-EP work, you must increase the global batch size B proportionally with P to keep per-peer messages large. This is why “huge batch” offline inference is critical for multi‑node MoE.


With the trap identified, let’s quantify the dispatch/combine payload that drives it.

4) Dispatch/Combine payload: the thing your NIC must move

To understand why the network chokes, we need to look at the exact volume of data hitting the wire.

In a Wide-EP setup, every forward pass involves two massive shuffles per layer:

  1. Dispatch: Sending tokens from their source GPU to the GPU hosting their selected experts.
  2. Combine: Sending the expert results back to the source GPU.

Anatomy of the payload

For one direction (dispatch or combine), the average payload size per rank is determined by the total volume of routed tokens divided by the number of participants.

V_{\text{rank}} \approx \underbrace{\frac{B}{P}}_{\text{tokens/rank}} \cdot \underbrace{k}_{\text{fan-out}} \cdot \underbrace{d}_{\text{hidden dim}} \cdot \underbrace{s}_{\text{precision}}

Let’s unpack why these terms multiply:

Reality Check: It’s never just the average

The formula above gives you the idealized average. Real-world serving is messier:

  1. All-to-Allv (The Skew Problem): Experts aren’t chosen uniformly. Some experts are “hot,” meaning the rank hosting them receives far more data than the average. Since all_to_all is a synchronization barrier, the entire group waits for the busiest rank. Your effective payload is defined by the maximum load, not the average.
  2. Metadata Tax: You aren’t just moving raw activation tensors. You must also transmit token indices (to track request IDs), expert scales (for the weighted sum), and padding (for kernel alignment). In practice, model this as a tax (1+\epsilon) on top of your raw payload.

Example: The 140 GB/s Hose

Let’s plug in real numbers for a DeepSeek-V3/R1 deployment to see the magnitude of the problem.

V_{\text{rank}} \approx \frac{128,000}{64} \cdot 8 \cdot 7168 \cdot 1 \approx 115\ \text{MB}

Why this matters: 115 MB might sound manageable. But remember:

  1. This happens twice per layer (dispatch + combine).
  2. There are 61 layers.
  3. That’s 115 \text{ MB} \times 2 \times 61 \approx 14 \text{ GB} of network traffic per forward pass.

If you are running at 10 steps/second, your NICs are pushing 140 GB/s. This is why we say Wide-EP is network-bound: you are essentially trying to stream the entire active state of the model through the interconnect continuously.


5) A time model that’s simple enough to tune

Now that we know what we are moving (the payload), we need to know how long it takes. We need a model that links batch size, cluster size, and hardware specs to end-to-end latency.

The Communication Tax

For the full dispatch+combine round-trip, we use a standard latency-bandwidth model (Hockney). This assumes dispatch and combine are roughly symmetric in cost; in practice skew can make combine slightly cheaper (fewer tokens return to hot ranks). The factor of 2 is a conservative upper bound.

t_{\text{comm}}(B) \approx 2 \cdot \left( L + \frac{V_{\text{rank}}}{BW_{\text{eff}}} \right)

Why does this equation matter so much in practice?

The Compute & The Straggler

While the network moves data, the GPUs must process it. The time to compute the experts is:

t_{\text{compute}}(B) \approx \underbrace{\gamma}_{\text{straggler}} \cdot \underbrace{\frac{B \cdot k}{P}}_{\text{work/rank}} \cdot \underbrace{c_{\text{tok}}}_{\text{time/token}}

The tension: comm vs. compute

This leads to the fundamental tension of Wide-EP serving.

Widening EP (increasing P) reduces compute per rank. As you add GPUs, t_{\text{compute}} shrinks because the work is split more ways. But t_{\text{comm}} often grows or stays flat (due to fixed L).

If t_{\text{compute}} becomes smaller than t_{\text{comm}}, you can no longer “hide” the communication behind the compute. Tensor cores sit idle, waiting for the network. This is why you must increase B (batch size) to keep the compute term large enough to cover the communication term.


The natural question: can we hide the communication behind the compute instead of just racing them? That’s exactly what DBO does.

6) DBO: The art of hiding the wire

Sequential execution is wasteful. In a naive implementation, GPUs sit idle while the network moves data, and then the network sits idle while the GPUs crunch numbers.

Dual-Batch Overlap (DBO) is the standard solution. It relies on a technique called double buffering (or “ping-pong” buffering) to parallelize work.

In a standard sequential flow, you have one buffer: the network fills it, then the GPU reads it. The resource not working is idle. With double buffering, we allocate two sets of communication buffers:

  1. Compute Phase: The GPU processes tokens from Buffer A (which arrived in the previous step).
  2. Comm Phase: Simultaneously, the NIC streams the next set of tokens into Buffer B.

When both are finished, they swap roles. This allows us to fetch the data for the next micro-batch while computing the current one.

How overlap works

Without overlap, you pay the sum of the latencies:

t_{\text{seq}} \approx t_{\text{comm}} + t_{\text{compute}}

With ideal pipelining (steady state), the step time is determined by the slowest component:

t_{\text{dbo}} \approx \max\left(t_{\text{comm}},\ t_{\text{compute}}\right)

This implies a theoretical speedup limit of 2\times (when t_{\text{comm}} = t_{\text{compute}}). This is a steady‑state result: the first and last micro‑batch pay full fill/drain overhead. The max model applies when per‑step time dominates the pipeline startup cost—typically when you process many micro‑batches (large B). For small batches, add ~1 extra step of latency for fill/drain.

It also reveals the failure mode.

The “Exposed Wire” Problem

DBO is not magic. It relies on “compute cover”—using the time spent calculating experts to hide the time spent moving tokens. This creates two distinct operating regimes:

  1. Compute‑dominant (t_{\text{compute}} \ge t_{\text{comm}}): The GPU takes longer to compute than the network takes to send. Network latency is effectively zero (hidden). Throughput is compute-bound.
  2. Comm‑dominant (t_{\text{comm}} > t_{\text{compute}}): The network is too slow. The GPU finishes its work and then waits. Throughput is capped by the wire.

Tuning DBO: The Control Panel

Modeling the SM split explicitly

Let S be total SMs, S_{\text{comm}} be VLLM_DBO_COMM_SMS, and \rho = S_{\text{comm}}/S. A practical first-order model is:

t_{\text{compute}}(B,\rho) \approx \gamma \cdot \frac{B\cdot k}{P} \cdot \frac{c_{\text{tok},0}}{1-\rho}

where c_{\text{tok},0} is time/token with no SM reservation. The communication side also depends on \rho because extra comm SMs improve progress:

t_{\text{comm}}(B,\rho) \approx 2 \cdot \left(L(\rho) + \frac{V_{\text{rank}}}{BW_{\text{eff}}(\rho)}\right)

So VLLM_DBO_COMM_SMS is a true trade-off knob: it can reduce t_{\text{comm}} while increasing t_{\text{compute}}.

Quantifying the overlap condition

We can now turn the intuition from Section 5 into a concrete inequality. Plugging the SM-aware model into t_{\text{compute}}(B,\rho) \ge t_{\text{comm}}(B,\rho) gives the minimum batch size required to maintain overlap:

The key inequality: minimum batch for overlap

B \gtrsim \frac{2 \cdot (P/k) \cdot L(\rho)}{\gamma \cdot \frac{c_{\text{tok},0}}{1-\rho} - \frac{2 \cdot d \cdot s}{BW_{\text{eff}}(\rho)}}

Scale B with P, or adding GPUs makes you slower.

Feasibility guard: The denominator can be zero or negative. If \gamma \cdot \frac{c_{\text{tok},0}}{1-\rho} \le \frac{2 \cdot d \cdot s}{BW_{\text{eff}}(\rho)}, per‑token communication remains too large relative to per‑token compute, and full cover is impossible at any batch size. In that regime, you need better comm (L(\rho)\downarrow, BW_{\text{eff}}(\rho)\uparrow), faster expert kernels, or a smaller routed payload.

When the denominator is positive, the numerator still scales linearly with P. For fixed B, pick \rho by minimizing t_{\text{dbo}}(B,\rho)=\max\left(t_{\text{comm}}(B,\rho),t_{\text{compute}}(B,\rho)\right); the best operating point is usually near the balance line t_{\text{comm}} \approx t_{\text{compute}}.

This is the second “key takeaway”: As you add GPUs to your cluster, you must increase your batch size just to maintain the same level of overlap efficiency. If you scale P without scaling B, you strip away the compute cover and expose the raw network latency.

Single-node DBO: when it is actually worth it

Single-node setups (NVLink/xGMI only) have much better L and BW_{\text{eff}} than multi-node RDMA, so DBO is not automatically a win. The best way to think about it:

A practical keep/remove rule for single-node:

  1. Measure step-time decomposition with DBO off.
  2. Enable DBO and sweep VLLM_DBO_COMM_SMS in a small range.
  3. Keep DBO only if end-to-end throughput and tail latency improve at your target batch/concurrency, not just synthetic peak.

Real-world numbers: The 2-Node Cliff

Let’s put concrete numbers on this for a standard 2-node cluster (16 GPUs total), serving DeepSeek-V3 (d=7168, k=8) in FP8.

When you span 2 nodes, your effective bandwidth (BW_{\text{eff}}) is capped by the inter-node interconnect. Here is how common hardware stacks compare:

Hardware StackInterconnectPer-GPU Link SpeedPeak BW (BW_{\text{peak}})Realized BW_{\text{eff}} (est.)Min Batch B for DBO (operational target)
NVIDIA H100/H200InfiniBand NDR / CX7400 Gbps (per‑port, raw)50 GB/s~36‑41 GB/s~72k‑96k
AMD MI300XBroadcom Thor2 / RoCE400 Gbps50 GB/s~32‑38 GB/s~80k‑112k
AWS P5 (H100)EFA v2400 Gbps (per‑port, raw)50 GB/s~26‑34 GB/s~96k‑160k
NVIDIA B200ConnectX‑8800 Gbps (per‑port, raw)100 GB/s~70‑82 GB/s~24k‑40k
GB200 NVL72NVLink SwitchN/A (In-Rack)900 GB/s~650‑800 GB/s~2k‑8k

All link speeds are raw signaling rates; effective throughput after encoding/protocol overhead is lower. Realized BW_{\text{eff}} values and batch targets above are operational ranges (not theoretical minima), assuming DeepSeek‑class shape (d=7168,k=8,P=16), moderate routing skew (\gamma \approx 1.05\text{\u2013}1.15), and DBO SM reservation in the common range (\rho \approx 0.10\text{\u2013}0.20, e.g., VLLM_DBO_COMM_SMS ≈ 12–28 on H100).

The Cliff: Inside a single node, the fast fabric (NVLink at ~450 GB/s/GPU, or xGMI at ~300+ GB/s) gives you abundant bandwidth. The moment you add a second node, your effective bandwidth drops to ~40 GB/s (400G IB/RoCE).

This is a 7–10× drop in bandwidth. To maintain the inequality t_{\text{compute}} \ge t_{\text{comm}}, you must increase your batch size B by roughly the same factor (or more, due to latency L) when you cross the node boundary. This cliff is the central challenge of multi‑node MoE serving, regardless of GPU vendor.

Use the widget to build intuition

The widget below visualizes exactly this dynamic.

For readability, the widget uses an equivalent effective-parameter form: L_{\text{eff}}, BW_{\text{eff}}, c_{\text{tok,eff}}. This means SM partition effects (e.g., VLLM_DBO_COMM_SMS, or \rho) are folded into the slider values rather than exposed as a separate control.

Pro tip: on the “Step Time vs Batch Size” chart, focus on where the compute curve sits relative to comm. If compute is below comm, overlap will not save you.


DBO tells you when overlap helps. The next question is which kernel to use for the communication itself.

7) DeepEP LL vs HT: a crossover, not a religion

In Section 5, we modeled communication time as t_{\text{comm}} \approx 2 \cdot (L + V/BW_{\text{eff}}). DeepEP—an open‑source CUDA kernel library by DeepSeek for MoE dispatch/combine—gives you a direct way to attack these two terms separately. It exposes two families of dispatch/combine kernels, and the right choice depends entirely on which side of the “Wide-EP Trap” (Section 3) you are currently stuck in.

The two modes

  1. LL (Low-Latency): Attacks L.

    • Mechanism: Uses pre-allocated buffers and simplified signaling (often just RDMA writes with immediate data) to shave off every microsecond of handshake overhead.
    • Time: t_{\text{LL}} \approx \underbrace{(t_{\text{kernel}} + L_{\text{RDMA}})}_{\text{Low fixed cost}} + \underbrace{\frac{V}{BW_{\text{frag}}}}_{\text{Slow transfer}} (Direct RDMA suffers from fragmentation/congestion).
    • Best for: Small batches, high expert parallelism (where B/P is small), or latency-sensitive decode steps.
    • The Cost: Memory. LL often requires O(P) registered buffer space per rank. For our P=64 example, maintaining dedicated buffers for every peer (to handle skew) can consume ~1–2 GB of HBM. On memory‑constrained nodes (Section 1), this fights directly against your KV cache.
  2. HT (High-Throughput): Attacks BW_{\text{eff}}.

    • Mechanism: Uses hierarchical collectives. Instead of every GPU talking to every other GPU (naive all-to-all), GPUs within a node first gather their data via NVLink, then a designated “leader” performs the heavy RDMA transfer to other nodes, followed by a local scatter.
    • Time: t_{\text{HT}} \approx \underbrace{(3t_{\text{kernel}} + 2L_{\text{NVL}} + L_{\text{RDMA}})}_{\text{High fixed cost}} + \underbrace{\frac{V}{BW_{\text{peak}}}}_{\text{Fast transfer}} (Hierarchical aggregation enables peak RDMA bandwidth).
    • Best for: Large “offline” batches where payload size V is huge. This is essential for crossing the “2-Node Cliff” (Section 6) efficiently.
    • The Cost: Latency. The extra gather/scatter steps add a fixed setup cost. However, it is memory-efficient: it only buffers the aggregated payload (~230 MB with double buffering). If your payload is small, HT is actually slower than LL.

Real-world intuition: Fragmented vs Peak Bandwidth

Why do we model two different bandwidths? Because all_to_allv efficiency depends heavily on message size and topology.

ModeRealized BW (est. on 400G)Why?
LL (Direct RDMA)~20–30 GB/s (Fragmented)Many small, non-contiguous messages. Congestion from random access.
HT (Hierarchical)~42–48 GB/s (Peak-ish)Few large, coalesced messages. Topology-aware routing minimizes hops.

Estimates based on typical 400Gbps RDMA efficiency (40–60% for fragmented traffic vs 85%+ for large contiguous bursts). See DeepEP benchmarks for detailed performance characterization of LL vs HT kernels.

The Crossover Point

This is not a philosophical choice; it is an arithmetic one. You should switch from LL to HT exactly when the bandwidth gains outweigh the setup costs.

vLLM Configuration: Select your kernel backend using --all2all-backend.

Advanced Tuning Guide:

If we model each as (L, BW), choose HT when:

L_{\text{HT}} + \frac{V}{BW_{\text{HT}}} < L_{\text{LL}} + \frac{V}{BW_{\text{LL}}}

Solving yields the crossover payload:

V^{*} = \frac{L_{\text{LL}}-L_{\text{HT}}}{\frac{1}{BW_{\text{HT}}}-\frac{1}{BW_{\text{LL}}}}

Note: because LL has lower latency but lower bandwidth than HT, we have L_{\text{LL}} < L_{\text{HT}} and BW_{\text{LL}} < BW_{\text{HT}}. Both numerator and denominator are negative, so V^{*} is positive. Below V^{*} LL wins (lower fixed cost dominates); above it HT wins (higher bandwidth dominates).

In Impala’s offline huge‑batch setting, we usually operate above this crossover—so HT is often the right starting point. The value of the crossover framing is that you can measure it and justify that choice on your own hardware/topology.

Real-world worked example: 16‑GPU H100 cluster serving DeepSeek‑V3

Suppose you measured the following on a 2‑node H100 cluster (P = 16):

ParameterLL kernelHT kernel
L35 µs120 µs
BW_{\text{eff}}22 GB/s44 GB/s

And from compute profiling: c_{\text{tok}} = 1.8 µs/token, \gamma = 1.1.

Step A — Kernel crossover (V^*):

V^{*} = \frac{L_{\text{LL}} - L_{\text{HT}}}{\frac{1}{BW_{\text{HT}}} - \frac{1}{BW_{\text{LL}}}} = \frac{35\mu s - 120\mu s}{\frac{1}{44\ \text{GB/s}} - \frac{1}{22\ \text{GB/s}}} = \frac{-85\mu s}{-22.7\ \text{ps/B}} \approx 3.7\ \text{MB}

Per‑rank payload at B = 8192: V = \frac{8192}{16} \cdot 8 \cdot 7168 \cdot 1 \approx 29\ \text{MB} \gg V^*. Use HT.

Step B — DBO feasibility check:

\gamma \cdot c_{\text{tok}} = 1.1 \times 1.8\ \mu s = 1.98\ \mu s/\text{tok} \frac{2 \cdot d \cdot s}{BW_{\text{eff}}} = \frac{2 \times 7168 \times 1}{44 \times 10^9} \approx 0.33\ \mu s/\text{tok}

Denominator is positive (1.98 > 0.33), so DBO can work. Minimum batch:

B \gtrsim \frac{2 \cdot (16/8) \cdot 120\ \mu s}{1.98 - 0.33\ \mu s/\text{tok}} \approx \frac{480\ \mu s}{1.65\ \mu s/\text{tok}} \approx 291\ \text{tokens}

Our batch of 8192 is well above the threshold. Enable DBO with HT kernels.

Interactive Analysis: Finding Your Crossover

The theory is clear, but the actual crossover point depends on your specific hardware and workload. Use the interactive tool below to explore the trade-off space.

What to look for:


Kernel choice is only half the story: the same model can land in very different comm/compute regimes depending on hardware locality. Before tuning knobs, anchor on where your interconnect cliff appears.

8) Hardware stacks: where the cliff falls

There are two deployment questions hiding inside “which GPU is faster?”:

  1. What is the largest locality domain where GPU↔GPU traffic stays on a dedicated fabric (and never touches the NIC)?
  2. What happens to your all‑to‑allv once it spills outside that domain (into RDMA + switches + congestion)?

Key idea: Wide‑EP economics are set by how much dispatch/combine stays inside the “cheap” locality domain versus how much spills into the “expensive” scale‑out network.

Minimal reference table (numbers + sources)

StackLocality domain (cliff boundary)Scale‑up anchorScale‑out anchorSources
NVIDIA H100/H2008 GPUs / HGX nodeNVLink: 900 GB/s bidirectional per GPUConnectX‑7: up to 400 Gb/sH100/H200, Hopper, CX7
NVIDIA GB200 NVL7272 GPUs / rack domainNVLink Switch: 130 TB/s rack, 1.8 TB/s bidirectional GPU↔GPU (HGX B200)ConnectX‑8: up to 800 Gb/sGB200 NVL72, HGX, CX8
AMD MI300X8 GPUs / platform896 GB/s aggregate bidirectional P2P (spec); ~315–336 GB/s aggregated unidirectional measured xGMI400 Gb/s class NIC deployments are commonMI300 platform, ROCm xGMI
Public MoE internode referenceMulti‑node DeepEP benchmarkBW_eff (normal kernels): ~43–58 GB/sBW_eff (low-latency kernels): ~39–46 GB/sDeepEP README

How to read these numbers in our model

What this means for deployment

For offline serving, cost scales with delivered tokens, not raw FLOPS.


9) Failure modes (what actually breaks in production)

EPLB: taming the straggler factor

In a standard Wide‑EP deployment, each GPU owns a fixed subset of the 256 logical experts. The router selects experts per token, but not uniformly: some experts are “hot” (popular across the workload) while others are cold. The GPU hosting a cluster of hot experts becomes the straggler—and because all‑to‑allv is a synchronization barrier, the entire EP group waits for it.

Recall the straggler factor from Section 5:

\gamma = \frac{\max_r\ \text{tokens}_r}{\text{mean}_r\ \text{tokens}_r}

Typical values range from 1.05 (well‑balanced) to 1.3+ (heavy skew). At \gamma = 1.3, 30% of your GPU compute is wasted waiting for the busiest rank.

Quick symbol map: \alpha vs \gamma

The idea: redundant experts

Expert Parallel Load Balancing (EPLB) breaks the 1:1 mapping between logical and physical experts. Instead of N logical experts mapped to N physical slots, EPLB allocates N + R physical slots where R extra copies (redundant experts) are given to the hottest experts.

If expert #42 gets 3× the average token load, EPLB gives it 3 physical copies spread across different GPUs. Each copy handles ~1/3 of the tokens routed to that expert. The result: the max‑load GPU drops closer to the mean.

How it works (high level)

EPLB runs periodically during inference (e.g., every 3000 steps) and follows a three‑step hierarchical algorithm:

  1. Pack expert groups to nodes — topology‑aware assignment that respects locality‑domain boundaries (NVLink, xGMI), keeping intra‑node traffic on the fast fabric.
  2. Replicate hot experts — within each node, the greediest strategy: iteratively assign the next redundant slot to whichever expert has the highest \text{load} / \text{replica\_count}.
  3. Pack physical experts to GPUs — balanced bin‑packing that minimizes max GPU load across all ranks.

Between rebalancing steps, vLLM tracks per‑expert token counts over a sliding window (default 1000 steps) to detect shifts in routing distribution.

Connecting back to the model

EPLB’s effect is straightforward: it drives \gamma \to 1.0. Plugging a lower \gamma into the compute model:

t_{\text{compute}}(B) = \gamma \cdot \frac{B \cdot k}{P} \cdot c_{\text{tok}}

A reduction from \gamma = 1.3 to \gamma = 1.05 cuts t_{\text{compute}} by ~19%. More importantly, it relaxes the DBO overlap condition from Section 6—the minimum batch B required for overlap shrinks because the compute term is larger relative to comm.

Trade‑offs: memory vs. balance

Each redundant expert costs memory. For DeepSeek‑V3 with 61 MoE layers:

\text{Extra HBM} \approx R \times 61 \times \text{bytes\_per\_expert} \div P

In practice, this works out to roughly ~2.4 GB per redundant expert per EP rank. Setting R = 32 adds ~77 GB across a 64‑GPU cluster—non‑trivial, but often a good trade when the alternative is 30% idle GPU time from straggler effects.

vLLM Configuration:

--enable-eplb --eplb-config '{"num_redundant_experts": 32, "window_size": 1000, "step_interval": 3000}'

Interactive exploration

Use the widget below to build intuition for how EPLB works:

Beyond EPLB: per‑batch optimal balancing with LPLB

EPLB is static: it rebalances every few thousand steps based on historical averages. Between rebalancing events, the mapping is frozen. If a particular batch has an unusual routing pattern—which is normal, since routing varies stochastically—EPLB cannot react. The straggler still pays.

LPLB (Linear‑Programming‑Based Load Balancer) extends EPLB with dynamic, per‑batch token redistribution. It keeps the same redundant expert topology, but instead of splitting tokens uniformly across replicas, it solves an optimization problem on every forward pass to find the best assignment.

The graph structure

Redundant experts create edges in a bipartite graph between GPUs. If GPU g_i hosts the original expert and GPU g_j hosts its replica, there is a directed edge e = (g_i, g_j) along which tokens can be redirected.

The choice of which edges exist—the topology—is a design parameter. LPLB supports several structured topologies (cube, hypercube, torus) that map onto physical NVLink/NVSwitch connectivity so that redirected tokens travel on the fast fabric.

The LP formulation

For each batch, LPLB observes the actual per‑GPU load w_g (total tokens routed to experts on GPU g) and solves:

\begin{alignedat}{3} \min_{f, z}\quad & z \\ \text{subject to}\quad & L_g = w_g - \sum_{e \in \text{out}(g)} f_e + \sum_{e \in \text{in}(g)} f_e && \forall\, g \\ & L_g \le z && \forall\, g \\ & 0 \le f_e \le c_e && \forall\, e \in E \\ & z \ge 0 \end{alignedat} z = \max_g L_g \quad \text{(at optimum)}

where:

This is a standard minimax linear program. Minimizing z is equivalent to minimizing \gamma, since \gamma = z / \text{mean}(L) and the mean is fixed (total tokens don’t change, they just move between GPUs).

LPLB solves this LP on‑GPU using a single‑SM Interior Point Method (IPM) backed by cuSolverDx/cuBLASDx, achieving ~100 µs solve time for intra‑node configurations.

Why LPLB improves on EPLB

EPLB (static)LPLB (dynamic)
When it rebalancesEvery ~3000 stepsEvery batch
What it optimizesHistorical average loadActual per-batch load
How it assigns tokensUniform split across replicas (\text{load}/\text{count})Optimal LP solution (minimax)
AdaptivenessCannot react to batch‑to‑batch varianceTracks instantaneous routing fluctuations
Weight movementRequires all‑to‑all weight transferNo weight movement; only token redirection

The key insight: EPLB’s uniform split is optimal on average but suboptimal for any specific batch. LPLB finds the per‑batch optimum, which matters most under high routing variance (small batches, bursty workloads, or models with sharp expert specialization).

When LPLB fails

LPLB is not a strict upgrade—it has its own failure modes:

  1. Solver overhead (~100 µs): For small batches where the entire MoE layer step takes under 1 ms, the LP solve time is non‑negligible. The optimization must pay for itself by saving more compute than it costs.

  2. Token count ≠ compute time: LPLB minimizes max token count, but grouped GEMM execution time is a non‑linear function of batch size (due to padding, tiling, and kernel launch granularity). Perfectly equal token counts can still produce unequal compute times.

  3. Extreme global imbalance: LPLB constrains each redundant expert to map to exactly one original expert. Under extreme skew, EPLB can assign multiple redundant slots to the same hot expert, effectively creating 3x or 4x replicas. LPLB’s topology constrains it to at most one edge per redundant expert, limiting its rebalancing capacity.

  4. Topology mismatch: The fixed graph topology (cube, torus) must align with the physical interconnect. If the topology is poorly chosen, redirected tokens may cross slow links—converting a compute imbalance into a communication penalty.

  5. Research stage: LPLB is an early research project from DeepSeek. Performance improvements are still under evaluation and it is not yet integrated into vLLM.

Practical guidance: Start with EPLB for production workloads. If profiling shows that \gamma varies significantly across batches (i.e., the per‑batch \gamma is much higher than the average \gamma), LPLB’s per‑batch optimization can close the remaining gap—especially for training workloads where small‑batch variance is high.

10) Software Stack: Porting the Strategy (CUDA vs ROCm)

DeepEP is CUDA-centric, but the systems strategy (hierarchical all‑to‑allv, coalescing, and overlap) is universal.

Stack equivalence table

ConceptNVIDIAAMD
GPU kernel runtimeCUDAHIP
Collective libraryNCCLRCCL
In‑node fabricNVLinkxGMI
Scale‑out transportGPUDirect RDMA + IB/RoCEROCm peer memory + IB/RoCE
Practical HT trickhierarchy/coalescinghierarchy/coalescing

The Portability Challenge

GPU‑driven communication is the right direction for fine‑grained all‑to‑allv, but DeepEP couples the GPU and NIC through NVIDIA‑specific plumbing (e.g., NVSHMEM/IBGDA).

Projects like UCCL‑EP are demonstrating that portability doesn’t have to cost performance. In fact, they are showing state-of-the-art (SOTA) results—beating DeepEP on GH200 and saturating AWS EFA—by fundamentally rethinking the control plane.

Algorithmic intuition (without vendor lock-in)

The core idea is control-plane / data-plane separation:

  1. GPU kernels do what they are best at: packing tokens, running expert compute, and issuing tiny transfer descriptors.
  2. A CPU proxy does what CPUs are best at: queue management, flow control, packet reordering, and transport-specific decisions.
  3. The NIC is driven by standard host-side verbs from the proxy, instead of direct GPU MMIO/NIC control logic.
  4. GPU compute continues while communication progress happens asynchronously through the proxy path.
  5. Completion events update buffer state so the next dispatch/combine wave can proceed without stalls.

Why this helps across NVIDIA, AMD, and cloud fabrics

Public results snapshot (from UCCL‑EP report)

PlatformHardwareBaselineReported UCCL‑EP delta
NVIDIAH100 + InfiniBandDeepEP (native)Parity
AMDMI300X + BroadcomAMD Primus / Megatron-LM+45% training throughput
AWSH100 + EFA (SRD)Existing EP on EFA+40% SGLang throughput

On GH200 specifically, coherent CPU↔GPU memory further reduces proxy overhead, so this split architecture can preserve flexibility without paying a large latency tax.


11) Summary: vLLM knobs and decision flow

If you keep one operational view from this post, use this:

  1. Size your effective batch so t_{\text{compute}} can cover t_{\text{comm}} (Sections 5–6).
  2. Pick LL vs HT by measured crossover, not preference (Section 7).
  3. Place EP to maximize traffic inside fast locality domains and avoid the inter-node cliff (Section 8).
  4. Validate with profiler signals, then move to load balancing only when \gamma remains high (Section 9).
What you tunevLLM Argument / Env VarMoves which termWhy it matters
Expert Parallelism--enable-expert-parallel (-ep)PEnables Wide-EP. Without this, MoE layers use Tensor Parallelism (TP).
Token budget--max-num-batched-tokensBLarger B amortizes L and can make DBO effective.
Chunked prefill--enable-chunked-prefillshape of BControls per‑forward granularity; trades launch overhead for steadier batches.
Concurrency--max-num-seqseffective BHigher concurrency keeps per‑GPU work above the overlap threshold.
Kernel mode--all2all-backendL vs BW_{\text{eff}}deepep_low_latency vs deepep_high_throughput.
Overlap (DBO)--enable-dbot_{\text{step}}Hides comm behind compute. Tune with VLLM_DBO_COMM_SMS.
DeepEP BuffersVLLM_DEEPEP_BUFFER_SIZE_MBMemoryAdjusts reserved HBM for RDMA buffers (competes with KV cache).
Load balancing--enable-eplb + --eplb-config\gammaUse only when routing skew is the bottleneck (Section 9).

This is the shortest path from model terms to production knobs.


References

Impala AI

Exploring the frontiers of efficient AI infrastructure.