Part 2 of 3: convert the model into tuning decisions—Dual-Batch Overlap (DBO), DeepEP low-latency (LL) vs high-throughput (HT) crossover, and where hardware locality boundaries create throughput cliffs.
Part 2 turns the Part 1 model into a concrete performance playbook: overlap1, kernel mode selection2, and topology-aware deployment choices.
Sequential execution is wasteful. In a naive implementation, GPUs sit idle while the network moves data, and then the network sits idle while the GPUs crunch numbers.
Dual-Batch Overlap (DBO) is the standard solution. It relies on a technique called double buffering (or “ping-pong” buffering) to parallelize work.
In a standard sequential flow, you have one buffer: the network fills it, then the GPU reads it. The resource not working is idle. With double buffering, we allocate two sets of communication buffers:
When both are finished, they swap roles. This allows us to fetch the data for the next micro-batch while computing the current one.
Without overlap, you pay the sum of the latencies:
With ideal pipelining (steady state), the step time is determined by the slowest component:
This implies a theoretical speedup limit of
It also reveals the failure mode.
DBO is not magic. It relies on “compute cover”—using the time spent calculating experts to hide the time spent moving tokens. This creates two distinct operating regimes:
Tuning DBO: The Control Panel
- Enable it:
--enable-dbo is essential for hiding inter-node latency, but it costs HBM4 (for double buffering).- Protect small batches: Use
--dbo-prefill-token-threshold (default 512) and--dbo-decode-token-threshold (default 32) to disable overlap when the batch is too small to justify the pipeline overhead. Increase these if you see regressions at low concurrency.- Balance Compute vs. Comm:
VLLM_DBO_COMM_SMS (default 20) reserves GPU SMs to drive the network.
- Increase if: Network transfers are jittery or lagging.
- Decrease if: Expert kernels become the bottleneck and end-to-end throughput drops even when overlap is active. On an H100 (132 SMs), 20 SMs is ~15% of total SMs; on other GPUs, adjust proportionally.
Modeling the SM split explicitly
Let
S be total SMs,S_{\text{comm}} beVLLM_DBO_COMM_SMS , and\rho = S_{\text{comm}}/S . A practical first-order model is:t_{\text{compute}}(B,\rho) \approx \gamma \cdot \frac{B\cdot k}{P} \cdot \frac{c_{\text{tok},0}}{1-\rho} where
c_{\text{tok},0} is time/token with no SM reservation. The communication side also depends on\rho because extra comm SMs improve progress:t_{\text{comm}}(B,\rho) \approx 2 \cdot \left(L(\rho) + \frac{V_{\text{rank}}}{BW_{\text{eff}}(\rho)}\right) So
VLLM_DBO_COMM_SMS is a true trade-off knob: it can reducet_{\text{comm}} while increasingt_{\text{compute}} .
We can now turn the intuition from Part 1 into a concrete inequality. Plugging the SM-aware model into
The key inequality: minimum batch for overlap
B \gtrsim \frac{2 \cdot (P/k) \cdot L(\rho)}{\gamma \cdot \frac{c_{\text{tok},0}}{1-\rho} - \frac{2 \cdot d \cdot s}{BW_{\text{eff}}(\rho)}} Scale
B withP , or adding GPUs makes you slower.
Feasibility guard: The denominator can be zero or negative. If
When the denominator is positive, the numerator still scales linearly with
This is the second “key takeaway”: As you add GPUs to your cluster, you must increase your batch size just to maintain the same level of overlap efficiency. If you scale
Single-node setups (NVLink/xGMI only) have much better
A practical keep/remove rule for single-node:
Let’s put concrete numbers on this for a standard 2-node cluster (16 GPUs total), serving DeepSeek-V3 (
When you span 2 nodes, your effective bandwidth (
| Hardware Stack | Interconnect | Per-GPU Link Speed | Peak BW ( | Realized | Min Batch |
|---|---|---|---|---|---|
| NVIDIA H100/H200 | InfiniBand NDR / CX7 | 400 Gbps (per‑port, raw) | 50 GB/s | ~36‑41 GB/s | ~72k‑96k |
| AMD MI300X | Broadcom Thor2 / RoCE | 400 Gbps | 50 GB/s | ~32‑38 GB/s | ~80k‑112k |
| AWS P5 (H100) | EFA v2 | 400 Gbps (per‑port, raw) | 50 GB/s | ~26‑34 GB/s | ~96k‑160k |
| NVIDIA B200 | ConnectX‑8 | 800 Gbps (per‑port, raw) | 100 GB/s | ~70‑82 GB/s | ~24k‑40k |
| GB200 NVL72 | NVLink Switch | N/A (In-Rack) | 900 GB/s | ~650‑800 GB/s | ~2k‑8k |
All link speeds are raw signaling rates; effective throughput after encoding/protocol overhead is lower. Realized
The Cliff: Inside a single node, the fast fabric (NVLink at ~450 GB/s/GPU, or xGMI at ~300+ GB/s) gives you abundant bandwidth. The moment you add a second node, your effective bandwidth drops to ~40 GB/s (400G IB/RoCE).
This is a 7–10× drop in bandwidth. To maintain the inequality
The widget below visualizes exactly this dynamic.
For readability, the widget uses an equivalent effective-parameter form:
Pro tip: on the “Step Time vs Batch Size” chart, focus on where the compute curve sits relative to comm. If compute is below comm, overlap will not save you.
DBO tells you when overlap helps. The next question is which kernel to use for the communication itself.
In Part 1, we modeled communication time as
Low-latency kernel (LL): Attacks
High-throughput kernel (HT): Attacks
Why do we model two different bandwidths? Because the all-to-all variable-size collective (
| Mode | Realized BW (est. on 400G) | Why? |
|---|---|---|
| LL (Direct RDMA) | ~20–30 GB/s (Fragmented) | Many small, non-contiguous messages. Congestion from random access. |
| HT (Hierarchical) | ~42–48 GB/s (Peak-ish) | Few large, coalesced messages. Topology-aware routing minimizes hops. |
Estimates based on typical 400Gbps RDMA efficiency (40–60% for fragmented traffic vs 85%+ for large contiguous bursts). See DeepEP benchmarks for detailed performance characterization of LL vs HT kernels.
This is not a philosophical choice; it is an arithmetic one. You should switch from LL to HT exactly when the bandwidth gains outweigh the setup costs.
vLLM Configuration: Select your kernel backend using
--all2all-backend .
- Use
deepep_low_latency for the LL kernel (latency-optimized).- Use
deepep_high_throughput for the HT kernel (bandwidth-optimized).Advanced Tuning Guide:
VLLM_DEEPEP_BUFFER_SIZE_MB (default 1024): Controls RDMA buffer size.
- Increase when: You see buffer overflow errors or run extremely large batches/hidden dims.
- Decrease when: You hit OOM and need to reclaim HBM for KV cache (especially at small batches).
VLLM_DEEPEP_HIGH_THROUGHPUT_FORCE_INTRA_NODE :
- Set to 1 for large NVLink domains (e.g., NVL72): Treats the entire fabric as one node to use NVLink instead of RDMA.
- Keep at 0 for multi‑node clusters: Essential for standard H100/MI300X clusters where nodes are connected by IB/RoCE.
VLLM_DEEPEP_LOW_LATENCY_USE_MNNVL :
- Set to 1 for multi‑node NVLink domains (LL mode): Allows the Low-Latency kernel to write directly over cross-node NVLink.
If we model each as
Solving yields the crossover payload:
Note: because LL has lower latency but lower bandwidth than HT, we have
In Impala’s offline huge‑batch setting, we usually operate above this crossover—so HT is often the right starting point. The value of the crossover framing is that you can measure it and justify that choice on your own hardware/topology.
Suppose you measured the following on a 2‑node H100 cluster (
| Parameter | LL kernel | HT kernel |
|---|---|---|
| 35 µs | 120 µs | |
| 22 GB/s | 44 GB/s |
And from compute profiling:
Step A — Kernel crossover (
Per‑rank payload at
Step B — DBO feasibility check:
Denominator is positive (1.98 > 0.33), so DBO can work. Minimum batch:
Our batch of 8192 is well above the threshold. Enable DBO with HT kernels.
The theory is clear, but the actual crossover point depends on your specific hardware and workload. Use the interactive tool below to explore the trade-off space.
What to look for:
Batch tokens slider to see exactly where the Low-Latency (LL) and High-Throughput (HT) curves intersect. This is your decision boundary.Animate to see how LL minimizes the fixed handshake overhead, while HT compresses the bulk transfer time via bandwidth efficiency.Kernel choice is only half the story: the same model can land in very different comm/compute regimes depending on hardware locality. Before tuning knobs, anchor on where your interconnect cliff appears.
There are two deployment questions hiding inside “which GPU is faster?”:
Key idea: Wide‑EP economics are set by how much dispatch/combine stays inside the “cheap” locality domain versus how much spills into the “expensive” scale‑out network.
| Stack | Locality domain (cliff boundary) | Scale‑up anchor | Scale‑out anchor | Sources |
|---|---|---|---|---|
| NVIDIA H100/H200 | 8 GPUs / HGX node | NVLink: 900 GB/s bidirectional per GPU | ConnectX‑7: up to 400 Gb/s | H100/H200, Hopper, CX7 |
| NVIDIA GB200 NVL72 | 72 GPUs / rack domain | NVLink Switch: 130 TB/s rack, 1.8 TB/s bidirectional GPU↔GPU (HGX B200) | ConnectX‑8: up to 800 Gb/s | GB200 NVL72, HGX, CX8 |
| AMD MI300X | 8 GPUs / platform | 896 GB/s aggregate bidirectional P2P (spec); ~315–336 GB/s aggregated unidirectional measured xGMI | 400 Gb/s class NIC deployments are common | MI300 platform, ROCm xGMI |
| Public MoE internode reference | Multi‑node DeepEP benchmark | BW_eff (normal kernels): ~43–58 GB/s | BW_eff (low-latency kernels): ~39–46 GB/s | DeepEP README |
For offline serving, cost scales with delivered tokens, not raw FLOPS.
In Part 3, we focus on operational stability: failure modes, EPLB/LPLB load balancing, portability across stacks, and the final decision flow for production runbooks.
Dual-Batch Overlap (DBO) overlaps communication with compute using double buffering. See Dual-Batch Overlap (DBO): The art of hiding the wire. ↩
Low-latency (LL) kernels reduce fixed overhead; high-throughput (HT) kernels increase realized bandwidth. See DeepEP low-latency (LL) vs high-throughput (HT). ↩
Network Interface Controller (NIC) is the hardware endpoint for network I/O. Overview: NIC. ↩
High-Bandwidth Memory (HBM) is on-package GPU memory used by model weights, caches, and communication buffers. Overview: HBM. ↩
Remote Direct Memory Access (RDMA) enables direct memory transfer across machines with low CPU overhead. Overview: RDMA. ↩
Key-value (KV) cache stores attention keys/values reused across decoding steps. Overview: Transformers cache explanation. ↩
All-to-all variable-size collective (all-to-allv) lets each rank send different payload sizes to peers. References: MPI Alltoallv, Collective operation. ↩
Exploring the frontiers of efficient AI infrastructure.