Communication Roofline

Inter-node (RDMA / RoCE) Intra-node (NVLink / xGMI)

Small messages are latency-bound (rising slope). Large messages are bandwidth-bound (flat plateau). Wide-EP slices tokens across more peers, shrinking per-peer messages. Without huge batches, inter-node communication lands in the latency-bound regime where most link bandwidth goes unused.

Model: Hockney latency-bandwidth — t = L + V/BW, BWeff = V/t. Inter-node: L = 20 µs, BW = 25 GB/s (400 Gbps IB/RoCE, ~50% utilization typical of fragmented all-to-allv). Intra-node: L = 1 µs, BW = 400 GB/s (NVLink 4th-gen / xGMI aggregate). Representative of 8×H100 or 8×MI300X nodes with 400G NICs; actual curves shift with your topology.