Top: Without overlap, each step runs Dispatch → Compute → Combine sequentially. GPUs idle during network transfers; the network idles during compute.
Bottom: With DBO, double buffering lets Step N's compute overlap with Step N+1's communication.
Steady-state step time drops from
tcomm + tcompute
to
max(tcomm, tcompute).
The shaded idle blocks show wasted GPU time that DBO eliminates.