LL wins for small batches (lower L), HT wins for large (higher BW_eff).
Throughput: R = B / t_{\text{step}} (tokens/sec)
Latency: t_{\text{step}} \approx \max(t_{\text{compute}},\ t_{\text{comm}}) (with DBO)