Impala AI Blog

Wide‑EP MoE Serving

3 parts

Part 1

Wide‑EP Mixture-of-Experts (MoE) Serving (Part 1/3): Why the Wire Becomes the Bottleneck

Part 1 of 3: build the Wide‑EP communication model from first principles—problem framing, notation, payload sizing, and the core communication-vs-compute time model.

Part 2

Wide‑EP Mixture-of-Experts (MoE) Serving (Part 2/3): Dual-Batch Overlap (DBO), Kernel Crossover, and the Hardware Cliff

Part 2 of 3: convert the model into tuning decisions—Dual-Batch Overlap (DBO), DeepEP low-latency (LL) vs high-throughput (HT) crossover, and where hardware locality boundaries create throughput cliffs.

Part 3

Wide‑EP Mixture-of-Experts (MoE) Serving (Part 3/3): Failure Modes, Load Balancing, and Portability

Part 3 of 3: production hardening for Wide‑EP—failure diagnostics, Expert Parallel Load Balancing (EPLB), Linear-Programming-Based Load Balancer (LPLB), software-stack portability, and final operator decision flow.

Read full reference edition →

Technical Blog

Wide‑EP MoE Serving

Wide‑EP Mixture-of-Experts (MoE) Serving (Part 1/3): Why the Wire Becomes the Bottleneck

Wide‑EP Mixture-of-Experts (MoE) Serving (Part 2/3): Dual-Batch Overlap (DBO), Kernel Crossover, and the Hardware Cliff

Wide‑EP Mixture-of-Experts (MoE) Serving (Part 3/3): Failure Modes, Load Balancing, and Portability

Wide‑EP Mixture-of-Experts (MoE) Serving: Dispatch/Combine, Dual-Batch Overlap (DBO), and Real Scaling Limits