AI Data Center Networking: How GPU Clusters Are Changing Network Design
Home › Data Center › AI Data Center Networking
A complete technical guide to GPU cluster topology, RoCE vs InfiniBand, Rail-Optimised fabrics, 400G/800G switching, lossless transport, and vendor landscape for AI infrastructure in 2025
By Route XP | Published: March 2026 | Updated: March 2026 | Data Center, Arista, Cisco
- Why AI Is Rewriting the Rules of Data Center Networking
- How GPUs Communicate: The Networking Demands of Distributed Training
- NVLink and NVSwitch: Inside the GPU-to-GPU Interconnect
- InfiniBand: The Legacy AI Network Standard
- RoCEv2: Ethernet Comes to AI Networking
- RoCEv2 vs InfiniBand: Full Technical Comparison
- AI Fabric Topology: Rail-Optimised Leaf-Spine Design
- Lossless Transport: PFC, ECN, and DCQCN Explained
- ECMP vs Adaptive Routing vs Spray: Solving the Elephant Flow Problem
- 400G and 800G Switching: The Hardware Behind AI Fabrics
- Optics and Cabling for AI Data Centers
- Vendor Landscape: Cisco, Arista, Nvidia, Broadcom, Juniper
- Storage Networking in AI: GPUDirect Storage and NFS over RDMA
- Design Best Practices and Common Pitfalls
- Summary and Architecture Selection Guide
- Frequently Asked Questions
1. Why AI Is Rewriting the Rules of Data Center Networking
For the first three decades of enterprise data center design, the network was an afterthought — a plumbing layer that moved packets between servers, storage, and the internet. Traffic patterns were largely north-south (client to server), bandwidth requirements were measured in gigabits per rack, and a well-tuned three-tier or leaf-spine fabric was more than sufficient.
The rise of large language models (LLMs), generative AI, and large-scale deep learning training has fundamentally invalidated every one of those assumptions. Training a frontier AI model like GPT-4, Llama 3, or Gemini Ultra requires tens of thousands of GPUs operating in tight synchrony, exchanging gradient tensors hundreds of times per second. The resulting traffic is overwhelmingly east-west, bandwidth-intensive, latency-sensitive, and exhibits traffic patterns unlike anything traditional data center networks were designed to handle.
The numbers are staggering. A single Nvidia DGX H100 server contains eight H100 GPUs, each connected to a 400G Infiniband or Ethernet NIC — generating up to 3.2 Tbps of aggregate network bandwidth per server. A 1,000-GPU training cluster built from 125 such servers requires a non-blocking fabric capable of sustaining over 400 Tbps of all-to-all traffic with near-zero packet loss and microsecond latency variance.
This guide provides a rigorous technical deep-dive into how AI workloads are reshaping data center network design: from GPU communication primitives and RDMA transport protocols to rail-optimised topology, lossless Ethernet configuration, adaptive routing, and the 400G/800G switch silicon powering the world's largest AI fabrics.
2. How GPUs Communicate: The Networking Demands of Distributed Training
To design an AI fabric correctly, you must first understand why GPUs need to communicate at all, and what the communication patterns look like at the protocol level.
Distributed Training: Model Parallelism and Data Parallelism
Modern large AI model training uses three fundamental parallelism strategies, each generating distinct network traffic patterns:
- Data Parallelism (DP): The training dataset is split across GPU workers. Each GPU trains on a different data shard using an identical copy of the model. After each forward/backward pass, GPUs exchange gradient updates via AllReduce collective operations — a bandwidth-intensive all-to-all communication pattern. This is the most common form of parallelism and the dominant driver of network traffic in most AI clusters
- Tensor/Model Parallelism (TP/MP): Individual layers or tensors of the model are split across multiple GPUs. GPUs must exchange intermediate activation tensors during both forward and backward passes via AllGather and ReduceScatter collectives. Generates high-bandwidth, low-latency traffic between a small group of tightly coupled GPUs — often within the same server via NVLink
- Pipeline Parallelism (PP): Different layers of the model are assigned to different GPU stages in a pipeline. Each stage passes activations to the next stage via point-to-point Send/Recv operations. Generates more predictable, structured traffic flows between consecutive pipeline stages — often between servers in the same rack or adjacent racks
Collective Communication Operations
The NCCL (Nvidia Collective Communications Library) and its open-source equivalent RCCL (AMD) implement the following collective operations that drive network traffic in AI clusters:
| Operation | Description | Traffic Pattern | Bandwidth per Node |
|---|---|---|---|
| AllReduce | Each GPU contributes a tensor; result (sum/mean) returned to all GPUs | All-to-all (ReduceScatter + AllGather phases) | 2 × (N-1)/N × BW |
| AllGather | Each GPU contributes a shard; every GPU receives the full concatenated tensor | One-to-all broadcast (ring or tree) | (N-1)/N × BW |
| ReduceScatter | Reduces tensors from all GPUs; each GPU receives a unique shard of the result | All-to-all with aggregation | (N-1)/N × BW |
| AllToAll | Each GPU sends a unique shard to every other GPU (used in Mixture of Experts) | Full mesh point-to-point | (N-1)/N × BW |
| Broadcast | One GPU sends the same tensor to all other GPUs | One-to-many | BW (sender) |
3. NVLink and NVSwitch: Inside the GPU-to-GPU Interconnect
Before examining the external network fabric, it is essential to understand the intra-server interconnect layer — NVLink and NVSwitch — because this determines the boundary between what happens inside a server and what must traverse the network fabric.
NVLink: Direct GPU-to-GPU Bandwidth
NVLink is Nvidia's proprietary high-speed interconnect that allows GPUs within a server (or across a small group of servers via NVLink Switch Systems) to communicate at far higher bandwidth than PCIe. Each generation of NVLink roughly doubles the bandwidth of the previous:
| Generation | GPU | Per-Link BW (bidirectional) | Total NVLink BW (per GPU) | NVLink Count |
|---|---|---|---|---|
| NVLink 3.0 | A100 | 50 GB/s | 600 GB/s | 12 |
| NVLink 4.0 | H100 | 50 GB/s | 900 GB/s | 18 |
| NVLink 5.0 | B100 / B200 | 100 GB/s | 1,800 GB/s | 18 |
NVSwitch: Full Any-to-Any GPU Connectivity
NVSwitch is a dedicated high-bandwidth switching ASIC that enables full any-to-any NVLink connectivity among all GPUs within a server. In the DGX H100, four NVSwitch chips form a non-blocking switch fabric, giving every GPU 900 GB/s of NVLink bandwidth to every other GPU in the server simultaneously — a total of 7.2 TB/s of all-to-all bandwidth within a single 8-GPU server.
The NVLink Switch System (codenamed NVL72) extends this intra-server fabric to a 72-GPU domain across multiple compute trays, creating what Nvidia calls a GPU domain — a group of GPUs with NVLink speeds to each other, even across physical trays. This dramatically changes the external network topology requirements: tensor parallelism can run entirely within the NVLink domain (no external network traffic), while data parallelism and pipeline parallelism still require the external Ethernet or InfiniBand fabric.
4. InfiniBand: The Legacy AI Network Standard
InfiniBand (IB) is a high-performance interconnect technology originally developed in the late 1990s by a consortium including Compaq, Dell, HP, IBM, Intel, and Sun. It became the dominant fabric for HPC (High Performance Computing) clusters and, by extension, early large-scale AI training infrastructure — most notably in Nvidia's DGX SuperPOD reference architectures.
InfiniBand Key Technical Characteristics
- Native RDMA: InfiniBand was designed from the ground up for Remote Direct Memory Access — the CPU is bypassed entirely for data transfers. Memory on one server is read/written directly by a remote GPU without involving the operating system on either end, eliminating CPU jitter and achieving sub-2 microsecond MPI latency
- Lossless by design: InfiniBand uses credit-based flow control at the link layer — a sender cannot transmit a packet unless the receiver has advertised sufficient buffer credits. This makes the fabric intrinsically lossless without requiring complex Ethernet mechanisms like PFC and ECN
- Dedicated subnet manager: InfiniBand requires a Subnet Manager (SM) to compute routing tables and manage the fabric. The SM is a centralised control plane — OpenSM (open-source) or vendor implementations. This adds operational complexity compared to standard IP routing
- Proprietary ecosystem: While InfiniBand adapters (HCAs) and switches are available from Nvidia (formerly Mellanox), the protocol is fundamentally proprietary — not interoperable with standard Ethernet management tooling, monitoring, or automation frameworks
| Generation | Abbreviation | Per-Port Speed (4x) | Latency (MPI) | Typical Use |
|---|---|---|---|---|
| Enhanced Data Rate | EDR | 100 Gb/s | ~0.5 µs | V100-era clusters |
| High Data Rate | HDR | 200 Gb/s | ~0.5 µs | A100 DGX SuperPOD |
| Next Data Rate | NDR | 400 Gb/s | ~0.5 µs | H100 DGX SuperPOD |
| Extended Data Rate | XDR | 800 Gb/s | <0.5 µs | B100/B200 next-gen clusters |
5. RoCEv2: Ethernet Comes to AI Networking
RDMA over Converged Ethernet version 2 (RoCEv2) is the technology that enables AI clusters to be built on standard Ethernet infrastructure while preserving the low-latency, CPU-bypass benefits of RDMA. RoCEv2 encapsulates the InfiniBand transport layer (IB BTH header) inside a standard UDP/IP packet, enabling it to be routed across standard IP networks — unlike RoCEv1, which was Layer 2 only.
How RoCEv2 Works
- Protocol stack: RoCEv2 frame = Ethernet header + IP header + UDP header (destination port 4791) + IB BTH (Base Transport Header) + Payload. The BTH carries the RDMA opcode (RDMA Write, RDMA Read, Send) and Packet Sequence Number (PSN) for ordering and loss detection
- Queue Pairs (QP): Communication in RDMA happens through Queue Pairs — a send queue and a receive queue. Applications post Work Requests (WRs) to the QP; the NIC (RNIC) processes them directly without OS involvement. This is the mechanism that achieves CPU bypass
- RDMA Verbs: The application interface to RDMA hardware. Key operations:
RDMA Write(write to remote memory without receiver CPU involvement),RDMA Read(read from remote memory),Send/Recv(two-sided — both sender and receiver CPU involved) - Go-Back-N (GBN) retransmission: RoCEv2 uses a Go-Back-N protocol for loss recovery. When a packet is lost, the sender retransmits from that packet number onwards. Since GBN retransmits all subsequent packets on a loss event, even a single dropped packet causes significant throughput degradation — this is why lossless transport is mandatory
Why RoCEv2 Has Become the Dominant AI Fabric Protocol
The shift from InfiniBand to RoCEv2 Ethernet as the dominant AI fabric protocol for hyperscalers (Meta, Google, Microsoft Azure, Amazon) is driven by several powerful advantages:
- Ethernet ecosystem leverage: Standard Ethernet switch silicon (Broadcom Tomahawk 5, Cisco Silicon One G300, Arista 7050X3) can be used — enabling competitive multi-vendor procurement and commodity pricing
- IP routing: RoCEv2 routes over standard IP, enabling multi-tier Clos/spine-leaf topologies, ECMP, and BGP — none of which are natively available in InfiniBand fabrics
- Unified fabric: A single Ethernet fabric can carry both RDMA (RoCEv2) for GPU communication and standard TCP/IP for management, storage, and external traffic — eliminating the dual-fabric (IB + Ethernet) architecture
- Operational familiarity: Network engineers already know Ethernet. Operational tooling (Prometheus/Grafana monitoring, Ansible automation, standard sFlow telemetry) all work natively with Ethernet switches
6. RoCEv2 vs InfiniBand: Full Technical Comparison
| Attribute | InfiniBand (NDR/XDR) | RoCEv2 over Ethernet |
|---|---|---|
| MPI / NCCL latency | ~0.5 µs (industry-leading) | 1–3 µs (well-tuned fabric) |
| Lossless transport | Native (credit-based flow control) | Requires PFC + ECN/DCQCN configuration |
| Current max port speed | 800G (XDR) | 800G (IEEE 802.3df) |
| Routing | Proprietary (Subnet Manager, LID-based) | Standard IP/BGP/ECMP |
| Multi-vendor switches | Limited (primarily Nvidia Quantum) | Full ecosystem (Cisco, Arista, Juniper, Broadcom-based) |
| Switch cost | High (proprietary, limited competition) | Lower (commodity silicon, competitive market) |
| Adaptive routing | Native (SHIELD adaptive routing in Quantum-2) | Requires vendor-specific implementation (Arista ECMP+, Cisco DLBR) |
| Fabric scale | Up to ~10,000–50,000 ports per fabric | Effectively unlimited (routed IP) |
| Operational tooling | Proprietary (SM, ibdiagnet, Nvidia UFM) | Standard Ethernet tooling (Ansible, Prometheus, SNMP, gNMI) |
| Unified fabric (RDMA + TCP) | Separate IB + Ethernet fabrics typically required | Single fabric for RDMA and TCP/IP |
| Best for | HPC, tightly coupled MPI workloads, ultra-low latency requirements, Nvidia-only deployments | Hyperscale AI, multi-vendor fabrics, cost-sensitive at scale, unified infrastructure |
7. AI Fabric Topology: Rail-Optimised Leaf-Spine Design
The standard enterprise leaf-spine (Clos) topology is not optimal for AI workloads without modification. The Rail-Optimised (also called GPU-Rail) architecture has emerged as the reference design for AI clusters at hyperscale and is documented in Nvidia's DGX SuperPOD reference architecture as well as Meta's AI research cluster design papers.
Standard Leaf-Spine vs Rail-Optimised: The Core Difference
In a standard leaf-spine fabric, all hosts in a rack connect to a single ToR (Top of Rack) leaf switch, which connects to multiple spine switches. All inter-rack traffic is load-balanced across all spine paths. This works well for general-purpose east-west traffic where flows between any pair of servers are equally likely.
AI training traffic is not uniformly distributed. AllReduce operations in data-parallel training cause predictable, structured communication patterns — each GPU communicates intensely with specific peer GPUs (its "gradient reduction partners"). The Rail-Optimised fabric exploits this structure to maximise performance and minimise congestion.
Rail-Optimised Architecture: How It Works
In a Rail-Optimised fabric, each server has multiple NICs (e.g., 8 × 400G NICs in a DGX H100), and each NIC connects to a different leaf switch — not all to the same ToR switch. These per-NIC leaf switches are called Rails. All Rail-0 NICs across all servers connect to Rail-0 leaf switches; all Rail-1 NICs connect to Rail-1 leaf switches; and so on.
- Within-rail communication: GPU-0 on all servers communicates primarily through Rail-0 switches. This concentrates related AllReduce traffic within a single rail, preventing cross-rail congestion contamination
- Cross-rail communication: When AllToAll (e.g., Mixture of Experts routing) requires communication across rails, spine switches carry this traffic
- Congestion isolation: A congestion event on Rail-2 (e.g., a hot AllReduce spanning many servers) does not impact Rail-0, Rail-1, etc. — congestion is contained within rails
- Failure isolation: A failed Rail-2 leaf switch degrades only GPU-2 bandwidth; the other 7 GPUs in each server continue at full bandwidth
| Parameter | Value | Notes |
|---|---|---|
| Servers per ScalingUnit (SU) | 32 DGX H100 | 256 H100 GPUs per SU |
| NICs per server | 8 × ConnectX-7 400G | One NIC per GPU — each on a separate rail |
| Rails (leaf switches) | 8 leaf switches per SU | Each leaf handles Rail-N NICs from all 32 servers |
| Ports per leaf switch used for servers | 32 downlink × 400G | One port per server |
| Uplink ports per leaf switch | 32 uplink × 400G to spine | 1:1 oversubscription ratio (non-blocking within SU) |
| Aggregate fabric bandwidth (1 SU) | 8 × 32 × 400G = 102.4 Tbps | Bidirectional, fully non-blocking |
| Maximum SUs per multi-SU pod | Up to 8 SUs (2,048 GPUs) | Connected via dedicated spine tier |
Three-Tier Architecture for Very Large Clusters
For clusters beyond 2,048 GPUs, a three-tier architecture is used: GPU Rail Leaf → Aggregation / Pod Spine → Cluster Core Spine. Each tier adds a level of oversubscription (typically 2:1 at aggregation, 4:1 at core for inter-pod traffic), reflecting the observation that AllReduce communication within a training job is typically confined to a single pod, and inter-pod traffic is comparatively light (primarily pipeline parallelism and checkpoint I/O).
8. Lossless Transport: PFC, ECN, and DCQCN Explained
Making Ethernet lossless for RoCEv2 requires a carefully co-designed set of mechanisms operating at different layers of the network stack. The three key components are PFC (Priority Flow Control), ECN (Explicit Congestion Notification), and the DCQCN (Data Center Quantized Congestion Notification) algorithm.
PFC — Priority Flow Control (IEEE 802.1Qbb)
PFC is a link-layer pause mechanism that allows a switch to signal to its upstream neighbour to temporarily stop sending traffic on a specific 802.1p priority class — without pausing other priority classes. In AI fabrics, RoCEv2 traffic is assigned a specific DSCP/PCP value and mapped to a dedicated PFC-enabled priority class (typically Priority 3 or 4), while other traffic (management, storage) uses different priorities that are not pause-enabled.
- How it works: When a switch ingress queue for the RoCEv2 priority fills beyond a configured headroom threshold, it sends a PAUSE frame to the upstream device. The upstream device stops transmitting on that priority class until a RESUME frame is received
- Headroom calculation: Switch buffer headroom must account for the pipe delay (switch-to-switch RTT) — all packets in flight when PAUSE is sent must be absorbed. Headroom = wire delay × link speed. For a 400G link with 100ns pipe delay: 400 × 10⁹ × 100 × 10⁻⁹ = 40,000 bits = 5 KB per priority class per port
- PFC deadlock risk: In a poorly designed fabric, circular PFC dependencies can cause a deadlock — Node A pauses B, B pauses C, C pauses A — all three freeze. This is mitigated by using a single lossless priority class and ensuring no circular dependencies exist in the switch buffer allocation
ECN — Explicit Congestion Notification (RFC 3168)
ECN is an IP-layer congestion signalling mechanism. When a switch queue depth exceeds a configured threshold (KMIN to KMAX), the switch marks packets with the CE (Congestion Experienced) bit in the IP header instead of dropping them. The receiver echoes this marking back to the sender, which then reduces its sending rate.
ECN is the preferred congestion signal in RoCEv2 fabrics because it acts earlier than PFC — reducing sending rates before queues fill to the point of triggering PAUSE frames. The design goal is: ECN manages congestion proactively; PFC is the last-resort backstop preventing packet loss.
DCQCN — The Congestion Control Algorithm
DCQCN (Data Center Quantized Congestion Notification), developed by Microsoft Research, is the end-to-end congestion control algorithm used with RoCEv2. It operates at three points:
- Switch (RED-like marking): Switch marks packets with ECN CE bit based on queue depth using a probabilistic marking function between KMIN and KMAX thresholds
- Receiver (CNP generation): On receiving a CE-marked packet, the receiver generates a CNP (Congestion Notification Packet) and sends it back to the sender. Rate: at most one CNP per microsecond per QP
- Sender (rate reduction and recovery): On receiving a CNP, the sender immediately reduces its rate by a multiplicative factor (default 0.5). Rate recovery follows an additive increase / multiplicative decrease (AIMD) model with a fast recovery phase using a timer
| Layer | Configuration Item | Recommended Value / Action |
|---|---|---|
| NIC / Host | RDMA mode | Enable RoCEv2 (routed mode); disable RoCEv1 |
| NIC / Host | DSCP marking | Mark RoCEv2 traffic DSCP 26 (CS3) or per vendor recommendation |
| Switch | PFC enablement | Enable PFC on RDMA priority class only (e.g., priority 3). Disable on all other priorities |
| Switch | ECN thresholds | KMIN = 300 KB, KMAX = 2 MB (tune per buffer capacity). Target: mark before queue exceeds 10% of buffer |
| Switch | Buffer allocation | Allocate dedicated shared buffer pool to RDMA priority; size headroom per port count × per-port headroom formula |
| Switch | DSCP-to-PFC mapping | Map RDMA DSCP value to PFC-enabled 802.1p priority class consistently across all switches |
| NIC / Host | CNP rate limiter | One CNP per 1 µs per QP (default Mellanox ConnectX-7) |
| Fabric | PFC deadlock prevention | Use single lossless priority class; verify no circular buffer dependencies in switch topology |
9. ECMP vs Adaptive Routing vs Spray: Solving the Elephant Flow Problem
Traffic load balancing is one of the most critical — and most nuanced — aspects of AI fabric design. AllReduce operations generate many simultaneous large flows (elephant flows) between the same source-destination pairs. Naive ECMP hashing can cause severe load imbalance, with some paths congested while adjacent paths sit idle.
Standard ECMP (Equal-Cost Multi-Path)
Standard ECMP hashes flows to paths using a tuple of source IP, destination IP, source port, destination port, and protocol. In AI workloads, AllReduce flows between the same GPU pairs are typically persistent, large flows. ECMP assigns each flow a fixed path based on the hash — and that path does not change even if it becomes congested. The result is hash polarisation: multiple large flows land on the same path while others remain lightly loaded.
Adaptive Routing (Per-Packet / Per-Flowlet)
Adaptive routing dynamically selects the output port for each packet or flowlet based on real-time queue depth or credit availability, instead of a static hash. This moves traffic away from congested paths instantly, without waiting for TCP/DCQCN to react.
- Per-packet adaptive routing: Each packet independently selects the least-loaded path. Maximum load balancing efficiency, but can cause out-of-order packet delivery. Requires the receiving NIC to support out-of-order RoCEv2 reassembly (not all NICs do)
- Per-flowlet routing: Packets within a flow that arrive within a short time window (flowlet threshold) are sent on the same path. A new flowlet starts when there's a gap larger than the threshold — allowing path switching between bursts. Avoids reordering while still adapting to congestion
- NVIDIA Spectrum — SHIELD (Switch-based Hardware Intelligent Efficient Load Distribution): Nvidia's implementation in Spectrum-4 uses credit-based adaptive routing — switches exchange credit information about queue depths and route packets to the path with the most available credits. Achieves near-optimal load balance with minimal reordering
- Arista 7800R4 — Adaptive Load Balancing: Arista's implementation uses per-flowlet adaptive routing with configurable flowlet timeout. Integrates with ECMP and supports 400G/800G at line rate
Packet Spraying
Packet spraying takes adaptive routing to its logical extreme: every packet is sent on a different path in a round-robin or randomised fashion, regardless of flow membership. This achieves near-perfect load balancing at the cost of guaranteed out-of-order delivery. Packet spraying requires NIC-level reassembly support (e.g., Nvidia ConnectX-7 with ROME or Lossless Spray mode). It is used in Nvidia's DGX SuperPOD reference architecture with NDR InfiniBand fabrics.
| Method | Load Balance Quality | Reordering Risk | Switch Requirement | NIC Requirement |
|---|---|---|---|---|
| Standard ECMP | Poor (hash polarisation) | None | All switches | All NICs |
| ECMP + ECMP-wide hashing | Moderate | None | All switches | All NICs |
| Flowlet Adaptive Routing | Good | Low (inter-flowlet only) | Vendor-specific (Arista, Cisco) | Standard RoCEv2 |
| Credit-Based Adaptive (SHIELD) | Excellent | Low–Moderate | Nvidia Spectrum only | ConnectX-7 (reorder support) |
| Packet Spray | Near-perfect | High (all packets) | Spray-capable (IB or Ethernet) | Out-of-order capable NIC required |
10. 400G and 800G Switching: The Hardware Behind AI Fabrics
The bandwidth demands of GPU clusters have driven the data center switching industry through an unprecedented acceleration. 400G ports — which were considered future-proofing for general enterprise networks just three years ago — are already the baseline minimum for AI leaf switches. 800G is the current frontier, and 1.6T is actively in development.
Key Switch Silicon for AI Fabrics
| ASIC | Vendor | Bandwidth | Max Ports | Key AI Feature |
|---|---|---|---|---|
| Tomahawk 5 (TH5) | Broadcom | 51.2 Tbps | 64× 800G or 128× 400G | Deep buffers, PFC/ECN, high-density 800G |
| Tofino 3 / P4 | Intel (Barefoot) | 12.8 Tbps | 32× 400G | P4 programmability for custom telemetry and load balancing |
| Cisco Silicon One G300 | Cisco | 102.4 Tbps | 64× 1.6T or 128× 800G | ICN intelligent collective networking, shared buffer, hardware telemetry |
| Spectrum-4 | Nvidia | 51.2 Tbps | 64× 800G | SHIELD adaptive routing, tight NVLink/RoCE integration |
| Jericho3-AI (J3-AI) | Broadcom | 57.6 Tbps | 72× 800G | Deep shared buffer (2 GB), hardware-offloaded AllReduce |
| Quantum-3 (XDR IB) | Nvidia | 57.6 Tbps | 64× 800G (XDR InfiniBand) | Native RDMA, SHARP in-network compute, SHIELD routing |
Buffer Size: A Critical Differentiator for AI Fabrics
Packet buffer size is arguably the most important switch hardware characteristic for AI workloads — more impactful than raw throughput for many deployments. The reason: AllReduce synchronisation barriers cause periodic incast events — moments where many-to-one flows converge simultaneously on a single egress port, creating transient bursts that exceed line rate for a short window. Shallow-buffered switches drop packets during these bursts, triggering PFC cascades and DCQCN throttling that can degrade cluster training throughput by 20–40%.
- Shallow buffer (~50 MB shared): Tomahawk series, Trident 4. Suitable for general enterprise east-west traffic. Can cause issues at high AI workload density without careful PFC/ECN tuning
- Deep buffer (~2–4 GB shared): Jericho3-AI, some Cisco platforms. Purpose-designed for absorbing AllReduce incast bursts. Significantly reduces PFC pause events
- Shared vs partition buffer: Shared buffer pools dynamically allocate memory across ports, making efficient use of available buffer. Partitioned (dedicated per-port) buffers waste capacity on lightly loaded ports. Shared is strongly preferred for AI fabrics
11. Optics and Cabling for AI Data Centers
At 400G and 800G speeds, optical interconnect design becomes a critical success factor for AI fabric deployments. The wrong optics choice can constrain reach, increase power consumption, or create reliability issues at scale.
| Form Factor | Speed | Typical Reach | Power | Best Use |
|---|---|---|---|---|
| 400G QSFP-DD DR4 | 400G (4×100G PAM4) | 500m (SMF) | ~9W | ToR-to-server, ToR-to-spine within building |
| 400G QSFP-DD SR4 | 400G (4×100G PAM4) | 100m (MMF OM4) | ~6W | Short-reach within same row/aisle |
| 800G OSFP 2×FR4 | 800G (8×100G PAM4) | 2km (SMF) | ~14W | Spine-to-spine, inter-building |
| 800G LPO (Linear Pluggable Optic) | 800G | 100m (MMF) | ~7W (~50% vs retimed) | High-density AI leaf-to-server; power-constrained racks |
| 400G DAC (Direct Attach Copper) | 400G | 3–5m (passive) / 7m (active) | <1W (passive) | Intra-rack connections where reach permits |
| 800G AOC (Active Optical Cable) | 800G | Up to 30m | ~10W | Intra-row connections exceeding DAC reach |
12. Vendor Landscape: Cisco, Arista, Nvidia, Broadcom, Juniper
The AI networking market has attracted significant investment from all major networking vendors, with each taking a distinct architectural approach. Understanding these differences is essential for multi-year infrastructure planning.
| Vendor | Key Platform | Switch Silicon | AI Differentiator | Best Fit |
|---|---|---|---|---|
| Cisco | Nexus 9364-SG3, Cisco 8132 (IOS-XR) | Silicon One G300 (102.4 Tbps) | ICN shared buffer, path-based LB, hardware telemetry, SONiC + NX-OS + ACI, Nexus One AgenticOps | Enterprise AI DC, sovereign cloud, SP AI clusters; Cisco-invested organisations |
| Arista | 7800R4 Series, 7050X3 | Broadcom TH5, custom deep-buffer | EOS extensibility, deep telemetry (DANZ), adaptive LB, SONiC support, proven hyperscale deployments | Hyperscalers, financial services AI, brownfield data centers with EOS |
| Nvidia | Quantum-3 (IB), Spectrum-4 (Ethernet) | Quantum-3 XDR, Spectrum-4 (51.2 Tbps) | Tightest GPU integration; SHARP in-network compute; SHIELD adaptive routing; best single-vendor stack | Nvidia GPU-centric deployments; DGX SuperPOD; HPC + AI convergence |
| Juniper | QFX5220, PTX10008 | Broadcom TH5, custom Express silicon | Apstra IBN for intent-based automation, Paragon telemetry, strong service provider DNA | Telco AI, research networks, Juniper-invested SP and enterprise |
| Broadcom | TH5 (OEM via Arista, Cisco, others) | Tomahawk 5 (51.2 Tbps), Jericho3-AI | Most widely deployed AI fabric silicon; deep buffer J3-AI for AllReduce; hardware AllReduce offload | Whitebox / OCP deployments; cost-sensitive hyperscale build-outs |
13. Storage Networking in AI: GPUDirect Storage and NFS over RDMA
Storage networking is the second critical fabric in AI data centers, often receiving less attention than the compute fabric but equally important for training throughput. Training large models requires reading enormous datasets at high speed — a single epoch over a 100 TB dataset at 1M tokens/second requires sustained storage throughput of hundreds of GB/s.
GPUDirect Storage (GDS)
GPUDirect Storage is Nvidia's technology that enables data to flow directly from NVMe SSDs or network storage directly into GPU memory — bypassing the CPU and system RAM entirely. The traditional path for loading training data is: NVMe → PCIe → CPU DRAM → PCIe → GPU VRAM. GPUDirect Storage shortens this to: NVMe → PCIe switch → GPU VRAM (or NFS server → RDMA NIC → GPU VRAM for network storage).
- Local GDS: NVMe SSDs connect to the same PCIe domain as the GPU. The NVMe controller DMA's data directly into GPU VRAM via peer-to-peer PCIe. Achieves 12–25 GB/s per server with modern NVMe SSDs
- Network GDS (GPUDirect RDMA): NFS or Lustre server serves data over RDMA. The storage client on the GPU server uses RDMA to read data directly into GPU VRAM over the RDMA fabric. Requires RoCEv2 or InfiniBand storage fabric
- Requirements: GPU with PCIe P2P support (all H100/B100), NVMe with GDS support, Nvidia driver ≥ 460, cuFile library, CUDA 11.4+
Storage Fabric Architecture Options
| Architecture | Protocol | Throughput | GPU DMA Support | Best For |
|---|---|---|---|---|
| Local NVMe (in-server) | NVMe/PCIe | 12–25 GB/s per server | Yes (local GDS) | Highest throughput; limited capacity per server |
| All-Flash NFS over RDMA | NFS v4.1+ over RoCEv2 | 50–200 GB/s (cluster-wide) | Yes (network GDS) | Shared dataset access; large training sets |
| Lustre over InfiniBand | Lustre / LNET over IB | 100–500 GB/s (large clusters) | Yes (GDS via IB) | HPC + AI convergence; large supercomputing clusters |
| Object Storage (S3-compatible) | HTTP/S3 over TCP | Variable (multi-GB/s at scale) | No direct GPU DMA | Checkpoint storage, cold dataset retrieval, cloud AI training |
14. Design Best Practices and Common Pitfalls
Drawing from documented hyperscaler deployments, Nvidia reference architectures, and published academic research, the following best practices and anti-patterns are critical for successful AI fabric design.
✅ Best Practices
- Non-blocking leaf tier: The GPU-to-leaf switch tier must be strictly non-blocking (1:1 oversubscription). Any packet drops at this tier directly cause PFC cascades and DCQCN throttling across all GPU pairs sharing the leaf switch
- Separate compute and storage traffic: Run GPU-to-GPU (RoCEv2 RDMA) and GPU-to-storage (NFS over RDMA or TCP) on separate VLANs with separate DSCP markings and PFC priority classes, or ideally on physically separate network ports. Mixing these on the same interface can cause PFC head-of-line blocking
- Consistent MTU across the fabric: Use jumbo frames (MTU 9000) across all AI fabric interfaces — switches, NICs, and storage appliances. MTU mismatches cause fragmentation that is catastrophic for RDMA performance (RoCEv2 does not fragment)
- Deploy INT (In-band Network Telemetry): Real-time per-flow queue depth and latency telemetry is essential for diagnosing congestion in AI fabrics. Deploy INT or gRPC streaming telemetry on all switches. Polling-based SNMP is too slow to catch microsecond-scale congestion events
- GPU-job scheduling with topology awareness: Configure the job scheduler (SLURM, Kubernetes + GPU Operator) to be topology-aware — schedule training jobs on GPU nodes within the same rail/pod whenever possible. This minimises inter-pod traffic and reduces the blast radius of congestion
- Validate PFC configuration with traffic testing: Before production use, inject synthetic AllReduce traffic patterns (e.g., using ib_send_bw or perftest tools) to validate PFC triggering, ECN marking thresholds, and DCQCN rate reduction. A single misconfigured DSCP mapping can disable lossless transport
🚫 Common Design Pitfalls
| Pitfall | Impact | Remedy |
|---|---|---|
| Enabling PFC globally on all priorities | PFC deadlock — fabric-wide pause cascade on congestion event | Enable PFC on RDMA priority only; use no-drop for all other traffic |
| Using standard 1500B MTU | RDMA fragmentation — RoCEv2 doesn't fragment; connections fail or fall back to slow path | Configure MTU 9000 end-to-end on all AI fabric interfaces |
| All GPU NICs on a single ToR switch | Single point of failure; no rail isolation; 8× oversubscription on uplinks | Implement Rail-Optimised topology — one NIC per rail switch |
| Using standard ECMP for AllReduce flows | Hash polarisation — congested paths alongside empty paths; 30–50% throughput loss | Deploy adaptive routing or flowlet ECMP on all AI fabric switches |
| Mixing RoCEv2 and TCP on same DSCP/PFC class | TCP traffic causes PFC storms that degrade RDMA; TCP retransmit loops | Strict DSCP separation; RDMA on dedicated lossless priority; TCP on best-effort |
| Under-sizing ECN thresholds (KMAX too high) | ECN marks too late — queues fill before rate reduction takes effect, triggering PFC | Tune KMIN/KMAX to mark at 5–10% queue occupancy; validate with traffic testing |
15. Summary and Architecture Selection Guide
AI data center networking represents the most demanding and fastest-evolving segment of network infrastructure today. The fundamental shift from traditional north-south enterprise traffic to synchronised, all-to-all GPU collective communications has invalidated many established design assumptions and created an entirely new set of requirements around lossless transport, adaptive routing, and ultra-high-bandwidth fabrics.
| Cluster Size | GPU Count | Recommended Fabric | Transport Protocol | Reference Vendor |
|---|---|---|---|---|
| Small AI / Dev cluster | 8–64 GPUs (1–8 servers) | Single-switch 400G leaf, or NVLink domain only | RoCEv2 or IB HDR | Any 400G switch |
| Mid AI cluster | 64–512 GPUs | Rail-Optimised 2-tier (leaf + spine), 400G | RoCEv2 (preferred) or IB NDR | Arista 7050X3, Cisco Nexus 9300 |
| Large enterprise AI | 512–4,096 GPUs | Rail-Optimised 2-tier, 400G/800G, adaptive routing | RoCEv2 with DCQCN | Cisco Nexus 9364-SG3, Arista 7800R4, Nvidia Spectrum-4 |
| Hyperscale AI cluster | 4,096–100,000+ GPUs | Rail-Optimised 3-tier, 800G, packet spray / adaptive routing | RoCEv2 (hyperscalers) or IB XDR (Nvidia) | Arista 7800R4, Cisco G300, Broadcom TH5/J3-AI whitebox |
| HPC + AI convergence | Any scale | InfiniBand fabric (SHARP, adaptive routing) | InfiniBand NDR/XDR | Nvidia Quantum-3, DGX SuperPOD reference |
16. Frequently Asked Questions
Q: Can I build an AI cluster on a regular enterprise Ethernet fabric?
Technically yes — the packets will be delivered. However, without lossless transport configuration (PFC + ECN + DCQCN), MTU 9000, and a non-blocking topology, you will experience significant throughput degradation due to packet loss and RDMA retransmits. An enterprise fabric not optimised for RDMA can deliver 30–60% lower training throughput compared to a purpose-configured AI fabric. The required configuration changes (PFC, ECN, DSCP marking, jumbo frames) are achievable on most modern enterprise switches — but require careful planning.
Q: Should I choose InfiniBand or RoCEv2 for a new AI cluster?
For most new enterprise and cloud AI deployments in 2025, RoCEv2 over Ethernet is the recommended choice for clusters of 64+ GPUs. The multi-vendor Ethernet ecosystem provides significant cost advantages, IP routing allows larger fabric scales, and operational tooling is far more mature. InfiniBand remains the best choice for pure HPC environments where MPI latency is the primary driver, or for organisations purchasing Nvidia DGX SuperPOD turnkey systems.
Q: What is the recommended oversubscription ratio for an AI fabric?
The GPU-to-leaf tier should be non-blocking (1:1) — no oversubscription. The leaf-to-spine tier can be 1:1 for maximum performance, but 2:1 is acceptable for well-managed job scheduling that confines training runs within pods. Core/aggregation tiers connecting pods to each other can use 4:1, as inter-pod traffic is typically light (primarily pipeline parallelism, not AllReduce).
Q: What switch buffer depth do I need for an AI fabric?
Buffer requirements depend on the number of GPUs, AllReduce tree depth, and link speeds. As a rule of thumb: for a 32-server 400G leaf switch, a minimum of 50 MB shared buffer is required to absorb AllReduce incast events without PFC triggering. For 800G fabrics or clusters beyond 256 GPUs, 500 MB to 2+ GB of shared buffer is recommended. Deep-buffer ASICs (Jericho3-AI at 2 GB, select Cisco platforms) are specifically designed for these requirements.
Q: How does training job topology affect network design?
Training job topology (data parallelism, tensor parallelism, pipeline parallelism) directly determines traffic patterns. Tensor parallelism generates very high bandwidth between a small number of tightly coupled GPUs — ideally within the same NVLink domain or the same leaf switch. Data parallelism generates AllReduce traffic across all GPUs in the job — requires full bisection bandwidth at the leaf tier. Pipeline parallelism generates sequential flow between pipeline stages — benefits from topology-aware job placement ensuring consecutive stages are in the same rack or adjacent racks.
Q: Can I use AI switching hardware (e.g., Broadcom TH5) for general enterprise workloads as well?
Yes — Tomahawk 5 and similar ASICs are fully capable of running standard IP routing, spanning tree, and VXLAN for general enterprise workloads. Many organisations run a unified fabric where the same switches handle both AI/RDMA workloads (in the AI pod) and general-purpose east-west traffic (in adjacent clusters), with separate VRFs and DSCP markings separating the traffic classes. The primary trade-off is cost — AI-class switches with deep buffers cost 2–3× more than standard enterprise ToR switches.
Technical content based on Nvidia DGX SuperPOD reference architecture documentation, Meta AI Research Supercluster design papers, Microsoft Research DCQCN publication, Cisco Silicon One G300 product documentation, Arista 7800R4 series technical briefs, Broadcom Tomahawk 5 and Jericho3-AI data sheets, and IEEE 802.3df/bs standards. Latency and bandwidth figures are representative values based on published benchmarks. All specifications current as of March 2026.