AI Data Center Networking: How GPU Clusters Are Changing Network Design

Home › Data Center › AI Data Center Networking

A complete technical guide to GPU cluster topology, RoCE vs InfiniBand, Rail-Optimised fabrics, 400G/800G switching, lossless transport, and vendor landscape for AI infrastructure in 2025

By Route XP | Published: March 2026 | Updated: March 2026 | Data Center, Arista, Cisco

AI Data Center Networking: How GPU Clusters Are Changing Network Design

AI Data Center Networking — GPU Cluster Fabric Design with 400G/800G Rail-Optimised Topology

800G Per-Port Speed — Current Gen AI Fabric

100 Tbps Top-of-Rack Switch Capacity

~0% Target Packet Loss in RoCEv2 Fabric

8:1 GPU-to-NIC Ratio in DGX H100

3× AI Network Traffic vs General DC Traffic

$500B+ AI Infrastructure Spend 2024–2026

📋 Table of Contents

Why AI Is Rewriting the Rules of Data Center Networking
How GPUs Communicate: The Networking Demands of Distributed Training
NVLink and NVSwitch: Inside the GPU-to-GPU Interconnect
InfiniBand: The Legacy AI Network Standard
RoCEv2: Ethernet Comes to AI Networking
RoCEv2 vs InfiniBand: Full Technical Comparison
AI Fabric Topology: Rail-Optimised Leaf-Spine Design
Lossless Transport: PFC, ECN, and DCQCN Explained
ECMP vs Adaptive Routing vs Spray: Solving the Elephant Flow Problem
400G and 800G Switching: The Hardware Behind AI Fabrics
Optics and Cabling for AI Data Centers
Vendor Landscape: Cisco, Arista, Nvidia, Broadcom, Juniper
Storage Networking in AI: GPUDirect Storage and NFS over RDMA
Design Best Practices and Common Pitfalls
Summary and Architecture Selection Guide
Frequently Asked Questions

1. Why AI Is Rewriting the Rules of Data Center Networking

For the first three decades of enterprise data center design, the network was an afterthought — a plumbing layer that moved packets between servers, storage, and the internet. Traffic patterns were largely north-south (client to server), bandwidth requirements were measured in gigabits per rack, and a well-tuned three-tier or leaf-spine fabric was more than sufficient.

The rise of large language models (LLMs), generative AI, and large-scale deep learning training has fundamentally invalidated every one of those assumptions. Training a frontier AI model like GPT-4, Llama 3, or Gemini Ultra requires tens of thousands of GPUs operating in tight synchrony, exchanging gradient tensors hundreds of times per second. The resulting traffic is overwhelmingly east-west, bandwidth-intensive, latency-sensitive, and exhibits traffic patterns unlike anything traditional data center networks were designed to handle.

The numbers are staggering. A single Nvidia DGX H100 server contains eight H100 GPUs, each connected to a 400G Infiniband or Ethernet NIC — generating up to 3.2 Tbps of aggregate network bandwidth per server. A 1,000-GPU training cluster built from 125 such servers requires a non-blocking fabric capable of sustaining over 400 Tbps of all-to-all traffic with near-zero packet loss and microsecond latency variance.

This guide provides a rigorous technical deep-dive into how AI workloads are reshaping data center network design: from GPU communication primitives and RDMA transport protocols to rail-optimised topology, lossless Ethernet configuration, adaptive routing, and the 400G/800G switch silicon powering the world's largest AI fabrics.

📌 Scope of This Article This guide focuses on the external network fabric connecting GPU servers — the leaf and spine switches, optics, and protocols that form the cluster interconnect. It covers both the compute fabric (GPU-to-GPU communication for training) and the storage fabric (GPU-to-storage for dataset access). NVLink and NVSwitch intra-server interconnects are covered to provide context on where the external network begins.

2. How GPUs Communicate: The Networking Demands of Distributed Training

To design an AI fabric correctly, you must first understand why GPUs need to communicate at all, and what the communication patterns look like at the protocol level.

Distributed Training: Model Parallelism and Data Parallelism

Modern large AI model training uses three fundamental parallelism strategies, each generating distinct network traffic patterns:

Data Parallelism (DP): The training dataset is split across GPU workers. Each GPU trains on a different data shard using an identical copy of the model. After each forward/backward pass, GPUs exchange gradient updates via AllReduce collective operations — a bandwidth-intensive all-to-all communication pattern. This is the most common form of parallelism and the dominant driver of network traffic in most AI clusters
Tensor/Model Parallelism (TP/MP): Individual layers or tensors of the model are split across multiple GPUs. GPUs must exchange intermediate activation tensors during both forward and backward passes via AllGather and ReduceScatter collectives. Generates high-bandwidth, low-latency traffic between a small group of tightly coupled GPUs — often within the same server via NVLink
Pipeline Parallelism (PP): Different layers of the model are assigned to different GPU stages in a pipeline. Each stage passes activations to the next stage via point-to-point Send/Recv operations. Generates more predictable, structured traffic flows between consecutive pipeline stages — often between servers in the same rack or adjacent racks

Collective Communication Operations

The NCCL (Nvidia Collective Communications Library) and its open-source equivalent RCCL (AMD) implement the following collective operations that drive network traffic in AI clusters:

NCCL Collective Operations — Traffic Patterns and Bandwidth Requirements
Operation	Description	Traffic Pattern	Bandwidth per Node
AllReduce	Each GPU contributes a tensor; result (sum/mean) returned to all GPUs	All-to-all (ReduceScatter + AllGather phases)	2 × (N-1)/N × BW
AllGather	Each GPU contributes a shard; every GPU receives the full concatenated tensor	One-to-all broadcast (ring or tree)	(N-1)/N × BW
ReduceScatter	Reduces tensors from all GPUs; each GPU receives a unique shard of the result	All-to-all with aggregation	(N-1)/N × BW
AllToAll	Each GPU sends a unique shard to every other GPU (used in Mixture of Experts)	Full mesh point-to-point	(N-1)/N × BW
Broadcast	One GPU sends the same tensor to all other GPUs	One-to-many	BW (sender)

🚫 Why Packet Loss Is Catastrophic for AI Training In traditional Ethernet networks, a dropped packet is retransmitted by TCP after a timeout — a minor inconvenience. In AI training fabrics using RDMA, a dropped packet can stall an entire AllReduce operation across thousands of GPUs. Since all GPUs must synchronise at a barrier before proceeding to the next iteration, a single stalled GPU stalls the entire cluster. A 0.001% packet loss rate can reduce training throughput by 10–30%. This is why lossless transport is non-negotiable in AI fabrics.

3. NVLink and NVSwitch: Inside the GPU-to-GPU Interconnect

Before examining the external network fabric, it is essential to understand the intra-server interconnect layer — NVLink and NVSwitch — because this determines the boundary between what happens inside a server and what must traverse the network fabric.

NVLink: Direct GPU-to-GPU Bandwidth

NVLink is Nvidia's proprietary high-speed interconnect that allows GPUs within a server (or across a small group of servers via NVLink Switch Systems) to communicate at far higher bandwidth than PCIe. Each generation of NVLink roughly doubles the bandwidth of the previous:

NVLink Generation Comparison
Generation	GPU	Per-Link BW (bidirectional)	Total NVLink BW (per GPU)	NVLink Count
NVLink 3.0	A100	50 GB/s	600 GB/s	12
NVLink 4.0	H100	50 GB/s	900 GB/s	18
NVLink 5.0	B100 / B200	100 GB/s	1,800 GB/s	18

NVSwitch: Full Any-to-Any GPU Connectivity

NVSwitch is a dedicated high-bandwidth switching ASIC that enables full any-to-any NVLink connectivity among all GPUs within a server. In the DGX H100, four NVSwitch chips form a non-blocking switch fabric, giving every GPU 900 GB/s of NVLink bandwidth to every other GPU in the server simultaneously — a total of 7.2 TB/s of all-to-all bandwidth within a single 8-GPU server.

The NVLink Switch System (codenamed NVL72) extends this intra-server fabric to a 72-GPU domain across multiple compute trays, creating what Nvidia calls a GPU domain — a group of GPUs with NVLink speeds to each other, even across physical trays. This dramatically changes the external network topology requirements: tensor parallelism can run entirely within the NVLink domain (no external network traffic), while data parallelism and pipeline parallelism still require the external Ethernet or InfiniBand fabric.

💡 The Two-Level Interconnect Hierarchy Modern AI servers create a two-level interconnect hierarchy: Level 1 (intra-domain) — NVLink/NVSwitch for GPU-to-GPU within the NVLink domain (sub-microsecond latency, TB/s bandwidth). Level 2 (inter-domain) — Ethernet or InfiniBand for communication between NVLink domains / racks (microsecond latency, 400G–800G per NIC). Network engineers own Level 2. Understanding where Level 1 ends and Level 2 begins is essential for correct topology design.

4. InfiniBand: The Legacy AI Network Standard

InfiniBand (IB) is a high-performance interconnect technology originally developed in the late 1990s by a consortium including Compaq, Dell, HP, IBM, Intel, and Sun. It became the dominant fabric for HPC (High Performance Computing) clusters and, by extension, early large-scale AI training infrastructure — most notably in Nvidia's DGX SuperPOD reference architectures.

InfiniBand Key Technical Characteristics

Native RDMA: InfiniBand was designed from the ground up for Remote Direct Memory Access — the CPU is bypassed entirely for data transfers. Memory on one server is read/written directly by a remote GPU without involving the operating system on either end, eliminating CPU jitter and achieving sub-2 microsecond MPI latency
Lossless by design: InfiniBand uses credit-based flow control at the link layer — a sender cannot transmit a packet unless the receiver has advertised sufficient buffer credits. This makes the fabric intrinsically lossless without requiring complex Ethernet mechanisms like PFC and ECN
Dedicated subnet manager: InfiniBand requires a Subnet Manager (SM) to compute routing tables and manage the fabric. The SM is a centralised control plane — OpenSM (open-source) or vendor implementations. This adds operational complexity compared to standard IP routing
Proprietary ecosystem: While InfiniBand adapters (HCAs) and switches are available from Nvidia (formerly Mellanox), the protocol is fundamentally proprietary — not interoperable with standard Ethernet management tooling, monitoring, or automation frameworks

InfiniBand Speed Generations
Generation	Abbreviation	Per-Port Speed (4x)	Latency (MPI)	Typical Use
Enhanced Data Rate	EDR	100 Gb/s	~0.5 µs	V100-era clusters
High Data Rate	HDR	200 Gb/s	~0.5 µs	A100 DGX SuperPOD
Next Data Rate	NDR	400 Gb/s	~0.5 µs	H100 DGX SuperPOD
Extended Data Rate	XDR	800 Gb/s	<0.5 µs	B100/B200 next-gen clusters

5. RoCEv2: Ethernet Comes to AI Networking

RDMA over Converged Ethernet version 2 (RoCEv2) is the technology that enables AI clusters to be built on standard Ethernet infrastructure while preserving the low-latency, CPU-bypass benefits of RDMA. RoCEv2 encapsulates the InfiniBand transport layer (IB BTH header) inside a standard UDP/IP packet, enabling it to be routed across standard IP networks — unlike RoCEv1, which was Layer 2 only.

How RoCEv2 Works

Protocol stack: RoCEv2 frame = Ethernet header + IP header + UDP header (destination port 4791) + IB BTH (Base Transport Header) + Payload. The BTH carries the RDMA opcode (RDMA Write, RDMA Read, Send) and Packet Sequence Number (PSN) for ordering and loss detection
Queue Pairs (QP): Communication in RDMA happens through Queue Pairs — a send queue and a receive queue. Applications post Work Requests (WRs) to the QP; the NIC (RNIC) processes them directly without OS involvement. This is the mechanism that achieves CPU bypass
RDMA Verbs: The application interface to RDMA hardware. Key operations: RDMA Write (write to remote memory without receiver CPU involvement), RDMA Read (read from remote memory), Send/Recv (two-sided — both sender and receiver CPU involved)
Go-Back-N (GBN) retransmission: RoCEv2 uses a Go-Back-N protocol for loss recovery. When a packet is lost, the sender retransmits from that packet number onwards. Since GBN retransmits all subsequent packets on a loss event, even a single dropped packet causes significant throughput degradation — this is why lossless transport is mandatory

Why RoCEv2 Has Become the Dominant AI Fabric Protocol

The shift from InfiniBand to RoCEv2 Ethernet as the dominant AI fabric protocol for hyperscalers (Meta, Google, Microsoft Azure, Amazon) is driven by several powerful advantages:

Ethernet ecosystem leverage: Standard Ethernet switch silicon (Broadcom Tomahawk 5, Cisco Silicon One G300, Arista 7050X3) can be used — enabling competitive multi-vendor procurement and commodity pricing
IP routing: RoCEv2 routes over standard IP, enabling multi-tier Clos/spine-leaf topologies, ECMP, and BGP — none of which are natively available in InfiniBand fabrics
Unified fabric: A single Ethernet fabric can carry both RDMA (RoCEv2) for GPU communication and standard TCP/IP for management, storage, and external traffic — eliminating the dual-fabric (IB + Ethernet) architecture
Operational familiarity: Network engineers already know Ethernet. Operational tooling (Prometheus/Grafana monitoring, Ansible automation, standard sFlow telemetry) all work natively with Ethernet switches

6. RoCEv2 vs InfiniBand: Full Technical Comparison

RoCEv2 vs InfiniBand — Comprehensive Technical Comparison
Attribute	InfiniBand (NDR/XDR)	RoCEv2 over Ethernet
MPI / NCCL latency	~0.5 µs (industry-leading)	1–3 µs (well-tuned fabric)
Lossless transport	Native (credit-based flow control)	Requires PFC + ECN/DCQCN configuration
Current max port speed	800G (XDR)	800G (IEEE 802.3df)
Routing	Proprietary (Subnet Manager, LID-based)	Standard IP/BGP/ECMP
Multi-vendor switches	Limited (primarily Nvidia Quantum)	Full ecosystem (Cisco, Arista, Juniper, Broadcom-based)
Switch cost	High (proprietary, limited competition)	Lower (commodity silicon, competitive market)
Adaptive routing	Native (SHIELD adaptive routing in Quantum-2)	Requires vendor-specific implementation (Arista ECMP+, Cisco DLBR)
Fabric scale	Up to ~10,000–50,000 ports per fabric	Effectively unlimited (routed IP)
Operational tooling	Proprietary (SM, ibdiagnet, Nvidia UFM)	Standard Ethernet tooling (Ansible, Prometheus, SNMP, gNMI)
Unified fabric (RDMA + TCP)	Separate IB + Ethernet fabrics typically required	Single fabric for RDMA and TCP/IP
Best for	HPC, tightly coupled MPI workloads, ultra-low latency requirements, Nvidia-only deployments	Hyperscale AI, multi-vendor fabrics, cost-sensitive at scale, unified infrastructure

📌 The Hyperscaler Verdict: RoCEv2 Wins at Scale Meta (Meta AI Research Supercluster — 16,000 A100s), Microsoft Azure (NDv4/NDv5), Google (TPU pod interconnects), and Amazon AWS (EFA) have all standardised on RoCEv2 over Ethernet for their largest AI clusters. The primary drivers: multi-vendor Ethernet procurement reducing switch costs, IP routing enabling larger fabric scales than InfiniBand subnet limits, and unified fabric operations. InfiniBand remains dominant in commercial HPC supercomputing and in organisations purchasing Nvidia turnkey DGX SuperPOD systems.

7. AI Fabric Topology: Rail-Optimised Leaf-Spine Design

The standard enterprise leaf-spine (Clos) topology is not optimal for AI workloads without modification. The Rail-Optimised (also called GPU-Rail) architecture has emerged as the reference design for AI clusters at hyperscale and is documented in Nvidia's DGX SuperPOD reference architecture as well as Meta's AI research cluster design papers.

Standard Leaf-Spine vs Rail-Optimised: The Core Difference

In a standard leaf-spine fabric, all hosts in a rack connect to a single ToR (Top of Rack) leaf switch, which connects to multiple spine switches. All inter-rack traffic is load-balanced across all spine paths. This works well for general-purpose east-west traffic where flows between any pair of servers are equally likely.

AI training traffic is not uniformly distributed. AllReduce operations in data-parallel training cause predictable, structured communication patterns — each GPU communicates intensely with specific peer GPUs (its "gradient reduction partners"). The Rail-Optimised fabric exploits this structure to maximise performance and minimise congestion.

Rail-Optimised Architecture: How It Works

In a Rail-Optimised fabric, each server has multiple NICs (e.g., 8 × 400G NICs in a DGX H100), and each NIC connects to a different leaf switch — not all to the same ToR switch. These per-NIC leaf switches are called Rails. All Rail-0 NICs across all servers connect to Rail-0 leaf switches; all Rail-1 NICs connect to Rail-1 leaf switches; and so on.

Within-rail communication: GPU-0 on all servers communicates primarily through Rail-0 switches. This concentrates related AllReduce traffic within a single rail, preventing cross-rail congestion contamination
Cross-rail communication: When AllToAll (e.g., Mixture of Experts routing) requires communication across rails, spine switches carry this traffic
Congestion isolation: A congestion event on Rail-2 (e.g., a hot AllReduce spanning many servers) does not impact Rail-0, Rail-1, etc. — congestion is contained within rails
Failure isolation: A failed Rail-2 leaf switch degrades only GPU-2 bandwidth; the other 7 GPUs in each server continue at full bandwidth

DGX H100 SuperPOD Leaf-Spine Rail-Optimised Reference Numbers
Parameter	Value	Notes
Servers per ScalingUnit (SU)	32 DGX H100	256 H100 GPUs per SU
NICs per server	8 × ConnectX-7 400G	One NIC per GPU — each on a separate rail
Rails (leaf switches)	8 leaf switches per SU	Each leaf handles Rail-N NICs from all 32 servers
Ports per leaf switch used for servers	32 downlink × 400G	One port per server
Uplink ports per leaf switch	32 uplink × 400G to spine	1:1 oversubscription ratio (non-blocking within SU)
Aggregate fabric bandwidth (1 SU)	8 × 32 × 400G = 102.4 Tbps	Bidirectional, fully non-blocking
Maximum SUs per multi-SU pod	Up to 8 SUs (2,048 GPUs)	Connected via dedicated spine tier

Three-Tier Architecture for Very Large Clusters

For clusters beyond 2,048 GPUs, a three-tier architecture is used: GPU Rail Leaf → Aggregation / Pod Spine → Cluster Core Spine. Each tier adds a level of oversubscription (typically 2:1 at aggregation, 4:1 at core for inter-pod traffic), reflecting the observation that AllReduce communication within a training job is typically confined to a single pod, and inter-pod traffic is comparatively light (primarily pipeline parallelism and checkpoint I/O).

⚠️ Oversubscription Is Not Always Wrong A common misconception is that AI fabrics must be completely non-blocking at all tiers. In practice, oversubscription at the core and aggregation tiers is acceptable if training jobs are confined to pods that communicate primarily within-pod. The key design rule: the leaf-to-server tier must be non-blocking (1:1). Oversubscription at aggregation (2:1) and core (4:1) tiers is acceptable if job scheduling ensures same-pod GPU affinity for large training runs.

8. Lossless Transport: PFC, ECN, and DCQCN Explained

Making Ethernet lossless for RoCEv2 requires a carefully co-designed set of mechanisms operating at different layers of the network stack. The three key components are PFC (Priority Flow Control), ECN (Explicit Congestion Notification), and the DCQCN (Data Center Quantized Congestion Notification) algorithm.

PFC — Priority Flow Control (IEEE 802.1Qbb)

PFC is a link-layer pause mechanism that allows a switch to signal to its upstream neighbour to temporarily stop sending traffic on a specific 802.1p priority class — without pausing other priority classes. In AI fabrics, RoCEv2 traffic is assigned a specific DSCP/PCP value and mapped to a dedicated PFC-enabled priority class (typically Priority 3 or 4), while other traffic (management, storage) uses different priorities that are not pause-enabled.

How it works: When a switch ingress queue for the RoCEv2 priority fills beyond a configured headroom threshold, it sends a PAUSE frame to the upstream device. The upstream device stops transmitting on that priority class until a RESUME frame is received
Headroom calculation: Switch buffer headroom must account for the pipe delay (switch-to-switch RTT) — all packets in flight when PAUSE is sent must be absorbed. Headroom = wire delay × link speed. For a 400G link with 100ns pipe delay: 400 × 10⁹ × 100 × 10⁻⁹ = 40,000 bits = 5 KB per priority class per port
PFC deadlock risk: In a poorly designed fabric, circular PFC dependencies can cause a deadlock — Node A pauses B, B pauses C, C pauses A — all three freeze. This is mitigated by using a single lossless priority class and ensuring no circular dependencies exist in the switch buffer allocation

ECN — Explicit Congestion Notification (RFC 3168)

ECN is an IP-layer congestion signalling mechanism. When a switch queue depth exceeds a configured threshold (KMIN to KMAX), the switch marks packets with the CE (Congestion Experienced) bit in the IP header instead of dropping them. The receiver echoes this marking back to the sender, which then reduces its sending rate.

ECN is the preferred congestion signal in RoCEv2 fabrics because it acts earlier than PFC — reducing sending rates before queues fill to the point of triggering PAUSE frames. The design goal is: ECN manages congestion proactively; PFC is the last-resort backstop preventing packet loss.

DCQCN — The Congestion Control Algorithm

DCQCN (Data Center Quantized Congestion Notification), developed by Microsoft Research, is the end-to-end congestion control algorithm used with RoCEv2. It operates at three points:

Switch (RED-like marking): Switch marks packets with ECN CE bit based on queue depth using a probabilistic marking function between KMIN and KMAX thresholds
Receiver (CNP generation): On receiving a CE-marked packet, the receiver generates a CNP (Congestion Notification Packet) and sends it back to the sender. Rate: at most one CNP per microsecond per QP
Sender (rate reduction and recovery): On receiving a CNP, the sender immediately reduces its rate by a multiplicative factor (default 0.5). Rate recovery follows an additive increase / multiplicative decrease (AIMD) model with a fast recovery phase using a timer

Lossless Fabric Configuration Checklist — RoCEv2 over Ethernet
Layer	Configuration Item	Recommended Value / Action
NIC / Host	RDMA mode	Enable RoCEv2 (routed mode); disable RoCEv1
NIC / Host	DSCP marking	Mark RoCEv2 traffic DSCP 26 (CS3) or per vendor recommendation
Switch	PFC enablement	Enable PFC on RDMA priority class only (e.g., priority 3). Disable on all other priorities
Switch	ECN thresholds	KMIN = 300 KB, KMAX = 2 MB (tune per buffer capacity). Target: mark before queue exceeds 10% of buffer
Switch	Buffer allocation	Allocate dedicated shared buffer pool to RDMA priority; size headroom per port count × per-port headroom formula
Switch	DSCP-to-PFC mapping	Map RDMA DSCP value to PFC-enabled 802.1p priority class consistently across all switches
NIC / Host	CNP rate limiter	One CNP per 1 µs per QP (default Mellanox ConnectX-7)
Fabric	PFC deadlock prevention	Use single lossless priority class; verify no circular buffer dependencies in switch topology

9. ECMP vs Adaptive Routing vs Spray: Solving the Elephant Flow Problem

Traffic load balancing is one of the most critical — and most nuanced — aspects of AI fabric design. AllReduce operations generate many simultaneous large flows (elephant flows) between the same source-destination pairs. Naive ECMP hashing can cause severe load imbalance, with some paths congested while adjacent paths sit idle.

Standard ECMP (Equal-Cost Multi-Path)

Standard ECMP hashes flows to paths using a tuple of source IP, destination IP, source port, destination port, and protocol. In AI workloads, AllReduce flows between the same GPU pairs are typically persistent, large flows. ECMP assigns each flow a fixed path based on the hash — and that path does not change even if it becomes congested. The result is hash polarisation: multiple large flows land on the same path while others remain lightly loaded.

Adaptive Routing (Per-Packet / Per-Flowlet)

Adaptive routing dynamically selects the output port for each packet or flowlet based on real-time queue depth or credit availability, instead of a static hash. This moves traffic away from congested paths instantly, without waiting for TCP/DCQCN to react.

Per-packet adaptive routing: Each packet independently selects the least-loaded path. Maximum load balancing efficiency, but can cause out-of-order packet delivery. Requires the receiving NIC to support out-of-order RoCEv2 reassembly (not all NICs do)
Per-flowlet routing: Packets within a flow that arrive within a short time window (flowlet threshold) are sent on the same path. A new flowlet starts when there's a gap larger than the threshold — allowing path switching between bursts. Avoids reordering while still adapting to congestion
NVIDIA Spectrum — SHIELD (Switch-based Hardware Intelligent Efficient Load Distribution): Nvidia's implementation in Spectrum-4 uses credit-based adaptive routing — switches exchange credit information about queue depths and route packets to the path with the most available credits. Achieves near-optimal load balance with minimal reordering
Arista 7800R4 — Adaptive Load Balancing: Arista's implementation uses per-flowlet adaptive routing with configurable flowlet timeout. Integrates with ECMP and supports 400G/800G at line rate

Packet Spraying

Packet spraying takes adaptive routing to its logical extreme: every packet is sent on a different path in a round-robin or randomised fashion, regardless of flow membership. This achieves near-perfect load balancing at the cost of guaranteed out-of-order delivery. Packet spraying requires NIC-level reassembly support (e.g., Nvidia ConnectX-7 with ROME or Lossless Spray mode). It is used in Nvidia's DGX SuperPOD reference architecture with NDR InfiniBand fabrics.

Load Balancing Comparison for AI Fabrics
Method	Load Balance Quality	Reordering Risk	Switch Requirement	NIC Requirement
Standard ECMP	Poor (hash polarisation)	None	All switches	All NICs
ECMP + ECMP-wide hashing	Moderate	None	All switches	All NICs
Flowlet Adaptive Routing	Good	Low (inter-flowlet only)	Vendor-specific (Arista, Cisco)	Standard RoCEv2
Credit-Based Adaptive (SHIELD)	Excellent	Low–Moderate	Nvidia Spectrum only	ConnectX-7 (reorder support)
Packet Spray	Near-perfect	High (all packets)	Spray-capable (IB or Ethernet)	Out-of-order capable NIC required

10. 400G and 800G Switching: The Hardware Behind AI Fabrics

The bandwidth demands of GPU clusters have driven the data center switching industry through an unprecedented acceleration. 400G ports — which were considered future-proofing for general enterprise networks just three years ago — are already the baseline minimum for AI leaf switches. 800G is the current frontier, and 1.6T is actively in development.

Key Switch Silicon for AI Fabrics

Leading Switch ASICs for AI Data Center Fabrics (2025)
ASIC	Vendor	Bandwidth	Max Ports	Key AI Feature
Tomahawk 5 (TH5)	Broadcom	51.2 Tbps	64× 800G or 128× 400G	Deep buffers, PFC/ECN, high-density 800G
Tofino 3 / P4	Intel (Barefoot)	12.8 Tbps	32× 400G	P4 programmability for custom telemetry and load balancing
Cisco Silicon One G300	Cisco	102.4 Tbps	64× 1.6T or 128× 800G	ICN intelligent collective networking, shared buffer, hardware telemetry
Spectrum-4	Nvidia	51.2 Tbps	64× 800G	SHIELD adaptive routing, tight NVLink/RoCE integration
Jericho3-AI (J3-AI)	Broadcom	57.6 Tbps	72× 800G	Deep shared buffer (2 GB), hardware-offloaded AllReduce
Quantum-3 (XDR IB)	Nvidia	57.6 Tbps	64× 800G (XDR InfiniBand)	Native RDMA, SHARP in-network compute, SHIELD routing

Buffer Size: A Critical Differentiator for AI Fabrics

Packet buffer size is arguably the most important switch hardware characteristic for AI workloads — more impactful than raw throughput for many deployments. The reason: AllReduce synchronisation barriers cause periodic incast events — moments where many-to-one flows converge simultaneously on a single egress port, creating transient bursts that exceed line rate for a short window. Shallow-buffered switches drop packets during these bursts, triggering PFC cascades and DCQCN throttling that can degrade cluster training throughput by 20–40%.

Shallow buffer (~50 MB shared): Tomahawk series, Trident 4. Suitable for general enterprise east-west traffic. Can cause issues at high AI workload density without careful PFC/ECN tuning
Deep buffer (~2–4 GB shared): Jericho3-AI, some Cisco platforms. Purpose-designed for absorbing AllReduce incast bursts. Significantly reduces PFC pause events
Shared vs partition buffer: Shared buffer pools dynamically allocate memory across ports, making efficient use of available buffer. Partitioned (dedicated per-port) buffers waste capacity on lightly loaded ports. Shared is strongly preferred for AI fabrics

📌 Cisco G300 and Intelligent Collective Networking (ICN) The Cisco Silicon One G300 introduces Intelligent Collective Networking (ICN) — a hardware-accelerated approach to AI fabric optimisation featuring a fully shared 256 MB packet buffer, fabric-wide path-based load balancing (not hash-based ECMP), and hardware-speed in-band telemetry. Cisco claims this delivers 28% faster training job completion and 33% higher network utilisation compared to standard ECMP fabrics.

11. Optics and Cabling for AI Data Centers

At 400G and 800G speeds, optical interconnect design becomes a critical success factor for AI fabric deployments. The wrong optics choice can constrain reach, increase power consumption, or create reliability issues at scale.

Optics Options for AI Fabric — 400G and 800G
Form Factor	Speed	Typical Reach	Power	Best Use
400G QSFP-DD DR4	400G (4×100G PAM4)	500m (SMF)	~9W	ToR-to-server, ToR-to-spine within building
400G QSFP-DD SR4	400G (4×100G PAM4)	100m (MMF OM4)	~6W	Short-reach within same row/aisle
800G OSFP 2×FR4	800G (8×100G PAM4)	2km (SMF)	~14W	Spine-to-spine, inter-building
800G LPO (Linear Pluggable Optic)	800G	100m (MMF)	~7W (~50% vs retimed)	High-density AI leaf-to-server; power-constrained racks
400G DAC (Direct Attach Copper)	400G	3–5m (passive) / 7m (active)	<1W (passive)	Intra-rack connections where reach permits
800G AOC (Active Optical Cable)	800G	Up to 30m	~10W	Intra-row connections exceeding DAC reach

⚠️ LPO (Linear Pluggable Optics): The AI Fabric Power Game-Changer LPO eliminates the DSP retimer chip inside the optical module, reducing power consumption by ~50% compared to retimed optics. In a 10,000-port AI fabric at 800G, this translates to 35–50 kW saved in optics power alone — a significant operational cost reduction. However, LPO has tighter host SerDes requirements (the switch/NIC SerDes must compensate for channel impairments without a DSP buffer), making it suitable for short-reach intra-rack and inter-rack deployments within the same building. Cisco G300 and Arista 7800R4 series explicitly support LPO.

12. Vendor Landscape: Cisco, Arista, Nvidia, Broadcom, Juniper

The AI networking market has attracted significant investment from all major networking vendors, with each taking a distinct architectural approach. Understanding these differences is essential for multi-year infrastructure planning.

AI Networking Vendor Landscape 2025
Vendor	Key Platform	Switch Silicon	AI Differentiator	Best Fit
Cisco	Nexus 9364-SG3, Cisco 8132 (IOS-XR)	Silicon One G300 (102.4 Tbps)	ICN shared buffer, path-based LB, hardware telemetry, SONiC + NX-OS + ACI, Nexus One AgenticOps	Enterprise AI DC, sovereign cloud, SP AI clusters; Cisco-invested organisations
Arista	7800R4 Series, 7050X3	Broadcom TH5, custom deep-buffer	EOS extensibility, deep telemetry (DANZ), adaptive LB, SONiC support, proven hyperscale deployments	Hyperscalers, financial services AI, brownfield data centers with EOS
Nvidia	Quantum-3 (IB), Spectrum-4 (Ethernet)	Quantum-3 XDR, Spectrum-4 (51.2 Tbps)	Tightest GPU integration; SHARP in-network compute; SHIELD adaptive routing; best single-vendor stack	Nvidia GPU-centric deployments; DGX SuperPOD; HPC + AI convergence
Juniper	QFX5220, PTX10008	Broadcom TH5, custom Express silicon	Apstra IBN for intent-based automation, Paragon telemetry, strong service provider DNA	Telco AI, research networks, Juniper-invested SP and enterprise
Broadcom	TH5 (OEM via Arista, Cisco, others)	Tomahawk 5 (51.2 Tbps), Jericho3-AI	Most widely deployed AI fabric silicon; deep buffer J3-AI for AllReduce; hardware AllReduce offload	Whitebox / OCP deployments; cost-sensitive hyperscale build-outs

13. Storage Networking in AI: GPUDirect Storage and NFS over RDMA

Storage networking is the second critical fabric in AI data centers, often receiving less attention than the compute fabric but equally important for training throughput. Training large models requires reading enormous datasets at high speed — a single epoch over a 100 TB dataset at 1M tokens/second requires sustained storage throughput of hundreds of GB/s.

GPUDirect Storage (GDS)

GPUDirect Storage is Nvidia's technology that enables data to flow directly from NVMe SSDs or network storage directly into GPU memory — bypassing the CPU and system RAM entirely. The traditional path for loading training data is: NVMe → PCIe → CPU DRAM → PCIe → GPU VRAM. GPUDirect Storage shortens this to: NVMe → PCIe switch → GPU VRAM (or NFS server → RDMA NIC → GPU VRAM for network storage).

Local GDS: NVMe SSDs connect to the same PCIe domain as the GPU. The NVMe controller DMA's data directly into GPU VRAM via peer-to-peer PCIe. Achieves 12–25 GB/s per server with modern NVMe SSDs
Network GDS (GPUDirect RDMA): NFS or Lustre server serves data over RDMA. The storage client on the GPU server uses RDMA to read data directly into GPU VRAM over the RDMA fabric. Requires RoCEv2 or InfiniBand storage fabric
Requirements: GPU with PCIe P2P support (all H100/B100), NVMe with GDS support, Nvidia driver ≥ 460, cuFile library, CUDA 11.4+

Storage Fabric Architecture Options

AI Storage Fabric Options — Comparison
Architecture	Protocol	Throughput	GPU DMA Support	Best For
Local NVMe (in-server)	NVMe/PCIe	12–25 GB/s per server	Yes (local GDS)	Highest throughput; limited capacity per server
All-Flash NFS over RDMA	NFS v4.1+ over RoCEv2	50–200 GB/s (cluster-wide)	Yes (network GDS)	Shared dataset access; large training sets
Lustre over InfiniBand	Lustre / LNET over IB	100–500 GB/s (large clusters)	Yes (GDS via IB)	HPC + AI convergence; large supercomputing clusters
Object Storage (S3-compatible)	HTTP/S3 over TCP	Variable (multi-GB/s at scale)	No direct GPU DMA	Checkpoint storage, cold dataset retrieval, cloud AI training

14. Design Best Practices and Common Pitfalls

Drawing from documented hyperscaler deployments, Nvidia reference architectures, and published academic research, the following best practices and anti-patterns are critical for successful AI fabric design.

✅ Best Practices

Non-blocking leaf tier: The GPU-to-leaf switch tier must be strictly non-blocking (1:1 oversubscription). Any packet drops at this tier directly cause PFC cascades and DCQCN throttling across all GPU pairs sharing the leaf switch
Separate compute and storage traffic: Run GPU-to-GPU (RoCEv2 RDMA) and GPU-to-storage (NFS over RDMA or TCP) on separate VLANs with separate DSCP markings and PFC priority classes, or ideally on physically separate network ports. Mixing these on the same interface can cause PFC head-of-line blocking
Consistent MTU across the fabric: Use jumbo frames (MTU 9000) across all AI fabric interfaces — switches, NICs, and storage appliances. MTU mismatches cause fragmentation that is catastrophic for RDMA performance (RoCEv2 does not fragment)
Deploy INT (In-band Network Telemetry): Real-time per-flow queue depth and latency telemetry is essential for diagnosing congestion in AI fabrics. Deploy INT or gRPC streaming telemetry on all switches. Polling-based SNMP is too slow to catch microsecond-scale congestion events
GPU-job scheduling with topology awareness: Configure the job scheduler (SLURM, Kubernetes + GPU Operator) to be topology-aware — schedule training jobs on GPU nodes within the same rail/pod whenever possible. This minimises inter-pod traffic and reduces the blast radius of congestion
Validate PFC configuration with traffic testing: Before production use, inject synthetic AllReduce traffic patterns (e.g., using ib_send_bw or perftest tools) to validate PFC triggering, ECN marking thresholds, and DCQCN rate reduction. A single misconfigured DSCP mapping can disable lossless transport

🚫 Common Design Pitfalls

Common AI Fabric Design Anti-Patterns
Pitfall	Impact	Remedy
Enabling PFC globally on all priorities	PFC deadlock — fabric-wide pause cascade on congestion event	Enable PFC on RDMA priority only; use no-drop for all other traffic
Using standard 1500B MTU	RDMA fragmentation — RoCEv2 doesn't fragment; connections fail or fall back to slow path	Configure MTU 9000 end-to-end on all AI fabric interfaces
All GPU NICs on a single ToR switch	Single point of failure; no rail isolation; 8× oversubscription on uplinks	Implement Rail-Optimised topology — one NIC per rail switch
Using standard ECMP for AllReduce flows	Hash polarisation — congested paths alongside empty paths; 30–50% throughput loss	Deploy adaptive routing or flowlet ECMP on all AI fabric switches
Mixing RoCEv2 and TCP on same DSCP/PFC class	TCP traffic causes PFC storms that degrade RDMA; TCP retransmit loops	Strict DSCP separation; RDMA on dedicated lossless priority; TCP on best-effort
Under-sizing ECN thresholds (KMAX too high)	ECN marks too late — queues fill before rate reduction takes effect, triggering PFC	Tune KMIN/KMAX to mark at 5–10% queue occupancy; validate with traffic testing

15. Summary and Architecture Selection Guide

AI data center networking represents the most demanding and fastest-evolving segment of network infrastructure today. The fundamental shift from traditional north-south enterprise traffic to synchronised, all-to-all GPU collective communications has invalidated many established design assumptions and created an entirely new set of requirements around lossless transport, adaptive routing, and ultra-high-bandwidth fabrics.

AI Fabric Architecture Selection Guide
Cluster Size	GPU Count	Recommended Fabric	Transport Protocol	Reference Vendor
Small AI / Dev cluster	8–64 GPUs (1–8 servers)	Single-switch 400G leaf, or NVLink domain only	RoCEv2 or IB HDR	Any 400G switch
Mid AI cluster	64–512 GPUs	Rail-Optimised 2-tier (leaf + spine), 400G	RoCEv2 (preferred) or IB NDR	Arista 7050X3, Cisco Nexus 9300
Large enterprise AI	512–4,096 GPUs	Rail-Optimised 2-tier, 400G/800G, adaptive routing	RoCEv2 with DCQCN	Cisco Nexus 9364-SG3, Arista 7800R4, Nvidia Spectrum-4
Hyperscale AI cluster	4,096–100,000+ GPUs	Rail-Optimised 3-tier, 800G, packet spray / adaptive routing	RoCEv2 (hyperscalers) or IB XDR (Nvidia)	Arista 7800R4, Cisco G300, Broadcom TH5/J3-AI whitebox
HPC + AI convergence	Any scale	InfiniBand fabric (SHARP, adaptive routing)	InfiniBand NDR/XDR	Nvidia Quantum-3, DGX SuperPOD reference

✅ The Three Non-Negotiables for Every AI Fabric Regardless of scale or vendor choice, three requirements are non-negotiable in any AI data center fabric: (1) Lossless transport — PFC + ECN + DCQCN configured correctly on every switch and host NIC; (2) Non-blocking leaf tier — no oversubscription between GPU NICs and leaf switches; (3) Jumbo MTU 9000 — end-to-end across all AI fabric interfaces. Violating any of these will cause predictable, hard-to-diagnose training throughput degradation.

16. Frequently Asked Questions

Q: Can I build an AI cluster on a regular enterprise Ethernet fabric?

Technically yes — the packets will be delivered. However, without lossless transport configuration (PFC + ECN + DCQCN), MTU 9000, and a non-blocking topology, you will experience significant throughput degradation due to packet loss and RDMA retransmits. An enterprise fabric not optimised for RDMA can deliver 30–60% lower training throughput compared to a purpose-configured AI fabric. The required configuration changes (PFC, ECN, DSCP marking, jumbo frames) are achievable on most modern enterprise switches — but require careful planning.

Q: Should I choose InfiniBand or RoCEv2 for a new AI cluster?

For most new enterprise and cloud AI deployments in 2025, RoCEv2 over Ethernet is the recommended choice for clusters of 64+ GPUs. The multi-vendor Ethernet ecosystem provides significant cost advantages, IP routing allows larger fabric scales, and operational tooling is far more mature. InfiniBand remains the best choice for pure HPC environments where MPI latency is the primary driver, or for organisations purchasing Nvidia DGX SuperPOD turnkey systems.

Q: What is the recommended oversubscription ratio for an AI fabric?

The GPU-to-leaf tier should be non-blocking (1:1) — no oversubscription. The leaf-to-spine tier can be 1:1 for maximum performance, but 2:1 is acceptable for well-managed job scheduling that confines training runs within pods. Core/aggregation tiers connecting pods to each other can use 4:1, as inter-pod traffic is typically light (primarily pipeline parallelism, not AllReduce).

Q: What switch buffer depth do I need for an AI fabric?

Buffer requirements depend on the number of GPUs, AllReduce tree depth, and link speeds. As a rule of thumb: for a 32-server 400G leaf switch, a minimum of 50 MB shared buffer is required to absorb AllReduce incast events without PFC triggering. For 800G fabrics or clusters beyond 256 GPUs, 500 MB to 2+ GB of shared buffer is recommended. Deep-buffer ASICs (Jericho3-AI at 2 GB, select Cisco platforms) are specifically designed for these requirements.

Q: How does training job topology affect network design?

Training job topology (data parallelism, tensor parallelism, pipeline parallelism) directly determines traffic patterns. Tensor parallelism generates very high bandwidth between a small number of tightly coupled GPUs — ideally within the same NVLink domain or the same leaf switch. Data parallelism generates AllReduce traffic across all GPUs in the job — requires full bisection bandwidth at the leaf tier. Pipeline parallelism generates sequential flow between pipeline stages — benefits from topology-aware job placement ensuring consecutive stages are in the same rack or adjacent racks.

Q: Can I use AI switching hardware (e.g., Broadcom TH5) for general enterprise workloads as well?

Yes — Tomahawk 5 and similar ASICs are fully capable of running standard IP routing, spanning tree, and VXLAN for general enterprise workloads. Many organisations run a unified fabric where the same switches handle both AI/RDMA workloads (in the AI pod) and general-purpose east-west traffic (in adjacent clusters), with separate VRFs and DSCP markings separating the traffic classes. The primary trade-off is cost — AI-class switches with deep buffers cost 2–3× more than standard enterprise ToR switches.

📚 Related Articles on The Network DNA

Share this article:

📘 Facebook 🐦 Twitter 💼 LinkedIn

Technical content based on Nvidia DGX SuperPOD reference architecture documentation, Meta AI Research Supercluster design papers, Microsoft Research DCQCN publication, Cisco Silicon One G300 product documentation, Arista 7800R4 series technical briefs, Broadcom Tomahawk 5 and Jericho3-AI data sheets, and IEEE 802.3df/bs standards. Latency and bandwidth figures are representative values based on published benchmarks. All specifications current as of March 2026.