Leaf-Spine Architecture Explained for Modern Data Centers
Home › Data Center › Leaf-Spine Architecture Explained
Why the three-tier network is dead, how Clos fabric works, ECMP and oversubscription explained, VXLAN/EVPN overlay design, BGP underlay, and a full vendor comparison for Cisco ACI, NX-OS, and Arista EOS
By Route XP | Published: March 2026 | Updated: March 2026 | Data Center, Cisco ACI, Arista
- Why Three-Tier Architecture Failed the Modern Data Center
- The Clos Network: Mathematical Foundation
- Leaf-Spine: How It Works
- ECMP: Equal-Cost Multi-Path Load Balancing
- Oversubscription Ratios Explained
- BGP Underlay Design
- VXLAN/EVPN Overlay: L2 and L3 Over IP
- Vendor Solutions: Cisco ACI, NX-OS, Arista EOS
- Scaling Leaf-Spine: Super-Spine and Multi-Pod
- Design Decision Guide
- Frequently Asked Questions
1. Why Three-Tier Architecture Failed the Modern Data Center
For the better part of two decades, enterprise data centers were built on a three-tier hierarchical model: a Core layer providing high-speed routing between aggregation blocks, an Aggregation (or Distribution) layer connecting groups of access switches and hosting Layer 3 boundaries, and an Access layer connecting servers. This design was inherited from campus networking and worked well — until the nature of data center traffic fundamentally changed.
The three-tier model was optimized for north-south traffic — client machines outside the data center communicating with servers inside it. Traffic entered through the core, was distributed by the aggregation layer, and reached servers at the access layer. The bandwidth requirement at the top of the hierarchy was a fraction of the total server bandwidth, because individual users were the bottleneck.
| Problem in Three-Tier | Root Cause | Leaf-Spine Solution |
|---|---|---|
| East-west bandwidth bottleneck | Server-to-server traffic hairpins through aggregation and core layers | Every server pair is exactly 2 hops apart; full bisection bandwidth available |
| Spanning Tree Protocol (STP) blocking | Redundant uplinks must be blocked by STP to prevent loops — wasting 50% of link capacity | No STP between leaf and spine — all paths active via ECMP routing |
| Limited scalability | Core and aggregation are complex, large chassis — adding capacity requires forklift upgrades | Add a leaf switch to add server capacity; add a spine to add bandwidth — independently |
| Unpredictable latency | Variable hop count between servers depending on their physical location in the hierarchy | Uniform latency — every server-to-server path is always exactly 2 hops |
| VLAN spanning complexity | Extending VLANs across aggregation blocks requires careful STP tuning and VTP management | VXLAN overlay decouples L2 domains from physical topology entirely |
2. The Clos Network: Mathematical Foundation
The leaf-spine architecture is a direct application of the Clos network — a multistage switching topology first described by Bell Labs engineer Charles Clos in 1953 for telephone switching systems. Clos proved mathematically that a multistage network of smaller switches could achieve the same non-blocking performance as a single crossbar switch at a fraction of the hardware cost. The insight applies equally to data center Ethernet switching six decades later.
A Clos network is defined by three parameters: n (the number of inputs/outputs per ingress stage switch), m (the number of middle-stage switches), and r (the number of ingress/egress stage switches). For a network to be strictly non-blocking — meaning any unused input can be connected to any unused output without rearranging existing connections — Clos proved that m ≥ 2n − 1.
In the context of a data center leaf-spine fabric, this translates directly to the hardware design rule: the number of spine switches must be at least equal to the number of uplinks on each leaf switch. A leaf switch with 16 uplink ports requires at least 16 spine switches for a non-blocking fabric. In practice, most data center deployments accept a degree of oversubscription (discussed in Section 5) and deploy fewer spines accordingly.
3. Leaf-Spine: How It Works
The leaf-spine fabric consists of exactly two switch tiers with a rigid connectivity rule: every leaf connects to every spine, and no leaf connects to another leaf, and no spine connects to another spine. This constraint is what gives the fabric its predictable, uniform performance characteristics.
Leaf Switches — The Server-Facing Tier
Leaf switches (also called Top-of-Rack or ToR switches in physical deployments) are the access layer. Every server, hypervisor host, storage node, and network service appliance (firewall, load balancer) connects to a leaf switch. Leaf switches have two categories of ports:
- Downlinks: Face the servers. Typically 1G, 10G, 25G, or 100G depending on server generation. In a 48-port leaf switch, 32–48 ports are typically downlinks
- Uplinks: Face the spine switches. Higher speed than downlinks — commonly 100G or 400G. In a standard 48×25G + 8×100G leaf, the 8 ports in the rightmost group are the spine uplinks, one per spine switch
A key rule: the Layer 3 boundary lives on the leaf switch in a modern leaf-spine design. Each leaf is a router, not just a switch. Servers in the same subnet communicate via the leaf switch's local forwarding table. Servers in different subnets communicate via the IP fabric between leaf switches (the spine tier). This is the opposite of the three-tier model, where Layer 3 boundaries lived at the aggregation layer.
Spine Switches — The Interconnect Tier
Spine switches have a single function: forward packets between leaf switches as fast as possible. They have no server-facing ports — only leaf-facing ports. Every spine connects to every leaf with exactly one link. Spines are typically high-radix, high-speed devices — 64-port 100G or 32-port 400G switches are common spine candidates. Because all paths between any two leaves are equal cost, spines are fully utilized by ECMP hashing.
The Two-Hop Guarantee
The defining characteristic of leaf-spine is the two-hop guarantee: traffic from any server to any other server traverses exactly two switch hops — leaf → spine → leaf. There is no exception and no variation. This predictable latency profile is critical for latency-sensitive workloads like trading systems, real-time analytics, and AI training clusters where jitter between GPU collective operations must be minimized.
Server-A ──> Leaf-1 ──> Spine-2 (ECMP selected) ──> Leaf-4 ──> Server-B
Hop 1 (L3 route) Hop 2 (L3 route)
Total hops: 2 — regardless of fabric size or server location
4. ECMP: Equal-Cost Multi-Path Load Balancing
ECMP is the mechanism that makes all spine uplinks simultaneously active in a leaf-spine fabric. In the three-tier model, STP blocked redundant links to prevent loops. In leaf-spine, there are no Layer 2 loops — all inter-switch links are Layer 3 routed interfaces. Routing protocols (BGP or OSPF) install multiple equal-cost routes to every destination prefix, one via each spine switch. The router hardware distributes flows across all equal-cost paths simultaneously.
ECMP Hashing: How Flows Are Distributed
ECMP does not split individual packets across multiple paths (that would cause reordering). Instead, it hashes each flow to a specific path and sends all packets within that flow on the same path. The hash input is typically the 5-tuple: source IP, destination IP, source port, destination port, and IP protocol. This ensures that packets belonging to a single TCP session always follow the same path — preserving order — while different flows between the same server pair can use different spines.
Leaf-1# show ip route 10.10.20.0/24
10.10.20.0/24, ubest/mbest: 4/0
*via 10.0.0.1, Eth1/49, [20/0], 02:14:03, bgp-65001, external
*via 10.0.0.3, Eth1/50, [20/0], 02:14:03, bgp-65001, external
*via 10.0.0.5, Eth1/51, [20/0], 02:14:03, bgp-65001, external
*via 10.0.0.7, Eth1/52, [20/0], 02:14:03, bgp-65001, external
# 4 equal-cost paths via 4 spine switches — all active simultaneously
ECMP Path Count and Bandwidth Scaling
The effective server-to-server bandwidth scales linearly with the number of spines. A leaf switch with 4 × 100G uplinks to 4 spines has an effective 400G of uplink bandwidth available (across all concurrent flows). Adding two more spines and two more uplinks brings this to 600G — without replacing any existing hardware. This is the elastic horizontal scaling that makes leaf-spine ideal for cloud-native data centers.
| Spine Count | Uplink Speed | Total Uplink Bandwidth / Leaf | Max Inter-Leaf Throughput |
|---|---|---|---|
| 2 spines | 100G | 200G | 200 Gbps (2 × 100G) |
| 4 spines | 100G | 400G | 400 Gbps (4 × 100G) |
| 8 spines | 100G | 800G | 800 Gbps (8 × 100G) |
| 4 spines | 400G | 1.6 Tbps | 1.6 Tbps (4 × 400G) — AI fabric standard |
5. Oversubscription Ratios Explained
Oversubscription is the ratio of total downlink (server-facing) bandwidth to total uplink (spine-facing) bandwidth on a leaf switch. A 3:1 oversubscription ratio means you have three times more server bandwidth than spine bandwidth — acceptable because not all servers transmit at line rate simultaneously in typical workloads.
Choosing the right oversubscription ratio for your workload is one of the most critical and nuanced leaf-spine design decisions. The right answer depends on traffic characteristics, workload type, and budget constraints.
| Ratio | Example Config (48×25G + 8×100G leaf) | Suitable For | Not Suitable For |
|---|---|---|---|
| 1:1 (Non-blocking) | Equal downlink and uplink bandwidth — all uplink ports used | AI/HPC training clusters, financial trading, real-time analytics | Cost-sensitive general-purpose DC — expensive in spine switches |
| 2:1 | 48×25G down (1,200G) / 8×100G up (800G) | Private cloud, virtualized compute, distributed databases | Workloads with sustained all-to-all communication bursts |
| 3:1 | 48×25G down (1,200G) / 4×100G up (400G) | Web serving, app tier, mixed general-purpose workloads | Storage-heavy workloads with high sustained east-west IO |
| 4:1 or higher | 48×1G down (48G) / 2×100G up (200G) | Dev/test, low-utilization workloads, lab environments | Any production workload — introduces congestion risk |
A practical formula for leaf oversubscription: Oversubscription = (Number of server ports × server port speed) ÷ (Number of spine uplinks × uplink speed). For a 48×25G leaf with 6×100G uplinks: (48 × 25G) / (6 × 100G) = 1,200G / 600G = 2:1.
6. BGP Underlay Design
The underlay is the physical IP routing fabric — the protocol and addressing scheme that carries packets between leaf and spine switches. While OSPF was common in early leaf-spine deployments, BGP (specifically eBGP) has become the de facto standard underlay protocol for modern data centers, recommended by RFC 7938 ("Use of BGP for Routing in Large-Scale Data Centers").
Why eBGP for Underlay?
- Simpler failure isolation: Each leaf-spine link is its own /31 (or /30) point-to-point subnet and its own BGP session. A link failure affects only that session and that peer — it does not trigger a fabric-wide SPF recalculation as OSPF would
- No flooding domain: Unlike OSPF LSA flooding, BGP only sends route updates when something changes. This scales far better in large fabrics with thousands of prefixes
- Policy flexibility: BGP route policies (communities, local preference, MED) allow fine-grained traffic engineering across the fabric without complex OSPF metric tuning
- AS number design: Each leaf gets a unique private AS number (from the 4-byte private range 4200000000–4294967294). All spines share a common AS number. This creates natural eBGP peering at every leaf-spine link and prevents BGP AS path looping between leaf switches
Leaf-1(config)# router bgp 4200000001
Leaf-1(config-router)# router-id 10.0.0.1
Leaf-1(config-router)# address-family ipv4 unicast
Leaf-1(config-router-af)# maximum-paths 8 # Enable ECMP across 8 spines
Leaf-1(config-router-af)# exit
Leaf-1(config-router)# neighbor 10.0.0.2 remote-as 4200000100 # Spine-1 AS
Leaf-1(config-router)# neighbor 10.0.0.4 remote-as 4200000100 # Spine-2 AS
Leaf-1(config-router)# neighbor 10.0.0.6 remote-as 4200000100 # Spine-3 AS
Leaf-1(config-router)# neighbor 10.0.0.8 remote-as 4200000100 # Spine-4 AS
7. VXLAN/EVPN Overlay: L2 and L3 Over IP
The BGP underlay provides IP reachability between leaf switches. But data center workloads also need Layer 2 connectivity — virtual machines that must stay in the same subnet as they vMotion between hosts, or containers that share a flat L2 broadcast domain. In the three-tier model, VLANs extended across trunks provided this. In leaf-spine, VXLAN (Virtual Extensible LAN) provides it — encapsulating original L2 Ethernet frames inside UDP packets that travel over the L3 underlay.
VXLAN Encapsulation
A VXLAN-encapsulated frame adds four headers to the original Ethernet frame:
# VNI (VXLAN Network Identifier) = 24-bit segment ID (up to 16M segments)
# Each VNI maps to a VLAN or VRF on the leaf switches
# Outer IP header = routable across the spine underlay
EVPN: The Control Plane for VXLAN
VXLAN by itself is just the data plane encapsulation. It needs a control plane to distribute MAC and IP address information between VTEPs — otherwise leaf switches would have to flood unknown unicast traffic to discover remote MACs. BGP EVPN (Ethernet VPN — RFC 7432) provides this control plane, using BGP to distribute MAC/IP bindings between all leaf switches.
EVPN uses four primary route types for VXLAN fabrics:
| Route Type | Name | Purpose |
|---|---|---|
| Type 2 | MAC/IP Advertisement | Distributes MAC address and optionally the associated IP address of a host, eliminating unknown unicast flooding and enabling distributed ARP suppression |
| Type 3 | Inclusive Multicast Route | Advertises each VTEP's membership in a VNI — allows BUM (Broadcast, Unknown Unicast, Multicast) traffic to be sent to all VTEPs participating in that VNI via ingress replication |
| Type 5 | IP Prefix Route | Distributes IP prefixes (subnets) between VTEPs for inter-VRF and external routing — enables symmetric IRB (Integrated Routing and Bridging) |
| Type 1 | Ethernet Auto-Discovery | Used for multihoming (ESI — Ethernet Segment Identifier) — enables dual-homed servers to use both leaf connections simultaneously with active-active EVPN multihoming |
Symmetric vs Asymmetric IRB
Integrated Routing and Bridging (IRB) is the mechanism for routing between L2 VNIs (subnets) on the leaf-spine fabric. Two models exist:
- Asymmetric IRB: Both L2 bridging and L3 routing happen at the ingress leaf. The egress leaf only performs L2 bridging. Simpler to configure but requires every leaf to have every VNI instantiated — limiting fabric scale and increasing MAC table size
- Symmetric IRB: The ingress leaf routes into an L3 VNI (a routed VRF tunnel), the spine forwards the L3 VNI, and the egress leaf routes from the L3 VNI to the destination L2 VNI. Only requires each leaf to have the VNIs of servers it directly hosts, plus a common L3 VNI per VRF — scales to much larger fabrics. Requires EVPN Type 5 routes and is the recommended model for any fabric beyond ~50 leaf switches
8. Vendor Solutions: Cisco ACI, NX-OS, and Arista EOS
Three solutions dominate enterprise leaf-spine deployments, each representing a distinct philosophy on how the fabric should be built and managed.
| Attribute | Cisco ACI | Cisco NX-OS VXLAN/EVPN | Arista EOS VXLAN/EVPN |
|---|---|---|---|
| Control plane | Proprietary OpFlex policy model via APIC controller | Standard BGP EVPN (RFC 7432) | Standard BGP EVPN (RFC 7432) |
| Management model | Centralised — APIC controller is required; all policy via GUI/REST API/Ansible | Distributed — per-device CLI, Nexus Dashboard optional | Distributed — per-device CLI + CloudVision (CVP) optional |
| Policy abstraction | Excellent — EPG/Contract model abstracts L2/L3 from policy intent | Standard — VRF/VLAN/VNI configured per-device | Standard — VRF/VLAN/VNI configured per-device |
| Hardware flexibility | ACI-specific Nexus hardware only (9000 series in ACI mode) | Nexus 9000, 7000 series in NX-OS standalone mode | Full Arista 7000 series; SONiC support on white-box |
| Automation / IaC | Strong — APIC REST API, Terraform hashicorp/aci, Ansible aci_tenant modules | Strong — NXAPI, Ansible cisco.nxos, Terraform CiscoDevNet/nxos | Excellent — eAPI, Ansible arista.eos, CVP, Terraform, strong NetDevOps community |
| Learning curve | High — ACI policy model is fundamentally different from traditional networking | Moderate — familiar NX-OS CLI with VXLAN/EVPN additions | Moderate — Linux-like EOS CLI familiar to multi-vendor engineers |
| Best for | Large enterprises needing policy automation, microsegmentation at scale, and multi-tenancy | Cisco-invested enterprises that want standard EVPN without ACI complexity | Cloud-native, DevOps-forward environments; hyperscale-style builds; OpenConfig/gNMI |
9. Scaling Leaf-Spine: Super-Spine and Multi-Pod
A standard two-tier leaf-spine fabric is limited by the port count of the spine switches. A 64-port spine switch can support a maximum of 64 leaf switches. A 64-port 400G spine with 64 leaves each hosting 48 servers supports 3,072 servers maximum — sufficient for many enterprise data centers, but insufficient for large cloud or hyperscale deployments.
Super-Spine: Three-Tier Clos
The natural extension is a Super-Spine (or Core) tier — a third tier of switches that interconnects multiple leaf-spine pods. Each pod is a complete leaf-spine fabric. Super-spines connect pod spines to each other, forming a three-tier Clos topology. Traffic between servers in the same pod still traverses only 2 hops. Traffic between pods traverses 4 hops (leaf → spine → super-spine → spine → leaf). Scale increases to hundreds of thousands of ports.
Cisco ACI Multi-Pod and Multi-Site
Cisco ACI has two native scaling constructs beyond a single pod. ACI Multi-Pod extends a single ACI fabric across multiple physical pods connected via an Inter-Pod Network (IPN) — all pods share a single APIC cluster and a unified policy domain. ACI Multi-Site connects geographically separate ACI fabrics (each with its own APIC cluster) via a Nexus Dashboard Orchestrator, enabling stretched L2/L3 domains between data centers with separate fault domains.
| Approach | Max Scale | Hop Count | Use Case |
|---|---|---|---|
| Standard leaf-spine (2-tier) | ~3,000–5,000 servers | 2 hops (always) | Single DC, enterprise or mid-size cloud |
| Super-Spine (3-tier Clos) | 50,000–100,000+ servers | 2 (intra-pod) / 4 (inter-pod) | Large cloud, hyperscale DC, AI GPU clusters |
| ACI Multi-Pod | Up to 6 pods, ~1,000 leaves | 2 (intra-pod) / 4 (inter-pod via IPN) | Campus data center, DC expansion, HA across rooms |
| ACI Multi-Site | Up to 12 sites | 2–6 hops (intra and inter-site) | Geo-distributed DC, active-active DCI, DR |
10. Design Decision Guide
With the foundational concepts in place, the table below maps common data center profiles to the recommended leaf-spine design choices — from oversubscription ratio and uplink speed to control plane and overlay model.
| DC Profile | Oversubscription | Uplink Speed | Underlay | Overlay | Recommended Platform |
|---|---|---|---|---|---|
| Small enterprise (≤500 servers) | 3:1 | 100G | OSPF or eBGP | VXLAN/EVPN asymmetric IRB | Cisco Nexus 93180 leaf / 9336 spine |
| Mid enterprise / private cloud | 2:1–3:1 | 100G | eBGP | VXLAN/EVPN symmetric IRB | Cisco ACI or NX-OS EVPN; Arista 7050X3 |
| Large enterprise / multi-tenant | 2:1 | 400G | eBGP | VXLAN/EVPN symmetric IRB + Type 5 | Cisco ACI Multi-Pod; Arista 7800R series |
| AI / GPU cluster | 1:1 (non-blocking) | 400G / 800G | eBGP | RoCEv2 lossless (PFC + ECN + DCQCN) | Cisco G300, Arista 7800R4, Nvidia Spectrum-4 |
| Hyperscale / cloud | 1:1–2:1 per tier | 400G / 800G | eBGP / SONiC | VXLAN/EVPN or SR-MPLS | Whitebox + SONiC; Arista 7800R4; Broadcom TH5 |
11. Frequently Asked Questions
Q: Can I migrate from a three-tier design to leaf-spine without a full forklift upgrade?
Yes — and this is how most production migrations happen. The standard approach is to deploy new leaf-spine pods alongside the existing three-tier fabric, then progressively migrate workloads by rack or by application tier. The existing core/aggregation switches temporarily act as the "border leaf" connecting the old and new fabrics via BGP or static routing. Full cutover typically takes 6–18 months for a medium-sized enterprise data center.
Q: Does leaf-spine require VXLAN? Can I run it with plain VLANs?
You can run a leaf-spine fabric without VXLAN if all inter-leaf communication is pure Layer 3 (routed) and you do not need to extend Layer 2 domains between racks. For Kubernetes deployments using Calico BGP (pure L3 pod routing), plain leaf-spine with eBGP and no overlay is perfectly viable and operationally simpler. VXLAN/EVPN is needed when VMs require L2 adjacency across racks, for workload mobility between leaves, or for multi-tenancy isolation.
Q: How many spine switches do I need?
At minimum, 2 spines for redundancy — a single spine is a single point of failure. For a non-blocking fabric, you need as many spines as uplink ports on each leaf. For a 3:1 oversubscription ratio with a 48×25G + 8×100G leaf, using 4 of the 8 uplinks gives you 4 spines at 3:1. The practical sweet spot for most enterprise fabrics is 4–8 spines, providing 400G–800G of aggregate uplink bandwidth per leaf while maintaining reasonable hardware cost.
Q: What is the difference between a border leaf and a regular leaf?
A border leaf (also called an external leaf or exit leaf) is a leaf switch that connects the VXLAN/EVPN fabric to external networks — the WAN router, internet edge firewall, or an upstream network. It runs the same leaf hardware and protocols but additionally peers with external BGP neighbours and redistributes external prefixes into the fabric. In Cisco ACI, the equivalent is the Border Leaf node with an L3Out configuration. Border leaves are always deployed in pairs for redundancy.
Q: Is STP completely eliminated in a leaf-spine design?
Between the leaf and spine tiers — yes. All leaf-to-spine links are Layer 3 routed interfaces; STP has no role there. However, STP (typically Rapid PVST+ or MST) still runs on the server-facing access ports of each leaf switch, within each VLAN or VNI. For server-facing ports, STP protections like PortFast, BPDU Guard, and Root Guard should always be enabled. The dual-homed server case (a server connecting to two leaves) is handled by EVPN multihoming (ESI-LAG) rather than STP.
📘 Facebook 🐦 Twitter 💼 LinkedIn
Technical content based on Cisco ACI and NX-OS design guides, Arista EOS VXLAN/EVPN configuration guides, RFC 7938 (BGP for Large-Scale Data Centers), RFC 7432 (BGP EVPN), and Charles Clos's original 1953 switching network paper. CLI examples are representative of Cisco NX-OS 10.x and should be validated in a non-production environment. All content current as of March 2026.