BGP Neighbor Flapping Issues
Home › Routing & Switching › BGP Neighbor Flapping Issues
Last Updated : March 2026 | By The Network DNA
BGP neighbor flapping refers to a condition where a BGP (Border Gateway Protocol) peering session repeatedly transitions between the Established and Idle/Active states — establishing the session, then dropping it, then re-establishing it in a continuous cycle. This instability can trigger massive route table churn, cause packet loss, impact SLA commitments, and — in severe cases — propagate instability across the entire internet routing table. In this blog, we will learn about intermittent BGP neighbor flapping issues, the root causes of repeated session drops, and the step-by-step methods to diagnose and resolve them.
BGP is the backbone protocol of the internet and is widely deployed in enterprise, data center, and service provider networks for both internal (iBGP) and external (eBGP) routing. Unlike OSPF or EIGRP, which can reconverge in milliseconds, BGP sessions involve a carefully negotiated TCP connection, Hold Timers, Keepalives, and complex state machines — any disruption at any layer can cause the entire session to reset. BGP flapping compounds this by triggering repeated withdrawals and re-announcements of prefixes, amplifying the impact far beyond the two routers involved.
Table of Contents
What is BGP Neighbor Flapping?
A BGP neighbor (peer) session becomes "flapping" when the TCP session underlying the BGP connection drops and reforms repeatedly — often within seconds or minutes of each session establishment. Because BGP relies on a persistent TCP connection on port 179, anything that disrupts that TCP stream — even briefly — causes both peers to reset their BGP state machines, withdraw all previously advertised routes, and restart the entire OPEN / KEEPALIVE negotiation cycle.
BGP flapping is particularly damaging in iBGP full-mesh or Route Reflector topologies where one unstable peer causes route updates to be propagated to every other iBGP speaker in the AS — magnifying the instability across the entire network.
Causes of BGP Neighbor Flapping
Let's look at the possible root causes of BGP session instability in detail, grouped by category.
Network Layer Issues
- Unstable physical or logical link – intermittent interface flaps (line protocol up/down) on the path between BGP peers break the TCP session and reset the BGP state machine.
- Packet loss or high latency – congestion, buffer drops, or QoS misconfiguration on the transit path causes BGP Keepalive packets to be lost; once three consecutive Keepalives are missed, the Hold Timer expires and the session resets.
- Routing loop or route recursion – if the next-hop used to reach the BGP peer becomes unreachable — even momentarily — the underlying TCP session drops. This is particularly common in iBGP when the loopback used as the update-source loses reachability.
- MTU mismatch – large BGP UPDATE packets (common when exchanging full internet routing tables) can be silently dropped if an interface on the path has a lower MTU, causing the session to stall or reset after the OPEN phase.
- ISP or WAN link instability – for eBGP peers across an internet link, ISP-side congestion events, maintenance windows, or carrier route flaps can intermittently break Layer 3 reachability.
BGP Timer Misconfiguration
- Mismatched Hold Timer values – BGP peers negotiate the Hold Timer during the OPEN message exchange; if one peer has an extremely low Hold Timer (e.g., 10 seconds), the session is vulnerable to any momentary delay in Keepalive delivery.
- Keepalive interval too aggressive – the default Keepalive interval is one-third of the Hold Timer (typically 60 seconds). On congested links, even small jitter can cause missed Keepalives against very tight timers.
- ConnectRetry timer inconsistency – an overly short ConnectRetry timer causes rapid reconnection attempts, which can trigger route dampening on the remote peer and worsen instability.
- BFD (Bidirectional Forwarding Detection) misconfiguration – BFD is used to accelerate BGP failure detection, but aggressive BFD timers (sub-second) on high-latency or jittery links can falsely declare the peer unreachable and bring down an otherwise healthy BGP session.
Hardware or Software Problems
- CPU overload on the router – BGP runs as a process on the router CPU. When the router's CPU is saturated — due to large routing table processing, high interface count, or other processes — it may fail to generate or respond to Keepalives in time, causing the Hold Timer to expire.
- Memory exhaustion – insufficient memory to hold the full BGP RIB (Routing Information Base) can cause BGP to crash or reset, particularly when receiving full internet routing tables (~1 million+ prefixes).
- Software bugs in BGP process – known bugs in specific IOS/JunOS/vendor firmware versions can cause BGP process crashes, memory leaks in the BGP table, or incorrect state machine transitions. Always check vendor advisories.
- Faulty SFP or transceiver – a marginal optical transceiver causing intermittent bit errors on the physical interface can translate to CRC errors, interface resets, and TCP session drops — the BGP session breaks even if the interface remains up in the routing table.
- Stale BGP sessions after failover – after a router reload or NSF/NSR (Non-Stop Routing) failure, stale BGP TCP sessions that were not gracefully closed can cause the peer to hold incorrect state and eventually reset.
Authentication & Policy Issues
- MD5 / TCP-AO password mismatch – if BGP MD5 authentication is enabled and the password is changed on only one peer, all incoming TCP segments will fail authentication checks and be silently dropped, causing the session to timeout.
- Route policy causing session reset – a malformed or overly broad outbound route policy that strips mandatory BGP attributes (such as AS_PATH or NEXT_HOP) can cause the receiving peer to send a NOTIFICATION and reset the session.
- Maximum prefix limit exceeded – BGP peers configured with a
maximum-prefixlimit will tear down the session and enter an Idle state when the prefix count exceeds the configured threshold — a common scenario when receiving a full routing table from a new upstream provider. - TTL security (GTSM) misconfiguration – Generalized TTL Security Mechanism requires eBGP peers to send packets with a specific TTL value. A mismatch between peers silently drops BGP packets, causing Hold Timer expiry.
Understanding BGP Session States
To effectively troubleshoot BGP flapping, it is essential to understand which state the peer repeatedly collapses into and what that state indicates about the failure mode.
| BGP State | Meaning | Flapping Implication |
|---|---|---|
| Idle | BGP is not attempting to connect; waiting for ConnectRetry timer | Often due to authentication failure, route policy error, or max-prefix exceeded |
| Connect | TCP SYN sent; waiting for TCP handshake to complete | TCP unreachable — routing or firewall blocking port 179 |
| Active | TCP handshake failed; actively retrying connection | Most common "stuck" flapping state — IP unreachable, wrong peer IP, or ACL blocking |
| OpenSent | OPEN message sent; waiting for peer's OPEN | MTU issue, AS number mismatch, or capability negotiation failure |
| OpenConfirm | OPEN received; waiting for KEEPALIVE to confirm | Authentication mismatch or timer negotiation error |
| Established | Session is fully up; routes being exchanged | If the session repeatedly reaches this state then drops, investigate Keepalive timing and link quality |
Impact of BGP Flapping on the Network
BGP flapping is not a local event — its blast radius extends across the routing domain and, in eBGP scenarios, potentially across the entire internet:
- Route churn – every session reset generates a flood of BGP WITHDRAW messages followed by UPDATE messages, consuming CPU cycles on every router in the affected AS and in peer ASes.
- Traffic black-holing – during the interval between WITHDRAW and re-advertisement, packets destined for prefixes announced via the flapping peer are dropped at the point of routing inconsistency.
- Route dampening activation – RFC 2439 Route Flap Dampening (RFD) penalizes prefixes whose originating BGP session flaps repeatedly, eventually suppressing the prefix from the routing table entirely — causing an outage even after the session stabilizes.
- Cascading instability – in iBGP Route Reflector topologies, flapping of a single client BGP session can cause the Route Reflector to regenerate and re-advertise hundreds of thousands of prefixes to all other clients.
- SLA and application impact – real-time applications (VoIP, video conferencing, financial trading) are highly sensitive to the micro-outages caused by BGP reconvergence events.
How to Diagnose and Troubleshoot BGP Neighbor Flapping?
Follow this structured approach to identify and resolve the root cause of BGP flapping in your network:
Best Practices to Prevent BGP Flapping
Proactive configuration hardening significantly reduces the likelihood and impact of BGP flapping:
- Use loopback interfaces for iBGP peering – peering using loopback addresses decouples the BGP session from a single physical interface failure. As long as any IGP path exists between peers, the BGP session remains up.
- Enable Graceful Restart (GR) – RFC 4724 Graceful Restart allows the BGP RIB to be preserved during a controlled router restart, preventing route withdrawals from propagating during planned maintenance or software upgrades.
- Configure Non-Stop Routing (NSR) – on platforms that support it, NSR maintains BGP state across a control-plane switchover (e.g., RP failover on Cisco ASR) without notifying peers, eliminating session resets during hardware redundancy events.
- Apply route dampening judiciously – configure Route Flap Dampening (RFD) on eBGP sessions receiving unstable prefixes, but review RFC 7196 guidance — aggressive dampening can worsen convergence times after legitimate failures.
- Set appropriate BGP timers per link type – use default timers (Hold: 90s, Keepalive: 30s) for stable enterprise links. Only reduce timers on known-good, low-latency paths where fast failure detection is genuinely required.
- Implement BGP route policies with care – test all inbound/outbound route-maps and prefix-lists in a lab before production deployment. A policy that unexpectedly strips mandatory attributes will trigger a NOTIFICATION and reset the session.
- Keep router software current – regularly apply vendor security advisories and bug-fix releases. Many BGP stability issues are caused by known software defects that are patched in more recent firmware versions.
- Monitor BGP session state continuously – deploy SNMP traps or streaming telemetry for
bgpBackwardTransitionevents. Real-time alerting on session state changes enables faster root-cause analysis before the flapping impacts end users.
Conclusion
BGP neighbor flapping is one of the most disruptive events in enterprise and service provider networks — combining the immediate impact of packet loss with the cascading effect of route churn propagating across multiple ASes. Understanding the layered causes — from physical media errors and timer mismatches to authentication failures and software bugs — is the foundation of effective troubleshooting.
A disciplined approach: starting with reachability verification, correlating log timestamps, checking physical-layer counters, and validating BGP configuration consistency on both peers will resolve the vast majority of BGP flapping incidents. Pairing this with proactive hardening measures — loopback peering, Graceful Restart, correct timer values, and continuous monitoring — transforms BGP from a fragile dependency into a resilient routing backbone.
Quick Reference — BGP Flapping Troubleshooting Summary
- Check
show bgp neighborsfor Last Reset reason and session uptime history - Ping peer IP from the correct source interface with extended 1000-packet tests
- Correlate syslog BGP session-change timestamps with interface and CPU events
- Verify Hold Timer and Keepalive values match on both peers
- Inspect interface error counters for CRC errors and resets
- Test path MTU with large DF-bit pings to detect MTU black holes
- Confirm MD5/TCP-AO password is identical and case-correct on both peers
- Check and raise maximum-prefix limits if applicable
- Monitor CPU/memory during a flapping event — consider reducing full-table BGP feeds
- Adjust or disable BFD on high-latency or jittery links
Tags