BGP Neighbor Flapping Issues

Home › Routing & Switching › BGP Neighbor Flapping Issues

Last Updated : March 2026 | By The Network DNA

BGP neighbor flapping refers to a condition where a BGP (Border Gateway Protocol) peering session repeatedly transitions between the Established and Idle/Active states — establishing the session, then dropping it, then re-establishing it in a continuous cycle. This instability can trigger massive route table churn, cause packet loss, impact SLA commitments, and — in severe cases — propagate instability across the entire internet routing table. In this blog, we will learn about intermittent BGP neighbor flapping issues, the root causes of repeated session drops, and the step-by-step methods to diagnose and resolve them.

BGP is the backbone protocol of the internet and is widely deployed in enterprise, data center, and service provider networks for both internal (iBGP) and external (eBGP) routing. Unlike OSPF or EIGRP, which can reconverge in milliseconds, BGP sessions involve a carefully negotiated TCP connection, Hold Timers, Keepalives, and complex state machines — any disruption at any layer can cause the entire session to reset. BGP flapping compounds this by triggering repeated withdrawals and re-announcements of prefixes, amplifying the impact far beyond the two routers involved.

Table of Contents

What is BGP Neighbor Flapping?
Causes of BGP Neighbor Flapping
Understanding BGP Session States
Impact of BGP Flapping on the Network
How to Diagnose and Troubleshoot BGP Flapping?
Best Practices to Prevent BGP Flapping
Conclusion

What is BGP Neighbor Flapping?

A BGP neighbor (peer) session becomes "flapping" when the TCP session underlying the BGP connection drops and reforms repeatedly — often within seconds or minutes of each session establishment. Because BGP relies on a persistent TCP connection on port 179, anything that disrupts that TCP stream — even briefly — causes both peers to reset their BGP state machines, withdraw all previously advertised routes, and restart the entire OPEN / KEEPALIVE negotiation cycle.

 BGP SESSION FLAPPING CYCLE

ESTABLISHED

Routes advertised

→

SESSION DROP

Holdtimer expires

→

IDLE / ACTIVE

Routes withdrawn

→

RE-ESTABLISHING

TCP SYN / OPEN

→

ESTABLISHED

Routes re-advertised

ⓘ Each reset cycle causes route withdrawals and re-advertisements, triggering CPU spikes on all BGP peers receiving the updates.

BGP flapping is particularly damaging in iBGP full-mesh or Route Reflector topologies where one unstable peer causes route updates to be propagated to every other iBGP speaker in the AS — magnifying the instability across the entire network.

Causes of BGP Neighbor Flapping

Let's look at the possible root causes of BGP session instability in detail, grouped by category.

Network Layer Issues

Unstable physical or logical link – intermittent interface flaps (line protocol up/down) on the path between BGP peers break the TCP session and reset the BGP state machine.
Packet loss or high latency – congestion, buffer drops, or QoS misconfiguration on the transit path causes BGP Keepalive packets to be lost; once three consecutive Keepalives are missed, the Hold Timer expires and the session resets.
Routing loop or route recursion – if the next-hop used to reach the BGP peer becomes unreachable — even momentarily — the underlying TCP session drops. This is particularly common in iBGP when the loopback used as the update-source loses reachability.
MTU mismatch – large BGP UPDATE packets (common when exchanging full internet routing tables) can be silently dropped if an interface on the path has a lower MTU, causing the session to stall or reset after the OPEN phase.
ISP or WAN link instability – for eBGP peers across an internet link, ISP-side congestion events, maintenance windows, or carrier route flaps can intermittently break Layer 3 reachability.

BGP Timer Misconfiguration

Mismatched Hold Timer values – BGP peers negotiate the Hold Timer during the OPEN message exchange; if one peer has an extremely low Hold Timer (e.g., 10 seconds), the session is vulnerable to any momentary delay in Keepalive delivery.
Keepalive interval too aggressive – the default Keepalive interval is one-third of the Hold Timer (typically 60 seconds). On congested links, even small jitter can cause missed Keepalives against very tight timers.
ConnectRetry timer inconsistency – an overly short ConnectRetry timer causes rapid reconnection attempts, which can trigger route dampening on the remote peer and worsen instability.
BFD (Bidirectional Forwarding Detection) misconfiguration – BFD is used to accelerate BGP failure detection, but aggressive BFD timers (sub-second) on high-latency or jittery links can falsely declare the peer unreachable and bring down an otherwise healthy BGP session.

Hardware or Software Problems

CPU overload on the router – BGP runs as a process on the router CPU. When the router's CPU is saturated — due to large routing table processing, high interface count, or other processes — it may fail to generate or respond to Keepalives in time, causing the Hold Timer to expire.
Memory exhaustion – insufficient memory to hold the full BGP RIB (Routing Information Base) can cause BGP to crash or reset, particularly when receiving full internet routing tables (~1 million+ prefixes).
Software bugs in BGP process – known bugs in specific IOS/JunOS/vendor firmware versions can cause BGP process crashes, memory leaks in the BGP table, or incorrect state machine transitions. Always check vendor advisories.
Faulty SFP or transceiver – a marginal optical transceiver causing intermittent bit errors on the physical interface can translate to CRC errors, interface resets, and TCP session drops — the BGP session breaks even if the interface remains up in the routing table.
Stale BGP sessions after failover – after a router reload or NSF/NSR (Non-Stop Routing) failure, stale BGP TCP sessions that were not gracefully closed can cause the peer to hold incorrect state and eventually reset.

Authentication & Policy Issues

MD5 / TCP-AO password mismatch – if BGP MD5 authentication is enabled and the password is changed on only one peer, all incoming TCP segments will fail authentication checks and be silently dropped, causing the session to timeout.
Route policy causing session reset – a malformed or overly broad outbound route policy that strips mandatory BGP attributes (such as AS_PATH or NEXT_HOP) can cause the receiving peer to send a NOTIFICATION and reset the session.
Maximum prefix limit exceeded – BGP peers configured with a maximum-prefix limit will tear down the session and enter an Idle state when the prefix count exceeds the configured threshold — a common scenario when receiving a full routing table from a new upstream provider.
TTL security (GTSM) misconfiguration – Generalized TTL Security Mechanism requires eBGP peers to send packets with a specific TTL value. A mismatch between peers silently drops BGP packets, causing Hold Timer expiry.

Understanding BGP Session States

To effectively troubleshoot BGP flapping, it is essential to understand which state the peer repeatedly collapses into and what that state indicates about the failure mode.

BGP State	Meaning	Flapping Implication
Idle	BGP is not attempting to connect; waiting for ConnectRetry timer	Often due to authentication failure, route policy error, or max-prefix exceeded
Connect	TCP SYN sent; waiting for TCP handshake to complete	TCP unreachable — routing or firewall blocking port 179
Active	TCP handshake failed; actively retrying connection	Most common "stuck" flapping state — IP unreachable, wrong peer IP, or ACL blocking
OpenSent	OPEN message sent; waiting for peer's OPEN	MTU issue, AS number mismatch, or capability negotiation failure
OpenConfirm	OPEN received; waiting for KEEPALIVE to confirm	Authentication mismatch or timer negotiation error
Established	Session is fully up; routes being exchanged	If the session repeatedly reaches this state then drops, investigate Keepalive timing and link quality

Impact of BGP Flapping on the Network

BGP flapping is not a local event — its blast radius extends across the routing domain and, in eBGP scenarios, potentially across the entire internet:

Route churn – every session reset generates a flood of BGP WITHDRAW messages followed by UPDATE messages, consuming CPU cycles on every router in the affected AS and in peer ASes.
Traffic black-holing – during the interval between WITHDRAW and re-advertisement, packets destined for prefixes announced via the flapping peer are dropped at the point of routing inconsistency.
Route dampening activation – RFC 2439 Route Flap Dampening (RFD) penalizes prefixes whose originating BGP session flaps repeatedly, eventually suppressing the prefix from the routing table entirely — causing an outage even after the session stabilizes.
Cascading instability – in iBGP Route Reflector topologies, flapping of a single client BGP session can cause the Route Reflector to regenerate and re-advertise hundreds of thousands of prefixes to all other clients.
SLA and application impact – real-time applications (VoIP, video conferencing, financial trading) are highly sensitive to the micro-outages caused by BGP reconvergence events.

How to Diagnose and Troubleshoot BGP Neighbor Flapping?

Follow this structured approach to identify and resolve the root cause of BGP flapping in your network:

Check BGP neighbor state and reset reason

Run show bgp neighbors <peer-IP> and look at the Last Reset field. Common messages: "Hold Timer Expired", "TCP connection closed by remote", "Notification: OPEN Message Error", or "BGP Notification received". Each message points to a distinct failure category.

Verify IP reachability to the BGP peer address

Ping the peer IP (or loopback if using loopback peering) with extended pings specifying the same source address used for the BGP session. Any packet loss, even intermittent, is critical. Use ping <peer-IP> repeat 1000 source <local-IP> on Cisco IOS.

Inspect BGP log messages and syslog

Enable BGP logging with bgp log-neighbor-changes (Cisco) or equivalent. Review syslog for timestamps of session drops — correlate them with interface up/down events, CPU spikes, or routing table changes logged at the same time.

Verify BGP timer configuration on both peers

Confirm that the Hold Timer and Keepalive Interval are compatible and reasonable on both sides. A Hold Timer of 90 seconds and Keepalive of 30 seconds is the standard default. Avoid Hold Timers below 20 seconds unless BFD is used. Check with show bgp neighbors | include Hold time|Keepalive.

Check interface error counters and physical layer

Run show interfaces <intf> and look for incrementing CRC errors, input errors, resets, or carrier transitions. A faulty SFP, damaged cable, or mismatched duplex/speed setting will silently corrupt or drop TCP segments, causing Hold Timer expiry.

Test for MTU issues with path MTU discovery

Send extended pings with the DF (Don't Fragment) bit set and large packet sizes to verify that the full 1500-byte (or jumbo frame) path is intact. BGP UPDATE messages for large routing tables can be several kilobytes — an MTU black hole along the path silently discards them after the session appears established.

Verify BGP authentication password consistency

If MD5 authentication (or TCP-AO) is configured, confirm that the password is identical on both peers — including case sensitivity and special characters. A mismatch causes all incoming TCP segments for port 179 to be silently discarded without an error on the receiving peer.

Review maximum-prefix configuration

If the session drops immediately after reaching the Established state, check whether a maximum-prefix limit has been hit. The router will log a NOTIFICATION message. Either raise the limit or add a warning-only threshold to avoid hard resets: neighbor <IP> maximum-prefix <n> warning-only.

Monitor router CPU and memory utilization

Use show processes cpu sorted and show memory statistics during a flapping event. CPU exceeding 80% in the BGP process or critically low free memory can directly prevent Keepalive generation — particularly on routers receiving a full internet BGP table.

Review BFD configuration and adjust timers if needed

If BFD is configured for fast BGP failure detection, verify that the BFD timers are appropriate for the link's latency and jitter profile. On high-latency satellite or LTE links, increase the BFD minimum interval to at least 300–500 ms to avoid false-positive failure detections. Alternatively, disable BFD temporarily to isolate whether it is contributing to the flap.

Best Practices to Prevent BGP Flapping

Proactive configuration hardening significantly reduces the likelihood and impact of BGP flapping:

Use loopback interfaces for iBGP peering – peering using loopback addresses decouples the BGP session from a single physical interface failure. As long as any IGP path exists between peers, the BGP session remains up.
Enable Graceful Restart (GR) – RFC 4724 Graceful Restart allows the BGP RIB to be preserved during a controlled router restart, preventing route withdrawals from propagating during planned maintenance or software upgrades.
Configure Non-Stop Routing (NSR) – on platforms that support it, NSR maintains BGP state across a control-plane switchover (e.g., RP failover on Cisco ASR) without notifying peers, eliminating session resets during hardware redundancy events.
Apply route dampening judiciously – configure Route Flap Dampening (RFD) on eBGP sessions receiving unstable prefixes, but review RFC 7196 guidance — aggressive dampening can worsen convergence times after legitimate failures.
Set appropriate BGP timers per link type – use default timers (Hold: 90s, Keepalive: 30s) for stable enterprise links. Only reduce timers on known-good, low-latency paths where fast failure detection is genuinely required.
Implement BGP route policies with care – test all inbound/outbound route-maps and prefix-lists in a lab before production deployment. A policy that unexpectedly strips mandatory attributes will trigger a NOTIFICATION and reset the session.
Keep router software current – regularly apply vendor security advisories and bug-fix releases. Many BGP stability issues are caused by known software defects that are patched in more recent firmware versions.
Monitor BGP session state continuously – deploy SNMP traps or streaming telemetry for bgpBackwardTransition events. Real-time alerting on session state changes enables faster root-cause analysis before the flapping impacts end users.

Conclusion

BGP neighbor flapping is one of the most disruptive events in enterprise and service provider networks — combining the immediate impact of packet loss with the cascading effect of route churn propagating across multiple ASes. Understanding the layered causes — from physical media errors and timer mismatches to authentication failures and software bugs — is the foundation of effective troubleshooting.

A disciplined approach: starting with reachability verification, correlating log timestamps, checking physical-layer counters, and validating BGP configuration consistency on both peers will resolve the vast majority of BGP flapping incidents. Pairing this with proactive hardening measures — loopback peering, Graceful Restart, correct timer values, and continuous monitoring — transforms BGP from a fragile dependency into a resilient routing backbone.

 Quick Reference — BGP Flapping Troubleshooting Summary

Check show bgp neighbors for Last Reset reason and session uptime history
Ping peer IP from the correct source interface with extended 1000-packet tests
Correlate syslog BGP session-change timestamps with interface and CPU events
Verify Hold Timer and Keepalive values match on both peers
Inspect interface error counters for CRC errors and resets
Test path MTU with large DF-bit pings to detect MTU black holes
Confirm MD5/TCP-AO password is identical and case-correct on both peers
Check and raise maximum-prefix limits if applicable
Monitor CPU/memory during a flapping event — consider reducing full-table BGP feeds
Adjust or disable BFD on high-latency or jittery links

BGP Neighbor Flapping Issues

What is BGP Neighbor Flapping?