BGP Protocol Analysis
Protocol-reasoning-driven analysis skill for BGP peering, path selection, and
route propagation. Unlike device health checks that compare metrics against
thresholds, BGP analysis requires interpreting protocol state machines, walking
the best-path algorithm, and validating policy application across the control
plane.
Commands are labeled [Cisco], [JunOS], or [EOS] where syntax
diverges. Unlabeled statements apply to all three vendors.
When to Use
- - BGP peer reported down or stuck in a non-Established state
- Suspected route leak — prefixes appearing in tables where they should not
- Path selection not matching expectations after policy changes
- Convergence too slow after planned maintenance or unplanned failover
- Post-change verification of BGP configuration (new peers, policy updates, community changes)
- Capacity planning for prefix table growth or session scaling
- Investigating asymmetric routing caused by inconsistent BGP attributes
Prerequisites
- - SSH or console access to the router (read-only privilege sufficient)
- BGP process running on the device with at least one configured peer
- Knowledge of expected peer topology: which neighbors should be up, expected
prefix counts per peer, intended path selection outcomes
- - Awareness of configured routing policy: route-maps, prefix-lists, community
filters, AS-path access lists, and local-preference assignments
- - For iBGP: understanding of the route-reflector or full-mesh topology
Procedure
Follow this diagnostic flow sequentially. Each step builds on the data from
prior steps. The procedure moves from broad inventory to targeted diagnosis.
Step 1: BGP Session Inventory
Collect all peer states and compare against expected topology.
[Cisco]
CODEBLOCK0
[JunOS]
CODEBLOCK1
[EOS]
CODEBLOCK2
Record each neighbor: address, AS number, state, prefixes received, up/down
time. Compare against expected topology — every configured peer should appear.
A peer missing from output means it was never configured or was removed. Any
peer not showing a numeric prefix count is not Established — proceed to Step 2
for that peer.
Step 2: Peer State Diagnosis
For any peer not in Established state, the BGP FSM state reveals the failure
domain. This is the core diagnostic reasoning step.
[Cisco]
CODEBLOCK3
[JunOS]
CODEBLOCK4
[EOS]
CODEBLOCK5
Interpret the FSM state to isolate the failure:
- - Idle → BGP process not attempting connection. Causes: administratively
shut down, no route to peer address, configured remote AS does not match, or
maximum-prefix limit hit triggering teardown.
- - Connect → TCP SYN sent, waiting for response. Peer is unreachable at
Layer 3 or a firewall is blocking TCP port 179.
- - Active → TCP connection attempt failed, retrying. Same causes as Connect
but the router has cycled back. Check: ACLs blocking port 179, peer not
configured for this neighbor, peer address unreachable.
- - OpenSent → TCP connected, OPEN message sent, no reply. Remote end
accepted TCP but is not sending OPEN — typically remote BGP not configured
for this neighbor or remote peer in admin shutdown.
- - OpenConfirm → OPEN received but parameters rejected. Check: AS number
mismatch, capability negotiation failure (AFI/SAFI mismatch), hold timer
negotiation failure, authentication (MD5/TCP-AO) mismatch.
Check "last reset reason" and "last error" fields — they often provide the
definitive cause. Reference: references/state-machine.md for full FSM detail.
Step 3: Route Table Analysis
For Established peers, verify prefix exchange matches expectations.
[Cisco]
CODEBLOCK6
[JunOS]
CODEBLOCK7
[EOS]
CODEBLOCK8
Compare received prefix count against baseline. Significant deviation indicates:
- - Drop >10% → upstream is withdrawing routes (maintenance, filter change, or failure)
- Increase >10% → route leak or new prefixes originated upstream
- Zero received → peer is Established but sending no routes (missing export policy
on JunOS, or outbound filter on remote blocking everything)
Check advertised prefix count similarly — confirm this router is sending the
expected number of routes to each peer.
Step 4: Path Selection Verification
When traffic takes an unexpected path, walk the BGP best-path algorithm to
identify which attribute is making the selection.
[Cisco]
CODEBLOCK9
[JunOS]
CODEBLOCK10
[EOS]
CODEBLOCK11
The best-path algorithm evaluates in this order (first difference wins):
- 1. Weight (Cisco/EOS local, highest wins — JunOS does not use weight)
- Local Preference (highest wins, default 100)
- Locally originated (network/aggregate preferred over learned)
- AS Path length (shortest wins)
- Origin (IGP < EGP < Incomplete)
- MED (lowest wins, compared only within same neighbor AS by default)
- eBGP over iBGP (external preferred)
- IGP metric to next-hop (lowest wins)
- Router ID (lowest wins, tiebreaker)
Identify which attribute selects the current best path. If unexpected, check
the route-map or policy applying that attribute on ingress.
Step 5: Route Filtering Validation
Verify that route-maps, prefix-lists, and community filters apply as intended.
[Cisco]
CODEBLOCK12
[JunOS]
CODEBLOCK13
[EOS]
CODEBLOCK14
For suspected route leaks: examine the RIB for prefixes that should not be
present. Check inbound filters on the peer that is the source. Common leak
causes: missing or misordered prefix-list entry, regex error in AS-path
filter, community match that is too broad.
For missing routes: verify the outbound policy on the advertising peer is not
filtering the prefix. On JunOS, a peer with no export policy sends nothing by
default — this is the most common JunOS-specific omission.
Step 6: Convergence Assessment
Evaluate convergence behavior and route stability.
[Cisco]
CODEBLOCK15
[JunOS]
CODEBLOCK16
[EOS]
CODEBLOCK17
Check for dampened (suppressed) routes — these indicate persistent flapping.
Review the BGP update activity: high update/withdrawal rates indicate churn.
After a planned change, measure convergence time from the change event to the
last BGP update. Compare against the target convergence window.
Threshold Tables
Operational parameter norms for BGP — these are protocol-level expectations, not
device resource thresholds.
| Parameter | Cisco Default | JunOS Default | EOS Default | Notes |
|---|
| Hold Timer | 180s | 90s | 180s | Negotiated to lower value |
| Keepalive Interval |
60s | 30s | 60s | Hold/3 by convention |
| ConnectRetry Timer | 120s | Varies | 120s | Time between TCP attempts |
| MRAI (eBGP) | 30s | 0s (immediate) | 30s | Minimum Route Advertisement Interval |
| MRAI (iBGP) | 5s | 0s | 5s | Lower than eBGP for faster iBGP convergence |
| Default Local Pref | 100 | 100 | 100 | Same across vendors |
Table Size Norms (IPv4 unicast):
| Deployment Type | Expected Prefixes | Warning | Critical |
|---|
| Internet edge (full table) | ~950K | >1M | >1.1M |
| Internet edge (partial) |
5K–100K | Varies | Per design |
| Enterprise WAN | 100–10K | >2x baseline | >5x baseline |
| Data center leaf | 50–5K | >2x baseline | >5x baseline |
Convergence Targets:
| Scenario | Target | Acceptable | Degraded |
|---|
| eBGP failover | < 90s | 90–180s | > 180s |
| iBGP reconvergence |
< 30s | 30–60s | > 60s |
| Full table reload | < 5min | 5–10min | > 10min |
Decision Trees
Peer Not Established
CODEBLOCK18
Unexpected Route Selection
CODEBLOCK19
Report Template
CODEBLOCK20
Troubleshooting
Session Flapping
Peer cycles between Established and Idle/Active repeatedly. Common causes:
unstable underlying transport (IGP flap, link errors), aggressive hold timers
on congested control planes, or MTU issues on the path causing fragmented
keepalives to be dropped. Check last reset reason and correlate with interface
or IGP events at the same timestamps.
Route Oscillation
The same prefix alternates between two or more paths. Caused by inconsistent
MED comparison across route reflectors, or deterministic-MED not enabled when
multiple exit points exist to the same neighbor AS. Enable always-compare-med
and deterministic-med to stabilize.
Memory Pressure from Full Table
Full Internet table (~950K IPv4 prefixes) requires 1–2 GB of RIB memory
depending on path diversity. Symptoms: slow convergence, peer resets during
table reload. Mitigate with soft-reconfiguration inbound (trades memory for
stability) or ORF (Outbound Route Filtering) to reduce inbound load.
Community Stripping
Routes arrive without expected communities. Check each transit AS in the path —
many providers strip non-standard communities by default. Use large communities
(RFC 8092) for end-to-end propagation across providers that strip standard
communities.
JunOS Default Export Policy
JunOS sends no routes to a peer without an explicit export policy. If a peer
shows Established with zero prefixes sent, add an export policy. This is the
most common JunOS-specific BGP issue and does not occur on Cisco or EOS.
BGP协议分析
基于协议推理驱动的BGP对等、路径选择和路由传播分析技能。与将指标与阈值进行比较的设备健康检查不同,BGP分析需要解释协议状态机、遍历最佳路径算法,并验证跨控制平面的策略应用。
命令在语法不同的地方标注为 [Cisco]、[JunOS] 或 [EOS]。未标注的语句适用于所有三个厂商。
何时使用
- - BGP对等体报告为Down或卡在非Established状态
- 疑似路由泄漏——前缀出现在不应出现的表中
- 策略变更后路径选择不符合预期
- 计划维护或非计划故障切换后收敛过慢
- BGP配置的变更后验证(新对等体、策略更新、团体变更)
- 前缀表增长或会话扩展的容量规划
- 调查由BGP属性不一致引起的非对称路由
前提条件
- - 路由器的SSH或控制台访问权限(只读权限足够)
- 设备上运行BGP进程且至少配置了一个对等体
- 了解预期的对等体拓扑:哪些邻居应处于Up状态,每个对等体的预期前缀数量,预期的路径选择结果
- 了解已配置的路由策略:route-map、prefix-list、团体过滤器、AS-path访问列表和本地优先级分配
- 对于iBGP:了解路由反射器或全互联拓扑
操作步骤
按顺序执行此诊断流程。每一步都基于前一步的数据。该流程从广泛的清单检查逐步过渡到针对性诊断。
步骤1:BGP会话清单
收集所有对等体状态并与预期拓扑进行比较。
[Cisco]
show bgp ipv4 unicast summary
[JunOS]
show bgp summary
[EOS]
show ip bgp summary
记录每个邻居:地址、AS号、状态、接收的前缀数、Up/Down时间。与预期拓扑进行比较——每个配置的对等体都应出现。输出中缺失的对等体意味着从未配置或已被删除。任何未显示数字前缀计数的对等体都未处于Established状态——针对该对等体进入步骤2。
步骤2:对等体状态诊断
对于任何未处于Established状态的对等体,BGP FSM状态揭示了故障域。这是核心的诊断推理步骤。
[Cisco]
show bgp ipv4 unicast neighbors [addr] | include state|last reset|error
[JunOS]
show bgp neighbor [addr] | match State|Last Error|Last State
[EOS]
show ip bgp neighbors [addr] | include state|last reset|error
解释FSM状态以隔离故障:
- - Idle → BGP进程未尝试连接。原因:管理性关闭、到对等体地址无路由、配置的远端AS不匹配、或达到最大前缀限制触发拆除。
- Connect → 已发送TCP SYN,等待响应。对等体在三层不可达,或防火墙阻止了TCP端口179。
- Active → TCP连接尝试失败,正在重试。原因与Connect相同,但路由器已循环回来。检查:阻止端口179的ACL、该邻居未在对等体上配置、对等体地址不可达。
- OpenSent → TCP已连接,OPEN消息已发送,无回复。远端接受了TCP但未发送OPEN——通常远端BGP未为该邻居配置或远端对等体处于管理性关闭状态。
- OpenConfirm → 已收到OPEN但参数被拒绝。检查:AS号不匹配、能力协商失败(AFI/SAFI不匹配)、保持定时器协商失败、认证(MD5/TCP-AO)不匹配。
检查last reset reason和last error字段——它们通常提供确切原因。参考:references/state-machine.md获取完整的FSM细节。
步骤3:路由表分析
对于Established对等体,验证前缀交换是否符合预期。
[Cisco]
show bgp ipv4 unicast neighbors [addr] routes | include Total
[JunOS]
show route receive-protocol bgp [addr] table summary
[EOS]
show ip bgp neighbors [addr] received-routes | include Total
将接收的前缀计数与基线进行比较。显著偏差表示:
- - 下降>10% → 上游正在撤销路由(维护、过滤器变更或故障)
- 增加>10% → 路由泄漏或上游发起了新前缀
- 接收为零 → 对等体处于Established状态但未发送路由(JunOS上缺少导出策略,或远端出站过滤器阻止了所有内容)
类似地检查通告的前缀计数——确认此路由器向每个对等体发送了预期数量的路由。
步骤4:路径选择验证
当流量走意外路径时,遍历BGP最佳路径算法以确定哪个属性导致了选择。
[Cisco]
show bgp ipv4 unicast [prefix] bestpath
[JunOS]
show route [prefix] detail | match AS path|Local|MED|Weight|preference
[EOS]
show ip bgp [prefix] detail
最佳路径算法按此顺序评估(第一个差异胜出):
- 1. Weight(Cisco/EOS本地,最高胜出——JunOS不使用weight)
- Local Preference(最高胜出,默认100)
- 本地发起(network/aggregate优先于学习到的)
- AS Path长度(最短胜出)
- Origin(IGP < EGP < Incomplete)
- MED(最低胜出,默认仅在同一邻居AS内比较)
- eBGP优于iBGP(外部优先)
- 到下一跳的IGP度量(最低胜出)
- Router ID(最低胜出,决胜项)
确定哪个属性选择了当前最佳路径。如果不符合预期,检查在入方向上应用该属性的route-map或策略。
步骤5:路由过滤验证
验证route-map、prefix-list和团体过滤器是否按预期应用。
[Cisco]
show bgp ipv4 unicast neighbors [addr] policy
[JunOS]
show policy [policy-name] | display detail
[EOS]
show route-map [name]
对于疑似路由泄漏:检查RIB中不应存在的前缀。检查作为源的入站过滤器。常见泄漏原因:prefix-list条目缺失或顺序错误、AS-path过滤器中的正则表达式错误、团体匹配过于宽泛。
对于缺失的路由:验证通告对等体上的出站策略未过滤该前缀。在JunOS上,没有导出策略的对等体默认不发送任何内容——这是最常见的JunOS特定遗漏。
步骤6:收敛评估
评估收敛行为和路由稳定性。
[Cisco]
show bgp ipv4 unicast dampening dampened-paths
[JunOS]
show route damping suppressed
[EOS]
show ip bgp dampening dampened-paths
检查被抑制的路由——这些表示持续的路由抖动。审查BGP更新活动:高更新/撤销率表示振荡。在计划变更后,测量从变更事件到最后一次BGP更新的收敛时间。与目标收敛窗口进行比较。
阈值表
BGP的操作参数规范——这些是协议级别的预期值,而非设备资源阈值。
| 参数 | Cisco默认值 | JunOS默认值 | EOS默认值 | 说明 |
|---|
| 保持定时器 | 180s | 90s | 180s | 协商为较低值 |
| 保活间隔 |
60s | 30s | 60s | 按惯例为Hold/3 |
| 连接重试定时器 | 120s | 可变 | 120s | TCP尝试之间的时间 |
| MRAI(eBGP) | 30s | 0s(立即) | 30s | 最小路由通告间隔 |
| MRAI(iBGP) | 5s | 0s | 5s | 低于eBGP以实现更快的iBGP收敛 |
| 默认本地优先级 | 100 | 100 | 100 | 各厂商相同 |
表大小规范(IPv4单播):
| 部署类型 | 预期前缀数 | 警告 | 严重 |
|---|
| 互联网边缘(完整表) | ~950K | >1M | >1.1M |
| 互联网边缘(部分) |
5K–100K | 可变 | 按设计 |
| 企业WAN | 100–10K | >2倍基线 | >5倍基线 |
| 数据中心叶子 | 50–5K | >2倍基线 | >5倍基线 |
收敛目标:
| 场景 | 目标 | 可接受 | 降级 |
|---|
| eBGP故障切换 | < 90s | 90–180s | > 180s |
| iBGP重新收敛 |
< 30s | 30–60s | >