Interface Health Assessment
Threshold-driven diagnostic skill for interface and link health. Covers the
physical and data-link layers — error counters, optical power levels, discard
rates, interface flaps, and bandwidth utilization. Each metric is evaluated
against four severity tiers (Normal / Warning / Critical / Emergency) with
vendor-specific collection commands.
Commands are labeled [Cisco], [JunOS], or [EOS] where syntax
diverges. Unlabeled statements apply to all three vendors. Detailed command
syntax is in references/cli-reference.md; full threshold tables with
per-optic-type ranges are in references/threshold-tables.md.
When to Use
- - Interface reported down or flapping (repeated up/down transitions)
- Users reporting packet loss or degraded throughput on a link
- Monitoring alerts for CRC errors, input errors, or output drops
- Pre/post maintenance validation of link quality after cable or optic swap
- Optical power alarms from DOM (Digital Optical Monitoring) readings
- Capacity planning — identifying interfaces approaching saturation
- Troubleshooting latency spikes that correlate with interface congestion
- Baseline collection for new link turn-ups or circuit migrations
Prerequisites
- - SSH or console access to the device (read-only privilege sufficient)
- Interfaces to evaluate are identified (specific interfaces or all active)
- Baseline error counts or a prior snapshot for delta comparison — without a
baseline, only instantaneous rates and absolute counters are available
- - Knowledge of expected link parameters: speed, duplex, media type (copper vs
fiber), SFP model, and cable distance
- - For optical checks: SFP/QSFP modules with DOM support installed
Procedure
Work through each step sequentially. Early steps collect broad status; later
steps drill into specific failure domains identified by prior output.
Step 1: Interface Status Overview
Collect admin state, operational state, speed, duplex, and media type for all
interfaces under review.
[Cisco]
CODEBLOCK0
[JunOS]
CODEBLOCK1
[EOS]
CODEBLOCK2
Record each interface: name, admin/oper state, speed, duplex, media type. Any
interface that is admin up but operationally down requires immediate
investigation — skip to the Decision Trees section for that interface. Duplex
mismatches (one end full, other half) cause late collisions and must be
resolved before error analysis is meaningful.
Step 2: Error Counter Analysis
Collect error counters and calculate per-interval rates. Raw counters are
cumulative since last clear — always compute a delta over a known interval
(minimum 5 minutes) for actionable rates.
[Cisco]
CODEBLOCK3
[JunOS]
CODEBLOCK4
[EOS]
CODEBLOCK5
Key counters to evaluate:
- - CRC errors — corrupted frames; indicates physical-layer problems (bad
cable, dirty fiber, failing SFP, EMI)
- - Input errors — superset including CRC, frame, overrun; aggregate
indicator of receive-path health
- - Output errors — transmission failures; often buffer exhaustion or
interface congestion
- - Frame errors — non-integer-octet frames; typically duplex mismatch or
bad NIC
- - Runts — undersized frames (<64 bytes); usually collision fragments or
bad NIC
- - Giants — oversized frames; MTU mismatch between endpoints
Compare rates against thresholds in references/threshold-tables.md. Any
counter incrementing steadily (not a stale historical value) at Warning level
or above warrants investigation.
Step 3: Discard Analysis
Evaluate input and output discards separately — they have different root
causes.
[Cisco]
CODEBLOCK6
[JunOS]
CODEBLOCK7
[EOS]
CODEBLOCK8
- - Output discards — interface transmit ring full. Causes: sustained
congestion (traffic exceeds link capacity), inadequate QoS scheduling,
microbursts overwhelming shallow buffers.
- - Input discards — receive ring full. Causes: CPU unable to process at
line rate (control plane punt), input QoS policer drops, or receive buffer
exhaustion.
- - Queue drops — per-queue drops visible in QoS policy output. Identify
which traffic class is affected to prioritize remediation.
High output discards with low utilization suggests microburst activity —
short-duration traffic spikes that don't appear in 5-minute utilization
averages but overflow interface buffers.
Step 4: Interface Reset and Flap Detection
Identify interfaces with recent or recurring resets.
[Cisco]
CODEBLOCK9
[JunOS]
CODEBLOCK10
[EOS]
CODEBLOCK11
Record reset counts and last-flap timestamps. Correlate flap events with error
counter spikes — a link that flaps and accumulates CRC errors on recovery
likely has a physical-layer issue (loose cable, marginal SFP). Frequent resets
without errors may indicate auto-negotiation failures or spanning-tree
reconvergence triggers.
Threshold: >3 resets/hour is Critical; >10 resets/hour is Emergency. See
references/threshold-tables.md for full severity tiers.
Step 5: Optical Power Monitoring
For fiber interfaces with DOM-capable SFPs, collect Tx power, Rx power, laser
bias current, and module temperature.
[Cisco]
CODEBLOCK12
[JunOS]
CODEBLOCK13
[EOS]
CODEBLOCK14
Key readings:
- - Tx Power (dBm) — transmit optical power. Out-of-range indicates SFP
degradation or failure.
- - Rx Power (dBm) — received optical power. Low Rx with normal Tx on the
remote end indicates fiber attenuation (dirty connector, bend loss, distance
exceeded, bad splice).
- - Laser Bias Current (mA) — current driving the laser. Rising bias over
time indicates SFP aging; high bias with low Tx power means the SFP is
compensating for degradation.
- - Temperature (°C) — module operating temperature. Elevated temperature
accelerates SFP aging and can cause transmission errors.
Compare readings against the per-optic-type tables in
references/threshold-tables.md. The tables provide manufacturer spec ranges
for common SFP types (1G-SX, 10G-SR, 10G-LR, 25G-SR, 100G-SR4).
Step 6: Utilization Assessment
Measure bandwidth usage to identify congested or underutilized links.
[Cisco]
CODEBLOCK15
[JunOS]
CODEBLOCK16
[EOS]
CODEBLOCK17
Calculate utilization as a percentage of interface speed. Note that CLI
"input/output rate" values are typically 5-minute weighted averages — they
smooth out microbursts. For burst detection, correlate with output discards
(Step 3) and use streaming telemetry or shorter polling intervals if available.
Threshold Tables
Summary of key thresholds used in this skill. Full per-optic-type tables and
detailed severity definitions are in references/threshold-tables.md.
| Metric | Normal | Warning | Critical | Emergency |
|---|
| CRC errors/5min | 0 | 1–5 | 6–50 | >50 |
| Input errors/5min |
0–2 | 3–20 | 21–100 | >100 |
| Output discards/5min | 0–10 | 11–100 | 101–1000 | >1000 |
| Interface resets/hr | 0 | 1–2 | 3–10 | >10 |
| Rx Power vs low-warn | >3 dBm margin | 1–3 dBm margin | 0–1 dBm margin | Below low-alarm |
| Utilization % | 0–50% | 51–75% | 76–90% | >90% |
Decision Trees
High Error Rate
CODEBLOCK18
High Discards
CODEBLOCK19
Optical Power Out of Range
CODEBLOCK20
Report Template
CODEBLOCK21
Troubleshooting
CRC Errors on Fiber with Normal Optical Power
Optical power within spec but CRC errors incrementing. Common causes:
wavelength mismatch between SFP types (e.g., SX connected to LR), dirty
connector on the inside of the SFP cage (not the fiber tip), or SFP
incompatibility with the switch (non-qualified optic). Try: clean the SFP
receptacle, verify both ends use the same SFP type, test with a
vendor-qualified optic.
Output Discards with Low Utilization
Interface shows <30% average utilization but output discards are climbing. This
is almost always microburst traffic — sub-second spikes that exceed link
capacity during the burst but average out below the utilization threshold.
Diagnose with: per-queue drop counters (shows which traffic class), interface
buffer allocation stats. Remediate with: QoS scheduling adjustments, increased
interface buffer depth, or traffic shaping at the ingress point.
Interface Stuck in Down/Down After Cable Swap
Admin up, operationally down after replacing a cable or SFP. Check: SFP is
fully seated (push firmly until click), fiber polarity is correct (Tx-to-Rx
crossover), SFP type matches remote end, speed/duplex is set to auto or
matches. On [Cisco], check show interfaces [intf] | include err-disabled
— the port may have been error-disabled by a protection feature (BPDU guard,
UDLD, link-flap detection). Recover with shutdown / no shutdown after
fixing the root cause.
Flapping Interface with No Errors
Interface cycles up/down every few seconds with zero error counters. This
suggests a negotiation or protocol issue, not a physical fault. Common causes:
auto-negotiation incompatibility (force speed/duplex on both ends), STP
topology changes causing repeated blocking/forwarding transitions, UDLD
aggressive mode detecting unidirectional link. Check spanning-tree state and
UDLD status on the interface.
Rising Laser Bias with Stable Tx Power
Laser bias current increasing over weeks/months while Tx power remains stable.
The SFP is compensating for laser degradation by driving more current. This is
normal aging but indicates the SFP will eventually fail — Tx power will drop
when the laser can no longer compensate. Plan proactive replacement before the
Tx power begins declining. Track the trend: if bias current exceeds 80% of the
manufacturer's max specification, schedule replacement within 30 days.
接口健康评估
针对接口和链路健康的阈值驱动诊断技能。涵盖物理层和数据链路层——错误计数器、光功率水平、丢弃率、接口抖动和带宽利用率。每个指标根据四个严重级别(正常/警告/严重/紧急)进行评估,并附带特定厂商的采集命令。
命令在语法不同的地方标注为 [Cisco]、[JunOS] 或 [EOS]。未标注的语句适用于所有三个厂商。详细命令语法见 references/cli-reference.md;包含每种光模块类型范围的完整阈值表见 references/threshold-tables.md。
使用场景
- - 接口报告为down或flapping(反复up/down切换)
- 用户报告链路上的丢包或吞吐量下降
- 针对CRC错误、输入错误或输出丢弃的监控告警
- 线缆或光模块更换后的链路质量维护前后验证
- 来自DOM(数字光学监控)读数的光功率告警
- 容量规划——识别接近饱和的接口
- 排查与接口拥塞相关的延迟峰值问题
- 新链路开通或电路迁移的基线采集
前置条件
- - 能够SSH或通过控制台访问设备(只读权限足够)
- 待评估的接口已识别(特定接口或所有活跃接口)
- 基线错误计数或用于增量比较的先前快照——如果没有基线,只能获取瞬时速率和绝对计数器
- 了解预期的链路参数:速率、双工模式、介质类型(铜缆与光纤)、SFP型号和线缆距离
- 对于光模块检查:需要安装支持DOM的SFP/QSFP模块
操作步骤
按顺序执行每个步骤。前面的步骤收集整体状态;后面的步骤深入分析先前输出中识别的特定故障域。
步骤1:接口状态概览
收集所有待审查接口的管理状态、运行状态、速率、双工模式和介质类型。
[Cisco]
show interfaces status
show interfaces [intf] | include line protocol|BW|duplex
[JunOS]
show interfaces terse
show interfaces [intf] | match Physical|Speed|Duplex|Link-level
[EOS]
show interfaces status
show interfaces [intf] | include line protocol|BW|duplex
记录每个接口:名称、管理/运行状态、速率、双工模式、介质类型。任何管理状态为up但运行状态为down的接口需要立即调查——跳转到该接口的决策树部分。双工模式不匹配(一端全双工,另一端半双工)会导致后碰撞,必须在错误分析有意义之前解决。
步骤2:错误计数器分析
收集错误计数器并计算每个时间间隔的速率。原始计数器是自上次清除以来的累积值——始终在已知时间间隔(至少5分钟)内计算增量以获得可操作的速率。
[Cisco]
show interfaces [intf] | include CRC|input errors|output errors|frame|runts|giants
show interfaces [intf] counters errors
[JunOS]
show interfaces [intf] extensive | match CRC|Errors|Framing|Runts|Giants
[EOS]
show interfaces [intf] counters errors
show interfaces [intf] | include CRC|input errors|output errors|runts|giants
需要评估的关键计数器:
- - CRC错误 — 损坏的帧;表示物理层问题(线缆不良、光纤脏污、SFP故障、电磁干扰)
- 输入错误 — 超集,包括CRC、帧、溢出;接收路径健康的综合指标
- 输出错误 — 传输失败;通常是缓冲区耗尽或接口拥塞
- 帧错误 — 非整数字节的帧;通常是双工模式不匹配或网卡故障
- 短帧 — 尺寸过小的帧(<64字节);通常是冲突碎片或网卡故障
- 巨帧 — 尺寸过大的帧;端点之间的MTU不匹配
将速率与 references/threshold-tables.md 中的阈值进行比较。任何稳定递增(不是陈旧的历史值)且达到警告级别或以上的计数器都值得调查。
步骤3:丢弃分析
分别评估输入和输出丢弃——它们有不同的根本原因。
[Cisco]
show interfaces [intf] | include drops|discard|queue
show policy-map interface [intf]
[JunOS]
show interfaces queue [intf]
show class-of-service interface [intf]
[EOS]
show interfaces [intf] counters discards
show qos interface [intf]
- - 输出丢弃 — 接口发送环已满。原因:持续拥塞(流量超过链路容量)、QoS调度不足、微突发淹没浅缓冲区。
- 输入丢弃 — 接收环已满。原因:CPU无法以线速处理(控制平面punt)、输入QoS策略丢弃或接收缓冲区耗尽。
- 队列丢弃 — QoS策略输出中可见的每个队列丢弃。识别受影响的流量类别以优先进行修复。
高输出丢弃伴随低利用率表明存在微突发活动——短时间流量峰值,不会出现在5分钟利用率平均值中,但会溢出接口缓冲区。
步骤4:接口重置和抖动检测
识别最近或反复重置的接口。
[Cisco]
show interfaces [intf] | include resets|Last input|Last output|last change
[JunOS]
show interfaces [intf] extensive | match Last flapped|Resets
[EOS]
show interfaces [intf] | include resets|Last input|Last output
show logging | include [intf].up|[intf].down
记录重置计数和上次抖动时间戳。将抖动事件与错误计数器峰值关联——链路抖动并在恢复后累积CRC错误很可能存在物理层问题(线缆松动、SFP边缘状态)。频繁重置但没有错误可能表明自动协商失败或生成树重新收敛触发。
阈值:>3次重置/小时为严重;>10次重置/小时为紧急。完整严重级别见 references/threshold-tables.md。
步骤5:光功率监控
对于使用支持DOM的SFP的光纤接口,收集发射功率、接收功率、激光偏置电流和模块温度。
[Cisco]
show interfaces [intf] transceiver detail
[JunOS]
show interfaces diagnostics optics [intf]
[EOS]
show interfaces [intf] transceiver detail
关键读数:
- - 发射功率 (dBm) — 发射光功率。超出范围表示SFP退化或故障。
- 接收功率 (dBm) — 接收光功率。远程端发射功率正常但接收功率低表示光纤衰减(连接器脏污、弯曲损耗、距离超限、熔接不良)。
- 激光偏置电流 (mA) — 驱动激光器的电流。偏置电流随时间上升表示SFP老化;高偏置电流伴随低发射功率意味着SFP正在补偿退化。
- 温度 (°C) — 模块工作温度。温度升高会加速SFP老化并可能导致传输错误。
将读数与 references/threshold-tables.md 中每种光模块类型的表格进行比较。这些表格提供了常见SFP类型(1G-SX、10G-SR、10G-LR、25G-SR、100G-SR4)的制造商规格范围。
步骤6:利用率评估
测量带宽使用情况以识别拥塞或未充分利用的链路。
[Cisco]
show interfaces [intf] | include input rate|output rate|reliability
show interfaces [intf] summary
[JunOS]
show interfaces [intf] traffic
show interfaces [intf] statistics traffic
[EOS]
show interfaces [intf] | include input rate|output rate
show interfaces [intf] counters rates
计算利用率占接口速率的百分比。注意CLI的输入/输出速率值通常是5分钟加权平均值——它们会平滑微突发。对于突发检测,与输出丢弃(步骤3)关联,并使用流式遥测或更短的轮询间隔(如果可用)。
阈值表
本技能使用的关键阈值摘要。每种光模块类型的完整表格和详细严重级别定义见 references/threshold-tables.md。
| 指标 | 正常 | 警告 | 严重 | 紧急 |
|---|
| CRC错误/5分钟 | 0 | 1–5 | 6–50 | >50 |
| 输入错误/5分钟 |
0–2 | 3–20 | 21–100 | >100 |
| 输出丢弃/5分钟 | 0–10 | 11–100 | 101–1000 | >1000 |
| 接口重置/小时 | 0 | 1–2 | 3–10 | >10 |
| 接收功率与低警告值 | >3 dBm余量 | 1–3 dBm余量 | 0–1 dBm余量 | 低于低告警值 |
| 利用率 % | 0–50% | 51–75% | 76–