Kubernetes Debugging Skill
Overview
Systematic toolkit for debugging Kubernetes clusters, workloads, networking, and storage with a deterministic, safety-first workflow.
Trigger Phrases
Use this skill when requests resemble:
- - "My pod is in
CrashLoopBackOff; help me find the root cause." - "Service DNS works in one pod but not another."
- "Deployment rollout is stuck."
- "Pods are
Pending and not scheduling." - "Cluster health looks degraded after a change."
- "PVC is pending and pods cannot mount storage."
Prerequisites
Run from the skill directory (devops-skills-plugin/skills/k8s-debug) so relative script paths work as written.
Required
- -
kubectl installed and configured. - An active cluster context.
- Read access to namespaces, pods, events, services, and nodes.
Quick preflight:
CODEBLOCK0
Optional but Recommended
- -
jq for more precise filtering in ./scripts/cluster_health.sh. - Metrics API (
metrics-server) for kubectl top. - In-container debug tools (
nslookup, getent, curl, wget, ip) for deep network tests.
Fallback behavior:
- - If optional tools are missing, scripts continue and print warnings with reduced output.
- If
kubectl top is unavailable, continue with kubectl describe and events.
When to Use This Skill
Use this skill for:
- - Pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled)
- Service connectivity or DNS resolution issues
- Network policy or ingress problems
- Volume and storage mount failures
- Deployment rollout issues
- Cluster health or performance degradation
- Resource exhaustion (CPU/memory)
- Configuration problems (ConfigMaps, Secrets, RBAC)
Safety Rules for Disruptive Commands
Default mode is read-only diagnosis first. Only execute disruptive commands after confirming blast radius and rollback.
Commands requiring explicit confirmation:
- - INLINECODE15
- INLINECODE16
- INLINECODE17
- INLINECODE18
- INLINECODE19
Before disruptive actions:
CODEBLOCK1
Reference Navigation Map
Load only the section needed for the observed symptom.
| Symptom / Need | Open | Start section |
|---|
| You need an end-to-end diagnosis path | INLINECODE20 | INLINECODE21 |
Pod state is Pending, CrashLoopBackOff, or INLINECODE24 |
./references/troubleshooting_workflow.md |
Pod Lifecycle Troubleshooting |
| Service reachability or DNS failure |
./references/troubleshooting_workflow.md |
Network Troubleshooting Workflow |
| Node pressure or performance regression |
./references/troubleshooting_workflow.md |
Resource and Performance Workflow |
| PVC / PV / storage class issues |
./references/troubleshooting_workflow.md |
Storage Troubleshooting Workflow |
| Quick symptom-to-fix lookup |
./references/common_issues.md | matching issue heading |
| Post-mortem fix options for known issues |
./references/common_issues.md |
Solutions sections |
Scripts Overview
| Script | Purpose | Required args | Optional args | Output | Fallback behavior |
|---|
| INLINECODE36 | Cluster-wide health snapshot (nodes, workloads, events, common failure states) | None | INLINECODE37 , K8S_REQUEST_TIMEOUT env var | Sectioned report to stdout | Continues on check failures, tracks them in summary and exit code |
| INLINECODE39 |
Pod-centric network and DNS diagnostics |
<pod-name> (
<namespace> defaults to
default) |
--strict,
--insecure,
K8S_REQUEST_TIMEOUT env var | Sectioned report to stdout | Uses secure API probe by default; insecure TLS requires explicit
--insecure |
|
./scripts/pod_diagnostics.py | Deep pod diagnostics (status, describe, YAML, events, per-container logs, node context) |
<pod-name> |
-n/--namespace,
-o/--output | Sectioned report to stdout or file | Fails fast on missing access; skips optional metrics/log blocks with clear messages |
Script Exit Codes
INLINECODE51 and ./scripts/network_debug.sh share the same contract:
- -
0: checks completed with no check failures (warnings allowed unless --strict is set). - INLINECODE55 : one or more checks failed, or warnings occurred in
--strict mode. - INLINECODE57 : blocked preconditions (for example: missing
kubectl, no active context, inaccessible namespace/pod).
Deterministic Debugging Workflow
Follow this systematic approach for any Kubernetes issue:
1. Preflight and Scope
CODEBLOCK2
If preflight fails, stop and fix access/context first.
2. Identify the Problem Layer
Categorize the issue:
- - Application Layer: Application crashes, errors, bugs
- Pod Layer: Pod not starting, restarting, or pending
- Service Layer: Network connectivity, DNS issues
- Node Layer: Node not ready, resource exhaustion
- Cluster Layer: Control plane issues, API problems
- Storage Layer: Volume mount failures, PVC issues
- Configuration Layer: ConfigMap, Secret, RBAC issues
3. Gather Diagnostics with the Right Script
Use the appropriate diagnostic script based on scope:
Pod-Level Diagnostics
Use
./scripts/pod_diagnostics.py for comprehensive pod analysis:
CODEBLOCK3
This script gathers:
- - Pod status and description
- Pod events
- Container logs (current and previous)
- Resource usage
- Node information
- YAML configuration
Output can be saved for analysis:
CODEBLOCK4
Cluster-Level Health Check
Use
./scripts/cluster_health.sh for overall cluster diagnostics:
CODEBLOCK5
This script checks:
- - Cluster info and version
- Node status and resources
- Pods across all namespaces
- Failed/pending pods
- Recent events
- Deployments, services, statefulsets, daemonsets
- PVCs and PVs
- Component health
- Common error states (CrashLoopBackOff, ImagePullBackOff)
Network Diagnostics
Use
./scripts/network_debug.sh for connectivity issues:
CODEBLOCK6
This script analyzes:
- - Pod network configuration
- DNS setup and resolution
- Service endpoints
- Network policies
- Connectivity tests
- CoreDNS logs
4. Follow Issue-Specific Reference Workflow
Based on the identified issue, consult ./references/troubleshooting_workflow.md:
- - Pod Pending: Resource/scheduling workflow
- CrashLoopBackOff: Application crash workflow
- ImagePullBackOff: Image pull workflow
- Service issues: Network connectivity workflow
- DNS failures: DNS troubleshooting workflow
- Resource exhaustion: Performance investigation workflow
- Storage issues: PVC binding workflow
- Deployment stuck: Rollout workflow
5. Apply Targeted Fixes
Refer to ./references/common_issues.md for symptom-specific fixes.
6. Verify and Close
Run final verification:
CODEBLOCK7
Issue is done when user-visible behavior is healthy and no new critical warning events appear.
Example Flows
Example 1: CrashLoopBackOff in payments Namespace
CODEBLOCK8
Then open ./references/common_issues.md and apply the CrashLoopBackOff solutions.
Example 2: Service DNS/Connectivity Failure
CODEBLOCK9
Then follow Service Connectivity Workflow in ./references/troubleshooting_workflow.md.
Essential Manual Commands
Pod Debugging
CODEBLOCK10
Service and Network Debugging
CODEBLOCK11
Resource Monitoring
CODEBLOCK12
Emergency Operations
CODEBLOCK13
Completion Criteria
Troubleshooting session is complete when all are true:
- - [ ] Cluster context and namespace are confirmed.
- [ ] Relevant diagnostic script output is captured.
- [ ] Root cause is identified and tied to evidence (events/logs/config/state).
- [ ] Any disruptive action was preceded by snapshot and rollback plan.
- [ ] Fix verification commands show healthy state.
- [ ] Reference path used (
./references/troubleshooting_workflow.md or ./references/common_issues.md) is documented in notes.
Related Tools
Useful additional tools for Kubernetes debugging:
- - kubectl-debug: Advanced debugging plugin
- stern: Multi-pod log tailing
- kubectx/kubens: Context and namespace switching
- k9s: Terminal UI for Kubernetes
- lens: Desktop IDE for Kubernetes
- Prometheus/Grafana: Monitoring and alerting
- Jaeger/Zipkin: Distributed tracing
Kubernetes 调试技能
概述
用于调试 Kubernetes 集群、工作负载、网络和存储的系统化工具包,采用确定性、安全优先的工作流程。
触发短语
当请求类似以下内容时使用此技能:
- - 我的 Pod 处于 CrashLoopBackOff 状态;帮我找到根本原因。
- 服务 DNS 在一个 Pod 中正常工作,但在另一个 Pod 中不行。
- Deployment 滚动更新卡住了。
- Pod 处于 Pending 状态且无法调度。
- 变更后集群健康状况下降。
- PVC 处于 Pending 状态,Pod 无法挂载存储。
前提条件
从技能目录(devops-skills-plugin/skills/k8s-debug)运行,以便相对脚本路径按原样工作。
必需条件
- - 已安装并配置 kubectl。
- 活跃的集群上下文。
- 对命名空间、Pod、事件、服务和节点的读取权限。
快速预检:
bash
kubectl config current-context
kubectl auth can-i get pods -A
kubectl auth can-i get events -A
kubectl get ns
可选但推荐
- - jq 用于在 ./scripts/cluster_health.sh 中进行更精确的过滤。
- Metrics API(metrics-server)用于 kubectl top。
- 容器内调试工具(nslookup、getent、curl、wget、ip)用于深度网络测试。
降级行为:
- - 如果缺少可选工具,脚本将继续运行并打印警告,输出内容会减少。
- 如果 kubectl top 不可用,则继续使用 kubectl describe 和事件。
何时使用此技能
在以下情况下使用此技能:
- - Pod 故障(CrashLoopBackOff、ImagePullBackOff、Pending、OOMKilled)
- 服务连接或 DNS 解析问题
- 网络策略或 Ingress 问题
- 卷和存储挂载失败
- Deployment 滚动更新问题
- 集群健康或性能下降
- 资源耗尽(CPU/内存)
- 配置问题(ConfigMap、Secret、RBAC)
破坏性命令的安全规则
默认模式为只读诊断优先。仅在确认影响范围和回滚方案后执行破坏性命令。
需要明确确认的命令:
- - kubectl delete pod ... --force --grace-period=0
- kubectl drain ...
- kubectl rollout restart ...
- kubectl rollout undo ...
- kubectl debug ... --copy-to=...
在执行破坏性操作之前:
bash
为回滚和事件记录快照当前状态
kubectl get deploy,rs,pod,svc -n
-o wide
kubectl get pod -n -o yaml > before-.yaml
kubectl get events -n --sort-by=.lastTimestamp > before-events.txt
参考导航图
仅加载观察到的症状所需的章节。
| 症状/需求 | 打开 | 起始章节 |
|---|
| 需要端到端诊断路径 | ./references/troubleshootingworkflow.md | 通用调试工作流程 |
| Pod 状态为 Pending、CrashLoopBackOff 或 ImagePullBackOff |
./references/troubleshootingworkflow.md | Pod 生命周期故障排除 |
| 服务可达性或 DNS 故障 | ./references/troubleshooting_workflow.md | 网络故障排除工作流程 |
| 节点压力或性能下降 | ./references/troubleshooting_workflow.md | 资源和性能工作流程 |
| PVC / PV / 存储类问题 | ./references/troubleshooting_workflow.md | 存储故障排除工作流程 |
| 快速症状到修复查找 | ./references/common_issues.md | 匹配的问题标题 |
| 已知问题的事后修复选项 | ./references/common_issues.md | 解决方案 章节 |
脚本概述
| 脚本 | 用途 | 必需参数 | 可选参数 | 输出 | 降级行为 |
|---|
| ./scripts/clusterhealth.sh | 集群范围健康快照(节点、工作负载、事件、常见故障状态) | 无 | --strict、K8SREQUESTTIMEOUT 环境变量 | 分段报告到标准输出 | 检查失败时继续,在摘要和退出代码中跟踪 |
| ./scripts/networkdebug.sh |
以 Pod 为中心的网络和 DNS 诊断 | ( 默认为 default) | --strict、--insecure、K8SREQUESTTIMEOUT 环境变量 | 分段报告到标准输出 | 默认使用安全 API 探测;不安全 TLS 需要显式 --insecure |
| ./scripts/pod_diagnostics.py | 深度 Pod 诊断(状态、描述、YAML、事件、每个容器的日志、节点上下文) | | -n/--namespace、-o/--output | 分段报告到标准输出或文件 | 缺少访问权限时快速失败;跳过可选指标/日志块并附带清晰消息 |
脚本退出代码
./scripts/clusterhealth.sh 和 ./scripts/networkdebug.sh 共享相同的约定:
- - 0:检查完成,无检查失败(除非设置了 --strict,否则允许警告)。
- 1:一个或多个检查失败,或在 --strict 模式下出现警告。
- 2:前提条件被阻止(例如:缺少 kubectl、无活跃上下文、命名空间/Pod 不可访问)。
确定性调试工作流程
对于任何 Kubernetes 问题,请遵循此系统化方法:
1. 预检和范围界定
bash
kubectl config current-context
kubectl get ns
kubectl auth can-i get pods -n
如果预检失败,请先停止并修复访问/上下文问题。
2. 识别问题层
对问题进行分类:
- - 应用层:应用程序崩溃、错误、缺陷
- Pod 层:Pod 无法启动、重启或处于 Pending 状态
- 服务层:网络连接、DNS 问题
- 节点层:节点未就绪、资源耗尽
- 集群层:控制平面问题、API 问题
- 存储层:卷挂载失败、PVC 问题
- 配置层:ConfigMap、Secret、RBAC 问题
3. 使用正确的脚本收集诊断信息
根据范围使用适当的诊断脚本:
Pod 级别诊断
使用 ./scripts/pod_diagnostics.py 进行全面的 Pod 分析:
bash
python3 ./scripts/pod_diagnostics.py -n
此脚本收集:
- - Pod 状态和描述
- Pod 事件
- 容器日志(当前和之前的)
- 资源使用情况
- 节点信息
- YAML 配置
输出可以保存以供分析:
bash
python3 ./scripts/pod_diagnostics.py -n -o diagnostics.txt
集群级别健康检查
使用 ./scripts/cluster_health.sh 进行整体集群诊断:
bash
./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt
此脚本检查:
- - 集群信息和版本
- 节点状态和资源
- 所有命名空间中的 Pod
- 失败/待处理的 Pod
- 最近事件
- Deployment、Service、StatefulSet、DaemonSet
- PVC 和 PV
- 组件健康状态
- 常见错误状态(CrashLoopBackOff、ImagePullBackOff)
网络诊断
使用 ./scripts/network_debug.sh 进行连接问题排查:
bash
./scripts/network_debug.sh
或仅在明确需要时强制警告敏感度/不安全 TLS:
./scripts/network_debug.sh --strict
./scripts/network_debug.sh --insecure
此脚本分析:
- - Pod 网络配置
- DNS 设置和解析
- 服务端点
- 网络策略
- 连接测试
- CoreDNS 日志
4. 遵循问题特定的参考工作流程
根据识别的问题,查阅 ./references/troubleshooting_workflow.md:
- - Pod Pending:资源/调度工作流程
- CrashLoopBackOff:应用程序崩溃工作流程
- ImagePullBackOff:镜像拉取工作流程
- 服务问题:网络连接工作流程
- DNS 故障