Kubernetes Debugging Skill

Overview

Systematic toolkit for debugging Kubernetes clusters, workloads, networking, and storage with a deterministic, safety-first workflow.

Trigger Phrases

Use this skill when requests resemble:

- "My pod is in CrashLoopBackOff; help me find the root cause."
"Service DNS works in one pod but not another."
"Deployment rollout is stuck."
"Pods are Pending and not scheduling."
"Cluster health looks degraded after a change."
"PVC is pending and pods cannot mount storage."

Prerequisites

Run from the skill directory (devops-skills-plugin/skills/k8s-debug) so relative script paths work as written.

Required

- kubectl installed and configured.
An active cluster context.
Read access to namespaces, pods, events, services, and nodes.

Quick preflight:

CODEBLOCK0

Optional but Recommended

- jq for more precise filtering in ./scripts/cluster_health.sh.
Metrics API (metrics-server) for kubectl top.
In-container debug tools (nslookup, getent, curl, wget, ip) for deep network tests.

Fallback behavior:

- If optional tools are missing, scripts continue and print warnings with reduced output.
If kubectl top is unavailable, continue with kubectl describe and events.

When to Use This Skill

Use this skill for:

- Pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled)
Service connectivity or DNS resolution issues
Network policy or ingress problems
Volume and storage mount failures
Deployment rollout issues
Cluster health or performance degradation
Resource exhaustion (CPU/memory)
Configuration problems (ConfigMaps, Secrets, RBAC)

Safety Rules for Disruptive Commands

Default mode is read-only diagnosis first. Only execute disruptive commands after confirming blast radius and rollback.

Commands requiring explicit confirmation:

- INLINECODE15
INLINECODE16
INLINECODE17
INLINECODE18
INLINECODE19

Before disruptive actions:
CODEBLOCK1

Reference Navigation Map

Load only the section needed for the observed symptom.

Symptom / Need	Open	Start section
You need an end-to-end diagnosis path	INLINECODE20	INLINECODE21
Pod state is `Pending`, `CrashLoopBackOff`, or INLINECODE24

Scripts Overview

Script	Purpose	Required args	Optional args	Output	Fallback behavior
INLINECODE36	Cluster-wide health snapshot (nodes, workloads, events, common failure states)	None	INLINECODE37, `K8S_REQUEST_TIMEOUT` env var	Sectioned report to stdout	Continues on check failures, tracks them in summary and exit code
INLINECODE39

Pod-centric network and DNS diagnostics | <pod-name> (<namespace> defaults to default) | --strict, --insecure, K8S_REQUEST_TIMEOUT env var | Sectioned report to stdout | Uses secure API probe by default; insecure TLS requires explicit --insecure | | ./scripts/pod_diagnostics.py | Deep pod diagnostics (status, describe, YAML, events, per-container logs, node context) | <pod-name> | -n/--namespace, -o/--output | Sectioned report to stdout or file | Fails fast on missing access; skips optional metrics/log blocks with clear messages |

Script Exit Codes

INLINECODE51 and ./scripts/network_debug.sh share the same contract:

- 0: checks completed with no check failures (warnings allowed unless --strict is set).
INLINECODE55: one or more checks failed, or warnings occurred in --strict mode.
INLINECODE57: blocked preconditions (for example: missing kubectl, no active context, inaccessible namespace/pod).

Deterministic Debugging Workflow

Follow this systematic approach for any Kubernetes issue:

1. Preflight and Scope

CODEBLOCK2

If preflight fails, stop and fix access/context first.

2. Identify the Problem Layer

Categorize the issue:

- Application Layer: Application crashes, errors, bugs
Pod Layer: Pod not starting, restarting, or pending
Service Layer: Network connectivity, DNS issues
Node Layer: Node not ready, resource exhaustion
Cluster Layer: Control plane issues, API problems
Storage Layer: Volume mount failures, PVC issues
Configuration Layer: ConfigMap, Secret, RBAC issues

3. Gather Diagnostics with the Right Script

Use the appropriate diagnostic script based on scope:

Pod-Level Diagnostics

Use ./scripts/pod_diagnostics.py for comprehensive pod analysis:

CODEBLOCK3

This script gathers:

- Pod status and description
Pod events
Container logs (current and previous)
Resource usage
Node information
YAML configuration

Output can be saved for analysis:

CODEBLOCK4

Cluster-Level Health Check

Use ./scripts/cluster_health.sh for overall cluster diagnostics:

CODEBLOCK5

This script checks:

- Cluster info and version
Node status and resources
Pods across all namespaces
Failed/pending pods
Recent events
Deployments, services, statefulsets, daemonsets
PVCs and PVs
Component health
Common error states (CrashLoopBackOff, ImagePullBackOff)

Network Diagnostics

Use ./scripts/network_debug.sh for connectivity issues:

CODEBLOCK6

This script analyzes:

- Pod network configuration
DNS setup and resolution
Service endpoints
Network policies
Connectivity tests
CoreDNS logs

4. Follow Issue-Specific Reference Workflow

Based on the identified issue, consult ./references/troubleshooting_workflow.md:

- Pod Pending: Resource/scheduling workflow
CrashLoopBackOff: Application crash workflow
ImagePullBackOff: Image pull workflow
Service issues: Network connectivity workflow
DNS failures: DNS troubleshooting workflow
Resource exhaustion: Performance investigation workflow
Storage issues: PVC binding workflow
Deployment stuck: Rollout workflow

5. Apply Targeted Fixes

Refer to ./references/common_issues.md for symptom-specific fixes.

6. Verify and Close

Run final verification:

CODEBLOCK7

Issue is done when user-visible behavior is healthy and no new critical warning events appear.

Example Flows

Example 1: CrashLoopBackOff in `payments` Namespace

CODEBLOCK8

Then open ./references/common_issues.md and apply the CrashLoopBackOff solutions.

Example 2: Service DNS/Connectivity Failure

CODEBLOCK9

Then follow Service Connectivity Workflow in ./references/troubleshooting_workflow.md.

Essential Manual Commands

Pod Debugging

CODEBLOCK10

Service and Network Debugging

CODEBLOCK11

Resource Monitoring

CODEBLOCK12

Emergency Operations

CODEBLOCK13

Completion Criteria

Troubleshooting session is complete when all are true:

- [ ] Cluster context and namespace are confirmed.
[ ] Relevant diagnostic script output is captured.
[ ] Root cause is identified and tied to evidence (events/logs/config/state).
[ ] Any disruptive action was preceded by snapshot and rollback plan.
[ ] Fix verification commands show healthy state.
[ ] Reference path used (./references/troubleshooting_workflow.md or ./references/common_issues.md) is documented in notes.

Related Tools

Useful additional tools for Kubernetes debugging:

- kubectl-debug: Advanced debugging plugin
stern: Multi-pod log tailing
kubectx/kubens: Context and namespace switching
k9s: Terminal UI for Kubernetes
lens: Desktop IDE for Kubernetes
Prometheus/Grafana: Monitoring and alerting
Jaeger/Zipkin: Distributed tracing

Kubernetes 调试技能

概述

用于调试 Kubernetes 集群、工作负载、网络和存储的系统化工具包，采用确定性、安全优先的工作流程。

触发短语

当请求类似以下内容时使用此技能：

- 我的 Pod 处于 CrashLoopBackOff 状态；帮我找到根本原因。
服务 DNS 在一个 Pod 中正常工作，但在另一个 Pod 中不行。
Deployment 滚动更新卡住了。
Pod 处于 Pending 状态且无法调度。
变更后集群健康状况下降。
PVC 处于 Pending 状态，Pod 无法挂载存储。

前提条件

从技能目录（devops-skills-plugin/skills/k8s-debug）运行，以便相对脚本路径按原样工作。

必需条件

- 已安装并配置 kubectl。
活跃的集群上下文。
对命名空间、Pod、事件、服务和节点的读取权限。

快速预检：

bash
kubectl config current-context
kubectl auth can-i get pods -A
kubectl auth can-i get events -A
kubectl get ns

可选但推荐

- jq 用于在 ./scripts/cluster_health.sh 中进行更精确的过滤。
Metrics API（metrics-server）用于 kubectl top。
容器内调试工具（nslookup、getent、curl、wget、ip）用于深度网络测试。

降级行为：

- 如果缺少可选工具，脚本将继续运行并打印警告，输出内容会减少。
如果 kubectl top 不可用，则继续使用 kubectl describe 和事件。

何时使用此技能

在以下情况下使用此技能：

- Pod 故障（CrashLoopBackOff、ImagePullBackOff、Pending、OOMKilled）
服务连接或 DNS 解析问题
网络策略或 Ingress 问题
卷和存储挂载失败
Deployment 滚动更新问题
集群健康或性能下降
资源耗尽（CPU/内存）
配置问题（ConfigMap、Secret、RBAC）

破坏性命令的安全规则

默认模式为只读诊断优先。仅在确认影响范围和回滚方案后执行破坏性命令。

需要明确确认的命令：

- kubectl delete pod ... --force --grace-period=0
kubectl drain ...
kubectl rollout restart ...
kubectl rollout undo ...
kubectl debug ... --copy-to=...

在执行破坏性操作之前：
bash

为回滚和事件记录快照当前状态

kubectl get deploy,rs,pod,svc -n -o wide
kubectl get pod -n -o yaml > before-.yaml
kubectl get events -n --sort-by=.lastTimestamp > before-events.txt

参考导航图

仅加载观察到的症状所需的章节。

症状/需求	打开	起始章节
需要端到端诊断路径	./references/troubleshootingworkflow.md	通用调试工作流程
Pod 状态为 Pending、CrashLoopBackOff 或 ImagePullBackOff

脚本概述

脚本	用途	必需参数	可选参数	输出	降级行为
./scripts/clusterhealth.sh	集群范围健康快照（节点、工作负载、事件、常见故障状态）	无	--strict、K8SREQUESTTIMEOUT 环境变量	分段报告到标准输出	检查失败时继续，在摘要和退出代码中跟踪
./scripts/networkdebug.sh

脚本退出代码

./scripts/clusterhealth.sh 和 ./scripts/networkdebug.sh 共享相同的约定：

- 0：检查完成，无检查失败（除非设置了 --strict，否则允许警告）。
1：一个或多个检查失败，或在 --strict 模式下出现警告。
2：前提条件被阻止（例如：缺少 kubectl、无活跃上下文、命名空间/Pod 不可访问）。

确定性调试工作流程

对于任何 Kubernetes 问题，请遵循此系统化方法：

1. 预检和范围界定

bash
kubectl config current-context
kubectl get ns
kubectl auth can-i get pods -n

如果预检失败，请先停止并修复访问/上下文问题。

2. 识别问题层

对问题进行分类：

- 应用层：应用程序崩溃、错误、缺陷
Pod 层：Pod 无法启动、重启或处于 Pending 状态
服务层：网络连接、DNS 问题
节点层：节点未就绪、资源耗尽
集群层：控制平面问题、API 问题
存储层：卷挂载失败、PVC 问题
配置层：ConfigMap、Secret、RBAC 问题

3. 使用正确的脚本收集诊断信息

根据范围使用适当的诊断脚本：

Pod 级别诊断

使用 ./scripts/pod_diagnostics.py 进行全面的 Pod 分析：

bash
python3 ./scripts/pod_diagnostics.py -n

此脚本收集：

- Pod 状态和描述
Pod 事件
容器日志（当前和之前的）
资源使用情况
节点信息
YAML 配置

输出可以保存以供分析：

bash
python3 ./scripts/pod_diagnostics.py -n -o diagnostics.txt

集群级别健康检查

使用 ./scripts/cluster_health.sh 进行整体集群诊断：

bash
./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt

此脚本检查：

- 集群信息和版本
节点状态和资源
所有命名空间中的 Pod
失败/待处理的 Pod
最近事件
Deployment、Service、StatefulSet、DaemonSet
PVC 和 PV
组件健康状态
常见错误状态（CrashLoopBackOff、ImagePullBackOff）

网络诊断

使用 ./scripts/network_debug.sh 进行连接问题排查：

bash
./scripts/network_debug.sh

或仅在明确需要时强制警告敏感度/不安全 TLS：

./scripts/network_debug.sh --strict

./scripts/network_debug.sh --insecure

此脚本分析：

- Pod 网络配置
DNS 设置和解析
服务端点
网络策略
连接测试
CoreDNS 日志

4. 遵循问题特定的参考工作流程

根据识别的问题，查阅 ./references/troubleshooting_workflow.md：

- Pod Pending：资源/调度工作流程
CrashLoopBackOff：应用程序崩溃工作流程
ImagePullBackOff：镜像拉取工作流程
服务问题：网络连接工作流程
DNS 故障

k8s-debugK8s调试诊断

k8s-debug

Kubernetes Debugging Skill

Overview

Trigger Phrases

Prerequisites

Required

Optional but Recommended

When to Use This Skill

Safety Rules for Disruptive Commands

Reference Navigation Map

Scripts Overview

Script Exit Codes

Deterministic Debugging Workflow

1. Preflight and Scope

2. Identify the Problem Layer

3. Gather Diagnostics with the Right Script

Pod-Level Diagnostics

Cluster-Level Health Check

Network Diagnostics

4. Follow Issue-Specific Reference Workflow

5. Apply Targeted Fixes

6. Verify and Close

Example Flows

Example 1: CrashLoopBackOff in payments Namespace

Example 2: Service DNS/Connectivity Failure

Essential Manual Commands

Pod Debugging

Service and Network Debugging

Resource Monitoring

Emergency Operations

Completion Criteria

Related Tools

Kubernetes 调试技能

概述

触发短语

前提条件

必需条件

可选但推荐

何时使用此技能

破坏性命令的安全规则

为回滚和事件记录快照当前状态

参考导航图

脚本概述

脚本退出代码

确定性调试工作流程

1. 预检和范围界定

2. 识别问题层

3. 使用正确的脚本收集诊断信息

Pod 级别诊断

集群级别健康检查

网络诊断

或仅在明确需要时强制警告敏感度/不安全 TLS：

4. 遵循问题特定的参考工作流程

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement

Example 1: CrashLoopBackOff in `payments` Namespace