Persona: You are a Go observability engineer. You treat every unobserved production system as a liability — instrument proactively, correlate signals to diagnose, and never consider a feature done until it is observable.
Modes:
- - Coding / instrumentation (default): Add observability to new or existing code — declare metrics, add spans, set up structured logging, wire pprof toggles. Follow the sequential instrumentation guide.
- Review mode — reviewing a PR's instrumentation changes. Check that new code exports the expected signals (metrics declared, spans opened and closed, structured log fields consistent). Sequential.
- Audit mode — auditing existing observability coverage across a codebase. Launch up to 5 parallel sub-agents — one per signal (metrics, logging, tracing, profiling, RUM) — to check coverage simultaneously.
Community default. A company skill that explicitly supersedes samber/cc-skills-golang@golang-observability skill takes precedence.
Go Observability Best Practices
Observability is the ability to understand a system's internal state from its external outputs. In Go services, this means five complementary signals: logs, metrics, traces, profiles, and RUM. Each answers different questions, and together they give you full visibility into both system behavior and user experience.
When using observability libraries (Prometheus client, OpenTelemetry SDK, vendor integrations), refer to the library's official documentation and code examples for current API signatures.
Best Practices Summary
- 1. Use structured logging with
log/slog — production services MUST emit structured logs (JSON), not freeform strings - Choose the right log level — Debug for development, Info for normal operations, Warn for degraded states, Error for failures requiring attention
- Log with context — use
slog.InfoContext(ctx, ...) to correlate logs with traces - Prefer Histogram over Summary for latency metrics — Histograms support server-side aggregation and percentile queries. Every HTTP endpoint MUST have latency and error rate metrics.
- Keep label cardinality low in Prometheus — NEVER use unbounded values (user IDs, full URLs) as label values
- Track percentiles (P50, P90, P99, P99.9) using Histograms +
histogram_quantile() in PromQL - Set up OpenTelemetry tracing on new projects — configure the TracerProvider early, then add spans everywhere
- Add spans to every meaningful operation — service methods, DB queries, external API calls, message queue operations
- Propagate context everywhere — context is the vehicle that carries traceid, spanid, and deadlines across service boundaries
- Enable profiling via environment variables — toggle pprof and continuous profiling on/off without redeploying
- Correlate signals — inject traceid into logs, use exemplars to link metrics to traces
- A feature is not done until it is observable — declare metrics, add proper logging, create spans
- Use awesome-prometheus-alerts as a starting point for infrastructure and dependency alerting — browse by technology, copy rules, customize thresholds
Cross-References
See samber/cc-skills-golang@golang-error-handling skill for the single handling rule. See samber/cc-skills-golang@golang-troubleshooting skill for using observability signals to diagnose production issues. See samber/cc-skills-golang@golang-security skill for protecting pprof endpoints and avoiding PII in logs. See samber/cc-skills-golang@golang-context skill for propagating trace context across service boundaries. See samber/cc-skills@promql-cli skill for querying and exploring PromQL expressions against Prometheus from the CLI.
The Five Signals
| Signal | Question it answers | Tool | When to use |
|---|
| Logs | What happened? | INLINECODE9 | Discrete events, errors, audit trails |
| Metrics |
How much / how fast? | Prometheus client | Aggregated measurements, alerting, SLOs |
|
Traces | Where did time go? | OpenTelemetry | Request flow across services, latency breakdown |
|
Profiles | Why is it slow / using memory? | pprof, Pyroscope | CPU hotspots, memory leaks, lock contention |
|
RUM | How do users experience it? | PostHog, Segment | Product analytics, funnels, session replay |
Detailed Guides
Each signal has a dedicated guide with full code examples, configuration patterns, and cost analysis:
- - Structured Logging — Why structured logging matters for log aggregation at scale. Covers
log/slog setup, log levels (Debug/Info/Warn/Error) and when to use each, request correlation with trace IDs, context propagation with slog.InfoContext, request-scoped attributes, the slog ecosystem (handlers, formatters, middleware), and migration strategies from zap/logrus/zerolog.
- - Metrics Collection — Prometheus client setup and the four metric types (Counter for rate-of-change, Gauge for snapshots, Histogram for latency aggregation). Deep dive: why Histograms beat Summaries (server-side aggregation, supports
histogram_quantile PromQL), naming conventions, the PromQL-as-comments convention (write queries above metric declarations for discoverability), production-grade PromQL examples, multi-window SLO burn rate alerting, and the high-cardinality label problem (why unbounded values like user IDs destroy performance).
- - Distributed Tracing — When and how to use OpenTelemetry SDK to trace request flows across services. Covers spans (creating, attributes, status recording),
otelhttp middleware for HTTP instrumentation, error recording with span.RecordError(), trace sampling (why you can't collect everything at scale), propagating trace context across service boundaries, and cost optimization.
- - Profiling — On-demand profiling with pprof (CPU, heap, goroutine, mutex, block profiles) — how to enable it in production, secure it with auth, and toggle via environment variables without redeploying. Continuous profiling with Pyroscope for always-on performance visibility. Cost implications of each profiling type and mitigation strategies.
- - Real User Monitoring — Understanding how users actually experience your service. Covers product analytics (event tracking, funnels), Customer Data Platform integration, and critical compliance: GDPR/CCPA consent checks, data subject rights (user deletion endpoints), and privacy checklist for tracking. Server-side event tracking (PostHog, Segment) and identity key best practices.
- - Alerting — Proactive problem detection. Covers the four golden signals (latency, traffic, errors, saturation), awesome-prometheus-alerts as a rule library with ~500 ready-to-use rules by technology, Go runtime alerts (goroutine leaks, GC pressure, OOM risk), severity levels, and common mistakes that break alerting (using
irate instead of rate, missing for: duration to avoid flapping).
- - Grafana Dashboards — Prebuilt dashboards for Go runtime monitoring (heap allocation, GC pause frequency, goroutine count, CPU). Explains the standard dashboards to install, how to customize them for your service, and when each dashboard answers a different operational question.
Correlating Signals
Signals are most powerful when connected. A trace_id in your logs lets you jump from a log line to the full request trace. An exemplar on a metric links a latency spike to the exact trace that caused it.
Logs + Traces: otelslog bridge
CODEBLOCK0
Metrics + Traces: Exemplars
CODEBLOCK1
Migrating Legacy Loggers
If the project currently uses zap, logrus, or zerolog, migrate to log/slog. It is the standard library logger since Go 1.21, has a stable API, and the ecosystem has consolidated around it. Continuing with third-party loggers means maintaining an extra dependency for no benefit.
Migration strategy:
- 1. Add
slog as the new logger with INLINECODE24 - Use bridge handlers during migration to route slog output through the existing logger: samber/slog-zap, samber/slog-logrus, samber/slog-zerolog
- Gradually replace all
zap.L().Info(...) / logrus.Info(...) / log.Info().Msg(...) calls with INLINECODE28 - Once fully migrated, remove the bridge handler and the old logger dependency
Definition of Done for Observability
A feature is not production-ready until it is observable. Before marking a feature as done, verify:
- - [ ] Metrics declared — counters for operations/errors, histograms for latencies, gauges for saturation. Each metric var has PromQL queries and alert rules as comments above its declaration.
- [ ] Logging is proper — structured key-value pairs with
slog, context variants used (slog.InfoContext), no PII in logs, errors MUST be either logged OR returned (NEVER both). - [ ] Spans created — every service method, DB query, and external API call has a span with relevant attributes, errors recorded with
span.RecordError(). - [ ] Dashboards and alerts exist — the PromQL from your metric comments is wired into Grafana dashboards and Prometheus alerting rules. Check awesome-prometheus-alerts for ready-to-use rules covering your infrastructure dependencies (databases, caches, brokers, proxies).
- [ ] RUM events tracked — key business events tracked server-side (PostHog/Segment), identity key is
user_id (not email), consent checked before tracking.
Common Mistakes
CODEBLOCK2
CODEBLOCK3
CODEBLOCK4
CODEBLOCK5
角色设定: 你是一名 Go 可观测性工程师。你将每一个未被观测的生产系统视为负债——主动进行埋点,关联信号以诊断问题,并且在一个功能未被观测之前,绝不认为它已完成。
模式:
- - 编码/埋点模式(默认):为新的或现有的代码添加可观测性——声明指标、添加 Span、设置结构化日志、接入 pprof 开关。遵循顺序埋点指南。
- 审查模式——审查 PR 中的埋点变更。检查新代码是否导出了预期的信号(指标已声明、Span 已开启和关闭、结构化日志字段一致)。顺序执行。
- 审计模式——审计整个代码库中现有的可观测性覆盖范围。最多启动 5 个并行子代理——每个信号一个(指标、日志、追踪、性能分析、RUM)——同时检查覆盖情况。
社区默认规则。 一个明确取代 samber/cc-skills-golang@golang-observability 技能的公司技能具有优先权。
Go 可观测性最佳实践
可观测性是从系统的外部输出来理解其内部状态的能力。在 Go 服务中,这意味着五种互补的信号:日志、指标、追踪、性能分析和 RUM。每个信号回答不同的问题,它们共同为您提供对系统行为和用户体验的全面可见性。
使用可观测性库(Prometheus 客户端、OpenTelemetry SDK、供应商集成)时,请参考库的官方文档和代码示例以获取最新的 API 签名。
最佳实践总结
- 1. 使用结构化日志,采用 log/slog——生产服务必须输出结构化日志(JSON),而不是自由格式的字符串
- 选择正确的日志级别——开发时用 Debug,正常操作用 Info,降级状态用 Warn,需要关注的故障用 Error
- 带上下文记录日志——使用 slog.InfoContext(ctx, ...) 将日志与追踪关联起来
- 对于延迟指标,优先使用 Histogram 而非 Summary——Histogram 支持服务端聚合和百分位数查询。每个 HTTP 端点必须具有延迟和错误率指标。
- 在 Prometheus 中保持较低的标签基数——切勿使用无界值(用户 ID、完整 URL)作为标签值
- 使用 Histogram + PromQL 中的 histogramquantile() 跟踪百分位数(P50、P90、P99、P99.9)
- 在新项目上设置 OpenTelemetry 追踪——尽早配置 TracerProvider,然后在各处添加 Span
- 为每个有意义的操作添加 Span——服务方法、数据库查询、外部 API 调用、消息队列操作
- 随处传播上下文——上下文是跨服务边界传递 traceid、spanid 和截止时间的载体
- 通过环境变量启用性能分析——无需重新部署即可切换 pprof 和持续性能分析的开关
- 关联信号——将 traceid 注入日志,使用 Exemplar 将指标链接到追踪
- 一个功能在可观测之前不算完成——声明指标、添加适当的日志、创建 Span
- 使用 awesome-prometheus-alerts 作为基础设施和依赖项告警的起点——按技术浏览、复制规则、自定义阈值
交叉引用
请参阅 samber/cc-skills-golang@golang-error-handling 技能了解单一处理规则。请参阅 samber/cc-skills-golang@golang-troubleshooting 技能了解如何使用可观测性信号诊断生产问题。请参阅 samber/cc-skills-golang@golang-security 技能了解如何保护 pprof 端点并避免在日志中包含 PII。请参阅 samber/cc-skills-golang@golang-context 技能了解如何跨服务边界传播追踪上下文。请参阅 samber/cc-skills@promql-cli 技能了解如何从 CLI 查询和探索针对 Prometheus 的 PromQL 表达式。
五种信号
| 信号 | 它回答的问题 | 工具 | 何时使用 |
|---|
| 日志 | 发生了什么? | log/slog | 离散事件、错误、审计追踪 |
| 指标 |
多少/多快? | Prometheus 客户端 | 聚合测量、告警、SLO |
|
追踪 | 时间花在了哪里? | OpenTelemetry | 跨服务的请求流程、延迟分解 |
|
性能分析 | 为什么慢/为什么使用内存? | pprof, Pyroscope | CPU 热点、内存泄漏、锁竞争 |
|
RUM | 用户如何体验它? | PostHog, Segment | 产品分析、漏斗、会话回放 |
详细指南
每个信号都有一个专门的指南,包含完整的代码示例、配置模式和成本分析:
- - 结构化日志——为什么结构化日志对于大规模日志聚合至关重要。涵盖 log/slog 设置、日志级别(Debug/Info/Warn/Error)及其使用时机、使用 trace ID 进行请求关联、使用 slog.InfoContext 进行上下文传播、请求范围的属性、slog 生态系统(处理器、格式化器、中间件),以及从 zap/logrus/zerolog 的迁移策略。
- - 指标收集——Prometheus 客户端设置和四种指标类型(Counter 用于变化率、Gauge 用于快照、Histogram 用于延迟聚合)。深入探讨:为什么 Histogram 优于 Summary(服务端聚合、支持 histogramquantile PromQL)、命名约定、PromQL 作为注释的约定(在指标声明上方编写查询以提高可发现性)、生产级 PromQL 示例、多窗口 SLO 燃烧率告警,以及高基数标签问题(为什么像用户 ID 这样的无界值会破坏性能)。
- - 分布式追踪——何时以及如何使用 OpenTelemetry SDK 追踪跨服务的请求流。涵盖 Span(创建、属性、状态记录)、用于 HTTP 埋点的 otelhttp 中间件、使用 span.RecordError() 记录错误、追踪采样(为什么不能大规模收集所有数据)、跨服务边界传播追踪上下文,以及成本优化。
- - 性能分析——使用 pprof 进行按需性能分析(CPU、堆、Goroutine、互斥锁、阻塞分析)——如何在生产中启用它、使用认证保护它、以及通过环境变量切换而无需重新部署。使用 Pyroscope 进行持续性能分析,以实现始终在线的性能可见性。每种性能分析类型的成本影响及缓解策略。
- - 真实用户监控——了解用户实际如何体验您的服务。涵盖产品分析(事件追踪、漏斗)、客户数据平台集成,以及关键的合规性:GDPR/CCPA 同意检查、数据主体权利(用户删除端点)和追踪的隐私清单。服务端事件追踪(PostHog、Segment)和身份密钥最佳实践。
- - 告警——主动问题检测。涵盖四个黄金信号(延迟、流量、错误、饱和度)、awesome-prometheus-alerts 作为一个包含约 500 条按技术分类的即用规则的规则库、Go 运行时告警(Goroutine 泄漏、GC 压力、OOM 风险)、严重性级别,以及破坏告警的常见错误(使用 irate 代替 rate、缺少 for: 持续时间以避免抖动)。
- - Grafana 仪表盘——用于 Go 运行时监控的预构建仪表盘(堆分配、GC 暂停频率、Goroutine 数量、CPU)。解释了要安装的标准仪表盘、如何为您的服务自定义它们,以及每个仪表盘何时回答不同的运维问题。
关联信号
当信号被连接起来时,它们是最强大的。日志中的 trace_id 让您可以从一行日志跳转到完整的请求追踪。指标上的 Exemplar 将延迟峰值与导致它的确切追踪联系起来。
日志 + 追踪:otelslog 桥接
go
import go.opentelemetry.io/contrib/bridges/otelslog
// 创建一个自动注入 traceid 和 spanid 的日志记录器
logger := otelslog.NewHandler(my-service)
slog.SetDefault(slog.New(logger))
// 现在每个带有上下文的 slog 调用都包含追踪关联
slog.InfoContext(ctx, order created, order_id, orderID)
// 输出包含:{traceid:abc123, spanid:def456, msg:order created, ...}
指标 + 追踪:Exemplar
go
// 在记录直方图观测值时,将 trace_id 作为 exemplar 附加
// 这样您就可以从 P99 峰值直接跳转到有问题的追踪
histogram.WithLabelValues(POST, /orders).
Exemplar(prometheus.Labels{trace_id: traceID}, duration)
迁移旧