How This Skill Works
This skill performs multi-level validation and provides interactive query planning:
- 1. Syntax Validation: Checks for syntactically correct PromQL expressions
- Semantic Validation: Ensures queries make logical sense (e.g., rate() on counters, not gauges)
- Anti-Pattern Detection: Identifies common mistakes and inefficient patterns
- Optimization Suggestions: Recommends performance improvements
- Query Explanation: Translates PromQL to plain English
- Interactive Planning: Helps users clarify intent and refine queries
Workflow
When a user provides a PromQL query, follow this workflow:
Working Directory Requirement
Run validation commands from the repository root so relative paths resolve correctly:
CODEBLOCK0
If running from another location, use absolute paths to scripts/ files.
Step 1: Validate Syntax
Run the syntax validation script to check for basic correctness:
CODEBLOCK1
Output parsing notes:
- - Exit
0: syntax valid - Exit non-zero: syntax failure; include stderr and pinpoint token/position
- Prefer quoting the smallest failing fragment, then provide corrected query
The script will check for:
- - Valid metric names and label matchers
- Correct operator usage
- Proper function syntax
- Valid time durations and ranges
- Balanced brackets and quotes
- Correct use of modifiers (offset, @)
Step 2: Check Best Practices
Run the best practices checker to detect anti-patterns and optimization opportunities:
CODEBLOCK2
Output parsing notes:
- - Treat script sections as independent findings (cardinality, metric-type misuse, regex misuse, etc.)
- If script output is empty but query is complex, add a manual sanity pass and mark it as INLINECODE2
- Preserve script wording for finding labels, then add remediation in plain English
The script will identify:
- - High cardinality queries without label filters
- Inefficient regex matchers that could be exact matches
- Missing rate()/increase() on counter metrics
- rate() used on gauge metrics
- Averaging pre-calculated quantiles
- Subqueries with excessive time ranges
- irate() over long time ranges
- Opportunities to add more specific label filters
- Complex queries that should use recording rules
Step 3: Explain the Query
Parse and explain what the query does in plain English:
- - What metrics are being queried
- What type of metrics they are (counter, gauge, histogram, summary)
- What functions are applied and why
- What the query calculates
- What labels will be in the output
- What the expected result structure looks like
Required Output Details (always include these explicitly):
CODEBLOCK3
Example:
CODEBLOCK4
Line-Number Citation Method (Required)
When citing examples/docs in recommendations, include file path + 1-based line numbers:
CODEBLOCK5
Rules:
- - Cite the most relevant single line (or start line if multi-line snippet)
- Keep citations tight; do not cite full files
- If line numbers are unavailable, state
line number unavailable and provide file path
Step 4: Interactive Query Planning (Phase 1 - STOP AND WAIT)
Ask the user clarifying questions to verify the query matches their intent:
- 1. Understand the Goal: "What are you trying to monitor or measure?"
- Request rate, error rate, latency, resource usage, etc.
- 2. Verify Metric Type: "Is this a counter (always increasing), gauge (can go up/down), histogram, or summary?"
- This affects which functions to use
- 3. Clarify Time Range: "What time window do you need?"
- Instant value, rate over time, historical analysis
- 4. Confirm Aggregation: "Do you need to aggregate data across labels? If so, which labels?"
- by (job), by (instance), without (pod), etc.
- 5. Check Output Intent: "Are you using this for alerting, dashboarding, or ad-hoc analysis?"
- Affects optimization priorities
IMPORTANT: Two-Phase Dialogue
After presenting Steps 1-4 results (Syntax, Best Practices, Query Explanation, and Intent Questions):
⏸️ STOP HERE AND WAIT FOR USER RESPONSE
Do NOT proceed to Steps 5-7 until the user answers the clarifying questions.
This ensures the subsequent recommendations are tailored to the user's actual intent.
Step 5: Compare Intent vs Implementation (Phase 2 - After User Response)
Only proceed to this step after the user has answered the clarifying questions from Step 4.
After understanding the user's intent:
- - Explain what the current query actually does
- Highlight any mismatches between intent and implementation
- Suggest corrections if the query doesn't match the goal
- Offer alternative approaches if applicable
When relevant, mention known limitations:
- - Note when metric type detection is heuristic-based (e.g., "The script inferred this is a gauge based on the
_bytes suffix. Please confirm if this is correct.") - Acknowledge when high-cardinality warnings might be false positives (e.g., "This warning may not apply if you're using a recording rule or know your cardinality is low.")
Step 6: Offer Optimizations
Based on validation results:
- - Suggest more efficient query patterns
- Recommend recording rules for complex/repeated queries
- Propose better label matchers to reduce cardinality
- Advise on appropriate time ranges
Reference Examples: When suggesting corrections, cite relevant examples using this format:
CODEBLOCK6
Citation sources:
- -
examples/good_queries.promql - for well-formed patterns - INLINECODE6 - for before/after comparisons
- INLINECODE7 - for showing what to avoid
- INLINECODE8 - for detailed explanations
- INLINECODE9 - for anti-pattern deep dives
Citation Format: file_path (lines X-Y) with the relevant code snippet quoted
Step 7: Let User Plan/Refine
Give the user control:
- - Ask if they want to modify the query
- Offer to help rewrite it for better performance
- Provide multiple alternatives if applicable
- Explain trade-offs between different approaches
Key Validation Rules
Syntax Rules
- 1. Metric Names: Must match
[a-zA-Z_:][a-zA-Z0-9_:]* or use UTF-8 quoting syntax (Prometheus 3.0+):
- Quoted form:
{"my.metric.with.dots"}
- Using
name label:
{__name__="my.metric.with.dots"}
- 2. Label Matchers:
= (equal), != (not equal), =~ (regex match), !~ (regex not match) - Time Durations:
[0-9]+(ms|s|m|h|d|w|y) - e.g., 5m, 1h, INLINECODE21 - Range Vectors:
metric_name[duration] - e.g., INLINECODE23 - Offset Modifier:
offset <duration> - e.g., INLINECODE25 - @ Modifier:
@ <timestamp> or @ start() / INLINECODE28
Semantic Rules
- 1. rate() and irate(): Should only be used with counter metrics (metrics ending in
_total, _count, _sum, or _bucket) - Counters: Should typically use
rate() or increase(), not raw values - Gauges: Should not use
rate() or INLINECODE36 - Histograms: Use
histogram_quantile() with le label and rate() on _bucket metrics - Summaries: Don't average quantiles; calculate from
_sum and INLINECODE42 - Aggregations: Use
by() or without() to control output labels
Performance Rules
- 1. Cardinality: Always use specific label matchers to reduce series count
- Regex: Use
= instead of =~ when possible for exact matches - Rate Range: Should be at least 4x the scrape interval (typically
[2m] minimum) - irate(): Best for short ranges (<5m); use
rate() for longer periods - Subqueries: Avoid excessive time ranges that process millions of samples
- Recording Rules: Use for complex queries accessed frequently
Anti-Patterns to Detect
High Cardinality Issues
❌ Bad: http_requests_total{}
- - Matches all time series without filtering
✅ Good: http_requests_total{job="api", instance="prod-1"}
- - Specific label filters reduce cardinality
Regex Overuse
❌ Bad: http_requests_total{status=~"2.."}
- - Regex is slower and less precise
✅ Good: http_requests_total{status="200"}
Missing rate() on Counters
❌ Bad: http_requests_total
- - Counter raw values are not useful (always increasing)
✅ Good: rate(http_requests_total[5m])
- - Rate shows requests per second
rate() on Gauges
❌ Bad: rate(memory_usage_bytes[5m])
- - Gauges measure current state, not cumulative values
✅ Good: memory_usage_bytes
- - Use gauge value directly or with INLINECODE57
Averaging Quantiles
❌ Bad: avg(http_request_duration_seconds{quantile="0.95"})
- - Mathematically invalid to average pre-calculated quantiles
✅ Good: histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
- - Calculate quantile from histogram buckets
Excessive Subquery Ranges
❌ Bad: rate(metric[5m])[90d:1m]
- - Processes millions of samples, very slow
✅ Good: Use recording rules or limit range to necessary duration
irate() Over Long Ranges
❌ Bad: irate(metric[1h])
- - irate() only looks at last two samples, range is wasted
✅ Good: rate(metric[1h]) or irate(metric[5m])
- - Use rate() for longer ranges or reduce irate() range
Mixed Metric Types
❌ Bad: avg(http_request_duration_seconds{quantile="0.95"}) / rate(node_memory_usage_bytes[1h]) + sum(http_requests_total)
- - Combines summary quantiles, gauge metrics, and counters in arithmetic
- Produces meaningless results
✅ Good: Keep each metric type in separate, purpose-specific queries:
- - Latency: INLINECODE65
- Memory: INLINECODE66
- Request rate: INLINECODE67
Output Format
Provide validation results in this structure:
CODEBLOCK7
Interactive Dialogue
After validation, engage in dialogue:
Claude: "I've validated your query. It's syntactically correct, but I notice it queries http_requests_total without any label filters. This could match thousands of time series. What specific service or endpoint are you trying to monitor?"
User: [provides intent]
Claude: "Great! Based on that, here's an optimized version: rate(http_requests_total{job="api-service", path="/users"}[5m]). This calculates the per-second rate of requests to the /users endpoint over the last 5 minutes. Does this match what you need?"
User: [confirms or asks for changes]
Claude: [provides refined query or alternatives]
Examples
See the examples/ directory for:
- -
good_queries.promql: Well-written queries following best practices - INLINECODE72 : Common mistakes and anti-patterns (with corrections)
- INLINECODE73 : Before/after optimization examples
Documentation
See the docs/ directory for:
- -
best_practices.md: Comprehensive PromQL best practices guide - INLINECODE76 : Detailed anti-pattern reference with explanations
Important Notes
- 1. Be Interactive: Always ask clarifying questions to understand user intent
- Be Educational: Explain WHY something is wrong, not just THAT it's wrong
- Be Helpful: Offer to rewrite queries, don't just criticize
- Be Context-Aware: Consider the user's use case (alerting vs dashboarding)
- Be Thorough: Check all four levels (syntax, semantics, performance, intent)
- Be Practical: Suggest realistic optimizations, not theoretical perfection
Integration
This skill can be used:
- - Standalone for query review
- During monitoring setup to validate alert rules
- When troubleshooting slow Prometheus queries
- As part of code review for recording rules
- For teaching PromQL to team members
Validation Tools
The skill uses two main Python scripts:
- 1. validatesyntax.py: Pure syntax checking using regex patterns
- checkbest_practices.py: Semantic and performance analysis
Both scripts output JSON for programmatic parsing and human-readable messages for display.
Success Criteria
A successful validation session should:
- 1. Identify all syntax errors
- Detect semantic problems
- Suggest at least one optimization (if applicable)
- Clearly explain what the query does
- Verify the query matches user intent
- Provide actionable next steps
Known Limitations
The validation scripts have some limitations to be aware of:
Metric Type Detection
- - Heuristic-based: Metric types (counter, gauge, histogram, summary) are inferred from naming conventions (e.g.,
_total, _bytes) - Custom metrics: Metrics with non-standard names may not be correctly classified
- Recommendation: When the script can't determine metric type, ask the user to clarify
High Cardinality Detection
- - Conservative approach: The script flags metrics without label selectors, but some use cases legitimately query all series
- Recording rules: Queries using recording rule metrics (e.g.,
job:http_requests:rate5m) are valid without label filters - Recommendation: Use judgment - if the user knows their cardinality is manageable, the warning can be safely ignored
Semantic Validation
- - No runtime context: The scripts cannot verify if metrics actually exist or if label values are valid
- Schema-agnostic: No knowledge of specific Prometheus deployments or metric schemas
- Recommendation: For production validation, test queries against actual Prometheus instances
Script Detection Coverage
The scripts detect common anti-patterns but cannot catch:
- - Business logic errors (e.g., calculating the wrong KPI)
- Context-specific optimizations (depends on scrape interval, retention, etc.)
- Custom function behavior from extensions
Remember
The goal is not just to validate queries, but to help users write better PromQL and understand their monitoring data. Always be educational, interactive, and helpful!
此技能的工作方式
此技能执行多层级验证并提供交互式查询规划:
- 1. 语法验证:检查PromQL表达式语法是否正确
- 语义验证:确保查询逻辑合理(例如,rate()应用于计数器而非仪表盘)
- 反模式检测:识别常见错误和低效模式
- 优化建议:推荐性能改进方案
- 查询解释:将PromQL翻译为通俗易懂的语言
- 交互式规划:帮助用户明确意图并优化查询
工作流程
当用户提供PromQL查询时,请遵循以下工作流程:
工作目录要求
从仓库根目录运行验证命令,以确保相对路径正确解析:
bash
cd $(git rev-parse --show-toplevel)
如果从其他位置运行,请使用scripts/文件的绝对路径。
步骤1:验证语法
运行语法验证脚本以检查基本正确性:
bash
python3 devops-skills-plugin/skills/promql-validator/scripts/validate_syntax.py
输出解析说明:
- - 退出码0:语法有效
- 退出码非零:语法错误;包含stderr并定位令牌/位置
- 优先引用最小的失败片段,然后提供修正后的查询
脚本将检查:
- - 有效的指标名称和标签匹配器
- 正确的运算符使用
- 正确的函数语法
- 有效的时间持续时间和范围
- 平衡的括号和引号
- 修饰符的正确使用(offset, @)
步骤2:检查最佳实践
运行最佳实践检查器以检测反模式和优化机会:
bash
python3 devops-skills-plugin/skills/promql-validator/scripts/checkbestpractices.py
输出解析说明:
- - 将脚本部分视为独立发现(基数、指标类型误用、正则表达式误用等)
- 如果脚本输出为空但查询复杂,添加手动合理性检查并标记为manual-review
- 保留脚本的发现标签措辞,然后以通俗语言添加修复建议
脚本将识别:
- - 无标签过滤器的高基数查询
- 本可使用精确匹配的低效正则表达式匹配器
- 计数器指标缺少rate()/increase()
- 在仪表盘指标上使用rate()
- 对预计算分位数求平均值
- 时间范围过大的子查询
- 长时间范围内使用irate()
- 可添加更具体标签过滤器的机会
- 应使用记录规则的复杂查询
步骤3:解释查询
解析并用通俗语言解释查询的作用:
- - 正在查询哪些指标
- 这些指标的类型(计数器、仪表盘、直方图、摘要)
- 应用了哪些函数及其原因
- 查询计算的内容
- 输出中将包含哪些标签
- 预期结果结构
必需输出详情(始终明确包含以下内容):
输出标签:[列出结果中的标签,或无(完全聚合为标量)]
预期结果结构:[瞬时向量 / 范围向量 / 标量] 包含 [N个序列 / 单个值]
示例:
输出标签:job, instance
预期结果结构:瞬时向量,每个job/instance组合对应一个序列
行号引用方法(必需)
在建议中引用示例/文档时,包含文件路径+基于1的行号:
text
examples/good_queries.promql:42
docs/best_practices.md:88
规则:
- - 引用最相关的单行(如果是多行片段则引用起始行)
- 保持引用紧凑;不要引用整个文件
- 如果行号不可用,说明行号不可用并提供文件路径
步骤4:交互式查询规划(阶段1 - 停止并等待)
向用户提出澄清性问题,以验证查询是否符合其意图:
- 1. 理解目标:您想监控或测量什么?
- 请求速率、错误率、延迟、资源使用等
- 2. 验证指标类型:这是计数器(始终递增)、仪表盘(可上下波动)、直方图还是摘要?
- 这会影响使用哪些函数
- 3. 明确时间范围:您需要什么时间窗口?
- 瞬时值、随时间变化率、历史分析
- 4. 确认聚合方式:您需要跨标签聚合数据吗?如果需要,按哪些标签?
- by (job)、by (instance)、without (pod)等
- 5. 检查输出意图:您将此用于告警、仪表盘还是临时分析?
- 影响优化优先级
重要:两阶段对话
在展示步骤1-4的结果(语法、最佳实践、查询解释和意图问题)后:
⏸️ 在此停止并等待用户响应
在用户回答澄清性问题之前,不要继续执行步骤5-7。
这确保后续建议针对用户的实际意图量身定制。
步骤5:比较意图与实现(阶段2 - 用户响应后)
仅在用户回答了步骤4中的澄清性问题后才继续此步骤。
理解用户意图后:
- - 解释当前查询实际执行的操作
- 突出意图与实现之间的任何不匹配
- 如果查询与目标不符,建议修正
- 如果适用,提供替代方法
在相关时,提及已知限制:
- - 注意指标类型检测是基于启发式的(例如,脚本根据_bytes后缀推断这是仪表盘。请确认是否正确。)
- 承认高基数警告可能是误报(例如,如果您使用记录规则或知道基数较低,此警告可能不适用。)
步骤6:提供优化建议
基于验证结果:
- - 建议更高效的查询模式
- 为复杂/重复查询推荐记录规则
- 提出更好的标签匹配器以减少基数
- 就适当的时间范围提供建议
参考示例:在建议修正时,使用以下格式引用相关示例:
如examples/bad_queries.promql(第91-97行)所示:
❌ 错误:avg(httprequestduration_seconds{quantile=0.95})
✅ 正确:使用histogram_quantile()配合直方图桶
引用来源:
- - examples/goodqueries.promql - 用于格式良好的模式
- examples/optimizationexamples.promql - 用于前后对比
- examples/badqueries.promql - 用于展示应避免的内容
- docs/bestpractices.md - 用于详细解释
- docs/anti_patterns.md - 用于反模式深入探讨
引用格式:file_path (lines X-Y) 并引用相关代码片段
步骤7:让用户规划/优化
给予用户控制权:
- - 询问他们是否想修改查询
- 主动帮助重写以获得更好性能
- 如果适用,提供多种替代方案
- 解释不同方法之间的权衡
关键验证规则
语法规则
- 1. 指标名称:必须匹配[a-zA-Z:][a-zA-Z0-9:]*或使用UTF-8引号语法(Prometheus 3.0+):
- 引号形式:{my.metric.with.dots}
- 使用
name标签:{
name=my.metric.with.dots}
- 2. 标签匹配器:=(等于)、!=(不等于)、=~(正则匹配)、!~(正则不匹配)
- 时间持续时间:[0-9]+(ms|s|m|h|d|w|y) - 例如5m、1h、7d
- 范围向量:metricname[duration] - 例如httprequeststotal[5m]
- 偏移修饰符:offset - 例如metricname offset 5m
- @修饰符:@ 或 @ start() / @ end()
语义规则
- 1. rate()和irate():仅应用于计数器指标(以total、count、sum或bucket结尾的指标)
- 计数器:通常应使用rate()或increase(),而非原始值
- 仪表盘:不应使用rate()或increase()
- 直方图:使用histogramquantile()配合le标签和bucket指标上的rate()
- 摘要:不要对分位数求平均值;从sum和count计算
- 聚合:使用by()或without()控制输出标签
性能规则
- 1. 基数:始终使用特定标签匹配器以减少序列数量
- 正则表达式:可能时使用=代替=~进行精确匹配
- 速率范围:应至少为抓取间隔的4倍(通常最小[2m]