promql-validator

How This Skill Works

This skill performs multi-level validation and provides interactive query planning:

1. Syntax Validation: Checks for syntactically correct PromQL expressions
Semantic Validation: Ensures queries make logical sense (e.g., rate() on counters, not gauges)
Anti-Pattern Detection: Identifies common mistakes and inefficient patterns
Optimization Suggestions: Recommends performance improvements
Query Explanation: Translates PromQL to plain English
Interactive Planning: Helps users clarify intent and refine queries

Workflow

When a user provides a PromQL query, follow this workflow:

Working Directory Requirement

Run validation commands from the repository root so relative paths resolve correctly:

CODEBLOCK0

If running from another location, use absolute paths to scripts/ files.

Step 1: Validate Syntax

Run the syntax validation script to check for basic correctness:

CODEBLOCK1

Output parsing notes:

- Exit 0: syntax valid
Exit non-zero: syntax failure; include stderr and pinpoint token/position
Prefer quoting the smallest failing fragment, then provide corrected query

The script will check for:

- Valid metric names and label matchers
Correct operator usage
Proper function syntax
Valid time durations and ranges
Balanced brackets and quotes
Correct use of modifiers (offset, @)

Step 2: Check Best Practices

Run the best practices checker to detect anti-patterns and optimization opportunities:

CODEBLOCK2

Output parsing notes:

- Treat script sections as independent findings (cardinality, metric-type misuse, regex misuse, etc.)
If script output is empty but query is complex, add a manual sanity pass and mark it as INLINECODE2
Preserve script wording for finding labels, then add remediation in plain English

The script will identify:

- High cardinality queries without label filters
Inefficient regex matchers that could be exact matches
Missing rate()/increase() on counter metrics
rate() used on gauge metrics
Averaging pre-calculated quantiles
Subqueries with excessive time ranges
irate() over long time ranges
Opportunities to add more specific label filters
Complex queries that should use recording rules

Step 3: Explain the Query

Parse and explain what the query does in plain English:

- What metrics are being queried
What type of metrics they are (counter, gauge, histogram, summary)
What functions are applied and why
What the query calculates
What labels will be in the output
What the expected result structure looks like

Required Output Details (always include these explicitly):

CODEBLOCK3

Example:
CODEBLOCK4

Line-Number Citation Method (Required)

When citing examples/docs in recommendations, include file path + 1-based line numbers:

CODEBLOCK5

Rules:

- Cite the most relevant single line (or start line if multi-line snippet)
Keep citations tight; do not cite full files
If line numbers are unavailable, state line number unavailable and provide file path

Step 4: Interactive Query Planning (Phase 1 - STOP AND WAIT)

Ask the user clarifying questions to verify the query matches their intent:

1. Understand the Goal: "What are you trying to monitor or measure?"

- Request rate, error rate, latency, resource usage, etc.

2. Verify Metric Type: "Is this a counter (always increasing), gauge (can go up/down), histogram, or summary?"

- This affects which functions to use

3. Clarify Time Range: "What time window do you need?"

- Instant value, rate over time, historical analysis

4. Confirm Aggregation: "Do you need to aggregate data across labels? If so, which labels?"

- by (job), by (instance), without (pod), etc.

5. Check Output Intent: "Are you using this for alerting, dashboarding, or ad-hoc analysis?"

- Affects optimization priorities

IMPORTANT: Two-Phase Dialogue
After presenting Steps 1-4 results (Syntax, Best Practices, Query Explanation, and Intent Questions):
⏸️ STOP HERE AND WAIT FOR USER RESPONSE
Do NOT proceed to Steps 5-7 until the user answers the clarifying questions.
This ensures the subsequent recommendations are tailored to the user's actual intent.

Step 5: Compare Intent vs Implementation (Phase 2 - After User Response)

Only proceed to this step after the user has answered the clarifying questions from Step 4.

After understanding the user's intent:

- Explain what the current query actually does
Highlight any mismatches between intent and implementation
Suggest corrections if the query doesn't match the goal
Offer alternative approaches if applicable

When relevant, mention known limitations:

- Note when metric type detection is heuristic-based (e.g., "The script inferred this is a gauge based on the _bytes suffix. Please confirm if this is correct.")
Acknowledge when high-cardinality warnings might be false positives (e.g., "This warning may not apply if you're using a recording rule or know your cardinality is low.")

Step 6: Offer Optimizations

Based on validation results:

- Suggest more efficient query patterns
Recommend recording rules for complex/repeated queries
Propose better label matchers to reduce cardinality
Advise on appropriate time ranges

Reference Examples: When suggesting corrections, cite relevant examples using this format:

CODEBLOCK6

Citation sources:

- examples/good_queries.promql - for well-formed patterns
INLINECODE6 - for before/after comparisons
INLINECODE7 - for showing what to avoid
INLINECODE8 - for detailed explanations
INLINECODE9 - for anti-pattern deep dives

Citation Format: file_path (lines X-Y) with the relevant code snippet quoted

Step 7: Let User Plan/Refine

Give the user control:

- Ask if they want to modify the query
Offer to help rewrite it for better performance
Provide multiple alternatives if applicable
Explain trade-offs between different approaches

Key Validation Rules

Syntax Rules

1. Metric Names: Must match [a-zA-Z_:][a-zA-Z0-9_:]* or use UTF-8 quoting syntax (Prometheus 3.0+):

- Quoted form: {"my.metric.with.dots"} - Using name label: {__name__="my.metric.with.dots"}

2. Label Matchers: = (equal), != (not equal), =~ (regex match), !~ (regex not match)
Time Durations: [0-9]+(ms|s|m|h|d|w|y) - e.g., 5m, 1h, INLINECODE21
Range Vectors: metric_name[duration] - e.g., INLINECODE23
Offset Modifier: offset <duration> - e.g., INLINECODE25
@ Modifier: @ <timestamp> or @ start() / INLINECODE28

Semantic Rules

1. rate() and irate(): Should only be used with counter metrics (metrics ending in _total, _count, _sum, or _bucket)
Counters: Should typically use rate() or increase(), not raw values
Gauges: Should not use rate() or INLINECODE36
Histograms: Use histogram_quantile() with le label and rate() on _bucket metrics
Summaries: Don't average quantiles; calculate from _sum and INLINECODE42
Aggregations: Use by() or without() to control output labels

Performance Rules

1. Cardinality: Always use specific label matchers to reduce series count
Regex: Use = instead of =~ when possible for exact matches
Rate Range: Should be at least 4x the scrape interval (typically [2m] minimum)
irate(): Best for short ranges (<5m); use rate() for longer periods
Subqueries: Avoid excessive time ranges that process millions of samples
Recording Rules: Use for complex queries accessed frequently

Anti-Patterns to Detect

High Cardinality Issues

❌ Bad: http_requests_total{}

- Matches all time series without filtering

✅ Good: http_requests_total{job="api", instance="prod-1"}

- Specific label filters reduce cardinality

Regex Overuse

❌ Bad: http_requests_total{status=~"2.."}

- Regex is slower and less precise

✅ Good: http_requests_total{status="200"}

- Exact match is faster

Missing rate() on Counters

❌ Bad: http_requests_total

- Counter raw values are not useful (always increasing)

✅ Good: rate(http_requests_total[5m])

- Rate shows requests per second

rate() on Gauges

❌ Bad: rate(memory_usage_bytes[5m])

- Gauges measure current state, not cumulative values

✅ Good: memory_usage_bytes

- Use gauge value directly or with INLINECODE57

Averaging Quantiles

❌ Bad: avg(http_request_duration_seconds{quantile="0.95"})

- Mathematically invalid to average pre-calculated quantiles

✅ Good: histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

- Calculate quantile from histogram buckets

Excessive Subquery Ranges

❌ Bad: rate(metric[5m])[90d:1m]

- Processes millions of samples, very slow

✅ Good: Use recording rules or limit range to necessary duration

irate() Over Long Ranges

❌ Bad: irate(metric[1h])

- irate() only looks at last two samples, range is wasted

✅ Good: rate(metric[1h]) or irate(metric[5m])

- Use rate() for longer ranges or reduce irate() range

Mixed Metric Types

❌ Bad: avg(http_request_duration_seconds{quantile="0.95"}) / rate(node_memory_usage_bytes[1h]) + sum(http_requests_total)

- Combines summary quantiles, gauge metrics, and counters in arithmetic
Produces meaningless results

✅ Good: Keep each metric type in separate, purpose-specific queries:

- Latency: INLINECODE65
Memory: INLINECODE66
Request rate: INLINECODE67

Output Format

Provide validation results in this structure:

CODEBLOCK7

Interactive Dialogue

After validation, engage in dialogue:

Claude: "I've validated your query. It's syntactically correct, but I notice it queries http_requests_total without any label filters. This could match thousands of time series. What specific service or endpoint are you trying to monitor?"

User: [provides intent]

Claude: "Great! Based on that, here's an optimized version: rate(http_requests_total{job="api-service", path="/users"}[5m]). This calculates the per-second rate of requests to the /users endpoint over the last 5 minutes. Does this match what you need?"

User: [confirms or asks for changes]

Claude: [provides refined query or alternatives]

Examples

See the examples/ directory for:

- good_queries.promql: Well-written queries following best practices
INLINECODE72: Common mistakes and anti-patterns (with corrections)
INLINECODE73: Before/after optimization examples

Documentation

See the docs/ directory for:

- best_practices.md: Comprehensive PromQL best practices guide
INLINECODE76: Detailed anti-pattern reference with explanations

Important Notes

1. Be Interactive: Always ask clarifying questions to understand user intent
Be Educational: Explain WHY something is wrong, not just THAT it's wrong
Be Helpful: Offer to rewrite queries, don't just criticize
Be Context-Aware: Consider the user's use case (alerting vs dashboarding)
Be Thorough: Check all four levels (syntax, semantics, performance, intent)
Be Practical: Suggest realistic optimizations, not theoretical perfection

Integration

This skill can be used:

- Standalone for query review
During monitoring setup to validate alert rules
When troubleshooting slow Prometheus queries
As part of code review for recording rules
For teaching PromQL to team members

Validation Tools

The skill uses two main Python scripts:

1. validatesyntax.py: Pure syntax checking using regex patterns
checkbest_practices.py: Semantic and performance analysis

Both scripts output JSON for programmatic parsing and human-readable messages for display.

Success Criteria

A successful validation session should:

1. Identify all syntax errors
Detect semantic problems
Suggest at least one optimization (if applicable)
Clearly explain what the query does
Verify the query matches user intent
Provide actionable next steps

Known Limitations

The validation scripts have some limitations to be aware of:

Metric Type Detection

- Heuristic-based: Metric types (counter, gauge, histogram, summary) are inferred from naming conventions (e.g., _total, _bytes)
Custom metrics: Metrics with non-standard names may not be correctly classified
Recommendation: When the script can't determine metric type, ask the user to clarify

High Cardinality Detection

- Conservative approach: The script flags metrics without label selectors, but some use cases legitimately query all series
Recording rules: Queries using recording rule metrics (e.g., job:http_requests:rate5m) are valid without label filters
Recommendation: Use judgment - if the user knows their cardinality is manageable, the warning can be safely ignored

Semantic Validation

- No runtime context: The scripts cannot verify if metrics actually exist or if label values are valid
Schema-agnostic: No knowledge of specific Prometheus deployments or metric schemas
Recommendation: For production validation, test queries against actual Prometheus instances

Script Detection Coverage

The scripts detect common anti-patterns but cannot catch:

- Business logic errors (e.g., calculating the wrong KPI)
Context-specific optimizations (depends on scrape interval, retention, etc.)
Custom function behavior from extensions

Remember

The goal is not just to validate queries, but to help users write better PromQL and understand their monitoring data. Always be educational, interactive, and helpful!

此技能的工作方式

此技能执行多层级验证并提供交互式查询规划：

1. 语法验证：检查PromQL表达式语法是否正确
语义验证：确保查询逻辑合理（例如，rate()应用于计数器而非仪表盘）
反模式检测：识别常见错误和低效模式
优化建议：推荐性能改进方案
查询解释：将PromQL翻译为通俗易懂的语言
交互式规划：帮助用户明确意图并优化查询

工作流程

当用户提供PromQL查询时，请遵循以下工作流程：

工作目录要求

从仓库根目录运行验证命令，以确保相对路径正确解析：

bash
cd $(git rev-parse --show-toplevel)

如果从其他位置运行，请使用scripts/文件的绝对路径。

步骤1：验证语法

运行语法验证脚本以检查基本正确性：

bash
python3 devops-skills-plugin/skills/promql-validator/scripts/validate_syntax.py

输出解析说明：

- 退出码0：语法有效
退出码非零：语法错误；包含stderr并定位令牌/位置
优先引用最小的失败片段，然后提供修正后的查询

脚本将检查：

- 有效的指标名称和标签匹配器
正确的运算符使用
正确的函数语法
有效的时间持续时间和范围
平衡的括号和引号
修饰符的正确使用（offset, @）

步骤2：检查最佳实践

运行最佳实践检查器以检测反模式和优化机会：

bash
python3 devops-skills-plugin/skills/promql-validator/scripts/checkbestpractices.py

输出解析说明：

- 将脚本部分视为独立发现（基数、指标类型误用、正则表达式误用等）
如果脚本输出为空但查询复杂，添加手动合理性检查并标记为manual-review
保留脚本的发现标签措辞，然后以通俗语言添加修复建议

脚本将识别：

- 无标签过滤器的高基数查询
本可使用精确匹配的低效正则表达式匹配器
计数器指标缺少rate()/increase()
在仪表盘指标上使用rate()
对预计算分位数求平均值
时间范围过大的子查询
长时间范围内使用irate()
可添加更具体标签过滤器的机会
应使用记录规则的复杂查询

步骤3：解释查询

解析并用通俗语言解释查询的作用：

- 正在查询哪些指标
这些指标的类型（计数器、仪表盘、直方图、摘要）
应用了哪些函数及其原因
查询计算的内容
输出中将包含哪些标签
预期结果结构

必需输出详情（始终明确包含以下内容）：

输出标签：[列出结果中的标签，或无（完全聚合为标量）]
预期结果结构：[瞬时向量 / 范围向量 / 标量] 包含 [N个序列 / 单个值]

示例：

输出标签：job, instance
预期结果结构：瞬时向量，每个job/instance组合对应一个序列

行号引用方法（必需）

在建议中引用示例/文档时，包含文件路径+基于1的行号：

text
examples/good_queries.promql:42
docs/best_practices.md:88

规则：

- 引用最相关的单行（如果是多行片段则引用起始行）
保持引用紧凑；不要引用整个文件
如果行号不可用，说明行号不可用并提供文件路径

步骤4：交互式查询规划（阶段1 - 停止并等待）

向用户提出澄清性问题，以验证查询是否符合其意图：

1. 理解目标：您想监控或测量什么？

- 请求速率、错误率、延迟、资源使用等

2. 验证指标类型：这是计数器（始终递增）、仪表盘（可上下波动）、直方图还是摘要？

- 这会影响使用哪些函数

3. 明确时间范围：您需要什么时间窗口？

- 瞬时值、随时间变化率、历史分析

4. 确认聚合方式：您需要跨标签聚合数据吗？如果需要，按哪些标签？

- by (job)、by (instance)、without (pod)等

5. 检查输出意图：您将此用于告警、仪表盘还是临时分析？

- 影响优化优先级

重要：两阶段对话
在展示步骤1-4的结果（语法、最佳实践、查询解释和意图问题）后：
⏸️ 在此停止并等待用户响应
在用户回答澄清性问题之前，不要继续执行步骤5-7。
这确保后续建议针对用户的实际意图量身定制。

步骤5：比较意图与实现（阶段2 - 用户响应后）

仅在用户回答了步骤4中的澄清性问题后才继续此步骤。

理解用户意图后：

- 解释当前查询实际执行的操作
突出意图与实现之间的任何不匹配
如果查询与目标不符，建议修正
如果适用，提供替代方法

在相关时，提及已知限制：

- 注意指标类型检测是基于启发式的（例如，脚本根据_bytes后缀推断这是仪表盘。请确认是否正确。）
承认高基数警告可能是误报（例如，如果您使用记录规则或知道基数较低，此警告可能不适用。）

步骤6：提供优化建议

基于验证结果：

- 建议更高效的查询模式
为复杂/重复查询推荐记录规则
提出更好的标签匹配器以减少基数
就适当的时间范围提供建议

参考示例：在建议修正时，使用以下格式引用相关示例：

如examples/bad_queries.promql（第91-97行）所示：
❌ 错误：avg(httprequestduration_seconds{quantile=0.95})
✅ 正确：使用histogram_quantile()配合直方图桶

引用来源：

- examples/goodqueries.promql - 用于格式良好的模式
examples/optimizationexamples.promql - 用于前后对比
examples/badqueries.promql - 用于展示应避免的内容
docs/bestpractices.md - 用于详细解释
docs/anti_patterns.md - 用于反模式深入探讨

引用格式：file_path (lines X-Y) 并引用相关代码片段

步骤7：让用户规划/优化

给予用户控制权：

- 询问他们是否想修改查询
主动帮助重写以获得更好性能
如果适用，提供多种替代方案
解释不同方法之间的权衡

关键验证规则

语法规则

1. 指标名称：必须匹配[a-zA-Z:][a-zA-Z0-9:]*或使用UTF-8引号语法（Prometheus 3.0+）：

- 引号形式：{my.metric.with.dots} - 使用name标签：{name=my.metric.with.dots}

2. 标签匹配器：=（等于）、!=（不等于）、=~（正则匹配）、!~（正则不匹配）
时间持续时间：[0-9]+(ms|s|m|h|d|w|y) - 例如5m、1h、7d
范围向量：metricname[duration] - 例如httprequeststotal[5m]
偏移修饰符：offset - 例如metricname offset 5m
@修饰符：@ 或 @ start() / @ end()

语义规则

1. rate()和irate()：仅应用于计数器指标（以total、count、sum或bucket结尾的指标）
计数器：通常应使用rate()或increase()，而非原始值
仪表盘：不应使用rate()或increase()
直方图：使用histogramquantile()配合le标签和bucket指标上的rate()
摘要：不要对分位数求平均值；从sum和count计算
聚合：使用by()或without()控制输出标签

性能规则

1. 基数：始终使用特定标签匹配器以减少序列数量
正则表达式：可能时使用=代替=~进行精确匹配
速率范围：应至少为抓取间隔的4倍（通常最小[2m]

promql-validatorPromQL验证器

How This Skill Works

Workflow

Working Directory Requirement

Step 1: Validate Syntax

Step 2: Check Best Practices

Step 3: Explain the Query

Line-Number Citation Method (Required)

Step 4: Interactive Query Planning (Phase 1 - STOP AND WAIT)

Step 5: Compare Intent vs Implementation (Phase 2 - After User Response)

Step 6: Offer Optimizations

Step 7: Let User Plan/Refine

Key Validation Rules

Syntax Rules

Semantic Rules

Performance Rules

Anti-Patterns to Detect

High Cardinality Issues

Regex Overuse

Missing rate() on Counters

rate() on Gauges

Averaging Quantiles

Excessive Subquery Ranges

irate() Over Long Ranges

Mixed Metric Types

Output Format

Interactive Dialogue

Examples

Documentation

Important Notes

Integration

Validation Tools

Success Criteria

Known Limitations

Metric Type Detection

High Cardinality Detection

Semantic Validation

Script Detection Coverage

Remember

此技能的工作方式

工作流程

工作目录要求

步骤1：验证语法

步骤2：检查最佳实践

步骤3：解释查询

行号引用方法（必需）

步骤4：交互式查询规划（阶段1 - 停止并等待）

步骤5：比较意图与实现（阶段2 - 用户响应后）

步骤6：提供优化建议

步骤7：让用户规划/优化

关键验证规则

语法规则

语义规则

性能规则

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement