Skill Reviewer
Audit agent skills (SKILL.md files) for quality, correctness, and completeness. Provides a structured review framework with scoring rubric, defect checklists, and improvement recommendations.
When to Use
- - Reviewing a skill before publishing to the registry
- Evaluating a skill you downloaded from the registry
- Auditing your own skills for quality improvements
- Comparing skills in the same category
- Deciding whether a skill is worth installing
Review Process
Step 1: Structural Check
Verify the skill has the required structure. Read the file and check each item:
CODEBLOCK0
Step 2: Frontmatter Quality
Description field audit
The description is the most impactful field. Evaluate it against these criteria:
CODEBLOCK1
Metadata audit
CODEBLOCK2
Step 3: Content Quality
Example density
Count code blocks and total lines:
CODEBLOCK3
Example quality
For each code block, check:
CODEBLOCK4 bash, ``python, etc.)
[ ] Command is syntactically correct
[ ] Output shown in comments where helpful
[ ] Uses realistic values (not foo/bar/baz)
[ ] No placeholder values left (TODO, FIXME, xxx)
[ ] Self-contained (doesn't depend on undefined variables)
OR setup is shown/referenced
[ ] Covers the common case (not just edge cases)
CODEBLOCK5
ORGANIZATION SCORING:
[2] Organized by task/scenario (not by abstract concept)
GOOD: "## Encode and Decode" → "## Inspect Characters" → "## Convert Formats"
BAD: "## Theory" → "## Types" → "## Advanced"
[2] Most common operations come first
GOOD: Basic usage → Variations → Advanced → Edge cases
BAD: Configuration → Theory → Finally the basic usage
[1] Sections are self-contained (can be used independently)
[1] Consistent depth (not mixing h2 with h4 randomly)
Score: __/6
CODEBLOCK6
PLATFORM CHECKLIST:
[ ] macOS differences noted where relevant
(sed -i '' vs sed -i, brew vs apt, BSD vs GNU flags)
[ ] Linux distro variations noted (apt vs yum vs pacman)
[ ] Windows compatibility addressed if os includes "win32"
[ ] Tool version assumptions stated (Docker v2 syntax, Python 3.x)
CODEBLOCK7
ACTIONABILITY SCORING:
[3] Instructions are imperative ("Run X", "Create Y")
NOT: "You might consider..." or "It's recommended to..."
[3] Steps are ordered logically (prerequisites before actions)
[2] Error cases addressed (what to do when something fails)
[2] Output/result described (how to verify it worked)
Score: __/10
CODEBLOCK8
TIPS SCORING:
[2] 5-10 tips present
[2] Tips are non-obvious (not "read the documentation")
GOOD: "The number one Makefile bug: spaces instead of tabs"
BAD: "Make sure to test your code"
[2] Tips are specific and actionable
GOOD: "Use flock to prevent overlapping cron runs"
BAD: "Be careful with concurrent execution"
[1] No tips contradict the main content
[1] Tips cover gotchas/footguns specific to this topic
Score: __/8
CODEBLOCK9
SKILL REVIEW SCORECARD
═══════════════════════════════════════
Skill: [name]
Reviewer: [agent/human]
Date: [date]
Category Score Max
─────────────────────────────────────
Structure __ 11
Description __ 8
Metadata __ 4
Example density __ 3*
Example quality __ 3*
Organization __ 6
Actionability __ 10
Tips __ 8
─────────────────────────────────────
TOTAL __ 53+
* Example density and quality are per-sample,
not summed. Use the average across all examples.
RATING:
45+ Excellent — publish-ready
35-44 Good — minor improvements needed
25-34 Fair — significant gaps to address
< 25 Poor — needs major rework
VERDICT: [PUBLISH / REVISE / REWORK]
CODEBLOCK10
DEFECT: Invalid frontmatter
DETECT: YAML parse error, missing required fields
FIX: Validate YAML, ensure name/description/metadata all present
DEFECT: Broken code examples
DETECT: Syntax errors, undefined variables, wrong flags
FIX: Test every command in a clean environment
DEFECT: Wrong tool requirements
DETECT: metadata.requires lists tools not used in content, or omits tools that are used
FIX: Grep content for command names, update requires to match
DEFECT: Misleading description
DETECT: Description promises coverage the content doesn't deliver
FIX: Align description with actual content, or add missing content
CODEBLOCK11
DEFECT: No "When to Use" section
IMPACT: Agent doesn't know when to activate the skill
FIX: Add 4-8 bullet points describing trigger scenarios
DEFECT: Text walls without examples
DETECT: Any section > 10 lines with no code block
FIX: Add concrete examples for every concept described
DEFECT: Examples missing language tags
DETECT: ` without language identifier
FIX: Add bash, python, javascript, yaml, etc. to every code fence
DEFECT: No Tips section
IMPACT: Missing the distilled expertise that makes a skill valuable
FIX: Add 5-10 non-obvious, actionable tips
DEFECT: Abstract organization
DETECT: Sections named "Theory", "Overview", "Background", "Introduction"
FIX: Reorganize by task/operation: what the user is trying to DO
CODEBLOCK12
DEFECT: Placeholder values
DETECT: foo, bar, baz, example.com, 1.2.3.4, TODO, FIXME
FIX: Replace with realistic values (myapp, api.example.com, 192.168.1.100)
DEFECT: Inconsistent formatting
DETECT: Mixed heading levels, inconsistent code block style
FIX: Standardize heading hierarchy and formatting
DEFECT: Missing cross-references
DETECT: Mentions tools/concepts covered by other skills without referencing them
FIX: Add "See the X skill for more on Y" notes
DEFECT: Outdated commands
DETECT: docker-compose (v1), python (not python3), npm -g without npx alternative
FIX: Update to current tool versions and syntax
CODEBLOCK13
COMPARATIVE CRITERIA:
1. Coverage breadth
Which skill covers more use cases?
2. Example quality
Which has more runnable, realistic examples?
3. Depth on common operations
Which handles the 80% case better?
4. Edge case coverage
Which addresses more gotchas and failure modes?
5. Cross-platform support
Which works across more environments?
6. Freshness
Which uses current tool versions and syntax?
WINNER: [skill A / skill B / tie]
REASON: [1-2 sentence justification]
CODEBLOCK14 markdown
## Quick Review: [skill-name]
**Structure**: [OK / Issues: ...]
**Description**: [Strong / Weak: reason]
**Examples**: [X code blocks across Y lines — density OK/low/high]
**Actionability**: [Agent can/cannot follow these instructions because...]
**Top defect**: [The single most impactful thing to fix]
**Verdict**: [PUBLISH / REVISE / REWORK]
CODEBLOCK15 bash
# 1. Validate frontmatter
head -20 skills/my-skill/SKILL.md
# Visually confirm YAML is valid
# 2. Count code blocks
grep -c '`' skills/my-skill/SKILL.md
# Divide total lines by this number for density
# 3. Check for placeholders
grep -n -i 'todo\|fixme\|xxx\|foo\|bar\|baz' skills/my-skill/SKILL.md
# 4. Check for missing language tags
grep -n '^`$' skills/my-skill/SKILL.md
# Every code fence should have a language tag — bare ` is a defect
# 5. Verify tool requirements match content
# Extract requires from frontmatter, then grep for each tool in content
# 6. Test commands (sample 3-5 from the skill)
# Run them in a clean shell to verify they work
# 7. Run the scorecard mentally or in a file
# Target: 35+ for good, 45+ for excellent
CODEBLOCK16 bash
# Install the skill
npx molthub@latest install skill-name
# Read it
cat skills/skill-name/SKILL.md
# Run the quick review template
# If score < 25, consider uninstalling and finding an alternative
`
## Tips
- The description field accounts for more real-world impact than all other fields combined. A perfect skill with a bad description will never be found via search.
- Count code blocks as your first quality signal. Skills with fewer than 8 code blocks are almost always too abstract to be useful.
- Test 3-5 commands from the skill in a clean environment. If more than one fails, the skill wasn't tested before publishing.
- "Organized by task" vs. "organized by concept" is the single biggest structural quality differentiator. Good skills answer "how do I do X?" — bad skills explain "what is X?"
- A skill with great tips but weak examples is better than one with thorough examples but no tips. Tips encode expertise that examples alone don't convey.
- Check the requires.anyBins against what the skill actually uses. A common defect is listing bash (which everything has) instead of the actual tools like docker, curl, or jq`.
- - Short skills (< 150 lines) usually aren't worth publishing — they don't provide enough value over a quick web search. If your skill is short, it might be better as a section in a larger skill.
- The best skills are ones you'd bookmark yourself. If you wouldn't use it, don't publish it.
技能审查员
对代理技能(SKILL.md文件)进行质量、正确性和完整性审计。提供结构化的审查框架,包含评分标准、缺陷检查清单和改进建议。
使用时机
- - 在将技能发布到注册表前进行审查
- 评估从注册表下载的技能
- 审计自己的技能以改进质量
- 比较同一类别的技能
- 决定某个技能是否值得安装
审查流程
第一步:结构检查
验证技能具备所需结构。读取文件并检查每一项:
结构检查清单:
[ ] 有效的YAML前置元数据(以---开头和结尾)
[ ] name字段存在且为有效的slug(小写、连字符分隔)
[ ] description字段存在且非空
[ ] metadata字段存在且包含有效的JSON
[ ] metadata.clawdbot.emoji为单个表情符号
[ ] metadata.clawdbot.requires.anyBins列出真实的CLI工具
[ ] 标题(# 标题)紧跟在元数据之后
[ ] 标题后的摘要段落
[ ] 使用时机部分存在
[ ] 至少3个主要内容部分
[ ] 末尾存在提示部分
第二步:前置元数据质量
描述字段审计
描述是最具影响力的字段。根据以下标准进行评估:
描述评分:
[2] 以技能功能开头(主动动词)
良好:为任何项目类型编写Makefile。
差劲:本技能涵盖Makefile。
差劲:Make的全面指南。
[2] 包含触发短语(当...时使用)
良好:当设置构建自动化、定义多目标构建时使用
差劲:完全没有触发短语
[2] 具体范围(提及具体工具、语言或操作)
良好:SQLite/PostgreSQL/MySQL — 模式设计、查询、CTE、窗口函数
差劲:数据库相关
[1] 合理长度(50-200字符)
过短:制作东西(无搜索覆盖)
过长:300+字符(会被截断)
[1] 自然包含可搜索关键词
良好:cron作业、systemd定时器、调度
差劲:关键词生硬堆砌
得分:/8
元数据审计
元数据评分:
[1] 表情符号与技能主题相关
[1] requires.anyBins列出技能实际使用的工具(非bash等通用工具)
[1] os数组准确(如果命令仅限Linux,不要声明win32)
[1] JSON有效(用JSON解析器测试)
得分:/4
第三步:内容质量
示例密度
统计代码块和总行数:
示例密度:
行数: _
代码块数: _
比例: 每_行1个代码块
目标:每8-15行1个代码块
< 8 行/块:可能过于碎片化
20行/块:需要更多示例
示例质量
对每个代码块进行检查:
示例质量检查清单:
[ ] 指定了语言标签(bash、python等)
[ ] 命令语法正确
[ ] 在注释中显示输出(如有帮助)
[ ] 使用真实值(非foo/bar/baz)
[ ] 没有遗留的占位值(TODO、FIXME、xxx)
[ ] 自包含(不依赖未定义的变量)
或显示/引用了设置步骤
[ ] 覆盖常见情况(不仅仅是边缘情况)
每个示例评分0-3:
- - 0:错误或误导
- 1:可用但过于简单(无输出、无上下文)
- 2:良好(正确、有输出或解释)
- 3:优秀(可复制粘贴、真实、覆盖边缘情况)
章节组织
组织评分:
[2] 按任务/场景组织(非按抽象概念)
良好:## 编码与解码 → ## 检查字符 → ## 转换格式
差劲:## 理论 → ## 类型 → ## 高级
[2] 最常用操作放在前面
良好:基本用法 → 变体 → 高级 → 边缘情况
差劲:配置 → 理论 → 最后才是基本用法
[1] 章节自包含(可独立使用)
[1] 深度一致(不随意混用h2和h4)
得分:/6
跨平台准确性
平台检查清单:
[ ] 在相关处注明macOS差异
(sed -i vs sed -i、brew vs apt、BSD vs GNU标志)
[ ] 注明Linux发行版差异(apt vs yum vs pacman)
[ ] 如果os包含win32,说明Windows兼容性
[ ] 说明工具版本假设(Docker v2语法、Python 3.x)
第四步:可操作性评估
核心问题:代理能否遵循这些指令产生正确结果?
可操作性评分:
[3] 指令为祈使句(运行X、创建Y)
非:你可以考虑...或建议...
[3] 步骤逻辑有序(先决条件在操作之前)
[2] 处理错误情况(失败时该怎么做)
[2] 描述输出/结果(如何验证成功)
得分:/10
第五步:提示部分质量
提示评分:
[2] 包含5-10条提示
[2] 提示非显而易见(非阅读文档)
良好:Makefile头号错误:用空格代替制表符
差劲:确保测试你的代码
[2] 提示具体且可操作
良好:使用flock防止重叠的cron运行
差劲:注意并发执行
[1] 提示不与主要内容矛盾
[1] 提示覆盖该主题特有的陷阱/雷区
得分:/8
评分汇总
技能审查评分卡
═══════════════════════════════════════
技能:[名称]
审查员:[代理/人类]
日期:[日期]
类别 得分 满分
─────────────────────────────────────
结构 11
描述 8
元数据 4
示例密度 3*
示例质量 3*
组织 6
可操作性 10
提示 8
─────────────────────────────────────
总计 53+
非累加。使用所有示例的平均值。
评级:
45+ 优秀 — 可发布
35-44 良好 — 需小幅改进
25-34 一般 — 存在明显差距
< 25 差劲 — 需大幅重做
裁决:[发布 / 修订 / 重做]
常见缺陷
严重(阻止发布)
缺陷:无效的前置元数据
检测:YAML解析错误、缺少必填字段
修复:验证YAML,确保name/description/metadata都存在
缺陷:错误的代码示例
检测:语法错误、未定义变量、错误标志
修复:在干净环境中测试每个命令
缺陷:错误的工具要求
检测:metadata.requires列出内容中未使用的工具,或遗漏了使用的工具
修复:在内容中搜索命令名称,更新requires以匹配
缺陷:误导性描述
检测:描述承诺的内容实际未提供
修复:使描述与实际内容一致,或补充缺失内容
主要(发布前应修复)
缺陷:缺少使用时机部分
影响:代理不知道何时激活该技能
修复:添加4-8个描述触发场景的要点
缺陷:无示例的纯文本段落
检测:任何超过10行且无代码块的部分
修复:为每个描述的概念添加具体示例
缺陷:示例缺少语言标签
检测: 后无语言标识符
修复:为每个代码围栏添加bash、python、javascript、yaml等
缺陷:缺少提示部分
影响:缺少使技能有价值的精炼专业知识
修复:添加5-10条非显而易见、可操作的提示
缺陷:抽象的组织方式
检测:章节命名为理论、概述、背景、介绍
修复:按任务/操作重新组织:用户想要做什么
次要(可修复)
缺陷:占位值
检测:foo、bar、baz、example.com、1.2.3.4、TODO、FIXME
修复:替换为真实值(myapp、api.example.com、192.168.1.100)
缺陷:格式不一致
检测:标题级别混用、代码块风格不一致
修复:标准化标题层级和格式
缺陷:缺少交叉引用
检测:提及其他技能涵盖的工具/概念但未引用
修复:添加有关Y的更多信息,请参阅X技能的说明
缺陷:过时的命令
检测:docker-compose(v1)、python(非python3)、npm -g无npx替代方案
修复:更新为当前工具版本和语法
比较审查
比较同一类别的技能时:
比较