Skill Evaluator
Evaluate skills across 25 criteria using a hybrid automated + manual approach.
Quick Start
1. Run automated checks
CODEBLOCK0
Checks: file structure, frontmatter, description quality, script syntax, dependency audit, credential scan, env var documentation.
2. Manual assessment
Use the rubric at references/rubric.md to score 25 criteria across 8 categories (0–4 each, 100 total). Each criterion has concrete descriptions per score level.
3. Write the evaluation
Copy assets/EVAL-TEMPLATE.md to the skill directory as EVAL.md. Fill in automated results + manual scores.
Evaluation Process
- 1. Run
eval-skill.py — get the automated structural score - Read the skill's SKILL.md — understand what it does
- Read/skim the scripts — assess code quality, error handling, testability
- Score each manual criterion using references/rubric.md — concrete criteria per level
- Prioritize findings as P0 (blocks publishing) / P1 (should fix) / P2 (nice to have)
- Write EVAL.md in the skill directory with scores + findings
Categories (8 categories, 25 criteria)
| # | Category | Source Framework | Criteria |
|---|
| 1 | Functional Suitability | ISO 25010 | Completeness, Correctness, Appropriateness |
| 2 |
Reliability | ISO 25010 | Fault Tolerance, Error Reporting, Recoverability |
| 3 | Performance / Context | ISO 25010 + Agent | Token Cost, Execution Efficiency |
| 4 | Usability — AI Agent | Shneiderman, Gerhardt-Powals | Learnability, Consistency, Feedback, Error Prevention |
| 5 | Usability — Human | Tognazzini, Norman | Discoverability, Forgiveness |
| 6 | Security | ISO 25010 + OpenSSF | Credentials, Input Validation, Data Safety |
| 7 | Maintainability | ISO 25010 | Modularity, Modifiability, Testability |
| 8 | Agent-Specific | Novel | Trigger Precision, Progressive Disclosure, Composability, Idempotency, Escape Hatches |
Interpreting Scores
| Range | Verdict | Action |
|---|
| 90–100 | Excellent | Publish confidently |
| 80–89 |
Good | Publishable, note known issues |
| 70–79 | Acceptable | Fix P0s before publishing |
| 60–69 | Needs Work | Fix P0+P1 before publishing |
| <60 | Not Ready | Significant rework needed |
Deeper Security Scanning
This evaluator covers security basics (credentials, input validation, data safety) but for thorough security audits of skills under development, consider SkillLens (npx skilllens scan <path>). It checks for exfiltration, code execution, persistence, privilege bypass, and prompt injection — complementary to the quality focus here.
Dependencies
- - Python 3.6+ (for eval-skill.py)
- PyYAML (
pip install pyyaml) — for frontmatter parsing in automated checks
技能评估器
使用混合自动化+手动方法,对技能进行25项标准的评估。
快速开始
1. 运行自动化检查
bash
python3 scripts/eval-skill.py /path/to/skill
python3 scripts/eval-skill.py /path/to/skill --json # 机器可读格式
python3 scripts/eval-skill.py /path/to/skill --verbose # 显示所有详情
检查项:文件结构、前置元数据、描述质量、脚本语法、依赖审计、凭据扫描、环境变量文档。
2. 手动评估
使用 references/rubric.md 中的评分标准,对8个类别的25项标准进行评分(每项0-4分,总分100分)。每个评分等级都有具体的描述说明。
3. 撰写评估报告
将 assets/EVAL-TEMPLATE.md 复制到技能目录中,命名为 EVAL.md。填入自动化检查结果和手动评分。
评估流程
- 1. 运行 eval-skill.py — 获取自动化结构评分
- 阅读技能的 SKILL.md — 了解其功能
- 阅读/浏览脚本 — 评估代码质量、错误处理、可测试性
- 使用 references/rubric.md 对每项手动标准进行评分 — 每个级别都有具体标准
- 将发现的问题按优先级分类:P0(阻止发布)/ P1(应修复)/ P2(锦上添花)
- 在技能目录中编写 EVAL.md,包含评分和发现的问题
类别(8个类别,25项标准)
| # | 类别 | 来源框架 | 标准 |
|---|
| 1 | 功能适用性 | ISO 25010 | 完整性、正确性、适当性 |
| 2 |
可靠性 | ISO 25010 | 容错性、错误报告、可恢复性 |
| 3 | 性能/上下文 | ISO 25010 + 智能体 | Token成本、执行效率 |
| 4 | 可用性 — AI智能体 | Shneiderman, Gerhardt-Powals | 可学习性、一致性、反馈、防错 |
| 5 | 可用性 — 人类 | Tognazzini, Norman | 可发现性、容错性 |
| 6 | 安全性 | ISO 25010 + OpenSSF | 凭据、输入验证、数据安全 |
| 7 | 可维护性 | ISO 25010 | 模块化、可修改性、可测试性 |
| 8 | 智能体特定 | 新型 | 触发精度、渐进式披露、可组合性、幂等性、逃生舱 |
评分解读
| 分数范围 | 评定 | 操作 |
|---|
| 90–100 | 优秀 | 可放心发布 |
| 80–89 |
良好 | 可发布,注明已知问题 |
| 70–79 | 可接受 | 发布前修复P0问题 |
| 60–69 | 需要改进 | 发布前修复P0+P1问题 |
| <60 | 未就绪 | 需要重大返工 |
深度安全扫描
本评估器涵盖基础安全项(凭据、输入验证、数据安全),但对于开发中技能的全面安全审计,建议使用 SkillLens(npx skilllens scan )。它可检查数据外泄、代码执行、持久化、权限绕过和提示注入——与本评估器的质量检查形成互补。
依赖项
- - Python 3.6+(用于 eval-skill.py)
- PyYAML(pip install pyyaml)— 用于自动化检查中的前置元数据解析