Prompt A/B Lab
Purpose
Design, log, compare, and score prompt experiments so users can systematically improve outputs instead of guessing.
Trigger phrases
- - 比较两个提示词
- prompt ab test
- 提示词实验
- 哪个 prompt 更好
- 建一个评测表
Ask for these inputs
- - prompt A and B
- task
- evaluation criteria
- test set
- weights if any
Workflow
- 1. Define what success looks like before comparing prompts.
- Generate an evaluation rubric and structured test table.
- Log outputs per test case and compute weighted scores.
- Summarize tradeoffs instead of declaring a winner too early.
- Recommend the next experiment iteration.
Output contract
- - experiment plan
- scored comparison table
- rubric
- next-iteration suggestions
Files in this skill
- - Script: INLINECODE0
- Resource: INLINECODE1
Operating rules
- - Be concrete and action-oriented.
- Prefer preview / draft / simulation mode before destructive changes.
- If information is missing, ask only for the minimum needed to proceed.
- Never fabricate metrics, legal certainty, receipts, credentials, or evidence.
- Keep assumptions explicit.
Suggested prompts
- - 比较两个提示词
- prompt ab test
- 提示词实验
Use of script and resources
Use the bundled script when it helps the user produce a structured file, manifest, CSV, or first-pass draft.
Use the resource file as the default schema, checklist, or preset when the user does not provide one.
Boundaries
- - This skill supports planning, structuring, and first-pass artifacts.
- It should not claim that files were modified, messages were sent, or legal/financial decisions were finalized unless the user actually performed those actions.
Compatibility notes
- - Directory-based AgentSkills/OpenClaw skill.
- Runtime dependency declared through
metadata.openclaw.requires. - Helper script is local and auditable:
scripts/prompt_experiment_logger.py. - Bundled resource is local and referenced by the instructions:
resources/eval_rubric.md.
Prompt A/B 实验室
目的
设计、记录、比较和评分提示词实验,使用户能够系统性地优化输出结果,而非盲目猜测。
触发短语
- - 比较两个提示词
- prompt ab test
- 提示词实验
- 哪个 prompt 更好
- 建一个评测表
需要用户提供以下信息
- - 提示词 A 和 B
- 任务
- 评估标准
- 测试集
- 权重(如有)
工作流程
- 1. 在比较提示词之前,先定义成功的标准。
- 生成评估量表和结构化测试表格。
- 记录每个测试用例的输出结果,并计算加权得分。
- 总结权衡因素,避免过早判定胜负。
- 推荐下一轮实验的迭代方向。
输出约定
本技能包含的文件
- - 脚本:{baseDir}/scripts/promptexperimentlogger.py
- 资源文件:{baseDir}/resources/eval_rubric.md
操作规则
- - 保持具体且以行动为导向。
- 在执行破坏性更改前,优先使用预览/草稿/模拟模式。
- 若信息缺失,仅询问推进所需的最少信息。
- 绝不编造指标、法律确定性、收据、凭证或证据。
- 明确说明所有假设。
建议的提示词
- - 比较两个提示词
- prompt ab test
- 提示词实验
脚本与资源的使用
当脚本有助于用户生成结构化文件、清单、CSV 或初稿时,使用捆绑脚本。
当用户未提供默认模式、检查清单或预设时,使用资源文件作为默认方案。
边界说明
- - 本技能支持规划、结构化和初稿生成。
- 除非用户实际执行了相关操作,否则不得声称文件已被修改、消息已发送或法律/财务决策已最终确定。
兼容性说明
- - 基于目录的 AgentSkills/OpenClaw 技能。
- 运行时依赖通过 metadata.openclaw.requires 声明。
- 辅助脚本为本地可审计文件:scripts/promptexperimentlogger.py。
- 捆绑资源为本地文件,由指令引用:resources/eval_rubric.md。