Prompt Engineer Toolkit

Overview

Use this skill to move prompts from ad-hoc drafts to production assets with repeatable testing, versioning, and regression safety. It emphasizes measurable quality over intuition. Apply it when launching a new LLM feature that needs reliable outputs, when prompt quality degrades after model or instruction changes, when multiple team members edit prompts and need history/diffs, when you need evidence-based prompt choice for production rollout, or when you want consistent prompt governance across environments.

Core Capabilities

- A/B prompt evaluation against structured test cases
Quantitative scoring for adherence, relevance, and safety checks
Prompt version tracking with immutable history and changelog
Prompt diffs to review behavior-impacting edits
Reusable prompt templates and selection guidance
Regression-friendly workflows for model/prompt updates

Key Workflows

1. Run Prompt A/B Test

Prepare JSON test cases and run:

CODEBLOCK0

Input can also come from stdin/--input JSON payload.

2. Choose Winner With Evidence

The tester scores outputs per case and aggregates:

- expected content coverage
forbidden content violations
regex/format compliance
output length sanity

Use the higher-scoring prompt as candidate baseline, then run regression suite.

3. Version Prompts

CODEBLOCK1

4. Regression Loop

1. Store baseline version.
Propose prompt edits.
Re-run A/B test.
Promote only if score and safety constraints improve.

Script Interfaces

- INLINECODE1

- Reads prompts/cases from stdin or --input - Optional external runner command - Emits text or JSON metrics

- INLINECODE3

- Manages prompt history (add, list, diff, changelog) - Stores metadata and content snapshots locally

Pitfalls, Best Practices & Review Checklist

Avoid these mistakes:

1. Picking prompts from single-case outputs — use a realistic, edge-case-rich test suite.
Changing prompt and model simultaneously — always isolate variables.
Missing must_not_contain (forbidden-content) checks in evaluation criteria.
Editing prompts without version metadata, author, or change rationale.
Skipping semantic diffs before deploying a new prompt version.
Optimizing one benchmark while harming edge cases — track the full suite.
Model swap without rerunning the baseline A/B suite.

Before promoting any prompt, confirm:

- [ ] Task intent is explicit and unambiguous.
[ ] Output schema/format is explicit.
[ ] Safety and exclusion constraints are explicit.
[ ] No contradictory instructions.
[ ] No unnecessary verbosity tokens.
[ ] A/B score improves and violation count stays at zero.

References

Evaluation Design

Each test case should define:

- input: realistic production-like input
INLINECODE10: required markers/content
INLINECODE11: disallowed phrases or unsafe content
INLINECODE12: required structural patterns

This enables deterministic grading across prompt variants.

Versioning Policy

- Use semantic prompt identifiers per feature (support_classifier, ad_copy_shortform).
Record author + change note for every revision.
Never overwrite historical versions.
Diff before promoting a new prompt to production.

Rollout Strategy

1. Create baseline prompt version.
Propose candidate prompt.
Run A/B suite against same cases.
Promote only if winner improves average and keeps violation count at zero.
Track post-release feedback and feed new failure cases back into test suite.

提示工程师工具包

概述

使用此技能将提示词从临时草稿转化为生产资产，具备可重复测试、版本控制和回归安全能力。它强调可衡量的质量而非直觉判断。在以下场景中应用：启动需要可靠输出的新LLM功能时；模型或指令变更后提示词质量下降时；多个团队成员编辑提示词并需要历史记录/差异对比时；需要基于证据选择生产环境提示词时；或希望跨环境实现一致的提示词治理时。

核心能力

- 基于结构化测试用例的A/B提示词评估
针对合规性、相关性和安全检查的量化评分
具备不可变历史记录和变更日志的提示词版本追踪
用于审查影响行为的编辑内容的提示词差异对比
可复用的提示词模板和选择指南
面向模型/提示词更新的回归友好型工作流

关键工作流

1. 运行提示词A/B测试

准备JSON测试用例并运行：

bash
python3 scripts/prompt_tester.py \
--prompt-a-file prompts/a.txt \
--prompt-b-file prompts/b.txt \
--cases-file testcases.json \
--runner-cmd my-llm-cli --prompt {prompt} --input {input} \
--format text

输入也可以来自stdin/--input JSON负载。

2. 基于证据选择胜出者

测试器对每个用例的输出进行评分并汇总：

- 预期内容覆盖率
禁止内容违规情况
正则表达式/格式合规性
输出长度合理性

将得分较高的提示词作为候选基线，然后运行回归测试套件。

3. 版本管理提示词

bash

添加版本

python3 scripts/prompt_versioner.py add \
--name support_classifier \
--prompt-file prompts/support_v3.txt \
--author alice

版本差异对比

python3 scripts/promptversioner.py diff --name supportclassifier --from-version 2 --to-version 3

变更日志

python3 scripts/promptversioner.py changelog --name supportclassifier

4. 回归循环

1. 存储基线版本。
提出提示词编辑建议。
重新运行A/B测试。
仅在评分和安全约束条件改善时进行升级。

脚本接口

- python3 scripts/prompt_tester.py --help

- 从stdin或--input读取提示词/用例 - 可选的外部运行器命令 - 输出文本或JSON指标

- python3 scripts/prompt_versioner.py --help

- 管理提示词历史记录（add、list、diff、changelog） - 在本地存储元数据和内容快照

陷阱、最佳实践与审查清单

避免以下错误：

1. 基于单个用例输出选择提示词——应使用包含丰富边缘案例的逼真测试套件。
同时更改提示词和模型——始终隔离变量。
在评估标准中遗漏mustnotcontain（禁止内容）检查。
编辑提示词时不记录版本元数据、作者或变更理由。
在部署新提示词版本前跳过语义差异对比。
优化某个基准指标而损害边缘案例——追踪完整测试套件。
更换模型而不重新运行基线A/B测试套件。

在升级任何提示词之前，请确认：

- [ ] 任务意图明确且无歧义。
[ ] 输出模式/格式明确。
[ ] 安全性和排除约束条件明确。
[ ] 无矛盾指令。
[ ] 无不必要的冗长标记。
[ ] A/B评分提升且违规次数保持为零。

参考资料

评估设计

每个测试用例应定义：

- input：逼真的生产环境输入
expectedcontains：必需的标记/内容
forbiddencontains：禁止的短语或不安全内容
expected_regex：必需的结构模式

这使不同提示词变体之间的确定性评分成为可能。

版本管理策略

- 按功能使用语义化提示词标识符（supportclassifier、adcopy_shortform）。
每次修订记录作者+变更说明。
绝不覆盖历史版本。
在将新提示词升级到生产环境前进行差异对比。

部署策略

1. 创建基线提示词版本。
提出候选提示词。
针对相同用例运行A/B测试套件。
仅在胜出者提升平均分且保持违规次数为零时进行升级。
追踪发布后反馈，将新的失败用例反馈到测试套件中。

prompt-engineer-toolkit提示工程工具包