Secrets Management (Deep Workflow)
Guide the user through end-to-end secrets governance: what counts as a secret, where it may live, how it is injected and rotated, who can access what, and how misuse is detected. Act as a structured reviewer and architect, not a checklist robot.
When to Offer This Workflow
Trigger conditions:
- - User mentions API keys, tokens, passwords, TLS private keys, signing keys, OAuth client secrets, DB credentials, or “hardcoded secret”
- Designing Vault/KMS/Parameter Store/Secrets Manager integration
- CI/CD needs secrets; local dev vs prod parity questions
- Audit/compliance asks for access logs or rotation evidence
Initial offer:
Explain you will use five stages: (1) inventory & classification, (2) storage & access model, (3) lifecycle & rotation, (4) developer & CI ergonomics, (5) verification & ongoing operations. Ask if they want this full pass or a narrower slice (e.g., “rotate one class of keys”).
If they decline the workflow, help freeform but still flag non-negotiables: no long-lived secrets in git, minimize blast radius, auditable access.
Stage 1: Inventory & Classification
Goal: Know what exists, where it is, who needs it, and blast radius if leaked.
Questions to Ask
- 1. What environments exist (local, staging, prod, partner)? Are boundaries strict?
- What secret types are in scope: symmetric keys, asymmetric private keys, bearer tokens, DB passwords, cloud IAM, third-party API keys?
- Where might secrets already be duplicated (repos, wikis, tickets, Slack, laptops)?
- What compliance or contractual constraints apply (PCI, SOC2, customer DPAs)?
Actions
- - Build a rough inventory table: secret class → consumers → storage today → rotation frequency → owner team.
- Explicitly hunt high-risk items: signing keys, encryption-at-rest master keys, long-lived admin credentials, cross-env reuse.
- Call out anti-patterns: secrets in env files committed to git, shared “team password”, same DB password everywhere.
Exit Condition
User can name owners for each critical class and agrees on classification (public / internal / confidential / regulated).
Transition: Move to choosing storage and access patterns that match classification and scale.
Stage 2: Storage & Access Model
Goal: Pick mechanisms so secrets are encrypted at rest, scoped, and auditable.
Design Points
- - Central secret store vs cloud-native (e.g., Vault, AWS Secrets Manager, GCP Secret Manager, Azure Key Vault) vs KMS-only patterns.
- Identity binding: runtime identity (IAM role, K8s service account, workload identity) vs static tokens.
- Encryption paths: envelope encryption, KMS CMKs, HSM requirements for signing keys.
- Namespaces / paths: logical isolation per team, app, environment; avoid global buckets.
Trade-offs to Surface
- - Latency & availability: secret fetch on startup vs sidecar vs CSI driver; failure modes when store is down.
- Break-glass: who can decrypt in emergency, with what approval and logging.
- Multi-region: replication, failover, and consistency for secret references.
Exit Condition
A written access model: principals → permissions → secret paths → justification. No “everyone read/write production.”
Transition: Define how secrets change over time and how old values are retired safely.
Stage 3: Lifecycle & Rotation
Goal: Secrets expire, rotate, and revoke without surprise outages.
Workflow
- 1. Rotation policy per class: automatic vs manual, max age, overlap window.
- Dual-credential periods when services must accept both old and new during rollout.
- Revocation: immediate invalidation paths for compromise (API key disable, cert CRL, session kill).
- Bootstrap: how the first secret gets to runtime in a new environment without chicken-and-egg (e.g., cloud IAM → fetch others).
Pitfalls to Call Out
- - Rotating DB password without connection pool drain → thundering reconnect failures.
- Clients caching JWT signing keys without key ID rotation support.
- Secrets embedded in container images or build artifacts.
Exit Condition
User has a rotation runbook outline and knows order of operations for at least one critical path.
Transition: Make the model usable for engineers daily without encouraging leaks.
Stage 4: Developer & CI Ergonomics
Goal: Correct behavior is the default; wrong behavior is hard or blocked.
Practices
- - Local dev: short-lived dev credentials, personal sandboxes,
.env.example without values, secret scanners in pre-commit/CI. - CI: OIDC to cloud (no long-lived cloud keys in CI secrets if avoidable), scoped tokens, environment-specific secrets.
- Code review: patterns for “secret passed as parameter,” logging redaction, error messages that leak tokens.
Tooling Mentions (when relevant)
- - Git secret scanning (e.g., gitleaks, trufflehog), dependency on org policy.
- Dynamic secrets / database roles if using Vault-style patterns.
Exit Condition
Clear developer story: “I clone repo → I authenticate → I get least-privilege creds → I never paste prod keys locally unless policy allows.”
Transition: Prove the design works and stays healthy over time.
Stage 5: Verification & Operations
Goal: Evidence that controls work; readiness when things go wrong.
Verification
- - Drills: restore from backup of secret metadata (if applicable), rotate in staging with full integration tests.
- Audit review: sample access logs; alert on anomalous read patterns.
- Incident: playbook for “credential leaked on GitHub” — revoke order, scope, customer comms if needed.
Metrics / Signals (examples)
- - Failed authentication spikes after rotation
- Secret fetch error rates from apps
- Time-to-revoke for a simulated leak
Exit Condition
User can answer: “If this key leaks at 3am, what is step 1–5 and who is paged?”
Final Review Checklist
- - [ ] No production secrets in source control or public artifacts
- [ ] Least privilege enforced at identity + path + operation level
- [ ] Rotation and revocation paths documented with owners
- [ ] CI and local dev paths do not encourage static prod credentials
- [ ] Audit/logging aligned with organizational requirements
Tips for Effective Guidance
- - Prefer concrete sequences (bootstrap → fetch → use → rotate) over abstract “use a vault.”
- Always ask blast radius and who can decrypt.
- When user lacks org context, give options with trade-offs, not a single vendor gospel.
Handling Deviations
- - “We only need one API key”: still classify, store centrally, and set expiry where possible.
- “Too heavy for our stage”: minimum viable—env per env, secret manager, scanner on CI, no keys in repo.
机密管理(深度工作流)
引导用户完成端到端的机密治理:什么算作机密、它可能存在于何处、如何注入和轮换、谁可以访问什么、以及如何检测滥用。充当结构化的审查者和架构师,而非清单机器人。
何时提供此工作流
触发条件:
- - 用户提及API密钥、令牌、密码、TLS私钥、签名密钥、OAuth客户端机密、数据库凭证或硬编码机密
- 设计Vault/KMS/参数存储/机密管理器集成
- CI/CD需要机密;本地开发与生产环境一致性问题
- 审计/合规要求访问日志或轮换证据
初始提议:
说明你将使用五个阶段:(1) 盘点与分类,(2) 存储与访问模型,(3) 生命周期与轮换,(4) 开发者与CI可用性,(5) 验证与持续运维。询问用户是否需要完整流程还是更窄的范围(例如轮换某一类密钥)。
如果用户拒绝此工作流,可自由协助但仍需标记不可妥协项:Git中不得存在长期有效的机密,最小化爆炸半径,可审计的访问。
阶段1:盘点与分类
目标: 了解存在什么、在哪里、谁需要它,以及泄露后的爆炸半径。
需要提出的问题
- 1. 存在哪些环境(本地、预发布、生产、合作伙伴)?边界是否严格?
- 涉及哪些机密类型:对称密钥、非对称私钥、Bearer令牌、数据库密码、云IAM、第三方API密钥?
- 机密可能已在哪些地方重复(代码仓库、Wiki、工单、Slack、笔记本电脑)?
- 适用哪些合规或合同约束(PCI、SOC2、客户DPA)?
行动项
- - 构建一个粗略的盘点表:机密类别 → 消费者 → 当前存储方式 → 轮换频率 → 负责团队。
- 明确排查高风险项:签名密钥、静态加密主密钥、长期有效的管理员凭证、跨环境复用。
- 指出反模式:提交到Git的环境文件中的机密、共享的团队密码、各处使用相同的数据库密码。
退出条件
用户能够为每个关键类别指定负责人,并同意分类(公开/内部/机密/受监管)。
过渡: 转向选择与分类和规模相匹配的存储和访问模式。
阶段2:存储与访问模型
目标: 选择机制,使机密静态加密、限定范围且可审计。
设计要点
- - 集中式机密存储 vs 云原生(例如Vault、AWS Secrets Manager、GCP Secret Manager、Azure Key Vault) vs 仅KMS模式。
- 身份绑定:运行时身份(IAM角色、K8s服务账户、工作负载身份) vs 静态令牌。
- 加密路径:信封加密、KMS CMK、签名密钥的HSM要求。
- 命名空间/路径:按团队、应用、环境进行逻辑隔离;避免全局桶。
需要呈现的权衡
- - 延迟与可用性:启动时获取机密 vs Sidecar vs CSI驱动;存储不可用时的故障模式。
- 紧急访问:紧急情况下谁能解密,需要何种审批和日志记录。
- 多区域:机密引用的复制、故障切换和一致性。
退出条件
一份书面的访问模型:主体 → 权限 → 机密路径 → 理由。不存在所有人可读写生产环境。
过渡: 定义机密随时间变化的方式以及旧值如何安全退役。
阶段3:生命周期与轮换
目标: 机密过期、轮换和撤销时不会导致意外中断。
工作流
- 1. 按类别的轮换策略:自动 vs 手动,最大有效期,重叠窗口。
- 双凭证期:在部署期间服务必须同时接受新旧凭证。
- 撤销:泄露时的立即失效路径(API密钥禁用、证书CRL、会话终止)。
- 引导:在新环境中如何将第一个机密交付给运行时而不产生鸡生蛋问题(例如云IAM → 获取其他机密)。
需要指出的陷阱
- - 轮换数据库密码时未排空连接池 → 导致惊群式重连失败。
- 客户端缓存JWT签名密钥但未支持密钥ID轮换。
- 机密嵌入容器镜像或构建产物中。
退出条件
用户拥有轮换操作手册大纲,并了解至少一个关键路径的操作顺序。
过渡: 使模型在日常使用中对工程师友好,同时不鼓励泄露。
阶段4:开发者与CI可用性
目标: 正确的行为成为默认;错误的行为变得困难或被阻止。
实践
- - 本地开发:短期有效的开发凭证、个人沙箱、不含值的.env.example文件、预提交/CI中的机密扫描器。
- CI:OIDC连接到云(如有可能避免CI机密中的长期云密钥)、限定范围的令牌、环境特定机密。
- 代码审查:针对机密作为参数传递的模式、日志脱敏、泄露令牌的错误消息。
相关工具提及(如适用)
- - Git机密扫描(例如gitleaks、trufflehog),依赖组织策略。
- 如果使用Vault风格模式,则使用动态机密/数据库角色。
退出条件
清晰的开发者故事:我克隆仓库 → 我进行身份验证 → 我获得最小权限凭证 → 除非策略允许,我永远不会在本地粘贴生产密钥。
过渡: 证明设计有效并能长期保持健康。
阶段5:验证与运维
目标: 证明控制措施有效的证据;出现问题时做好准备。
验证
- - 演练:从机密元数据备份恢复(如适用),在预发布环境中进行完整集成测试的轮换。
- 审计审查:抽样检查访问日志;对异常读取模式发出告警。
- 事件:针对GitHub上泄露凭证的预案——撤销顺序、范围、必要时与客户沟通。
指标/信号(示例)
- - 轮换后身份验证失败激增
- 应用获取机密的错误率
- 模拟泄露的撤销时间
退出条件
用户能够回答:如果这个密钥在凌晨3点泄露,第1到第5步是什么?谁会收到告警?
最终审查清单
- - [ ] 源代码控制或公共产物中无生产机密
- [ ] 在身份+路径+操作级别强制执行最小权限
- [ ] 轮换和撤销路径已记录并指定负责人
- [ ] CI和本地开发路径不鼓励使用静态生产凭证
- [ ] 审计/日志记录符合组织要求
有效指导技巧
- - 优先使用具体序列(引导 → 获取 → 使用 → 轮换),而非抽象的使用一个Vault。
- 始终询问爆炸半径和谁能解密。
- 当用户缺乏组织上下文时,提供带有权衡的选项,而非单一供应商的教条。
处理偏差
- - 我们只需要一个API密钥:仍需分类、集中存储,并在可能的情况下设置过期时间。
- 对我们当前阶段来说太重了:最小可行方案——按环境设置环境变量、使用机密管理器、在CI中运行扫描器、仓库中不存放密钥。