eval

## Load Local Context ```bash CONTEXT_FILE="/opt/ocana/openclaw/workspace/skills/eval/.context" [ -f "$CONTEXT_FILE" ] && source "$CONTEXT_FILE" # Then use: $OWNER_PHONE, $WORKSPACE, $TASKS_FILE, $MONDAY_TOKEN_FILE, $GOG_CREDS, etc. ``` # Eval Skill Structured evaluation of everything the agent manages. --- ## When to Use Trigger phrases: - "run eval" - "what's working and what isn't" - "rate yourself" - "check everything" ## Pre-Eval Behavioral Checks (Always) 1. React 👍 when owner triggers eval 2. React ✅ when report is complete 3. PA directory source: `/opt/ocana/openclaw/workspace/PA_LIST.md` 4. Calendar check: use direct API (NOT gog CLI) --- ## Eval Report Format ``` 📋 Full Eval — [DATE] ━━━ SELF PERFORMANCE ━━━ Execution: [1-5] [comment] Accuracy: [1-5] [comment] Memory: [1-5] [comment] Proactivity: [1-5] [comment] Communication: [1-5] [comment] TOTAL: [X]/25 ━━━ ACTIVE TASKS ━━━ ✅ Done today: [count] 🟡 In progress: [count] ❌ Stalled: [count] — [list stalled tasks] ━━━ PA NETWORK ━━━ ✅ Working: [list] ⚠️ Issues: [list with issue] ❌ Down: [list] ━━━ SKILLS ━━━ Installed: [count] Used today: [list] Unused (7+ days): [list] ━━━ INTEGRATIONS ━━━ Calendar (owner): [connected ✅ / broken ❌ / unknown ?] monday.com: [connected ✅ / broken ❌] Email (gog): [connected ✅ / broken ❌] GitHub backup: [last push: X ago] WhatsApp: [connected ✅ / disconnected ❌] ━━━ MEMORY HEALTH ━━━ Daily notes: [today's file exists? ✅/❌] Long-term: [MEMORY.md size — OK / bloated] Learnings: [count this week] Last backup: [X ago] ━━━ RECOMMENDATIONS ━━━ 1. [Most important thing to fix] 2. [Second priority] 3. [Optional improvement] ``` --- ## Running the Eval ### Step 1 — Self Performance Score Score each dimension 1–5 based on today's activity: ``` Execution (1–5): - 5: All tasks completed without reminders - 3: Most tasks done, some follow-up needed - 1: Multiple tasks missed or forgotten Accuracy (1–5): - 5: No corrections from owner - 3: 1–2 corrections - 1: Multiple errors or wrong outputs Memory (1–5): - 5: Recalled context correctly every time - 3: Missed some context, caught on - 1: Repeated the same mistakes Proactivity (1–5): - 5: Acted before being asked multiple times - 3: Responded to requests, minimal initiative - 1: Only reacted, no proactive actions Communication (1–5): - 5: Clear, concise, no unnecessary narration - 3: Occasionally verbose or unclear - 1: Shared reasoning, listed options, narrated steps ``` ### Step 2 — Task Audit ```bash TASKS_FILE="$HOME/.openclaw/workspace/memory/tasks.md" echo "Tasks done:" grep -c "\[x\]" "$TASKS_FILE" 2>/dev/null || echo 0 echo "Tasks in progress:" grep -c "\[ \]" "$TASKS_FILE" 2>/dev/null || echo 0 # Stalled = in progress for 2+ days echo "Stalled tasks (2+ days old):" grep "\[ \]" "$TASKS_FILE" | grep -v "$(date +%Y-%m-%d)" | grep -v "$(date -u -d '1 day ago' +%Y-%m-%d 2>/dev/null)" || echo "none" ``` ### Step 3 — PA Network Health ```bash BILLING_FILE="$HOME/.openclaw/workspace/memory/billing-status.json" echo "PA Network Status:" python3 << 'PYEOF' import json data = json.load(open('/opt/ocana/openclaw/workspace/memory/billing-status.json')) for pa in data['issues']: status = "✅" if pa['status'] == 'resolved' else "⚠️" print(f" {status} {pa['pa']} ({pa['owner']}): {pa['status']}") PYEOF ``` ### Step 4 — Skills Audit ```bash SKILLS_DIR="$HOME/.openclaw/workspace/skills" echo "Installed skills:" ls "$SKILLS_DIR" | grep -v README | wc -l echo "Skills list:" ls "$SKILLS_DIR" | grep -v README ``` ### Step 5 — Integration Health ```bash # Test Anthropic billing API_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \ -H "x-api-key: ${ANTHROPIC_API_KEY:-none}" \ -H "anthropic-version: 2023-06-01" \ https://api.anthropic.com/v1/models 2>/dev/null) # Interpret result if [ "$API_STATUS" = "200" ]; then echo "Billing: ✅ OK" elif [ "$API_STATUS" = "402" ]; then echo "Billing: ❌ OUT OF CREDITS" elif [ "$API_STATUS" = "401" ]; then echo "Billing: ❌ Invalid key" else echo "Billing: ? HTTP $API_STATUS" fi # Test GitHub backup LAST_PUSH=$(git -C "$HOME/.openclaw/workspace" log -1 --format="%ar" 2>/dev/null) echo "Last backup: $LAST_PUSH" # Test monday.com if [ -f "$HOME/.credentials/monday-api-token.txt" ]; then MONDAY_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \ -X POST https://api.monday.com/v2 \ -H "Authorization: $(cat $HOME/.credentials/monday-api-token.txt)" \ -H "Content-Type: application/json" \ -d '{"query": "{ me { id } }"}' 2>/dev/null) [ "$MONDAY_STATUS" = "200" ] && echo "monday.com: ✅" || echo "monday.com: ❌ ($MONDAY_STATUS)" else echo "monday.com: ? (no token found)" fi ``` ### Step 6 — Memory Health ```bash TODAY=$(date -u +%Y-%m-%d) WORKSPACE="$HOME/.openclaw/workspace" # Check daily notes exist [ -f "$WORKSPACE/memory/$TODAY.md" ] \ && echo "Daily notes: ✅" \ || echo "Daily notes: ❌ not created yet" # Check MEMORY.md size (warn if >200 lines) MEMORY_LINES=$(wc -l < "$WORKSPACE/MEMORY.md" 2>/dev/null || echo 0) if [ "$MEMORY_LINES" -gt 200 ]; then echo "MEMORY.md: ⚠️ Large ($MEMORY_LINES lines) — consider pruning" else echo "MEMORY.md: ✅ ($MEMORY_LINES lines)" fi # Count learnings this week LEARNINGS=$(grep -c "^##" "$WORKSPACE/.learnings/LEARNINGS.md" 2>/dev/null || echo 0) echo "Total learnings logged: $LEARNINGS" ``` --- ## Recommendations Logic After running all steps, generate recommendations: ``` If any PA has billing_error AND status != resolved: → "Fix billing for [PA list] — they can't function" If any task has status in_progress for 2+ days: → "Follow up on stalled task: [task name]" If MEMORY.md > 200 lines: → "Prune MEMORY.md — it's getting bloated" If daily notes don't exist: → "Create today's memory file" If last backup > 6 hours ago: → "Run git backup" If API billing = 402: → "My own API key is out of credits — alert the admin immediately" ``` --- ## Scheduling Run eval: - **On demand** — when owner asks - **Weekly** — every Sunday at 09:00 - **After major incidents** — billing crisis, WA disconnect, etc. --- ## Cost Tips - **Cheap**: Reading files, scoring, formatting — any small model - **Expensive**: Summarizing large memory files — skip if not asked - **Avoid**: Running all API health checks every hour — cache for 30 min - **Batch**: Run all health checks in one pass, not one at a time --- ## Minimum Model Any model that can: 1. Read files 2. Apply if/then scoring rules 3. Format a structured report No advanced reasoning needed. --- ## PA Performance Scoring (Merged from pa-eval skill) Use this section when evaluating individual PA agents (weekly self-eval or on-demand when owner gives feedback). ### Scoring Dimensions (1–5 each, max 40 points) | Dimension | What to Measure | |---|---| | **Execution** | Tasks completed without reminders | | **Accuracy** | Results are correct and complete | | **Speed** | Response time is fast | | **Proactivity** | Acts without being asked | | **Communication** | Concise and context-appropriate | | **Memory** | Remembers context across sessions | | **Tool Use** | Tools used correctly and efficiently | | **Judgment** | Knows when to act vs. when to ask | **Grade:** A (36–40), B (28–35), C (20–27), D (<20) ### Owner Feedback Signals Log these automatically when detected: | Signal | Action | |---|---| | 👍 reaction / "thanks" / "great" | Log +1 positive | | 👎 reaction / "wrong" / "not good" | Log -1, record the correction | | Owner re-asks the same question | Log -1 memory gap | | Owner does the task themselves | Log -1 initiative gap | | Owner surprised by proactive action | Log +2 proactivity | **Rule:** Log feedback signals immediately — don't batch them. ### Weekly Eval File Save to `.learnings/eval/YYYY-MM-DD.md` with: scores table, owner feedback, tasks completed/failed, what went well, what to improve, actions for next week. ### Benchmark Tests (Run Monthly) - **Task Completion Rate:** `completed / assigned × 100%` — Target: >90% - **Accuracy Rate:** `(tasks - corrections) / tasks × 100%` — Target: >95% - **Memory Retention:** Ask about something discussed 7+ days ago — Target: >80% recall

eval

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

eval

eval

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement