Iterative Skill Refinement

Core Principle

If you only improve against a fixed benchmark, you're training to the test. Every improvement must generalize beyond the tasks that revealed it.

The Improvement Loop

1. EXPERIMENT — Run baseline vs skill on diverse tasks
2. ASSESS    — Blind assess (use blind-skill-assessment)
   └─ Skill wins consistently? → DONE (see Convergence)
3. DIAGNOSE  — Root cause on dimensions where skill lost
   └─ Don't fix symptoms. Ask: "What class of bugs does this represent?"
   └─ Example: "bugs at hole boundaries" → missing cross-hole verification
4. TRIAGE    — Rank causes by breadth. Fix the widest-impact cause first.
5. EDIT      — One targeted change for one root cause. Log it (see Revision Log).
6. RE-RUN    — New experiments with improved skill
   └─ 2+ cycles on same tasks? Add new tasks (see Anti-Overfitting)
   └─ 2+ cycles with same judges? Rotate personas or dimensions
7. GOTO 2

Anti-Overfitting Checklist

Before committing any skill edit, answer all four:

Check	Pass	Fail
Would this help on a completely different task?	Structural improvement	Overfitting
Does this add a general step/rule, not task-specific wording?	Genuine	Overfitting
After 2+ cycles on same tasks, did you add new tasks?	Fresh signal	Stale benchmark
After 2+ cycles with same judges, did you rotate personas?	Diverse signal	Judge-fitted

Litmus Test

"Adds a structural step catching a class of bugs" = genuine. "Tunes wording to pass a specific test" = overfitting.

Convergence — When to Stop

Stop when all hold:

Skill wins consistently across diverse tasks (not just original set)
New task sets reveal no new failure modes
Last 2 cycles produced no revisions
Gains per cycle are diminishing

Revision Log

Every skill edit gets a row. No undocumented changes.

Date	What changed	Triggered by	Validated by
example	Added VERIFY step	Phase 3: bugs at hole seams	Phase 3b: HDD won 5/5

Example: HDD VERIFY Step

Blind assessment: baseline won 4/5. Bug Hunter found bugs at hole boundaries — shared state and error paths crossing seams. Root cause: no cross-hole verification step. Edit: added VERIFY step (check state, scopes, error paths after each fill). Re-run: HDD won 5/5. New tasks confirmed. Converged.

Red Flags

STOP if you catch yourself doing any of these

Proposing fixes without a re-experimentation plan
Treating all failures with equal weight (no triage)
Editing the skill without logging what triggered and validated it
Same tasks for 3+ cycles without adding new ones
Framing success as "fewer bugs" instead of "beats baseline"
No convergence criteria defined before starting

STOP. Diagnose root cause. Plan validation. Then edit.