Iterative Skill Refinement
Core Principle
If you only improve against a fixed benchmark, you're training to the test. Every improvement must generalize beyond the tasks that revealed it.
The Improvement Loop
1. EXPERIMENT — Run baseline vs skill on diverse tasks
2. ASSESS — Blind assess (use blind-skill-assessment)
└─ Skill wins consistently? → DONE (see Convergence)
3. DIAGNOSE — Root cause on dimensions where skill lost
└─ Don't fix symptoms. Ask: "What class of bugs does this represent?"
└─ Example: "bugs at hole boundaries" → missing cross-hole verification
4. TRIAGE — Rank causes by breadth. Fix the widest-impact cause first.
5. EDIT — One targeted change for one root cause. Log it (see Revision Log).
6. RE-RUN — New experiments with improved skill
└─ 2+ cycles on same tasks? Add new tasks (see Anti-Overfitting)
└─ 2+ cycles with same judges? Rotate personas or dimensions
7. GOTO 2
Anti-Overfitting Checklist
Before committing any skill edit, answer all four:
| Check | Pass | Fail |
|---|---|---|
| Would this help on a completely different task? | Structural improvement | Overfitting |
| Does this add a general step/rule, not task-specific wording? | Genuine | Overfitting |
| After 2+ cycles on same tasks, did you add new tasks? | Fresh signal | Stale benchmark |
| After 2+ cycles with same judges, did you rotate personas? | Diverse signal | Judge-fitted |
Litmus Test
"Adds a structural step catching a class of bugs" = genuine. "Tunes wording to pass a specific test" = overfitting.
Convergence — When to Stop
Stop when all hold:
- Skill wins consistently across diverse tasks (not just original set)
- New task sets reveal no new failure modes
- Last 2 cycles produced no revisions
- Gains per cycle are diminishing
Revision Log
Every skill edit gets a row. No undocumented changes.
| Date | What changed | Triggered by | Validated by |
|---|---|---|---|
| example | Added VERIFY step | Phase 3: bugs at hole seams | Phase 3b: HDD won 5/5 |
Example: HDD VERIFY Step
Blind assessment: baseline won 4/5. Bug Hunter found bugs at hole boundaries — shared state and error paths crossing seams. Root cause: no cross-hole verification step. Edit: added VERIFY step (check state, scopes, error paths after each fill). Re-run: HDD won 5/5. New tasks confirmed. Converged.
Red Flags
STOP if you catch yourself doing any of these
- Proposing fixes without a re-experimentation plan
- Treating all failures with equal weight (no triage)
- Editing the skill without logging what triggered and validated it
- Same tasks for 3+ cycles without adding new ones
- Framing success as "fewer bugs" instead of "beats baseline"
- No convergence criteria defined before starting
STOP. Diagnose root cause. Plan validation. Then edit.