Experiment-Driven Skill Development
Three Claude Code skills for developing, validating, and improving AI agent skills through blind A/B experimentation.
The Problem
How do you know if a Claude Code skill actually works? Compliance checks (does the agent follow the rules?) are necessary but insufficient. A skill can be followed perfectly and still produce worse results than baseline.
You need blind experiments. Run the same task with and without the skill, judge the outputs without knowing which is which, and let the data decide.
The Skills
experiment-set-design — what to test, how to progress phases
├── blind-skill-assessment — how to fairly judge results
└── iterative-skill-refinement — the improvement loop (uses both above)
| Skill | When to Use |
|---|---|
| Blind Skill Assessment | Comparing two versions of agent output to determine which is better |
| Experiment Set Design | Designing experiments to test whether a skill is effective |
| Iterative Skill Refinement | Improving a skill that underperforms in blind assessment |
Each skill is independently useful. Together they form a complete methodology for evidence-based skill development.
Quick Start
Installation
# Copy to your project
cp -r skills/ your-project/.claude/skills/
# Or install globally
cp -r skills/ ~/.claude/skills/
Example Workflow
- Write a skill from a description or intuition
- Design experiments (use
experiment-set-design) — 3-5 tasks with baselines - Run Phase 1 — does the agent follow the skill? (compliance)
- Run Phase 2 — does it follow under pressure? (stress)
- Run Phase 3 — does following produce better results? (blind quality review)
- If skill loses — diagnose, edit, re-run (use
iterative-skill-refinement) - Judge results with randomized labels and multi-persona scoring (use
blind-skill-assessment) - Converge — skill wins consistently, no new failure modes
Test Results
All three skills were developed using their own methodology (TDD for skills):
| Skill | Baseline Gaps Found | GREEN Test Result |
|---|---|---|
| blind-skill-assessment | No randomization, single perspective, holistic verdicts | 5/5 process steps followed |
| experiment-set-design | No baseline principle, no phases, no anti-overfitting | All 6 gaps addressed |
| iterative-skill-refinement | No structured loop, no re-experimentation, no convergence | Full 7-step loop followed |
Integration test: all three skills compose correctly on a novel scenario.
Origin
Extracted from the Hole Driven Development skill project methodology — 35 experiments across 5 languages, culminating in 5/5 blind review wins after iterative improvement.