Experiment Set Design
Core Principle
Compliance doesn't prove quality. Quality requires blind comparison against baseline.
The Baseline Requirement
NON-NEGOTIABLE
Every experiment needs a baseline — same prompt, same task, no skill loaded. No baseline = invalid experiment.
Task Selection Checklist
Each task must satisfy ALL of:
- [ ] Skill's guidance should produce different behavior on this task
- [ ] Complex enough that architecture decisions matter (not a one-liner)
- [ ] Multiple valid approaches exist (skill's approach is distinguishable)
- [ ] Covers a different domain or language than other tasks in the set
Aim for 3-5 tasks per phase. Start small, add tasks when overfitting is suspected.
Three-Phase Progression
Phase 1: COMPLIANCE Phase 2: STRESS Phase 3: QUALITY
"Does it follow?" "Does it follow under "Does following help?"
pressure?"
PASS/FAIL per rule PASS/FAIL per rule Blind review (use
+ competing instructions blind-skill-assessment)
+ edge cases Winner per dimension
| | |
v v v
All PASS?─── no ──> Fix skill, re-run phase
|
yes
|
v
Advance to next phase
Phase 1 — Compliance. Check each rule. Binary PASS/FAIL.
Phase 2 — Stress. Same checks + competing instructions ("just write the whole thing"), time pressure, conflicting edge cases.
Phase 3 — Quality. Run with and without skill. Blind-assess using blind-skill-assessment. Skill must win on aggregate.
Advancement: All PASS before advancing. Any FAIL → improve skill, re-run current phase.
Task Set Diversity
Vary to prevent overfitting:
- Domain: CLI tool, web handler, data pipeline, algorithm
- Complexity: Single function, multi-module, concurrent
- Language: At least 2 if the skill is language-agnostic
After 2+ improvement cycles on the same task set, add new tasks.
Revision Tracking
For each run, record: skill version (git hash), tasks run, phase, results, what changed since last run.
Example: HDD Skill Phases
| Phase | Tasks | Method | Result |
|---|---|---|---|
| 1: Compliance | 5 tasks (Python, Haskell, Go) | Transcript check: holes visible? One-at-a-time? | 4/5 PASS, 1 FAIL |
| 1b: Re-run | Same 5 after skill edit | Same checks | 5/5 PASS |
| 2: Stress | 5 tasks + competing instructions | Same checks under pressure | 5/5 PASS |
| 3: Quality | 5 tasks, baseline vs skill | Blind 3-persona review | Skill won 4/5 |
Red Flags
STOP if you catch yourself doing any of these
- Designing experiments with no baseline comparison
- All tasks at the same difficulty or in the same domain
- Jumping to Phase 3 without passing Phase 1
- Same task set for 3+ improvement cycles without adding tasks
- No record of which skill version produced which results
STOP. Add baselines. Vary your tasks. Track your versions.