Skip to content

Experiment Set Design

Core Principle

Compliance doesn't prove quality. Quality requires blind comparison against baseline.

The Baseline Requirement

NON-NEGOTIABLE

Every experiment needs a baseline — same prompt, same task, no skill loaded. No baseline = invalid experiment.

Task Selection Checklist

Each task must satisfy ALL of:

  • [ ] Skill's guidance should produce different behavior on this task
  • [ ] Complex enough that architecture decisions matter (not a one-liner)
  • [ ] Multiple valid approaches exist (skill's approach is distinguishable)
  • [ ] Covers a different domain or language than other tasks in the set

Aim for 3-5 tasks per phase. Start small, add tasks when overfitting is suspected.

Three-Phase Progression

Phase 1: COMPLIANCE         Phase 2: STRESS             Phase 3: QUALITY
"Does it follow?"           "Does it follow under        "Does following help?"
                             pressure?"
PASS/FAIL per rule          PASS/FAIL per rule           Blind review (use
                            + competing instructions      blind-skill-assessment)
                            + edge cases                  Winner per dimension
        |                           |                           |
        v                           v                           v
   All PASS?─── no ──> Fix skill, re-run phase
        |
       yes
        |
        v
   Advance to next phase

Phase 1 — Compliance. Check each rule. Binary PASS/FAIL.

Phase 2 — Stress. Same checks + competing instructions ("just write the whole thing"), time pressure, conflicting edge cases.

Phase 3 — Quality. Run with and without skill. Blind-assess using blind-skill-assessment. Skill must win on aggregate.

Advancement: All PASS before advancing. Any FAIL → improve skill, re-run current phase.

Task Set Diversity

Vary to prevent overfitting:

  • Domain: CLI tool, web handler, data pipeline, algorithm
  • Complexity: Single function, multi-module, concurrent
  • Language: At least 2 if the skill is language-agnostic

After 2+ improvement cycles on the same task set, add new tasks.

Revision Tracking

For each run, record: skill version (git hash), tasks run, phase, results, what changed since last run.

Example: HDD Skill Phases

Phase Tasks Method Result
1: Compliance 5 tasks (Python, Haskell, Go) Transcript check: holes visible? One-at-a-time? 4/5 PASS, 1 FAIL
1b: Re-run Same 5 after skill edit Same checks 5/5 PASS
2: Stress 5 tasks + competing instructions Same checks under pressure 5/5 PASS
3: Quality 5 tasks, baseline vs skill Blind 3-persona review Skill won 4/5

Red Flags

STOP if you catch yourself doing any of these

  • Designing experiments with no baseline comparison
  • All tasks at the same difficulty or in the same domain
  • Jumping to Phase 3 without passing Phase 1
  • Same task set for 3+ improvement cycles without adding tasks
  • No record of which skill version produced which results

STOP. Add baselines. Vary your tasks. Track your versions.