Skip to content

Test Results

All three skills were developed using TDD for skills: write a failing test (baseline), write the skill, verify it passes (GREEN), close loopholes (REFACTOR).

Baseline Behavior (RED Phase)

Each skill was tested by running a scenario WITHOUT the skill loaded, documenting what the agent naturally does wrong.

Blind Skill Assessment

Given two code implementations and asked "which is better?", the agent:

Gap What happened
No label randomization Used labels as given — ordering bias
Single perspective Assessed from one viewpoint
Holistic verdict only "Version A is better" with no per-dimension scores
No upfront rubric Ad-hoc evaluation
No confidence markers All conclusions stated with equal certainty

Experiment Set Design

Asked to design a test plan for a skill, the agent:

Gap What happened
No baseline principle Designed contrastive tests but didn't mandate baselines
No anti-overfitting All tests fixed upfront, no rotation
No phase progression All tests at same level (compliance only)
No assessment variation Single measurement approach
Impractical sample sizes Proposed 72-120 runs

Iterative Skill Refinement

Given failure data and asked "how to improve?", the agent:

Gap What happened
No structured loop Jumped straight to proposing fixes
No re-experimentation Proposed changes without validation plan
No overfitting warning Fixes targeted 5 specific bugs only
No baseline framing "Fewer bugs" not "beats baseline"
No convergence criteria No definition of "done"

GREEN Phase Results

Each skill was tested by running the same scenario WITH the skill loaded.

Skill Steps Followed Key Improvement
blind-skill-assessment 5/5 (BLIND, RUBRIC, JUDGE, DECODE, AGGREGATE) Agent randomized labels, used 3 personas, scored per-dimension with confidence
experiment-set-design 6/6 gaps addressed Baselines non-negotiable, 3-phase progression, anti-overfitting rules
iterative-skill-refinement 7/7 loop steps Structured loop, triage by breadth, anti-overfitting checklist, convergence criteria

Integration Test

A novel scenario (improving a "defensive-error-handling" skill) exercised all three skills together:

  • iterative-skill-refinement orchestrated the improvement loop
  • experiment-set-design governed task planning and baseline requirements
  • blind-skill-assessment governed how results would be judged

All three composed naturally with clear handoff points and no conflicts.

REFACTOR Phase

All GREEN tests passed cleanly — no shortcuts, rationalizations, or skipped steps observed. No loopholes to close.