Blind Skill Assessment
Core Principle
No baseline, no experiment. Every assessment compares two versions under blinded conditions with structured scoring.
Process
1. BLIND — Randomly assign labels A/B. Record mapping privately.
Strip origin hints (filenames, "baseline"/"skill" comments).
2. RUBRIC — State scoring dimensions before reading the code.
3. JUDGE — Three personas score both versions (1-5 per dimension).
4. DECODE — Reveal A/B mapping only after ALL scoring is complete.
5. AGGREGATE — Tally dimension wins, compute per-persona averages.
Default Personas
| Persona | Focus | Dimensions |
|---|---|---|
| Bug Hunter | Correctness | Bugs, edge cases, error handling, race conditions |
| Architect | Design | Modularity, separation of concerns, extensibility |
| Pragmatist | Clarity | Readability, naming, documentation, maintainability |
Each persona scores 1-5 per dimension for both A and B, then picks a per-dimension winner. Append confidence: [h] high, [m] medium, [l] low.
Scoring Anchors
| Score | Meaning |
|---|---|
| 1 | Broken |
| 2 | Significant issues |
| 3 | Adequate |
| 4 | Good, minor issues |
| 5 | Excellent |
Anti-Bias Rules
- Judges never see origin labels ("baseline", "skill-version", "v1", "v2")
- Randomize which version is A — do not always assign the first-listed input as A
- Score each dimension independently — no holistic verdicts
- Sanitize identifying information (file paths, naming conventions that leak origin)
Custom Personas
Add a domain-specific persona when the defaults don't cover the task's concerns. Each needs:
- Name — role title
- Focus area — one sentence
- Scoring anchors — what 1 and 5 mean for this persona
Aggregation
- Per-experiment: Dimension wins across all personas. More wins takes the experiment.
- Cross-experiment: Tally experiment wins as "X/N won by [version]."
- Per-persona: Average scores to identify where improvement is strongest/weakest.
Example
Task: Compare two implementations of merge3().
## Label Assignment (private)
Coin flip: tails → Version A = skill-version, Version B = baseline
## Bug Hunter — Correctness
A B
Edge cases: 4[h] 3[h] — A handles empty-file edge case, B does not
Error path: 3[m] 3[m] — both miss error on binary input
Winner: A
## Architect — Design
A B
Modularity: 4[h] 3[h] — A separates hunk extraction cleanly
API surface: 3[m] 4[m] — B's top-level API has better early-exit
Winner: tie
## Pragmatist — Clarity
A B
Readability: 3[h] 4[h] — B's variable names are clearer
Documentation: 4[m] 3[m] — A has better docstring coverage
Winner: tie
## Decode
A = skill-version, B = baseline
## Result
Dimension wins: A=3, B=2, ties=1 → A wins this experiment.
Red Flags
STOP if you catch yourself doing any of these
- Evaluating without randomizing labels
- Reading origin labels before scoring is complete
- Giving only a holistic verdict with no per-dimension scores
- Skipping the upfront rubric ("I'll just see what stands out")
- Using a single perspective instead of multiple personas
STOP. Randomize. Score per-dimension. Then proceed.