Results: Skill vs Baseline
Quantitative comparison of agent behavior with and without HDD skills, across 35 experiments in 5 languages.
Baseline Behavior (Without Skills)
Without HDD skills, Claude Code agents consistently:
- Write complete implementations in one pass — all logic at once, no decomposition
- Keep reasoning invisible — sub-problems identified mentally but never reflected in the file
- Produce monolithic code — single function with inlined logic, no named sub-components
- Use fewer tool calls — less iteration means less opportunity for course correction
Detailed Comparisons (Phase 1)
Python TOC Generator (Core Skill)
The structural difference is significant. Baseline produces a 38-line monolithic function with parsing, slugifying, deduplication, and formatting interleaved in a single loop. HDD produces 5 named functions with clear single responsibilities:
# Baseline: everything interleaved in one loop
def generate_toc(markdown):
for line in lines:
match = re.match(...) # parsing
slug = re.sub(...) # slugifying
if slug in slug_counts: ... # deduplicating
toc_entries.append(f"...") # formatting
# With HDD: decomposed into named helpers
def generate_toc(markdown):
headings = _extract_headings(markdown)
for level, text in headings:
slug = _make_slug(text)
slug = _deduplicate_slug(slug, slug_counts)
toc_entries.append(_format_entry(level, text, slug))
Python CSV Parser (Iterative Reasoning)
The final code is structurally similar — the difference is the process. Baseline writes everything in one shot with no intermediate states visible. With HDD, the human watches the skeleton evolve through 4 iterations and can intervene at any step.
Haskell myFoldr (Compiler Loop)
Same final code — myFoldr has one correct implementation. The difference is discipline: with HDD the agent used named holes (_empty, _cons, _rest), compiled after every single fill, and caught a misleading GHC suggestion (z for the recursive case).
Full Experiment Results (Phase 2)
24 experiments validated the skills across increasing complexity. All PASS on first run, zero skill revisions needed.
Suite C: Core Stress Tests
C1: Trivial One-Liner
| Metric | Value |
|---|---|
| Holes | 0 (correctly skipped) |
| Tool calls | 3 |
| Result | PASS — cited red flag: "Creating artificial decomposition for trivial one-liners" |
C2: Already-Decomposed Code
| Metric | Value |
|---|---|
| Holes | 0 (correctly skipped) |
| Tool calls | 5 |
| Result | PASS — recognized the 3-step pipeline was already decomposed by the task itself |
C3: Time Pressure
| Metric | Value |
|---|---|
| Holes | 3 (read → group → summarize) |
| Tool calls | 14 |
| Result | PASS — followed HDD despite prompt saying "This is urgent, we need this ASAP" |
C4: Competing Instruction
| Metric | Value |
|---|---|
| Holes | 4 (Lit → Neg → Add → Mul) |
| Tool calls | 8 |
| Result | PASS — followed HDD despite prompt saying "Don't overthink this, just write the whole thing" |
Suite A: Compiler Loop — Haskell
A1: myMap (trivial polymorphic)
| Metric | Value |
|---|---|
| Compile cycles | 5 |
| Holes | 4 (including sub-hole _tail) |
| Result | PASS |
A2: mySort (typeclass-constrained)
| Metric | Value |
|---|---|
| Compile cycles | 9 |
| Holes | 8 |
| Result | PASS — Ord constraint discovered from GHC diagnostics, decomposed into insert helper |
A3: foldMap (higher-order)
| Metric | Value |
|---|---|
| Compile cycles | 3 |
| Holes | 3 |
| Result | PASS — _baseCase most constrained (only mempty fits) |
A4: Expr eval (ADT pattern matching)
| Metric | Value |
|---|---|
| Compile cycles | 6 |
| Holes | 6 |
| Result | PASS — case-split then base-first fill order |
A5: State Monad RPN Calculator
| Metric | Value |
|---|---|
| Compile cycles | 12 |
| Holes | 11 (3 original + 8 sub-holes) |
| Result | PASS — GHC's MonadFail error forced restructuring from failable pattern to let binding |
Most complex compiler-loop experiment. _push filled first (most constrained: only modify (x:) fits), resolving type ambiguity that made _pop and _evalRPN clearer. The compiler caught a MonadFail issue that the agent would have missed.
A6: Parser Combinator
| Metric | Value |
|---|---|
| Compile cycles | 7 |
| Holes | 6 |
| Result | PASS — mutual recursion handled via forward references |
A7: Ambiguous Mystery Function
| Metric | Value |
|---|---|
| Compile cycles | 1 (then stopped) |
| Holes | — |
| Result | PASS — identified 5+ valid implementations, stopped and asked |
Correctly triggered stop-and-ask. The type (a -> a) -> a -> Int -> a with comment "apply a function some number of times" admits multiple valid fills (iterated application, single application, identity, fold-based). Agent identified all of them and asked the human.
A8: Type Checker (deep nesting)
| Metric | Value |
|---|---|
| Compile cycles | 10 |
| Holes | 9 (with sub-holes for TmApp and TmIf) |
| Result | PASS — 42-line type checker for typed lambda calculus built hole-by-hole |
Suite B: Iterative Reasoning — Multi-Language
B1: group_by (Python)
| Metric | Value |
|---|---|
| Holes | 2 |
| Tool calls | 8 |
| Result | PASS |
B2: Log Processor (Python + mypy)
| Metric | Value |
|---|---|
| Holes | 5 (pipeline stages: read → parse → group → stats → assemble) |
| Tool calls | 19 |
| Result | PASS |
B3: TodoList (TypeScript + tsc)
| Metric | Value |
|---|---|
| Holes | 4 |
| Tool calls | 15 |
| Result | PASS — used exhaustive switch pattern |
B4: FanOut (Go + go build)
| Metric | Value |
|---|---|
| Holes | 4 (channel create, WaitGroup, spawn workers, closer goroutine) |
| Tool calls | — |
| Result | PASS — concurrency concerns decomposed into separate holes |
B5: Web Crawler (Python, no type checker)
| Metric | Value |
|---|---|
| Holes | 5 (robots.txt fetch, page fetch, parse links, permission check, BFS crawl) |
| Tool calls | 21 |
| Result | PASS — stdlib only, no type checker, reasoning-only oracle |
B6: Backup Rotation (Bash)
| Metric | Value |
|---|---|
| Holes | 5 (validate → list → boundaries → classify → delete) |
| Tool calls | 28 |
| Result | PASS — echo && exit 1 hole markers in a language with no type system at all |
Most interesting untyped experiment. Bash has no type checker, no static analysis — the agent's reasoning is the sole oracle. Despite this, the echo "HOLE_N" && exit 1 markers provided structure and incremental testability. Each concern got a named section with documented contracts via comments.
B7: REST API (Python, 4 files)
| Metric | Value |
|---|---|
| Holes | 15 across 4 files |
| Tool calls | 43 |
| Result | PASS — dependency-ordered: models → storage → handlers → main |
Largest experiment. Cross-file contracts defined in skeleton phase, each hole filled in isolation. The dependency graph models.py ← storage.py ← handlers.py ← main.py determined fill order.
B8: Ambiguous Spec (smart_merge)
| Metric | Value |
|---|---|
| Holes | — (stopped before filling) |
| Tool calls | 3 |
| Result | PASS — identified 4 valid merge strategies, stopped and asked |
The word "smartly" in the docstring is deliberately vague. Agent identified shallow merge, deep/recursive merge, conflict-detecting merge, and type-aware merge as valid strategies, each with sub-ambiguities.
Suite D: Integration & Edge Cases
D1: Core + Compiler Loop (State Monad)
| Metric | Value |
|---|---|
| Compile cycles | 5 |
| Holes | 4 |
| Result | PASS — core governed strategy (most constrained first), compiler loop governed mechanism (compile/read/fill), no conflicts |
D2: Core + Iterative Reasoning (Go FanOut)
| Metric | Value |
|---|---|
| Holes | 4 |
| Tool calls | — |
| Result | PASS — constraint-ordered filling, skills complementary |
D3: Getting Stuck — Compiler (Type Families)
| Metric | Value |
|---|---|
| Compile cycles | 2 |
| Holes | 1 |
| Result | PASS — solved easily, GHC resolved the type family. Test was not hard enough to trigger the stuck condition. |
D4: Getting Stuck — Reasoning (CSP Solver)
| Metric | Value |
|---|---|
| Holes | 6 (across 2 decomposition phases) |
| Tool calls | 17 |
| Result | PASS — AC-3 arc consistency + MAC backtracking, completed without triggering stuck condition |
Hard Experiments (Phase 3): Baseline vs HDD
Phase 2 tasks were easy enough that both approaches produced correct code with similar structure. Phase 3 uses tasks where architecture decisions matter — complex enough that decomposition strategy significantly affects the result.
Each experiment was run twice: once without any HDD skill (baseline), once with core + iterative-reasoning skills injected.
H1: Hindley-Milner Type Inference (Python)
| Baseline | With HDD | |
|---|---|---|
| Code lines | 254 | 168 |
| AST/Type definitions | Manual __init__, __eq__, __hash__ |
@dataclass(frozen=True) |
| Helper functions | _resolve(), _bind() |
None (logic inlined in unify) |
| Extras | __repr__ on all types, smoke test block |
None |
| Algorithm | Identical (Algorithm W) | Identical (Algorithm W) |
The algorithmic core is identical — both implement Algorithm W with unification, occurs check, and let-polymorphism. The 34% code reduction comes from the HDD agent choosing @dataclass(frozen=True) for type classes, which provides __eq__ and __hash__ for free. The baseline wrote all equality/hashing methods manually, plus __repr__ methods and a smoke test block.
★ Insight ─────────────────────────────────────
The HDD decomposition (14 holes, most-constrained-first) naturally led to filling fresh_tvar and ftv_type before unify and w. This bottom-up fill order meant the agent had utility functions available when it reached the complex holes, producing cleaner code. The baseline wrote everything top-to-bottom in reading order.
─────────────────────────────────────────────────
H2: Concurrent Producer-Consumer Pipeline (Go)
| Baseline | With HDD | |
|---|---|---|
| Code lines | 97 | 101 |
| Structure | Run() + extracted runStage() function |
All inline in Run() with HOLE comments |
| Worker loop | for item := range in |
select + <-ctx.Done() |
| Error handling | Drain + continue | Return immediately on cancel |
| On error | Returns nil, firstErr (discards results) |
Returns results, firstErr (partial results) |
Both are structurally similar — the interesting difference is in cancellation safety. The baseline uses for item := range in which blocks until the channel closes, requiring an explicit drain-and-continue strategy on error. The HDD version uses a select loop with ctx.Done(), allowing workers to exit immediately when cancelled.
★ Insight ─────────────────────────────────────
The HDD agent's iterative review cycle caught the range-based deadlock scenario and replaced it with a select loop. This is the kind of subtle concurrency fix that emerges from re-reading code after each hole fill — baseline agents commit to the full implementation in one pass and don't revisit architectural choices.
─────────────────────────────────────────────────
H3: Three-Way Merge (Python)
| Baseline | With HDD | |
|---|---|---|
| Code lines | 193 | 122 |
| Functions | 6 (_lcs_table, _compute_blocks, _collapse_equal, _collect_overlap_group, _flatten_repl, _ensure_newline) |
2 (_lcs_opcodes, merge3) |
| Strategy | Region-splitting: align boundaries, compare element-wise | Hunk-walking: extract change hunks, walk base with dual cursors |
| Complexity | Splits equal regions at cut points from the other side | Direct interval intersection for overlap detection |
Fundamentally different architectures. The baseline uses a region-splitting strategy — it converts diffs into regions with is_change flags, collects all boundary points from both sides, splits regions at those boundaries, then compares aligned regions element-wise. The HDD version uses a simpler hunk-walking strategy — it extracts only the changed hunks from each side, then walks the base with two cursors detecting overlaps via interval intersection.
The 37% code reduction isn't just conciseness — it reflects a genuinely simpler algorithm that emerged from the hole decomposition. The HDD agent's skeleton had 3 top-level holes (diff, extract hunks, merge walk), and the merge walk hole naturally decomposed into overlap detection + case dispatch, producing the cleaner dual-cursor approach.
★ Insight ─────────────────────────────────────
This is the strongest example of HDD producing a different algorithm, not just different style. The baseline's region-splitting approach requires 3 extra helper functions (_collapse_equal, _collect_overlap_group, _flatten_repl) that exist only to manage the complexity of the region-alignment strategy. HDD's hunk-walking approach avoids this complexity entirely.
─────────────────────────────────────────────────
H4: Incremental Build System (Python)
| Baseline | With HDD | |
|---|---|---|
| Code lines | 166 | 186 |
| Dep resolution + cycle detection | Combined in single _topo_sort |
Separated: _resolve_deps, _detect_cycle, _topo_sort |
| Dep coordination | threading.Event per task |
Polling loop with 0.001s sleep |
| Cache storage | Per-file (one .json per cache key) |
Single cache.json file |
| Cache key includes | Task name + inspect.getsource(fn) + dep results |
Task name + dep results |
| Extra features | invalidate() + _transitive_dependents() |
Dry-run mode returning {"_dry_run": [...]} |
HDD is 12% larger in code lines — different decomposition strategies with a size tradeoff. The baseline combines dependency resolution and cycle detection into a single _topo_sort method (Kahn's algorithm detects cycles implicitly when len(order) != len(needed)). HDD separated these into 3 distinct functions with clear single responsibilities.
The baseline's threading.Event approach for dependency coordination is more efficient than HDD's polling loop. However, the baseline includes inspect.getsource(fn) in cache keys — a clever touch that detects when the function body changes, though it's fragile across Python versions and decorators.
H5: Composable Thread-Safe Rate Limiter (Python)
| Baseline | With HDD | |
|---|---|---|
| Code lines | 174 | 121 |
| Blocking strategy | threading.Condition with calculated wait |
Spin-sleep loops |
| Composite atomicity | Two-phase locking: check all, then commit all | Rollback: acquire each, _give_back() on failure |
| Composite complexity | ~100 lines, type-checks each limiter type | ~25 lines, type-agnostic |
| Extensibility | Adding a new limiter type requires modifying CompositeRateLimiter |
Adding a new limiter type only requires implementing _give_back() |
Radically different atomicity strategies. The baseline's CompositeRateLimiter uses two-phase locking — it acquires all internal locks sorted by id() to prevent deadlock, checks availability in all leaf limiters, then commits all at once. This requires isinstance checks for TokenBucket, SlidingWindow, and PerClientLimiter to reach into their internals.
The HDD version discovered the _give_back() pattern during hole filling — when trying to fill the composite try_acquire hole, the constraint "atomic: no partial acquires" forced the agent to realize that undo capability was needed. This led to adding _give_back() to the ABC and all concrete classes, producing a simpler, more extensible design that doesn't need knowledge of each limiter type's internals.
★ Insight ─────────────────────────────────────
The _give_back() pattern is a textbook example of HDD surfacing a design insight. The hole's contract ("atomic, must pass ALL") created a constraint that couldn't be satisfied without rollback capability. The baseline agent solved the same problem by reaching into internal state — correct but tightly coupled. HDD's constraint-first approach naturally led to the cleaner abstraction.
─────────────────────────────────────────────────
Phase 3 Summary
| Experiment | Baseline | HDD | Code Diff | Architecture Diff |
|---|---|---|---|---|
| H1: Type Inference | 254 | 168 | -34% | Same algorithm, better data class choices |
| H2: Go Pipeline | 97 | 101 | +4% | Caught cancellation deadlock in review |
| H3: Three-Way Merge | 193 | 122 | -37% | Fundamentally different algorithm |
| H4: Build System | 166 | 186 | +12% | Cleaner separation of concerns, more code |
| H5: Rate Limiter | 174 | 121 | -30% | Discovered _give_back abstraction |
Line counts exclude comments, docstrings, and blank lines.
In 3 of 5 hard experiments, HDD produced substantially less code (30-37% reduction). In 2 of 5, HDD produced slightly more code due to explicit decomposition and select-based patterns. More importantly, in 3 of 5 experiments (H3, H5, and H2), HDD produced architecturally different solutions — different decompositions that emerged from the constraint-first fill order.
Blind Code Review
Both versions blind-reviewed by three AI judge personas (Bug Hunter, Architect, Pragmatist). Labels randomized — judges didn't know which used HDD. Scores are 1–5.
Result: Baseline 4 · HDD 1
| 🔍 Bugs | 🏗️ Design | 📖 Clarity | ||
|---|---|---|---|---|
| H1: Type Inference | ||||
| Baseline | ★★★★☆ | ★★★☆☆ | ★★★☆☆ | Winner |
| HDD | ★★☆☆☆ | ★★★★☆ | ★★★★★ | |
| H2: Go Pipeline | ||||
| Baseline | ★★★★★ | ★★★★☆ | ★★★★☆ | Winner |
| HDD | ★★★☆☆ | ★★★☆☆ | ★★★☆☆ | |
| H3: Three-Way Merge | ||||
| Baseline | ★★★★☆ | ★★★★☆ | ★★☆☆☆ | Winner |
| HDD | ★★☆☆☆ | ★★★☆☆ | ★★★★☆ | |
| H4: Build System | ||||
| Baseline | ★★★★★ | ★★★☆☆ | ★★★★★ | Winner |
| HDD | ★★☆☆☆ | ★★★★☆ | ★★☆☆☆ | |
| H5: Rate Limiter | ||||
| Baseline | ★★★★☆ | ★★☆☆☆ | ★★☆☆☆ | |
| HDD | ★★★☆☆ | ★★★★☆ | ★★★★★ | Winner |
| Persona | Baseline avg | HDD avg |
|---|---|---|
| 🔍 Bug Hunter | 4.4 | 2.4 |
| 🏗️ Architect | 3.2 | 3.6 |
| 📖 Pragmatist | 3.2 | 3.8 |
HDD consistently scores higher on Design (+0.4) and Clarity (+0.6) but dramatically lower on Bugs (−2.0). The iterative hole-filling process produces cleaner architecture but introduces subtle correctness issues — race conditions, non-recursive substitution chains, hunk-skip bugs — that baseline's single-pass approach avoids.
Key bugs found by judges in HDD versions:
- H1:
apply_substdoes single-step lookup instead of recursive chain-following - H2: Context leak in
NewPipeline(cancel stored on struct), worker drain deadlock on error - H3: Unconditional
oi += 1; ti += 1after overlap can skip hunks - H4: Race condition in
resultsdict reads without lock, busy-wait polling - H5:
PerClientLimiterreleases lock between bucket lookup and acquire
This finding drives the next iteration of skill prompts: HDD needs explicit correctness verification steps after each hole fill.
Phase 3b: After VERIFY Step
Root cause analysis of Phase 3 bugs revealed a common pattern: each hole fill was locally correct, but cross-hole interactions had bugs — race conditions, non-recursive substitution chains, hunk-skip bugs, resource leaks. The HDD skills had no verification step after filling.
Fix: Added a VERIFY step to all three skills. After each fill, the agent checks:
- Shared mutable state — is access synchronized?
- Resource lifecycle — are acquire/release scopes matched across holes?
- Error/cancel paths — do they clean up resources from other holes?
All five experiments re-run with improved skills, same prompts, same blind judging methodology.
Result: HDD v2 5 · Baseline 0 (was Baseline 4 · HDD 1)
| 🔍 Bugs | 🏗️ Design | 📖 Clarity | ||
|---|---|---|---|---|
| H1: Type Inference | ||||
| Baseline | ★★★☆☆ | ★★★☆☆ | ★★★☆☆ | |
| HDD v2 | ★★★★☆ | ★★★★☆ | ★★★★★ | Winner |
| H2: Go Pipeline | ||||
| Baseline | ★★★☆☆ | ★★★★☆ | ★★★★☆ | |
| HDD v2 | ★★★★☆ | ★★★☆☆ | ★★★☆☆ | Winner |
| H3: Three-Way Merge | ||||
| Baseline | ★★★☆☆ | ★★★☆☆ | ★★★☆☆ | |
| HDD v2 | ★★★★☆ | ★★★★★ | ★★★★★ | Winner |
| H4: Build System | ||||
| Baseline | ★★★★☆ | ★★★☆☆ | ★★★★☆ | |
| HDD v2 | ★★★☆☆ | ★★★★☆ | ★★★★☆ | Winner |
| H5: Rate Limiter | ||||
| Baseline | ★★☆☆☆ | ★★☆☆☆ | ★★☆☆☆ | |
| HDD v2 | ★★★☆☆ | ★★★★☆ | ★★★★☆ | Winner |
| Persona | Baseline avg | HDD v1 avg | HDD v2 avg | Change |
|---|---|---|---|---|
| 🔍 Bug Hunter | 3.0 | 2.4 | 3.6 | +1.2 |
| 🏗️ Architect | 3.0 | 3.6 | 4.0 | +0.4 |
| 📖 Pragmatist | 3.2 | 3.8 | 4.2 | +0.4 |
The VERIFY step and monolithic algorithm guidance closed the bug gap (+1.2 Bug Hunter) while boosting design (+0.4 Architect) and clarity (+0.4 Pragmatist). HDD v2 now leads on all three dimensions.
Notable VERIFY catches during v2 experiments:
- H2: Data race on
chvariable (multiple goroutines writing), workers continuing after cancellation - H4:
invalidate_downstreamwould delete cache for freshly-built tasks (fixed withalready_builtparameter) - H5:
_can_acquire_lockedneeded for two-phase probe-then-commit atomicity in composite
H3 was initially a baseline win. After adding monolithic algorithm guidance (keep tightly-coupled state machines as a single hole), the re-run produced a cleaner merge walk that the judges rated higher on all three dimensions.
Convergence
24/24 PASS in Phase 2. Zero skill revisions needed.
Phase 3 confirmed these results scale to hard problems: in 3/5 experiments HDD produced 30-37% less code, and in 3/5 experiments produced architecturally different solutions. Initial blind code review revealed HDD introduces subtle correctness bugs (Baseline 4 · HDD 1). Adding a VERIFY step and monolithic algorithm guidance fixed the gap (HDD v2 5 · Baseline 0).
The five rules governing HDD:
- "Holes must be visible" — prevents mental-only decomposition
- "Use named holes" — improves trackability in compiler feedback
- "Each distinct concern gets a hole" — prevents under-decomposition
- "Verify after filling" — catches cross-hole interaction bugs (Phase 3b)
- "Don't decompose monolithic algorithms" — tightly-coupled state machines stay as one hole (Phase 3b H3 re-run)
What the Numbers Show
More iteration = more opportunities for correction
The 2–4x increase in tool calls is a feature, not overhead. Each iteration is a checkpoint where the agent re-reads the file state, reassesses which hole to fill next, and reasons about constraints before committing. Baseline agents commit to the entire implementation in one step.
Visible holes enable human oversight
With HDD, the human sees the skeleton evolve in their editor. They can intervene if a hole is decomposed incorrectly before the agent fills it. Baseline behavior shows the final code — by then it's too late to influence the approach.
Ambiguity detection prevents wrong guesses
Both ambiguity tests (A7, B8) correctly triggered the stop-and-ask behavior. The agents identified 4–6 valid interpretations each and asked the human to choose. Without HDD skills, agents silently pick one interpretation.
Constraint ordering reduces errors
Filling the most constrained hole first means the agent makes easy, deterministic fills early, narrowing the remaining holes' contracts. Example from A5: filling _push first (only modify (x:) fits) resolved type ambiguity for _pop and _evalRPN.
Skills compose without conflict
Integration tests (D1, D2) showed core + extending skill work at different levels: core governs strategy (decompose, most constrained first), extending skill governs mechanism (compile/read/fill or reason/write/validate). No conflicts observed.