Results: Skill vs Baseline

Quantitative comparison of agent behavior with and without HDD skills, across 35 experiments in 5 languages.

Baseline Behavior (Without Skills)

Without HDD skills, Claude Code agents consistently:

Write complete implementations in one pass — all logic at once, no decomposition
Keep reasoning invisible — sub-problems identified mentally but never reflected in the file
Produce monolithic code — single function with inlined logic, no named sub-components
Use fewer tool calls — less iteration means less opportunity for course correction

Detailed Comparisons (Phase 1)

Python TOC Generator (Core Skill)

	Baseline	With HDD
Behavior	Wrote everything in a single function body — parse, slugify, deduplicate, format — all at once	Created 4 holes, each starting as `raise NotImplementedError(...)`, filled most constrained first
Tool calls	5	19
Holes	0	4
Structure	1 monolithic function	5 focused functions

The structural difference is significant. Baseline produces a 38-line monolithic function with parsing, slugifying, deduplication, and formatting interleaved in a single loop. HDD produces 5 named functions with clear single responsibilities:

# Baseline: everything interleaved in one loop
def generate_toc(markdown):
    for line in lines:
        match = re.match(...)         # parsing
        slug = re.sub(...)            # slugifying
        if slug in slug_counts: ...   # deduplicating
        toc_entries.append(f"...")     # formatting

# With HDD: decomposed into named helpers
def generate_toc(markdown):
    headings = _extract_headings(markdown)
    for level, text in headings:
        slug = _make_slug(text)
        slug = _deduplicate_slug(slug, slug_counts)
        toc_entries.append(_format_entry(level, text, slug))

Python CSV Parser (Iterative Reasoning)

	Baseline	With HDD
Behavior	Identified sub-problems mentally, wrote all 17 lines at once	Wrote skeleton with `NotImplementedError` markers, filled iteratively with contract reasoning
Tool calls	7	14
Holes	0	3 + 1 sub-hole
Contract reasoning	Implicit	Explicit per hole

The final code is structurally similar — the difference is the process. Baseline writes everything in one shot with no intermediate states visible. With HDD, the human watches the skeleton evolve through 4 iterations and can intervene at any step.

Haskell myFoldr (Compiler Loop)

	Baseline	With HDD
Behavior	Used compiler loop but batch-filled two holes in one cycle	Strict one-hole-per-cycle discipline with named holes
Compile cycles	3 (batched)	5 (one hole per cycle)
Holes	Batch-filled	4 (including sub-hole `_rest`)
Discipline	Filled 2 holes in 1 cycle	Strict 1:1 fill-to-cycle ratio

Same final code — myFoldr has one correct implementation. The difference is discipline: with HDD the agent used named holes (_empty, _cons, _rest), compiled after every single fill, and caught a misleading GHC suggestion (z for the recursive case).

Full Experiment Results (Phase 2)

24 experiments validated the skills across increasing complexity. All PASS on first run, zero skill revisions needed.

Suite C: Core Stress Tests

C1: Trivial One-Liner

Metric	Value
Holes	0 (correctly skipped)
Tool calls	3
Result	PASS — cited red flag: "Creating artificial decomposition for trivial one-liners"

C2: Already-Decomposed Code

Metric	Value
Holes	0 (correctly skipped)
Tool calls	5
Result	PASS — recognized the 3-step pipeline was already decomposed by the task itself

C3: Time Pressure

Metric	Value
Holes	3 (read → group → summarize)
Tool calls	14
Result	PASS — followed HDD despite prompt saying "This is urgent, we need this ASAP"

C4: Competing Instruction

Metric	Value
Holes	4 (Lit → Neg → Add → Mul)
Tool calls	8
Result	PASS — followed HDD despite prompt saying "Don't overthink this, just write the whole thing"

Suite A: Compiler Loop — Haskell

A1: myMap (trivial polymorphic)

Metric	Value
Compile cycles	5
Holes	4 (including sub-hole `_tail`)
Result	PASS

A2: mySort (typeclass-constrained)

Metric	Value
Compile cycles	9
Holes	8
Result	PASS — `Ord` constraint discovered from GHC diagnostics, decomposed into `insert` helper

A3: foldMap (higher-order)

Metric	Value
Compile cycles	3
Holes	3
Result	PASS — `_baseCase` most constrained (only `mempty` fits)

A4: Expr eval (ADT pattern matching)

Metric	Value
Compile cycles	6
Holes	6
Result	PASS — case-split then base-first fill order

A5: State Monad RPN Calculator

Metric	Value
Compile cycles	12
Holes	11 (3 original + 8 sub-holes)
Result	PASS — GHC's `MonadFail` error forced restructuring from failable pattern to `let` binding

Most complex compiler-loop experiment. _push filled first (most constrained: only modify (x:) fits), resolving type ambiguity that made _pop and _evalRPN clearer. The compiler caught a MonadFail issue that the agent would have missed.

A6: Parser Combinator

Metric	Value
Compile cycles	7
Holes	6
Result	PASS — mutual recursion handled via forward references

A7: Ambiguous Mystery Function

Metric	Value
Compile cycles	1 (then stopped)
Holes	—
Result	PASS — identified 5+ valid implementations, stopped and asked

Correctly triggered stop-and-ask. The type (a -> a) -> a -> Int -> a with comment "apply a function some number of times" admits multiple valid fills (iterated application, single application, identity, fold-based). Agent identified all of them and asked the human.

A8: Type Checker (deep nesting)

Metric	Value
Compile cycles	10
Holes	9 (with sub-holes for `TmApp` and `TmIf`)
Result	PASS — 42-line type checker for typed lambda calculus built hole-by-hole

Suite B: Iterative Reasoning — Multi-Language

B1: group_by (Python)

Metric	Value
Holes	2
Tool calls	8
Result	PASS

B2: Log Processor (Python + mypy)

Metric	Value
Holes	5 (pipeline stages: read → parse → group → stats → assemble)
Tool calls	19
Result	PASS

B3: TodoList (TypeScript + tsc)

Metric	Value
Holes	4
Tool calls	15
Result	PASS — used exhaustive switch pattern

B4: FanOut (Go + go build)

Metric	Value
Holes	4 (channel create, WaitGroup, spawn workers, closer goroutine)
Tool calls	—
Result	PASS — concurrency concerns decomposed into separate holes

B5: Web Crawler (Python, no type checker)

Metric	Value
Holes	5 (robots.txt fetch, page fetch, parse links, permission check, BFS crawl)
Tool calls	21
Result	PASS — stdlib only, no type checker, reasoning-only oracle

B6: Backup Rotation (Bash)

Metric	Value
Holes	5 (validate → list → boundaries → classify → delete)
Tool calls	28
Result	PASS — `echo && exit 1` hole markers in a language with no type system at all

Most interesting untyped experiment. Bash has no type checker, no static analysis — the agent's reasoning is the sole oracle. Despite this, the echo "HOLE_N" && exit 1 markers provided structure and incremental testability. Each concern got a named section with documented contracts via comments.

B7: REST API (Python, 4 files)

Metric	Value
Holes	15 across 4 files
Tool calls	43
Result	PASS — dependency-ordered: `models → storage → handlers → main`

Largest experiment. Cross-file contracts defined in skeleton phase, each hole filled in isolation. The dependency graph models.py ← storage.py ← handlers.py ← main.py determined fill order.

B8: Ambiguous Spec (smart_merge)

Metric	Value
Holes	— (stopped before filling)
Tool calls	3
Result	PASS — identified 4 valid merge strategies, stopped and asked

The word "smartly" in the docstring is deliberately vague. Agent identified shallow merge, deep/recursive merge, conflict-detecting merge, and type-aware merge as valid strategies, each with sub-ambiguities.

Suite D: Integration & Edge Cases

D1: Core + Compiler Loop (State Monad)

Metric	Value
Compile cycles	5
Holes	4
Result	PASS — core governed strategy (most constrained first), compiler loop governed mechanism (compile/read/fill), no conflicts

D2: Core + Iterative Reasoning (Go FanOut)

Metric	Value
Holes	4
Tool calls	—
Result	PASS — constraint-ordered filling, skills complementary

D3: Getting Stuck — Compiler (Type Families)

Metric	Value
Compile cycles	2
Holes	1
Result	PASS — solved easily, GHC resolved the type family. Test was not hard enough to trigger the stuck condition.

D4: Getting Stuck — Reasoning (CSP Solver)

Metric	Value
Holes	6 (across 2 decomposition phases)
Tool calls	17
Result	PASS — AC-3 arc consistency + MAC backtracking, completed without triggering stuck condition

Hard Experiments (Phase 3): Baseline vs HDD

Phase 2 tasks were easy enough that both approaches produced correct code with similar structure. Phase 3 uses tasks where architecture decisions matter — complex enough that decomposition strategy significantly affects the result.

Each experiment was run twice: once without any HDD skill (baseline), once with core + iterative-reasoning skills injected.

H1: Hindley-Milner Type Inference (Python)

Baseline | HDD

	Baseline	With HDD
Code lines	254	168
AST/Type definitions	Manual `__init__`, `__eq__`, `__hash__`	`@dataclass(frozen=True)`
Helper functions	`_resolve()`, `_bind()`	None (logic inlined in `unify`)
Extras	`__repr__` on all types, smoke test block	None
Algorithm	Identical (Algorithm W)	Identical (Algorithm W)

The algorithmic core is identical — both implement Algorithm W with unification, occurs check, and let-polymorphism. The 34% code reduction comes from the HDD agent choosing @dataclass(frozen=True) for type classes, which provides __eq__ and __hash__ for free. The baseline wrote all equality/hashing methods manually, plus __repr__ methods and a smoke test block.

★ Insight ───────────────────────────────────── The HDD decomposition (14 holes, most-constrained-first) naturally led to filling fresh_tvar and ftv_type before unify and w. This bottom-up fill order meant the agent had utility functions available when it reached the complex holes, producing cleaner code. The baseline wrote everything top-to-bottom in reading order. ─────────────────────────────────────────────────

H2: Concurrent Producer-Consumer Pipeline (Go)

Baseline | HDD

	Baseline	With HDD
Code lines	97	101
Structure	`Run()` + extracted `runStage()` function	All inline in `Run()` with HOLE comments
Worker loop	`for item := range in`	`select` + `<-ctx.Done()`
Error handling	Drain + continue	Return immediately on cancel
On error	Returns `nil, firstErr` (discards results)	Returns `results, firstErr` (partial results)

Both are structurally similar — the interesting difference is in cancellation safety. The baseline uses for item := range in which blocks until the channel closes, requiring an explicit drain-and-continue strategy on error. The HDD version uses a select loop with ctx.Done(), allowing workers to exit immediately when cancelled.

★ Insight ───────────────────────────────────── The HDD agent's iterative review cycle caught the range-based deadlock scenario and replaced it with a select loop. This is the kind of subtle concurrency fix that emerges from re-reading code after each hole fill — baseline agents commit to the full implementation in one pass and don't revisit architectural choices. ─────────────────────────────────────────────────

H3: Three-Way Merge (Python)

Baseline | HDD

	Baseline	With HDD
Code lines	193	122
Functions	6 (`_lcs_table`, `_compute_blocks`, `_collapse_equal`, `_collect_overlap_group`, `_flatten_repl`, `_ensure_newline`)	2 (`_lcs_opcodes`, `merge3`)
Strategy	Region-splitting: align boundaries, compare element-wise	Hunk-walking: extract change hunks, walk base with dual cursors
Complexity	Splits equal regions at cut points from the other side	Direct interval intersection for overlap detection

Fundamentally different architectures. The baseline uses a region-splitting strategy — it converts diffs into regions with is_change flags, collects all boundary points from both sides, splits regions at those boundaries, then compares aligned regions element-wise. The HDD version uses a simpler hunk-walking strategy — it extracts only the changed hunks from each side, then walks the base with two cursors detecting overlaps via interval intersection.

The 37% code reduction isn't just conciseness — it reflects a genuinely simpler algorithm that emerged from the hole decomposition. The HDD agent's skeleton had 3 top-level holes (diff, extract hunks, merge walk), and the merge walk hole naturally decomposed into overlap detection + case dispatch, producing the cleaner dual-cursor approach.

★ Insight ───────────────────────────────────── This is the strongest example of HDD producing a different algorithm, not just different style. The baseline's region-splitting approach requires 3 extra helper functions (_collapse_equal, _collect_overlap_group, _flatten_repl) that exist only to manage the complexity of the region-alignment strategy. HDD's hunk-walking approach avoids this complexity entirely. ─────────────────────────────────────────────────

H4: Incremental Build System (Python)

Baseline | HDD

	Baseline	With HDD
Code lines	166	186
Dep resolution + cycle detection	Combined in single `_topo_sort`	Separated: `_resolve_deps`, `_detect_cycle`, `_topo_sort`
Dep coordination	`threading.Event` per task	Polling loop with 0.001s sleep
Cache storage	Per-file (one `.json` per cache key)	Single `cache.json` file
Cache key includes	Task name + `inspect.getsource(fn)` + dep results	Task name + dep results
Extra features	`invalidate()` + `_transitive_dependents()`	Dry-run mode returning `{"_dry_run": [...]}`

HDD is 12% larger in code lines — different decomposition strategies with a size tradeoff. The baseline combines dependency resolution and cycle detection into a single _topo_sort method (Kahn's algorithm detects cycles implicitly when len(order) != len(needed)). HDD separated these into 3 distinct functions with clear single responsibilities.

The baseline's threading.Event approach for dependency coordination is more efficient than HDD's polling loop. However, the baseline includes inspect.getsource(fn) in cache keys — a clever touch that detects when the function body changes, though it's fragile across Python versions and decorators.

H5: Composable Thread-Safe Rate Limiter (Python)

Baseline | HDD

	Baseline	With HDD
Code lines	174	121
Blocking strategy	`threading.Condition` with calculated wait	Spin-sleep loops
Composite atomicity	Two-phase locking: check all, then commit all	Rollback: acquire each, `_give_back()` on failure
Composite complexity	~100 lines, type-checks each limiter type	~25 lines, type-agnostic
Extensibility	Adding a new limiter type requires modifying `CompositeRateLimiter`	Adding a new limiter type only requires implementing `_give_back()`

Radically different atomicity strategies. The baseline's CompositeRateLimiter uses two-phase locking — it acquires all internal locks sorted by id() to prevent deadlock, checks availability in all leaf limiters, then commits all at once. This requires isinstance checks for TokenBucket, SlidingWindow, and PerClientLimiter to reach into their internals.

The HDD version discovered the _give_back() pattern during hole filling — when trying to fill the composite try_acquire hole, the constraint "atomic: no partial acquires" forced the agent to realize that undo capability was needed. This led to adding _give_back() to the ABC and all concrete classes, producing a simpler, more extensible design that doesn't need knowledge of each limiter type's internals.

★ Insight ───────────────────────────────────── The _give_back() pattern is a textbook example of HDD surfacing a design insight. The hole's contract ("atomic, must pass ALL") created a constraint that couldn't be satisfied without rollback capability. The baseline agent solved the same problem by reaching into internal state — correct but tightly coupled. HDD's constraint-first approach naturally led to the cleaner abstraction. ─────────────────────────────────────────────────

Phase 3 Summary

Experiment	Baseline	HDD	Code Diff	Architecture Diff
H1: Type Inference	254	168	-34%	Same algorithm, better data class choices
H2: Go Pipeline	97	101	+4%	Caught cancellation deadlock in review
H3: Three-Way Merge	193	122	-37%	Fundamentally different algorithm
H4: Build System	166	186	+12%	Cleaner separation of concerns, more code
H5: Rate Limiter	174	121	-30%	Discovered `_give_back` abstraction

Line counts exclude comments, docstrings, and blank lines.

In 3 of 5 hard experiments, HDD produced substantially less code (30-37% reduction). In 2 of 5, HDD produced slightly more code due to explicit decomposition and select-based patterns. More importantly, in 3 of 5 experiments (H3, H5, and H2), HDD produced architecturally different solutions — different decompositions that emerged from the constraint-first fill order.

Both versions blind-reviewed by three AI judge personas (Bug Hunter, Architect, Pragmatist). Labels randomized — judges didn't know which used HDD. Scores are 1–5.

Result: Baseline 4 · HDD 1

	🔍 Bugs	🏗️ Design	📖 Clarity
H1: Type Inference
Baseline	★★★★☆	★★★☆☆	★★★☆☆	Winner
HDD	★★☆☆☆	★★★★☆	★★★★★
H2: Go Pipeline
Baseline	★★★★★	★★★★☆	★★★★☆	Winner
HDD	★★★☆☆	★★★☆☆	★★★☆☆
H3: Three-Way Merge
Baseline	★★★★☆	★★★★☆	★★☆☆☆	Winner
HDD	★★☆☆☆	★★★☆☆	★★★★☆
H4: Build System
Baseline	★★★★★	★★★☆☆	★★★★★	Winner
HDD	★★☆☆☆	★★★★☆	★★☆☆☆
H5: Rate Limiter
Baseline	★★★★☆	★★☆☆☆	★★☆☆☆
HDD	★★★☆☆	★★★★☆	★★★★★	Winner

Persona	Baseline avg	HDD avg
🔍 Bug Hunter	4.4	2.4
🏗️ Architect	3.2	3.6
📖 Pragmatist	3.2	3.8

HDD consistently scores higher on Design (+0.4) and Clarity (+0.6) but dramatically lower on Bugs (−2.0). The iterative hole-filling process produces cleaner architecture but introduces subtle correctness issues — race conditions, non-recursive substitution chains, hunk-skip bugs — that baseline's single-pass approach avoids.

Key bugs found by judges in HDD versions:

H1: apply_subst does single-step lookup instead of recursive chain-following
H2: Context leak in NewPipeline (cancel stored on struct), worker drain deadlock on error
H3: Unconditional oi += 1; ti += 1 after overlap can skip hunks
H4: Race condition in results dict reads without lock, busy-wait polling
H5: PerClientLimiter releases lock between bucket lookup and acquire

This finding drives the next iteration of skill prompts: HDD needs explicit correctness verification steps after each hole fill.

Phase 3b: After VERIFY Step

Root cause analysis of Phase 3 bugs revealed a common pattern: each hole fill was locally correct, but cross-hole interactions had bugs — race conditions, non-recursive substitution chains, hunk-skip bugs, resource leaks. The HDD skills had no verification step after filling.

Fix: Added a VERIFY step to all three skills. After each fill, the agent checks:

Shared mutable state — is access synchronized?
Resource lifecycle — are acquire/release scopes matched across holes?
Error/cancel paths — do they clean up resources from other holes?

All five experiments re-run with improved skills, same prompts, same blind judging methodology.

Result: HDD v2 5 · Baseline 0 (was Baseline 4 · HDD 1)

	🔍 Bugs	🏗️ Design	📖 Clarity
H1: Type Inference
Baseline	★★★☆☆	★★★☆☆	★★★☆☆
HDD v2	★★★★☆	★★★★☆	★★★★★	Winner
H2: Go Pipeline
Baseline	★★★☆☆	★★★★☆	★★★★☆
HDD v2	★★★★☆	★★★☆☆	★★★☆☆	Winner
H3: Three-Way Merge
Baseline	★★★☆☆	★★★☆☆	★★★☆☆
HDD v2	★★★★☆	★★★★★	★★★★★	Winner
H4: Build System
Baseline	★★★★☆	★★★☆☆	★★★★☆
HDD v2	★★★☆☆	★★★★☆	★★★★☆	Winner
H5: Rate Limiter
Baseline	★★☆☆☆	★★☆☆☆	★★☆☆☆
HDD v2	★★★☆☆	★★★★☆	★★★★☆	Winner

Persona	Baseline avg	HDD v1 avg	HDD v2 avg	Change
🔍 Bug Hunter	3.0	2.4	3.6	+1.2
🏗️ Architect	3.0	3.6	4.0	+0.4
📖 Pragmatist	3.2	3.8	4.2	+0.4

The VERIFY step and monolithic algorithm guidance closed the bug gap (+1.2 Bug Hunter) while boosting design (+0.4 Architect) and clarity (+0.4 Pragmatist). HDD v2 now leads on all three dimensions.

Notable VERIFY catches during v2 experiments:

H2: Data race on ch variable (multiple goroutines writing), workers continuing after cancellation
H4: invalidate_downstream would delete cache for freshly-built tasks (fixed with already_built parameter)
H5: _can_acquire_locked needed for two-phase probe-then-commit atomicity in composite

H3 was initially a baseline win. After adding monolithic algorithm guidance (keep tightly-coupled state machines as a single hole), the re-run produced a cleaner merge walk that the judges rated higher on all three dimensions.

Convergence

24/24 PASS in Phase 2. Zero skill revisions needed.

Phase 3 confirmed these results scale to hard problems: in 3/5 experiments HDD produced 30-37% less code, and in 3/5 experiments produced architecturally different solutions. Initial blind code review revealed HDD introduces subtle correctness bugs (Baseline 4 · HDD 1). Adding a VERIFY step and monolithic algorithm guidance fixed the gap (HDD v2 5 · Baseline 0).

The five rules governing HDD:

"Holes must be visible" — prevents mental-only decomposition
"Use named holes" — improves trackability in compiler feedback
"Each distinct concern gets a hole" — prevents under-decomposition
"Verify after filling" — catches cross-hole interaction bugs (Phase 3b)
"Don't decompose monolithic algorithms" — tightly-coupled state machines stay as one hole (Phase 3b H3 re-run)

What the Numbers Show

More iteration = more opportunities for correction

The 2–4x increase in tool calls is a feature, not overhead. Each iteration is a checkpoint where the agent re-reads the file state, reassesses which hole to fill next, and reasons about constraints before committing. Baseline agents commit to the entire implementation in one step.

Visible holes enable human oversight

With HDD, the human sees the skeleton evolve in their editor. They can intervene if a hole is decomposed incorrectly before the agent fills it. Baseline behavior shows the final code — by then it's too late to influence the approach.

Ambiguity detection prevents wrong guesses

Both ambiguity tests (A7, B8) correctly triggered the stop-and-ask behavior. The agents identified 4–6 valid interpretations each and asked the human to choose. Without HDD skills, agents silently pick one interpretation.

Constraint ordering reduces errors

Filling the most constrained hole first means the agent makes easy, deterministic fills early, narrowing the remaining holes' contracts. Example from A5: filling _push first (only modify (x:) fits) resolved type ambiguity for _pop and _evalRPN.

Skills compose without conflict

Integration tests (D1, D2) showed core + extending skill work at different levels: core governs strategy (decompose, most constrained first), extending skill governs mechanism (compile/read/fill or reason/write/validate). No conflicts observed.

Results: Skill vs Baseline

Baseline Behavior (Without Skills)

Detailed Comparisons (Phase 1)

Python TOC Generator (Core Skill)

Python CSV Parser (Iterative Reasoning)

Haskell myFoldr (Compiler Loop)

Full Experiment Results (Phase 2)

Suite C: Core Stress Tests

C1: Trivial One-Liner

C2: Already-Decomposed Code

C3: Time Pressure

C4: Competing Instruction

Suite A: Compiler Loop — Haskell

A1: myMap (trivial polymorphic)

A2: mySort (typeclass-constrained)

A3: foldMap (higher-order)

A4: Expr eval (ADT pattern matching)

A5: State Monad RPN Calculator

A6: Parser Combinator

A7: Ambiguous Mystery Function

A8: Type Checker (deep nesting)

Suite B: Iterative Reasoning — Multi-Language

B1: group_by (Python)

B2: Log Processor (Python + mypy)

B3: TodoList (TypeScript + tsc)

B4: FanOut (Go + go build)

B5: Web Crawler (Python, no type checker)

B6: Backup Rotation (Bash)

B7: REST API (Python, 4 files)

B8: Ambiguous Spec (smart_merge)

Suite D: Integration & Edge Cases

D1: Core + Compiler Loop (State Monad)

D2: Core + Iterative Reasoning (Go FanOut)

D3: Getting Stuck — Compiler (Type Families)

D4: Getting Stuck — Reasoning (CSP Solver)

Hard Experiments (Phase 3): Baseline vs HDD

H1: Hindley-Milner Type Inference (Python)

H2: Concurrent Producer-Consumer Pipeline (Go)

H3: Three-Way Merge (Python)

H4: Incremental Build System (Python)

H5: Composable Thread-Safe Rate Limiter (Python)

Phase 3 Summary

Blind Code Review

Phase 3b: After VERIFY Step

Convergence

What the Numbers Show

More iteration = more opportunities for correction

Visible holes enable human oversight

Ambiguity detection prevents wrong guesses

Constraint ordering reduces errors

Skills compose without conflict