9. Benchmarks¶

All benchmarks on Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz (6C/12T), NixOS, GCC 15.2.0, GHC 9.10.3, powersave governor. Best-of-N timing to reduce CPU frequency variance.

9.1. Microbenchmarks¶

Source: bench_overhead.c

Fork/Join Overhead (us/iter)¶

Threads	Native libgomp	RTS-backed	Ratio
1	0.464	0.033	14.1x faster *
2	0.765	0.499	1.5x faster
4	0.931	0.945	1.02
8	1.461	1.692	1.16

* 1-thread: RTS uses a single-thread fast path (nthreads == 1 → direct function call, no barrier init). libgomp still performs full thread-pool setup.

Barrier Latency (us/iter)¶

Threads	Native libgomp	RTS-backed	Ratio
1	0.276	0.002	138.0x faster *
2	0.242	0.137	1.8x faster
4	0.248	0.27	1.09
8	0.434	0.47	1.08

* 1-thread: RTS uses a single-thread fast path (nthreads == 1 → direct function call, no barrier init). libgomp still performs full thread-pool setup.

Parallel For + Reduction (1M sin(), best of 10, ms)¶

Threads	Native libgomp	RTS-backed	Ratio
1	15.973	15.388	0.96
2	7.453	7.641	1.03
4	3.777	3.879	1.03
8	3.507	3.5	1.00

Critical Section (1000 lock/unlock per thread, ms)¶

Threads	Native libgomp	RTS-backed	Ratio
1	0.021	0.026	1.24
2	0.069	0.264	3.83
4	0.352	0.327	1.1x faster
8	0.942	1.318	1.40

9.2. DGEMM¶

Source: bench_dgemm.c, omp_compute.c

Same naive triple-loop DGEMM compiled identically, linked against either native libgomp or our runtime. Checksums match exactly.

4 Threads¶

N	Native (ms)	RTS (ms)	Ratio	GFLOPS (RTS)
128	0.93	1.05	1.13x	4.01
256	10.83	13.6	1.26x	2.47
512	78.59	83.66	1.06x	3.21
1024	670.05	654.66	0.98x	3.28

Interleaved re-runs confirm the two runtimes trade leads: the difference is CPU frequency noise, not runtime overhead.

DGEMM Head-to-Head: Native libgomp vs RTS

Scaling (RTS-backed, DGEMM 1024x1024)¶

Threads	Time (ms)	GFLOPS	Speedup
1	2965.94	0.72	1.0x
2	1880.42	1.14	1.6x
4	654.66	3.28	4.5x

9.3. FFI Scaling¶

Source: HsMain.hs

Haskell calling parallel sinsum via safe FFI:

Threads	Time (ms)	Speedup
1	32.5	1.0x
2	17.3	1.9x
4	9.9	3.3x
8	5.0	6.5x

Near-linear scaling through the FFI boundary, confirming the runtime correctly parallelizes work dispatched from Haskell.

9.4. Parallelism Crossover¶

Source: HsCrossover.hs

When does OpenMP from Haskell beat sequential C? We measured sinsum (compute-bound, ~11ns per element) at various sizes with 4 threads.

Elements	Sequential	Parallel	Speedup
100	0.7 us	2.8 us	0.26x
200	3.1 us	2.9 us	1.08x
500	8.8 us	5.8 us	1.50x
1000	10.2 us	5.2 us	1.96x
5000	64.6 us	21.7 us	2.98x
100000	1528.7 us	385.5 us	3.97x

The crossover is at ~500 elements — above this, OpenMP parallel execution from Haskell is faster than sequential C called via unsafe FFI. The fixed overhead is ~1.8us (86ns safe FFI + 1712ns OpenMP fork/join).

Parallelism Crossover: Sequential vs Parallel

9.5. GHC Native Parallelism vs OpenMP¶

Source: HsParCompare.hs

For the same compute-bound sinsum workload, how does Haskell's forkIO with manual work splitting compare to OpenMP via safe FFI?

GHC NCG (default code generator)¶

Elements	Seq Haskell	Seq C	Par Haskell	Par OpenMP	Hs/OMP ratio
10K	327.0 us	148.0 us	107.3 us	42.9 us	2.50x
100K	3283.6 us	1549.8 us	881.1 us	375.8 us	2.34x
1M	31922.8 us	15417.0 us	8637.3 us	6626.1 us	1.30x
10M	336020.5 us	153998.0 us	85324.7 us	40187.1 us	2.12x

With the default NCG backend, OpenMP is consistently ~2x faster than parallel Haskell. Sequential Haskell is also ~2.2x slower than sequential C. The gap comes entirely from per-element code quality, not parallelism overhead — both achieve near-ideal scaling on 4 threads.

GHC LLVM backend (`-fllvm`)¶

Elements	Seq Haskell	Seq C	Par Haskell	Par OpenMP	Hs/OMP ratio
10K	104 us	103 us	43 us	28 us	1.5x
100K	1109 us	1101 us	303 us	278 us	1.1x
1M	11156 us	10998 us	2972 us	2866 us	1.04x
10M	111751 us	111112 us	30924 us	28747 us	1.08x

Compiling with -fllvm (LLVM 20.1) eliminates the gap entirely. Sequential Haskell matches sequential C, and parallel Haskell reaches parity with OpenMP. The 2x gap under NCG is purely a code generator limitation — GHC's inner loop compiles to 17 instructions vs GCC/LLVM's 10. See Appendix A.3 for the full assembly analysis.

9.6. Task Execution¶

Source: HsTaskDemo.hs, test_omp_tasks.c

Deferred task execution with work-stealing barriers (4 threads, best of 5):

Tasks	Sequential	Parallel	Speedup
100	1.5 ms	0.4 ms	3.81x
500	7.7 ms	2.5 ms	3.12x
1,000	15.6 ms	5.0 ms	3.12x
5,000	78.1 ms	19.9 ms	3.93x
10,000	155.9 ms	42.6 ms	3.66x

Near-linear scaling (3.4-4.0x on 4 threads). Correctness verified against sequential reference with exact match.

9.7. Calling Convention Overhead¶

Source: HsCmmDemo.hs, omp_prims.cmm

Convention	ns/call	Relative	Mechanism
`foreign import prim` (Cmm)	~0	—	Direct register read, GHC optimizes away
`foreign import ccall unsafe`	3.1	—	Save/restore STG registers
`foreign import ccall safe`	89.8	29x vs unsafe	+ suspendThread/resumeThread

9.8. Batched Calls¶

Source: HsCmmBatch.hs, omp_batch.cmm

Amortizing the ~68ns safe FFI overhead by batching N C calls within a single suspendThread/resumeThread cycle:

Batch size	Standard safe	Cmm batched	Speedup
1	98.3 ns	97.9 ns	1.0x
2	100.0 ns	50.3 ns	2.0x
5	97.3 ns	20.8 ns	4.7x
10	97.4 ns	12.4 ns	7.8x
20	102.6 ns	7.6 ns	13.4x
50	104.6 ns	4.7 ns	22.2x
100	97.8 ns	3.7 ns	26.4x

At batch=100, per-call overhead drops to 2.7 ns — within 35% of unsafe FFI cost (~2 ns). The results match the theoretical prediction (68 + N × 2) / N closely at every batch size.