9. Benchmarks¶
All benchmarks on Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz (6C/12T), NixOS, GCC 15.2.0, GHC 9.10.3, powersave governor. Best-of-N timing to reduce CPU frequency variance.
9.1. Microbenchmarks¶
Source: bench_overhead.c
Fork/Join Overhead (us/iter)¶
| Threads | Native libgomp | RTS-backed | Ratio |
|---|---|---|---|
| 1 | 0.464 | 0.033 | 14.1x faster * |
| 2 | 0.765 | 0.499 | 1.5x faster |
| 4 | 0.931 | 0.945 | 1.02 |
| 8 | 1.461 | 1.692 | 1.16 |
* 1-thread: RTS uses a single-thread fast path (nthreads == 1 → direct function call, no barrier init). libgomp still performs full thread-pool setup.
Barrier Latency (us/iter)¶
| Threads | Native libgomp | RTS-backed | Ratio |
|---|---|---|---|
| 1 | 0.276 | 0.002 | 138.0x faster * |
| 2 | 0.242 | 0.137 | 1.8x faster |
| 4 | 0.248 | 0.27 | 1.09 |
| 8 | 0.434 | 0.47 | 1.08 |
* 1-thread: RTS uses a single-thread fast path (nthreads == 1 → direct function call, no barrier init). libgomp still performs full thread-pool setup.
Parallel For + Reduction (1M sin(), best of 10, ms)¶
| Threads | Native libgomp | RTS-backed | Ratio |
|---|---|---|---|
| 1 | 15.973 | 15.388 | 0.96 |
| 2 | 7.453 | 7.641 | 1.03 |
| 4 | 3.777 | 3.879 | 1.03 |
| 8 | 3.507 | 3.5 | 1.00 |
Critical Section (1000 lock/unlock per thread, ms)¶
| Threads | Native libgomp | RTS-backed | Ratio |
|---|---|---|---|
| 1 | 0.021 | 0.026 | 1.24 |
| 2 | 0.069 | 0.264 | 3.83 |
| 4 | 0.352 | 0.327 | 1.1x faster |
| 8 | 0.942 | 1.318 | 1.40 |
9.2. DGEMM¶
Source: bench_dgemm.c, omp_compute.c
Same naive triple-loop DGEMM compiled identically, linked against either native libgomp or our runtime. Checksums match exactly.
4 Threads¶
| N | Native (ms) | RTS (ms) | Ratio | GFLOPS (RTS) |
|---|---|---|---|---|
| 128 | 0.93 | 1.05 | 1.13x | 4.01 |
| 256 | 10.83 | 13.6 | 1.26x | 2.47 |
| 512 | 78.59 | 83.66 | 1.06x | 3.21 |
| 1024 | 670.05 | 654.66 | 0.98x | 3.28 |
Interleaved re-runs confirm the two runtimes trade leads: the difference is CPU frequency noise, not runtime overhead.
Scaling (RTS-backed, DGEMM 1024x1024)¶
| Threads | Time (ms) | GFLOPS | Speedup |
|---|---|---|---|
| 1 | 2965.94 | 0.72 | 1.0x |
| 2 | 1880.42 | 1.14 | 1.6x |
| 4 | 654.66 | 3.28 | 4.5x |
9.3. FFI Scaling¶
Source: HsMain.hs
Haskell calling parallel sinsum via safe FFI:
| Threads | Time (ms) | Speedup |
|---|---|---|
| 1 | 32.5 | 1.0x |
| 2 | 17.3 | 1.9x |
| 4 | 9.9 | 3.3x |
| 8 | 5.0 | 6.5x |
Near-linear scaling through the FFI boundary, confirming the runtime correctly parallelizes work dispatched from Haskell.
9.4. Parallelism Crossover¶
Source: HsCrossover.hs
When does OpenMP from Haskell beat sequential C? We measured sinsum (compute-bound, ~11ns per element) at various sizes with 4 threads.
| Elements | Sequential | Parallel | Speedup |
|---|---|---|---|
| 100 | 0.7 us | 2.8 us | 0.26x |
| 200 | 3.1 us | 2.9 us | 1.08x |
| 500 | 8.8 us | 5.8 us | 1.50x |
| 1000 | 10.2 us | 5.2 us | 1.96x |
| 5000 | 64.6 us | 21.7 us | 2.98x |
| 100000 | 1528.7 us | 385.5 us | 3.97x |
The crossover is at ~500 elements — above this, OpenMP parallel execution from Haskell is faster than sequential C called via unsafe FFI. The fixed overhead is ~1.8us (86ns safe FFI + 1712ns OpenMP fork/join).
9.5. GHC Native Parallelism vs OpenMP¶
Source: HsParCompare.hs
For the same compute-bound sinsum workload, how does Haskell's forkIO with
manual work splitting compare to OpenMP via safe FFI?
GHC NCG (default code generator)¶
| Elements | Seq Haskell | Seq C | Par Haskell | Par OpenMP | Hs/OMP ratio |
|---|---|---|---|---|---|
| 10K | 327.0 us | 148.0 us | 107.3 us | 42.9 us | 2.50x |
| 100K | 3283.6 us | 1549.8 us | 881.1 us | 375.8 us | 2.34x |
| 1M | 31922.8 us | 15417.0 us | 8637.3 us | 6626.1 us | 1.30x |
| 10M | 336020.5 us | 153998.0 us | 85324.7 us | 40187.1 us | 2.12x |
With the default NCG backend, OpenMP is consistently ~2x faster than parallel Haskell. Sequential Haskell is also ~2.2x slower than sequential C. The gap comes entirely from per-element code quality, not parallelism overhead — both achieve near-ideal scaling on 4 threads.
GHC LLVM backend (-fllvm)¶
| Elements | Seq Haskell | Seq C | Par Haskell | Par OpenMP | Hs/OMP ratio |
|---|---|---|---|---|---|
| 10K | 104 us | 103 us | 43 us | 28 us | 1.5x |
| 100K | 1109 us | 1101 us | 303 us | 278 us | 1.1x |
| 1M | 11156 us | 10998 us | 2972 us | 2866 us | 1.04x |
| 10M | 111751 us | 111112 us | 30924 us | 28747 us | 1.08x |
Compiling with -fllvm (LLVM 20.1) eliminates the gap entirely. Sequential
Haskell matches sequential C, and parallel Haskell reaches parity with OpenMP.
The 2x gap under NCG is purely a code generator limitation — GHC's inner loop
compiles to 17 instructions vs GCC/LLVM's 10. See
Appendix A.3
for the full assembly analysis.
9.6. Task Execution¶
Source: HsTaskDemo.hs, test_omp_tasks.c
Deferred task execution with work-stealing barriers (4 threads, best of 5):
| Tasks | Sequential | Parallel | Speedup |
|---|---|---|---|
| 100 | 1.5 ms | 0.4 ms | 3.81x |
| 500 | 7.7 ms | 2.5 ms | 3.12x |
| 1,000 | 15.6 ms | 5.0 ms | 3.12x |
| 5,000 | 78.1 ms | 19.9 ms | 3.93x |
| 10,000 | 155.9 ms | 42.6 ms | 3.66x |
Near-linear scaling (3.4-4.0x on 4 threads). Correctness verified against sequential reference with exact match.
9.7. Calling Convention Overhead¶
Source: HsCmmDemo.hs, omp_prims.cmm
| Convention | ns/call | Relative | Mechanism |
|---|---|---|---|
foreign import prim (Cmm) |
~0 | — | Direct register read, GHC optimizes away |
foreign import ccall unsafe |
3.1 | — | Save/restore STG registers |
foreign import ccall safe |
89.8 | 29x vs unsafe | + suspendThread/resumeThread |
9.8. Batched Calls¶
Source: HsCmmBatch.hs, omp_batch.cmm
Amortizing the ~68ns safe FFI overhead by batching N C calls within a single
suspendThread/resumeThread cycle:
| Batch size | Standard safe | Cmm batched | Speedup |
|---|---|---|---|
| 1 | 98.3 ns | 97.9 ns | 1.0x |
| 2 | 100.0 ns | 50.3 ns | 2.0x |
| 5 | 97.3 ns | 20.8 ns | 4.7x |
| 10 | 97.4 ns | 12.4 ns | 7.8x |
| 20 | 102.6 ns | 7.6 ns | 13.4x |
| 50 | 104.6 ns | 4.7 ns | 22.2x |
| 100 | 97.8 ns | 3.7 ns | 26.4x |
At batch=100, per-call overhead drops to 2.7 ns — within 35% of unsafe FFI
cost (~2 ns). The results match the theoretical prediction (68 + N × 2) / N
closely at every batch size.