ghc-openmp: GHC's Runtime System as an OpenMP Runtime¶
An OpenMP runtime backed by GHC RTS Capabilities
GitHub · API Reference (Haddock) · Multi-page view
Contents¶
- Abstract
- Motivation
- Background
- Architecture
- Optimization: From 24x Slower to Parity
- Haskell Integration
- Low-Level Techniques
- Shared Memory Demos
- Benchmarks
- Implementation Timeline
- Notable Bugs and Fixes
- Limitations
- Related Work
- Conclusions
- Appendix A.1: Implemented ABI Surface
- Appendix A.2: GOMP ABI Primer
- Appendix A.3: NCG vs LLVM Code Generation
- Appendix A.4: GHC RTS Internals
- Appendix A.5: Sense-Reversing Barrier
- Appendix A.6: Zero-Copy FFI with Pinned ByteArray
- Appendix A.7: Linear Typed Arrays
1. Abstract¶
We implement a drop-in OpenMP runtime library that uses GHC's Runtime System
as its threading infrastructure. Standard C code compiled with
gcc -fopenmp runs on GHC Capabilities instead of libgomp's
pthreads. The runtime implements the GCC GOMP_* ABI and the
omp_* user API, supporting parallel regions, worksharing loops,
barriers, critical sections, tasks, and sections.
After lock-free optimization, the runtime achieves performance parity
with native libgomp on both microbenchmarks and real numerical workloads
(dense matrix multiplication). Haskell programs call OpenMP-parallelized C code
via FFI, with both runtimes sharing the same thread pool. OpenMP workers
call back into Haskell via FunPtr with automatic Capability
acquisition. GHC's stop-the-world garbage collector does not pause OpenMP
workers because they do not hold Capabilities. The culmination is
type-safe shared memory: using GHC's linear types, Haskell and OpenMP C
code operate on disjoint regions of the same array with compile-time
proof of safety and zero synchronization overhead.
2. Motivation¶
GHC's RTS has a mature, production-quality thread pool with per-Capability run queues, work-stealing spark pools, NUMA awareness, and sophisticated scheduling. OpenMP runtimes (libgomp, libomp) maintain their own, separate thread pools. When a Haskell program calls OpenMP-annotated C code via FFI, two independent thread pools compete for the same CPU cores.
If the OpenMP runtime used GHC's thread pool directly, we would get:
- Unified resource management — one thread pool, not two
- Seamless interop — Haskell green threads and OpenMP parallel regions coexist naturally
- GHC as a platform — C programs benefit from GHC's scheduler, and Haskell programs get access to OpenMP's parallel-for without reinventing it
- Type-safe shared memory — linear types can prove disjoint access at compile time, eliminating synchronization barriers when Haskell and C share data
This project investigates whether this is feasible and what the performance cost is.
3. Background¶
3.1. GHC RTS Capabilities¶
A Capability is GHC's central execution unit: one OS thread, one
run queue of lightweight Haskell threads (TSOs), and one work-stealing spark
pool. The number of Capabilities is set by +RTS -N. Each
Capability has a 0-indexed number (cap->no) that maps directly to
OpenMP's omp_get_thread_num().
Key RTS APIs for embedding:
hs_init_ghc(&argc, &argv, conf); // Boot the RTS
Capability *cap = rts_lock(); // Acquire a Capability
rts_unlock(cap); // Release it
rts_setInCallCapability(i, 1); // Pin OS thread to Capability i
uint32_t getNumCapabilities(void); // Current Capability count
uint32_t getNumberOfProcessors(void); // CPU count
hs_init_ghc() is reference-counted: calling it when the RTS is
already running (as in a Haskell host program) simply increments the counter
and returns. This is the key to transparent interop — our runtime
auto-detects whether it is being hosted by a C program or a Haskell program.
3.2. The libgomp ABI¶
Source: ghc_omp_runtime_rts.c
GCC transforms OpenMP pragmas into calls to GOMP_* functions.
For example:
#pragma omp parallel
{ body; }
// becomes:
void outlined_fn(void *data) { body; }
GOMP_parallel(outlined_fn, &data, num_threads, flags);
A minimum viable runtime needs only 9 symbols (GOMP_parallel,
GOMP_barrier, GOMP_critical_start/end,
GOMP_single_start, GOMP_task,
GOMP_taskwait, omp_get_num_threads,
omp_get_thread_num). Full OpenMP 4.5 coverage requires ~85
symbols. Our implementation provides ~97.
3.3. Cmm and foreign import prim¶
Source: omp_prims.cmm
Cmm (C minus minus) is GHC's low-level intermediate representation — a portable assembly language that sits between STG and native code. GHC compiles all Haskell to Cmm before generating machine code.
GHC provides three FFI calling conventions with different overhead:
| Convention | Mechanism | Overhead |
|---|---|---|
foreign import ccall safe |
Releases Capability, calls C, reacquires | ~68 ns |
foreign import ccall unsafe |
Saves STG registers, calls C, restores | ~2 ns |
foreign import prim |
Direct STG register passing, no boundary | ~0 ns |
The prim convention is the fastest: arguments pass directly in GHC's STG
registers (R1, R2, ...) with no calling convention switch. Functions written
in Cmm can access RTS internals like MyCapability() directly. GHC treats
prim calls as pure expressions and can optimize them away entirely
(loop-invariant code motion, common subexpression elimination).
The inline-cmm library lets you
embed Cmm code directly in Haskell modules via a [cmm| ... |] quasiquoter
(similar to inline-c for C code). It automatically generates the
foreign import prim declaration and compiles the Cmm via Template Haskell.
4. Architecture¶
Source: ghc_omp_runtime_rts.c
Figure 1: Runtime architecture. Workers are plain OS threads pinned to
GHC Capabilities. After rts_lock()/rts_unlock() init, they do NOT
hold Capabilities — invisible to GC.
Design decision: We chose a hybrid approach — a C shim calling GHC RTS APIs — over modifying GHC's RTS source directly or using
foreign export. This keeps the runtime as a single.cfile with no GHC fork required.
4.1. Worker Pool Design¶
N-1 worker threads are created at initialization. Each is pinned to a GHC
Capability via rts_setInCallCapability(i, 1), performs one
rts_lock()/rts_unlock() cycle to register with the RTS, then enters
a spin-wait loop on an atomic generation counter.
The master thread (Capability 0) dispatches work by:
- Storing the function pointer and data
- Atomically incrementing the generation counter (release fence)
- Broadcasting a condvar (for sleeping workers)
- Participating in the start barrier, executing
fn(data), and hitting the end barrier
4.2. Synchronization Primitives¶
All barriers use a sense-reversing centralized barrier: each thread maintains a local sense flag. Threads atomically decrement a shared counter; the last thread flips the global sense, releasing all waiters. This is fully lock-free on the fast path.
Workers use a spin-then-sleep strategy: spin for ~4000
iterations (configurable via OMP_WAIT_POLICY: passive=100,
active=10000) on the generation counter (using _mm_pause), then
sched_yield() after the spin threshold, then fall back to a
pthread_cond_wait for power efficiency during idle periods.
Worksharing loops support static, dynamic, guided, and runtime scheduling.
Guided scheduling uses a CAS-based loop with exponentially-decreasing
chunk sizes (remaining / nthreads), giving large initial chunks and
progressively smaller ones for load balancing.
Serialized nested parallelism: Inner parallel regions execute with 1
thread. Full omp_get_level(), omp_get_active_level(),
omp_get_ancestor_thread_num(), and omp_get_team_size() support with
per-thread nesting state up to 8 levels deep.
4.3. Task Queues and Work Stealing¶
Source: ghc_omp_runtime_rts.c, test_omp_tasks.c
OpenMP tasks (#pragma omp task) enable fork-join parallelism where one
thread creates work items and other threads steal them. Our runtime supports
deferred execution: tasks are queued to per-Capability work-stealing queues
and executed by idle threads waiting at barriers.
Each Capability has its own task queue protected by an atomic_flag spinlock,
with an atomic pending counter for fast-path bypass:
- GOMP_task: When
if_clauseis true and we're in a parallel region, copies data to heap (viacpyfnormemcpy) and pushes to the local queue. Otherwise executes inline. Task descriptors are allocated from a pre-allocated pool (4096 entries) with per-Capability lock-free free lists. - Work stealing: Idle threads first pop from their own queue, then steal from other threads' queues via linear scan from a pseudo-random start.
- Barrier task stealing:
spin_barrier_wait_taskschecksg_tasks_pendingbefore attempting steals (avoiding expensive atomic operations when no tasks exist). The last thread arriving drains remaining tasks before releasing the barrier. - End-of-parallel stealing: The pool's end-barrier uses the task-stealing
variant, since GCC may omit explicit
GOMP_barriercalls after#pragma omp single.
Benchmark results are in Section 9.6.
5. Optimization: From 24x Slower to Parity¶
The Phase 2 runtime was functional but slow: fork/join took 24 us vs libgomp's 1 us. The bottleneck was mutex+condvar on every operation. We eliminated all locks from the hot path:
5.1. Lock-free Work Dispatch¶
// Master: store work, then release-fence generation increment
g_pool.fn = fn;
g_pool.data = data;
atomic_fetch_add(&g_pool.generation, 1, memory_order_release);
// Worker: spin on generation (acquire-fence)
while (atomic_load(&g_pool.generation, memory_order_acquire) == my_gen)
_mm_pause();
No mutex on the hot path. The condvar broadcast is only for workers that fell asleep after 4000 spin iterations.
5.2. Sense-Reversing Barrier¶
The sense-reversing centralized barrier follows Mellor-Crummey & Scott's algorithm ("Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors", ACM TOCS 9(1), 1991):
void spin_barrier_wait(spin_barrier_t *b, int *local_sense) {
*local_sense = 1 - *local_sense;
if (atomic_fetch_sub(&b->count, 1, memory_order_acq_rel) == 1) {
// Last thread: reset counter, flip global sense
atomic_store(&b->count, b->size, memory_order_relaxed);
atomic_store(&b->sense, *local_sense, memory_order_release);
} else {
// Spin until sense matches
while (atomic_load(&b->sense, memory_order_acquire) != *local_sense)
_mm_pause();
}
}
Pure atomic operations, no locks. The centralized design has O(N) wakeup but is optimal for small team sizes (typical OpenMP use).
5.3. Results¶
| Metric (4 threads) | Phase 2 | Phase 3 | Native libgomp |
|---|---|---|---|
| Fork/join | 24.35 us | 0.81 us | 0.97 us |
| Barrier | 7.01 us | 0.25 us | 0.51 us |
| Parallel for (1M sin) | 6.71 ms | 3.91 ms | 3.85 ms |
| Critical section | 0.39 ms | 0.38 ms | 0.92 ms |
After optimization, the RTS-backed runtime matches or beats native libgomp on all benchmarks.
With parity established, the runtime becomes a platform. The following sections build the capabilities — FFI integration, GC isolation, zero-copy sharing, linear types — that culminate in Haskell and OpenMP collaborating on shared data structures (§8).
6. Haskell Integration¶
This section covers the integration between Haskell and the OpenMP runtime: calling conventions, initialization, concurrent execution, garbage collection behavior, and bidirectional callbacks.
See also the Haddock API reference for the GHC.OpenMP module.
6.1. FFI Calling Convention¶
Source: HsMain.hs
Haskell calls OpenMP C code via foreign import ccall safe:
The safe keyword is critical: it tells GHC to release the
calling Capability before entering the foreign code, and reacquire it on
return. This means:
- Other Haskell green threads can run on the released Capability
- The C code enters
GOMP_parallel, which dispatches to the worker pool — including potentially the Capability just released - No deadlock: workers don't need to hold Capabilities to execute C compute kernels
6.2. RTS Initialization¶
When called from a Haskell host, hs_init_ghc() is already done
by GHC before main. Our runtime's ensure_rts() calls
hs_init_ghc() again, which simply increments the reference count
and returns. The runtime discovers the existing Capabilities via
getNumCapabilities() and spawns workers for Caps 1..N-1.
6.3. Concurrent Execution¶
Source: HsConcurrent.hs
-- Haskell green thread: pure computation
_ <- forkIO $ do
let !result = haskellSinSum 1200000
putMVar hsDone result
-- OpenMP FFI call (safe: releases Capability)
_ <- forkIO $ do
result <- c_parallel_sinsum 12000000
putMVar ompDone result
-- Both run simultaneously!
Measured: sequential 68ms → concurrent 58ms, with 10ms of overlapping execution confirmed.
6.4. Garbage Collection Isolation¶
Source: HsGCStress.hs
A key concern: GHC's stop-the-world GC pauses all threads holding Capabilities. Would this stall OpenMP workers?
Answer: No. OpenMP workers do not hold Capabilities during parallel execution. After their initial
rts_lock()/rts_unlock()registration, they are plain OS threads spinning on atomic variables. GC only synchronizes Capability-holding threads — our workers are invisible.
Experimental Validation¶
We ran 500 OpenMP parallel regions (each ~400us) concurrently with:
| Scenario | p50 (us) | p99 (us) | max (us) |
|---|---|---|---|
| Baseline (OpenMP alone) | 314–478 | 636–658 | 692–783 |
| + allocation pressure (50K rounds) | 313–543 | 538–651 | 585–691 |
| + forced major GC (20 × performGC) | 315–556 | 549–744 | 574–2262 |
Allocation pressure has negligible impact (within noise). Forced major GCs produced one outlier spike of 2262us on one run and none on another. The spike correlates with the GHC RTS reporting a 1.6ms max GC pause — likely the OS thread making the FFI call had its Capability briefly paused at a region boundary.
GHC RTS statistics: 99.7% productivity, GC time <0.5% of elapsed.
6.5. Bidirectional Callbacks¶
Source: HsCallback.hs, omp_compute.c
The previous sections demonstrated Haskell calling OpenMP. OpenMP workers can also call back into Haskell from within a parallel region.
Mechanism¶
Haskell creates a FunPtr via
foreign import ccall "wrapper":
foreign import ccall "wrapper"
mkCallback :: (CInt -> IO CDouble)
-> IO (FunPtr (CInt -> IO CDouble))
sinCb <- mkCallback (\i -> return (sin (fromIntegral i * 0.001)))
GHC generates a C stub that wraps the Haskell closure with automatic Capability management:
// Generated wrapper (simplified):
CDouble wrapper(CInt arg) {
Capability *cap = rts_lock(); // acquire Capability
// ... evaluate Haskell closure ...
rts_unlock(cap); // release Capability
return result;
}
The C code calls this FunPtr from inside an OpenMP parallel for:
void parallel_reduce_callback(hs_callback_t callback, int n) {
double sum = 0.0;
#pragma omp parallel for reduction(+:sum) schedule(static)
for (int i = 0; i < n; i++)
sum += callback(i); // each worker calls into Haskell
return sum;
}
Correctness¶
All results verified against pure C and pure Haskell reference implementations:
| Test | Result | Status |
|---|---|---|
| parallel_map (1000 sin values) | Element-wise match to 1e-10 | OK |
| parallel_reduce (100K sin sum) | 1839.343386 (matches pure C) | OK |
| polynomial callback (10K) | 1109840.005000 (matches Haskell) | OK |
Performance¶
| Threads | Pure C (ms) | Callback (ms) | Overhead | Per-callback |
|---|---|---|---|---|
| 1 | 1.69 | 46.60 | 27.6x | ~0.47 us |
| 2 | 1.17 | 60.43 | 51.8x | ~0.60 us |
| 4 | 0.71 | 57.91 | 82.1x | ~0.58 us |
The per-callback cost of ~0.5us is the rts_lock()/rts_unlock()
round-trip. This is constant regardless of what the Haskell function does.
For callbacks that perform milliseconds of work (e.g., looking up a Haskell
data structure, evaluating a complex expression), the overhead is negligible.
For tight inner loops like 100K trivial sin() calls, pure C should be used
instead.
Practical guideline: Use Haskell callbacks when each invocation does ≥100us of work. Below that, the
rts_lock/unlockoverhead dominates. Structure code so that OpenMP handles the hot numerical loop in C, and calls Haskell for complex logic at coarser granularity.
7. Low-Level Techniques¶
This section describes advanced techniques for reducing overhead at the Haskell-C boundary: zero-overhead Cmm primitives, batched FFI calls, zero-overhead Cmm primitives and batched FFI calls.
7.1. Cmm Primitives¶
Source: omp_prims.cmm, HsCmmDemo.hs
GHC provides three calling conventions for foreign code, each with different
overhead. We wrote a Cmm primitive that reads Capability_no(MyCapability())
— the same value as omp_get_thread_num() — to measure the overhead of each
tier (see Section 9.7).
The Cmm Primitive¶
Called from Haskell via:
The inline-cmm library provides a
[cmm| ... |] quasiquoter that eliminates the need for separate .cmm files
and manual foreign import prim declarations — Template Haskell handles
compilation and linking automatically, producing identical zero-overhead results.
Key Findings¶
Prim calls are truly free: GHC treats foreign import prim functions as
pure expressions. The Cmm function compiles to a single memory load from
BaseReg, which GHC can hoist out of loops entirely via LICM (loop-invariant
code motion). In a tight loop, 100M prim calls complete in <1ms.
The safe FFI tax is ~65ns: Each foreign import ccall safe call costs ~65ns
more than unsafe — the price of suspendThread()/resumeThread() which
release and reacquire the Capability. For OpenMP regions doing >1us of work,
this is negligible (<7%).
The callback overhead gap: The ~500ns per callback overhead from
Section 6.5 is ~7x larger than the raw safe FFI
cost (~68ns). The difference comes from rts_lock()/rts_unlock() performing
additional work beyond suspendThread()/resumeThread(): Task structure
allocation, global Capability search, and lock acquisition. A Cmm-level fast
path could potentially reduce this.
7.2. Batched Safe Calls¶
Source: omp_batch.cmm, HsCmmBatch.hs
The safe FFI tax of ~68ns per call comes from suspendThread()/resumeThread(),
which release and reacquire the Capability. For workloads that make many short
C calls, this overhead dominates.
By writing the suspend/resume manually in Cmm, we can batch N C calls within a single Capability release/reacquire cycle, amortizing the ~68ns overhead across all N calls.
The Batching Primitive¶
#include "Cmm.h"
cmm_batched_tid(W_ n) {
W_ tok; W_ result; W_ new_base; W_ stack;
W_ i; W_ t;
/* Save Sp to TSO — GC needs valid stack pointer */
stack = StgTSO_stackobj(CurrentTSO);
StgStack_sp(stack) = Sp;
(tok) = ccall suspendThread(BaseReg "ptr", 0);
result = 0; i = 0;
goto loop_check;
loop_body:
(t) = ccall omp_get_thread_num();
result = result + t;
i = i + 1;
loop_check:
if (i < n) goto loop_body;
(new_base) = ccall resumeThread(tok);
BaseReg = new_base;
/* Restore Sp — GC may have moved the stack */
stack = StgTSO_stackobj(CurrentTSO);
Sp = StgStack_sp(stack);
return (result);
}
Implementation Details¶
Three details were critical for correctness:
-
Save/restore Sp:
suspendThreadreleases the Capability, allowing GC to run. The GC needs a validSpin the TSO to scan the suspended thread's stack. Without this, the GC follows a stale stack pointer and crashes. -
No
"ptr"on tok: The token fromsuspendThreadisvoid*, not a GC-traceable pointer. Annotating it as"ptr"would tell GHC's Cmm compiler to treat it as a GC root, causing the GC to follow a non-heap pointer. -
State# threading:
foreign import primis pure by default — GHC can CSE or hoist the call. ThreadingState# RealWorldthrough the type signature makes GHC treat it as effectful while adding zero runtime cost (State# is erased at the Cmm level).
Benchmark results showing speedups up to 27x are in Section 9.8.
8. Shared Memory Demos¶
Source: HsSharedMem1.hs, HsSharedMem2.hs, HsSharedMem3.hs, HsSharedMem4.hs, HsSharedMem5.hs, omp_shared.c
Everything in this project builds to this: Haskell and OpenMP C code operating on the same data, concurrently, with type-level proof that their access patterns are safe. The unified runtime (§6) makes shared memory possible; zero-copy FFI (§A.6) makes it practical; linear types (§A.7) make it correct by construction.
Five demos show this progression — from sequential handoff, through
defensive synchronization, to compile-time proof of disjoint access,
safety guarantees, and finally pure Haskell parallelism via GHC sparks.
All use the same workload: element-wise
f(x) = sin(x) * cos(x) + sqrt(|x|), applied by both Haskell and C/OpenMP
to portions of a shared array.
Demo 1: The basic pattern works¶
Sequential handoff — Haskell fills, C transforms, Haskell reads:
arrIn <- newPinnedDoubles n -- pinned: not moved by GC
arrOut <- newPinnedDoubles n
forM_ [0..n-1] $ \i -> writeD arrIn i (fromIntegral i * 0.001)
c_transform_all (ptrOf arrIn) (ptrOf arrOut) (fromIntegral n)
s <- sumArray arrOut n -- read back in Haskell
No concurrent access, no synchronization. This establishes that the
plumbing works: a pinned ByteArray created in Haskell is directly
readable and writable by C/OpenMP code via mutableByteArrayContents#.
Demo 2: The problem — defensive synchronization¶
Haskell and C/OpenMP each process a disjoint half of the output array.
Without compile-time proof of disjointness, a GOMP_barrier() is needed
for memory visibility — even though the regions never actually overlap:
hsTransformRange arrIn arrOut 0 half -- Haskell: [0, half)
c_transform_range_barrier pIn pOut half (n - half) -- C: [half, n) + barrier
This "defensive synchronization" is the cost of not having type-level guarantees about disjoint access. The barrier is cheap (~0.3 us), but it represents a fundamental limitation: the programmer must manually reason about which regions are disjoint, and the compiler cannot help.
Demo 3: The solution — linear types eliminate barriers¶
Same partition, but using Data.Array.Linear's split/combine to prove
disjointness at the type level:
case halve rw arrOut of
MkSlice st rwL rwR arrL arrR ->
let rwL' = linearTransform rwL arrIn arrL -- Haskell: left half
!() = unsafePerformIO $ -- C: right half
unsafeWithPtr arrOut $ \pOut ->
unsafeWithPtr arrIn $ \pIn ->
c_transform_range pIn pOut
(fromIntegral half) (fromIntegral (n - half))
in combine st rwL' rwR -- no barrier
The type system enforces:
rwLgrants exclusive write access to the left halfrwRgrants exclusive access to the right half — consumed bycombine- No Haskell code can use
rwRto accessarrRwhile C is processing it combineis zero-cost (no allocation, no copying)
Demo 4: Safety — why barriers exist¶
Two examples showing that removing barriers (nowait) without proof of
disjointness silently introduces bugs.
Part A — Off-by-one overlap. Each partition writes [off..off+chunk+1)
instead of [off..off+chunk). Boundary elements are written by two
partitions. With nowait, this is a data race; with barriers, it is
deterministic but still wrong (double-write at boundaries). With linear
split, overlapping ranges are impossible — the type system rejects them:
| Variant | N=10K, P=4 | N=100K, P=16 | N=1M, P=64 |
|---|---|---|---|
| Disjoint (correct) | 0.00 | 0.00 | 0.00 |
| Overlap + barrier | 3.06 | 9.26 | 31.8 |
| Overlap + nowait | 3.06 | 9.26 | 31.8 |
Part B — Two-pass stencil. Pass 1 writes out[i] = f(in[i])
(independent per element); pass 2 reads neighbors
out[i] = avg(out[i-1..i+1]) (crosses partition boundaries). Without
a barrier between passes, pass 2 reads stale data. With linear types,
combine forces pass 1 to complete before pass 2 can access the parent
token needed for cross-boundary reads:
let rw1 = linearMultiPass1 rw arrIn arrOut 0 numParts -- partitioned pass 1
-- combine happened inside — rw1 is the parent token
in linearPass2 rw1 arrOut -- pass 2 needs parent
| Variant | N=10K | N=100K | N=1M |
|---|---|---|---|
| C nowait (worst of 10 runs) | 2.68e-2 | 1.09e-2 | 4.44e-3 |
| Haskell linear vs ref | 0.00 | 0.00 | 0.00 |
The C nowait version produces wrong results (stale reads at partition
boundaries). The Haskell linear version is both correct and barrier-free.
Demo 5: GHC spark parallelism¶
Demos 2–4 use C/OpenMP for the parallel half. Demo 5 shows that the same
split/combine pattern works for pure Haskell parallelism via GHC sparks.
parCombine replaces combine — it sparks the left partition and evaluates
the right on the current thread, using spark#/seq# with noDuplicate#
for safety:
parPartition rw arrIn arrOut base depth =
case halve rw arrOut of
MkSlice st rwL rwR arrL arrR ->
let rwL' = parPartition rwL arrIn arrL base (depth - 1)
rwR' = parPartition rwR arrIn arrR (base + size arrL) (depth - 1)
in parCombine st rwL' rwR' -- spark left, eval right
Spark scaling (N=1,000,000, 4 capabilities):
| Depth | Partitions | Sequential | Parallel | Speedup |
|---|---|---|---|---|
| 0 | 1 | 41.9 ms | 43.6 ms | 0.96x |
| 1 | 2 | 44.4 ms | 22.3 ms | 1.99x |
| 2 | 4 | 42.5 ms | 13.8 ms | 3.09x |
| 3 | 8 | 42.1 ms | 15.0 ms | 2.80x |
| 5 | 32 | 46.1 ms | 13.5 ms | 3.41x |
| 6 | 64 | 45.7 ms | 13.7 ms | 3.34x |
Near-ideal 2x at 2 partitions, ~3x at 4+ partitions on 4 cores. No C/OpenMP involved — parallelism is GHC's work-stealing scheduler.
Results (Demos 1–3, 4 threads, i7-10750H)¶
Barrier vs linear — iteration loop:
| N | Iters | With barrier | Linear | Saved |
|---|---|---|---|---|
| 10,000 | 1000 | 202.5 ms | 194.2 ms | 4.1% |
| 100,000 | 100 | 231.9 ms | 221.7 ms | 4.4% |
| 1,000,000 | 10 | 224.6 ms | 221.8 ms | 1.3% |
Partition scaling (N=1,000,000) — linear types:
| Partitions | Time (ms) |
|---|---|
| 2 | 39.0 |
| 4 | 39.8 |
| 8 | 38.9 |
| 16 | 38.3 |
| 32 | 38.4 |
Scaling is flat: zero barriers regardless of partition count, because
split/combine is pure arithmetic on offset/length views into the same
underlying buffer.
Interpretation¶
The modest 1–4% improvement in Demo 3 vs Demo 2 confirms that barrier elimination is not the point — the barrier was already fast. The point is correctness and composability:
-
Safety (Demo 4): The type checker rejects programs that access overlapping regions (Part A) and enforces ordering between passes that share data across partition boundaries (Part B). These are real bugs that
nowaitintroduces silently — linear types catch them at compile time. -
Zero-cost abstraction:
split/combineinvolves no allocation, copying, or runtime checks. The tokens (RW s) are erased at runtime. The generated code is identical to the unsafe version minus the barrier. -
Composable parallelism (Demo 5): The same
split/combinepattern works with GHC sparks (parCombine) for pure Haskell parallelism, achieving near-ideal scaling. The programmer writes the same code regardless of whether parallelism comes from OpenMP (C FFI) or GHC sparks — linear types guarantee safety in both cases.
9. Benchmarks¶
All benchmarks on Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz (6C/12T), NixOS, GCC 15.2.0, GHC 9.10.3, powersave governor. Best-of-N timing to reduce CPU frequency variance.
9.1. Microbenchmarks¶
Source: bench_overhead.c
Fork/Join Overhead (us/iter)¶
| Threads | Native libgomp | RTS-backed | Ratio |
|---|---|---|---|
| 1 | 0.464 | 0.033 | 14.1x faster * |
| 2 | 0.765 | 0.499 | 1.5x faster |
| 4 | 0.931 | 0.945 | 1.02 |
| 8 | 1.461 | 1.692 | 1.16 |
* 1-thread: RTS uses a single-thread fast path (nthreads == 1 → direct function call, no barrier init). libgomp still performs full thread-pool setup.
Barrier Latency (us/iter)¶
| Threads | Native libgomp | RTS-backed | Ratio |
|---|---|---|---|
| 1 | 0.276 | 0.002 | 138.0x faster * |
| 2 | 0.242 | 0.137 | 1.8x faster |
| 4 | 0.248 | 0.27 | 1.09 |
| 8 | 0.434 | 0.47 | 1.08 |
* 1-thread: RTS uses a single-thread fast path (nthreads == 1 → direct function call, no barrier init). libgomp still performs full thread-pool setup.
Parallel For + Reduction (1M sin(), best of 10, ms)¶
| Threads | Native libgomp | RTS-backed | Ratio |
|---|---|---|---|
| 1 | 15.973 | 15.388 | 0.96 |
| 2 | 7.453 | 7.641 | 1.03 |
| 4 | 3.777 | 3.879 | 1.03 |
| 8 | 3.507 | 3.5 | 1.00 |
Critical Section (1000 lock/unlock per thread, ms)¶
| Threads | Native libgomp | RTS-backed | Ratio |
|---|---|---|---|
| 1 | 0.021 | 0.026 | 1.24 |
| 2 | 0.069 | 0.264 | 3.83 |
| 4 | 0.352 | 0.327 | 1.1x faster |
| 8 | 0.942 | 1.318 | 1.40 |
9.2. DGEMM¶
Source: bench_dgemm.c, omp_compute.c
Same naive triple-loop DGEMM compiled identically, linked against either native libgomp or our runtime. Checksums match exactly.
4 Threads¶
| N | Native (ms) | RTS (ms) | Ratio | GFLOPS (RTS) |
|---|---|---|---|---|
| 128 | 0.93 | 1.05 | 1.13x | 4.01 |
| 256 | 10.83 | 13.6 | 1.26x | 2.47 |
| 512 | 78.59 | 83.66 | 1.06x | 3.21 |
| 1024 | 670.05 | 654.66 | 0.98x | 3.28 |
Interleaved re-runs confirm the two runtimes trade leads: the difference is CPU frequency noise, not runtime overhead.
Scaling (RTS-backed, DGEMM 1024x1024)¶
| Threads | Time (ms) | GFLOPS | Speedup |
|---|---|---|---|
| 1 | 2965.94 | 0.72 | 1.0x |
| 2 | 1880.42 | 1.14 | 1.6x |
| 4 | 654.66 | 3.28 | 4.5x |
9.3. FFI Scaling¶
Source: HsMain.hs
Haskell calling parallel sinsum via safe FFI:
| Threads | Time (ms) | Speedup |
|---|---|---|
| 1 | 32.5 | 1.0x |
| 2 | 17.3 | 1.9x |
| 4 | 9.9 | 3.3x |
| 8 | 5.0 | 6.5x |
Near-linear scaling through the FFI boundary, confirming the runtime correctly parallelizes work dispatched from Haskell.
9.4. Parallelism Crossover¶
Source: HsCrossover.hs
When does OpenMP from Haskell beat sequential C? We measured sinsum (compute-bound, ~11ns per element) at various sizes with 4 threads.
| Elements | Sequential | Parallel | Speedup |
|---|---|---|---|
| 100 | 0.7 us | 2.8 us | 0.26x |
| 200 | 3.1 us | 2.9 us | 1.08x |
| 500 | 8.8 us | 5.8 us | 1.50x |
| 1000 | 10.2 us | 5.2 us | 1.96x |
| 5000 | 64.6 us | 21.7 us | 2.98x |
| 100000 | 1528.7 us | 385.5 us | 3.97x |
The crossover is at ~500 elements — above this, OpenMP parallel execution from Haskell is faster than sequential C called via unsafe FFI. The fixed overhead is ~1.8us (86ns safe FFI + 1712ns OpenMP fork/join).
9.5. GHC Native Parallelism vs OpenMP¶
Source: HsParCompare.hs
For the same compute-bound sinsum workload, how does Haskell's forkIO with
manual work splitting compare to OpenMP via safe FFI?
GHC NCG (default code generator)¶
| Elements | Seq Haskell | Seq C | Par Haskell | Par OpenMP | Hs/OMP ratio |
|---|---|---|---|---|---|
| 10K | 327.0 us | 148.0 us | 107.3 us | 42.9 us | 2.50x |
| 100K | 3283.6 us | 1549.8 us | 881.1 us | 375.8 us | 2.34x |
| 1M | 31922.8 us | 15417.0 us | 8637.3 us | 6626.1 us | 1.30x |
| 10M | 336020.5 us | 153998.0 us | 85324.7 us | 40187.1 us | 2.12x |
With the default NCG backend, OpenMP is consistently ~2x faster than parallel Haskell. Sequential Haskell is also ~2.2x slower than sequential C. The gap comes entirely from per-element code quality, not parallelism overhead — both achieve near-ideal scaling on 4 threads.
GHC LLVM backend (-fllvm)¶
| Elements | Seq Haskell | Seq C | Par Haskell | Par OpenMP | Hs/OMP ratio |
|---|---|---|---|---|---|
| 10K | 104 us | 103 us | 43 us | 28 us | 1.5x |
| 100K | 1109 us | 1101 us | 303 us | 278 us | 1.1x |
| 1M | 11156 us | 10998 us | 2972 us | 2866 us | 1.04x |
| 10M | 111751 us | 111112 us | 30924 us | 28747 us | 1.08x |
Compiling with -fllvm (LLVM 20.1) eliminates the gap entirely. Sequential
Haskell matches sequential C, and parallel Haskell reaches parity with OpenMP.
The 2x gap under NCG is purely a code generator limitation — GHC's inner loop
compiles to 17 instructions vs GCC/LLVM's 10. See
Appendix A.3
for the full assembly analysis.
9.6. Task Execution¶
Source: HsTaskDemo.hs, test_omp_tasks.c
Deferred task execution with work-stealing barriers (4 threads, best of 5):
| Tasks | Sequential | Parallel | Speedup |
|---|---|---|---|
| 100 | 1.5 ms | 0.4 ms | 3.81x |
| 500 | 7.7 ms | 2.5 ms | 3.12x |
| 1,000 | 15.6 ms | 5.0 ms | 3.12x |
| 5,000 | 78.1 ms | 19.9 ms | 3.93x |
| 10,000 | 155.9 ms | 42.6 ms | 3.66x |
Near-linear scaling (3.4-4.0x on 4 threads). Correctness verified against sequential reference with exact match.
9.7. Calling Convention Overhead¶
Source: HsCmmDemo.hs, omp_prims.cmm
| Convention | ns/call | Relative | Mechanism |
|---|---|---|---|
foreign import prim (Cmm) |
~0 | — | Direct register read, GHC optimizes away |
foreign import ccall unsafe |
3.1 | — | Save/restore STG registers |
foreign import ccall safe |
89.8 | 29x vs unsafe | + suspendThread/resumeThread |
9.8. Batched Calls¶
Source: HsCmmBatch.hs, omp_batch.cmm
Amortizing the ~68ns safe FFI overhead by batching N C calls within a single
suspendThread/resumeThread cycle:
| Batch size | Standard safe | Cmm batched | Speedup |
|---|---|---|---|
| 1 | 98.3 ns | 97.9 ns | 1.0x |
| 2 | 100.0 ns | 50.3 ns | 2.0x |
| 5 | 97.3 ns | 20.8 ns | 4.7x |
| 10 | 97.4 ns | 12.4 ns | 7.8x |
| 20 | 102.6 ns | 7.6 ns | 13.4x |
| 50 | 104.6 ns | 4.7 ns | 22.2x |
| 100 | 97.8 ns | 3.7 ns | 26.4x |
At batch=100, per-call overhead drops to 2.7 ns — within 35% of unsafe FFI
cost (~2 ns). The results match the theoretical prediction (68 + N × 2) / N
closely at every batch size.
10. Implementation Timeline¶
The runtime was developed in 21 phases. Each phase is summarized below with a reference to the section containing full details.
| Phase | Description | Section |
|---|---|---|
| 1 | Stub pthread-based runtime validating GCC GOMP ABI compatibility | §4 |
| 2 | Replace pthreads with GHC Capabilities; hybrid C shim approach | §4 |
| 3 | Lock-free optimization: atomic generation counter, sense-reversing barriers | §5 |
| 4 | Haskell FFI interop via foreign import ccall safe |
§6.1 |
| 5 | Concurrent Haskell green threads + OpenMP parallel regions | §6.3 |
| 6 | GC interaction testing — minimal impact on OpenMP latency | §6.4 |
| 7 | Dense matrix multiply (DGEMM) workload | §9.2 |
| 8 | Head-to-head comparison with native libgomp — performance parity | §9.2 |
| 9 | Bidirectional interop: OpenMP workers call Haskell via FunPtr | §6.5 |
| 10 | Cmm primitives via foreign import prim — zero-overhead FFI |
§7.1 |
| 11 | inline-cmm quasiquoter integration |
§7.1 |
| 12 | Batched safe calls amortizing 68ns FFI overhead | §7.2 |
| 13 | Parallelism crossover analysis — break-even at ~500 elements | §9.4 |
| 14 | GHC native parallelism vs OpenMP — parity with -fllvm |
§9.5 |
| 15 | Deferred task execution with work-stealing barriers | §4.3 |
| 16 | Zero-copy FFI with pinned ByteArray — 19% inner loop speedup | §A.6 |
| 17 | Linear typed arrays for type-safe disjoint partitioning | §A.7 |
| 18 | Runtime improvements: guided scheduling, hybrid barriers, task pools | §4 |
| 19 | Shared memory demos: producer-consumer, synchronized, linear | §8 |
| 20 | Safety demos: overlap bugs, stencil ordering, linear type prevention | §8 |
| 21 | GHC spark parallelism via parCombine with spark#/noDuplicate# |
§8 |
11. Notable Bugs and Fixes¶
11.1. Barrier Sense Mismatch Deadlock¶
Symptom: Program hangs when calling GOMP_parallel
from a forkIO thread at -N4. No output at all. Works
at -N1.
Root cause: Workers' local barrier sense variables
(start_sense, end_sense) persisted from previous
parallel regions (value 1), but spin_barrier_init() reset the
barrier's global sense to 0. On the next region:
- Workers flipped 1→0, saw
sense(0) == local_sense(0), passed through immediately - Master (on a new OS thread from
forkIO) had fresh sense=0, flipped to 1, but couldn't complete the barrier
Fix: Reset all local sense variables to 0 at the start of each parallel region, matching the freshly initialized barriers.
11.2. False Parallel-For Regression¶
Symptom: At 4 threads, parallel for appeared 1.65x slower than native libgomp (6.7ms vs 4.1ms).
Root cause: Single-sample measurement on a laptop with
powersave CPU governor (i7-10750H at 46% clock). CPU boost state
varied between process invocations.
Fix: Changed to best-of-10 within each process. Controlled interleaved testing confirmed parity (3.85ms vs 3.91ms).
12. Limitations¶
| Limitation | Impact | Notes |
|---|---|---|
| Serialized nesting only | Low | Inner parallel regions execute with 1 thread. True nested parallelism (multiple active levels) is not supported. |
| Single global team | Low | No support for different thread counts in nested teams. |
| No target offloading | None | Not applicable to this project's scope. |
| No doacross loops | Low | GOMP_doacross_* not implemented. |
13. Related Work¶
BOLT (bolt-omp.org, Best Paper PACT '19) is the closest analogue to this project. BOLT is a full OpenMP runtime built on Argobots, a lightweight user-level threading library from Argonne National Laboratory. Where libgomp maps OpenMP threads to pthreads, BOLT maps them to Argobots user-level threads (ULTs) scheduled on execution streams (ES) — achieving efficient nested parallelism and fine-grained tasking that pthreads cannot.
The architectural parallel is direct:
| Concept | BOLT / Argobots | ghc-openmp / GHC RTS |
|---|---|---|
| OS-thread abstraction | Execution Stream (ES) | Capability |
| Lightweight work unit | ULT / Tasklet | Haskell green thread |
| OpenMP thread mapping | ULT on ES | OS thread pinned to Capability |
| Scheduler | Pluggable per-pool | GHC spark pool + spin-wait workers |
| Work stealing | Built-in | Phase 15 deferred tasks |
The key difference is motivation: BOLT starts from a purpose-built threading substrate (Argobots) designed for composing HPC runtimes (MPI + OpenMP + task libraries). ghc-openmp repurposes an existing language runtime that already provides green threads, garbage collection, and an FFI — trading Argobots' generality for seamless Haskell interoperation.
14. Conclusions¶
GHC's Runtime System can serve as a fully functional OpenMP runtime with zero measurable overhead compared to native libgomp. The implementation is a single ~1300-line C file using only public GHC RTS APIs — no GHC fork required.
The key architectural insights are:
- Capabilities as thread IDs:
cap->nodirectly maps toomp_get_thread_num() - Workers without Capabilities: After RTS registration, worker threads release their Capabilities. They execute C code as plain OS threads, invisible to GC.
- Reference-counted init:
hs_init_ghc()is idempotent, enabling transparent use from both C and Haskell hosts. - Lock-free synchronization is essential: The naive mutex+condvar implementation was 20–25x slower. Sense-reversing barriers and atomic generation counters brought it to parity.
- Bidirectional FFI works: OpenMP workers call Haskell
functions via
FunPtrwith ~0.5us overhead per invocation (automaticrts_lock/unlock), making it practical for coarse-grained callbacks.
This demonstrates that language runtimes can share threading infrastructure across FFI boundaries. A Haskell program can call OpenMP C code, with both sharing the same thread pool, the same CPU cores, and coexisting with GHC's garbage collector.
Beyond performance parity, unifying the runtimes enables a new programming model: Haskell and C code operating on the same data with type-safe guarantees. Linear tokens prove disjoint access at compile time, eliminating defensive synchronization. The type checker becomes a concurrency tool — data races are compile errors, not runtime surprises.
Appendix A.1: Implemented ABI Surface¶
All implementations in ghc_omp_runtime_rts.c
Core Parallel:
GOMP_parallel, GOMP_parallel_start, GOMP_parallel_end, GOMP_barrier
Synchronization:
GOMP_critical_start, GOMP_critical_end, GOMP_critical_name_start, GOMP_critical_name_end, GOMP_atomic_start, GOMP_atomic_end, GOMP_single_start, GOMP_single_copy_start, GOMP_single_copy_end, GOMP_ordered_start, GOMP_ordered_end
Worksharing Loops:
GOMP_loop_static_start, GOMP_loop_static_next, GOMP_loop_dynamic_start, GOMP_loop_dynamic_next, GOMP_loop_guided_start, GOMP_loop_guided_next, GOMP_loop_runtime_start, GOMP_loop_runtime_next, GOMP_loop_start, GOMP_loop_end, GOMP_loop_end_nowait, GOMP_loop_nonmonotonic_dynamic_start, GOMP_loop_nonmonotonic_dynamic_next, GOMP_loop_nonmonotonic_guided_start, GOMP_loop_nonmonotonic_guided_next, GOMP_parallel_loop_static, GOMP_parallel_loop_dynamic, GOMP_parallel_loop_guided, GOMP_parallel_loop_runtime, GOMP_parallel_loop_nonmonotonic_dynamic, GOMP_parallel_loop_nonmonotonic_guided
Tasks:
GOMP_task, GOMP_taskwait, GOMP_taskyield, GOMP_taskgroup_start, GOMP_taskgroup_end
Sections:
GOMP_sections_start, GOMP_sections_next, GOMP_sections_end, GOMP_sections_end_nowait, GOMP_parallel_sections
Cancellation & Teams:
GOMP_cancel, GOMP_cancellation_point, GOMP_barrier_cancel, GOMP_loop_end_cancel, GOMP_sections_end_cancel, GOMP_teams_reg
omp_* User API:
omp_get_num_threads, omp_get_thread_num, omp_get_max_threads, omp_get_num_procs, omp_set_num_threads, omp_in_parallel, omp_set_dynamic, omp_get_dynamic, omp_set_nested, omp_get_nested, omp_get_wtime, omp_get_wtick, omp_init_lock, omp_destroy_lock, omp_set_lock, omp_unset_lock, omp_test_lock, omp_init_nest_lock, omp_destroy_nest_lock, omp_set_nest_lock, omp_unset_nest_lock, omp_test_nest_lock, omp_get_level, omp_get_active_level, omp_get_ancestor_thread_num, omp_get_team_size, omp_get_thread_limit, omp_set_max_active_levels, omp_get_max_active_levels, omp_get_supported_active_levels, omp_set_schedule, omp_get_schedule, omp_in_final, omp_get_cancellation, omp_get_proc_bind, omp_get_num_places, omp_get_place_num, omp_get_default_device, omp_set_default_device, omp_get_num_devices, omp_get_num_teams, omp_get_team_num, omp_is_initial_device, omp_get_initial_device, omp_get_max_task_priority
Appendix A.2: GOMP ABI Primer¶
GCC does not interpret OpenMP pragmas at runtime. Instead, the compiler
transforms each pragma into calls to GOMP_* functions at compile time.
Any library that exports these symbols can serve as the OpenMP runtime.
Outlined Functions¶
GCC extracts the body of a parallel region into a separate outlined
function. The original code is replaced by a call to GOMP_parallel:
// Source code:
#pragma omp parallel
{
do_work(shared_data);
}
// GCC transforms this to:
static void outlined_fn(void *data) {
struct shared *d = data;
do_work(d->shared_data);
}
GOMP_parallel(outlined_fn, &shared_data, num_threads, flags);
GOMP_parallel calls outlined_fn on the master thread and dispatches
it to worker threads. An implicit barrier at the end ensures all threads
complete before the master returns.
Worksharing Loops¶
#pragma omp parallel for combines a parallel region with work
distribution. GCC transforms the loop into a protocol:
// Source code:
#pragma omp parallel for schedule(static)
for (int i = 0; i < n; i++)
a[i] = compute(i);
// Each thread executes:
long start, end;
if (GOMP_loop_static_start(0, n, 1, chunk, &start, &end)) {
do {
for (long i = start; i < end; i++)
a[i] = compute(i);
} while (GOMP_loop_static_next(&start, &end));
}
GOMP_loop_end();
The _start function assigns the first chunk to the calling thread.
_next returns subsequent chunks until the iteration space is exhausted.
_end synchronizes with an implicit barrier.
For dynamic and guided scheduling, the same protocol applies with
GOMP_loop_dynamic_start/_next or GOMP_loop_guided_start/_next.
The runtime decides chunk sizes: static divides evenly, dynamic uses
fixed chunks, guided uses exponentially decreasing chunks.
Reductions¶
OpenMP reductions use thread-local accumulators combined at the barrier:
// Source code:
double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < n; i++)
sum += a[i];
// Each thread gets a local copy of sum (initialized to 0.0).
// At the barrier, all local copies are combined with +.
GCC generates the local copies and the combining code. The runtime provides the barrier; the reduction logic is entirely compiler-generated.
Critical Sections and Atomics¶
GOMP_critical_start(); // acquire global mutex
shared_counter++;
GOMP_critical_end(); // release global mutex
GOMP_critical_name_start(&named_lock); // per-name mutex
named_resource++;
GOMP_critical_name_end(&named_lock);
Unnamed critical sections share a single global mutex. Named critical sections use per-name mutexes, allowing independent critical regions to execute concurrently.
Tasks¶
GOMP_task supports deferred execution:
When if_clause is true and threads are available, the runtime copies
the task data (via cpyfn or memcpy) to the heap and enqueues it.
Idle threads steal tasks from other threads' queues. When if_clause
is false, the task executes immediately (inlined). GOMP_taskwait
blocks until all child tasks complete.
Appendix A.3: NCG vs LLVM Code Generation¶
This appendix details why GHC's default native code generator (NCG) produces
a ~2x slower inner loop than GCC for sin()-heavy workloads, and how the
LLVM backend eliminates the gap. See Section 9.5
for the benchmark results.
What the Haskell code compiles to¶
GHC -O2 already fully unboxes the inner loop (sinDouble#, +##, *##
on Double#), and GCC does not vectorize sin() calls. Neither boxing nor
SIMD explains the gap — it is purely a code generator quality issue.
NCG: 17 instructions per iteration¶
cvtsi2sdq %r14, %xmm1 # i -> double
mulsd .Ln3Sj(%rip), %xmm1 # * 0.001
subq $8, %rsp # stack adjust (every iter!)
movsd %xmm0, %xmm2 # shuffle acc
movsd %xmm1, %xmm0 # move arg for sin()
movl $1, %eax # varargs ABI marker
movsd %xmm2, 72(%rsp) # spill acc
movq %rsi, %rbx # save STG register
call sin
addq $8, %rsp # restore stack
movsd 64(%rsp), %xmm1 # reload acc
addsd %xmm0, %xmm1 # acc += sin(...)
incq %r14 # i++
movsd %xmm1, %xmm0 # shuffle back
movq %rbx, %rsi # restore STG register
cmpq %rsi, %r14 # i < hi?
jl loop
The extra instructions come from:
- STG register save/restore (
movq %rsi, %rbx/movq %rbx, %rsi): NCG saves R2 around everysin()call, inside the loop - Per-iteration stack adjustment (
subq $8/addq $8): NCG adjusts the stack frame every iteration instead of once at function entry - Register shuffles (
movsd %xmm0, %xmm2,movsd %xmm1, %xmm0,movsd %xmm1, %xmm0): NCG's linear register allocator produces unnecessary moves - Varargs ABI marker (
movl $1, %eax): required by x86-64 SysV ABI for variadic functions, but GCC elides it when it knows the callee
LLVM: 10 instructions per iteration¶
movsd %xmm1, 16(%rsp) # spill acc (same as GCC)
xorps %xmm0, %xmm0
cvtsi2sd %r14, %xmm0 # i -> double
mulsd .LCPI20_0(%rip), %xmm0 # * 0.001
callq sin@PLT
movsd 16(%rsp), %xmm1 # reload acc
addsd %xmm0, %xmm1 # acc += sin(...)
incq %r14 # i++
cmpq %r14, %r15 # i < hi?
jne loop
LLVM hoists the STG register save/restore outside the loop, allocates the stack frame once, and eliminates all redundant shuffles. The resulting loop matches GCC instruction-for-instruction.
Summary¶
The 2x NCG gap is not a fundamental Haskell overhead. It is an artifact of GHC's native code generator producing suboptimal machine code for tight loops with C calls. The LLVM backend produces identical code quality to GCC, achieving parity on both sequential and parallel benchmarks.
Environment: GHC 9.10.3, NCG vs -fllvm with LLVM 20.1, GCC 15.2.0.
Appendix A.4: GHC RTS Internals¶
The GHC Runtime System is the execution environment for compiled Haskell programs. This appendix covers the concepts needed to understand how our OpenMP runtime integrates with it.
Capabilities¶
A Capability is GHC's fundamental execution unit. Each Capability consists of:
- One OS thread (the owner)
- One run queue of lightweight Haskell threads (TSOs)
- One spark pool for speculative parallelism (
par) - A private allocation area for the generational GC
The number of Capabilities is set by +RTS -N4 (4 Capabilities).
Each has a 0-indexed number (cap->no) that we map directly to
omp_get_thread_num().
TSOs (Thread State Objects)¶
A TSO represents a lightweight Haskell thread — what forkIO
creates. TSOs are much cheaper than OS threads (~1KB vs ~8MB stack).
Thousands of TSOs can be multiplexed onto a single Capability. The
Capability's scheduler picks TSOs from the run queue and executes them
in round-robin fashion, yielding on allocation (every ~4KB allocated).
Scheduler Loop¶
Each Capability runs a loop:
loop:
tso = pick from run queue (or steal a spark)
run tso until it yields/blocks/finishes
if tso blocked: move to blocked queue
if tso yielded: put back on run queue
goto loop
When a Capability has no work, it can steal sparks from other Capabilities or go idle. This is the same work-stealing mechanism that our OpenMP task implementation builds on.
RTS API for Embedding¶
These functions allow C code to interact with the RTS:
| Function | Purpose |
|---|---|
hs_init_ghc(&argc, &argv, conf) |
Boot the RTS. Reference-counted: safe to call when already running. |
rts_lock() |
Acquire a Capability. Returns Capability*. Blocks until one is available. |
rts_unlock(cap) |
Release a Capability. Makes it available for Haskell threads or other callers. |
rts_setInCallCapability(i, 1) |
Pin the calling OS thread to Capability i. Subsequent rts_lock() calls will always get Capability i. |
getNumCapabilities() |
Return the current number of Capabilities. |
getNumberOfProcessors() |
Return the CPU count. |
Our runtime uses these to create workers pinned to specific Capabilities.
After the initial rts_lock()/rts_unlock() registration, workers release
their Capabilities and become plain OS threads.
Safe vs Unsafe FFI¶
GHC provides two FFI calling conventions with different trade-offs:
Unsafe (foreign import ccall unsafe): The Haskell thread keeps
holding its Capability during the C call. Fast (~2ns overhead), but
blocks all other Haskell threads on that Capability. Suitable for
short, non-blocking C functions.
Safe (foreign import ccall safe): The Haskell thread releases
its Capability before calling C, and reacquires it on return. Slower
(~68ns overhead) but allows other Haskell threads to run. Required for
C functions that may block or run for a long time.
Internally, safe FFI calls suspendThread() (release Capability, return
a token) before the C function, and resumeThread(token) (reacquire
Capability) after. This is the mechanism our batched calls exploit
(Section 7.2).
Garbage Collection¶
GHC uses a stop-the-world generational garbage collector. When a GC is triggered:
- All Capabilities are synchronized (each thread reaches a safe point)
- GC runs, scanning all Capability-local allocation areas
- Capabilities are released and threads resume
Critically, GC only synchronizes threads that hold Capabilities. Our OpenMP workers do not hold Capabilities during parallel execution — they are invisible to the GC. This is why OpenMP compute kernels are not paused by Haskell garbage collection (Section 6.4).
STG Machine Registers¶
GHC compiles Haskell to STG (Spineless Tagless G-machine) code, which uses a set of virtual registers mapped to hardware registers:
| Register | x86-64 | Purpose |
|---|---|---|
| BaseReg | %r13 |
Pointer to current Capability |
| Sp | %rbp |
STG stack pointer |
| Hp | %r12 |
Heap allocation pointer |
| R1 | %rbx |
First argument / return value |
| R2-R6 | %r14, %rsi, %rdi, %r8, %r9 |
Arguments |
| SpLim | %r15 |
Stack limit |
These registers are caller-saved with respect to C calls. Every
foreign import ccall must save them before and restore them after the
C function. This is the source of the NCG overhead analyzed in the
Appendix A.3: the NCG
saves/restores these registers inside the loop, while the LLVM backend
hoists them outside.
Environment: NixOS, GHC 9.10.3, GCC 15.2.0, Intel i7-10750H (6C/12T). Source code: ghc-openmp repository. February 2026.
Appendix A.5: Sense-Reversing Barrier¶
Source: spin_barrier_wait, spin_barrier_wait_tasks
The runtime's barrier is a centralized sense-reversing barrier from Mellor-Crummey & Scott ("Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors", ACM TOCS 9(1), 1991).
Data Structure¶
typedef struct {
atomic_int count; /* threads remaining (decremented by arrivals) */
atomic_int sense; /* global sense flag (flipped by last arrival) */
int size; /* team size (reset value for count) */
} spin_barrier_t;
Each thread also maintains a thread-local int local_sense, initially 0.
Algorithm¶
-
Arrive: Thread flips its
local_sense(1 - local_sense), then atomically decrementscountwithmemory_order_acq_rel. -
Last thread (decrement returns 1): Resets
counttosize(relaxed store), then flips the globalsenseto matchlocal_sense(release store). This releases all waiting threads. -
Other threads: Spin-wait until the global
sensematches theirlocal_sense(acquire load).
The sense-reversing trick avoids the resetting race in naive barriers: because each phase uses the opposite sense value, threads that haven't yet left a barrier cannot be confused with threads entering the next one.
Hybrid Spin-Wait¶
Spinning uses a three-tier strategy controlled by g_spin_iters
(configurable via OMP_WAIT_POLICY):
- Spin with
pause: For the firstg_spin_itersiterations (~4000 default), execute_mm_pause()to reduce pipeline contention and save power on x86. sched_yield(): After the spin threshold, yield the CPU to other threads. This avoids wasting cycles during longer waits.- Condvar fallback: The worker pool's generation-counter wait (not the
barrier itself) uses
pthread_cond_waitfor idle periods between parallel regions.
The OMP_WAIT_POLICY environment variable controls aggressiveness:
active sets 10000 spin iterations, passive sets 100.
Task-Stealing Variant¶
spin_barrier_wait_tasks
extends the barrier with work stealing during the spin-wait phase:
- While waiting: If
g_tasks_pending > 0, attempt to steal and execute a task from another thread's queue. Reset the spin counter after productive work. - Last thread: Before releasing the barrier, drain all remaining
tasks (spin on
g_tasks_pendinguntil zero). This ensures no tasks are lost when GCC omits explicitGOMP_barriercalls after#pragma omp single.
The g_tasks_pending fast-path check (acquire load of a single atomic)
avoids the cost of scanning per-thread queues when no tasks exist. This
is critical for the common case where barriers are used without tasks.
Appendix A.6: Zero-Copy FFI with Pinned ByteArray¶
Source: HsZeroCopy.hs
The standard FFI pattern (allocaArray + peekElemOff/pokeElemOff) boxes
every element as CDouble and converts via realToFrac, adding overhead at
the Haskell↔OpenMP boundary:
-- Boxed: every element goes through CDouble
a <- peekElemOff pA (i * n + k) -- returns boxed CDouble
b <- peekElemOff pB (k * n + j) -- returns boxed CDouble
go (acc + realToFrac a * realToFrac b) (k + 1) -- 4 box/unbox ops
Using pinned ByteArray# with unboxed primops eliminates this overhead:
-- Unboxed: Double# throughout, no boxing
case readDoubleArray# mbaA (i *# n +# k) s of
(# s', a #) -> -- a :: Double# (unboxed)
case readDoubleArray# mbaB (k *# n +# j) s' of
(# s'', b #) -> -- b :: Double# (unboxed)
goK s'' i j (k +# 1#) (acc +## (a *## b))
The pinned ByteArray is passed to C via mutableByteArrayContents#, which
returns a raw Addr# — zero-copy, no marshalling. touch# keeps the
ByteArray alive during the C call.
Benchmark¶
Haskell sequential DGEMM inner loop, pinned ByteArray with unboxed primops vs standard boxed FFI:
| N | Boxed (ms) | Unboxed (ms) | Speedup |
|---|---|---|---|
| 256 | 57.3 | 53.4 | 1.07x |
| 512 | 457.9 | 384.6 | 1.19x |
The 19% improvement at N=512 comes from eliminating CDouble boxing in the
O(n³) inner loop. -ddump-simpl confirms the hot loop uses +##, *##, and
readDoubleArray# with no D# constructor.
Appendix A.7: Linear Typed Arrays¶
Source: Data/Array/Linear.hs, HsLinearDemo.hs
GHC's -XLinearTypes extension can enforce exclusive ownership
of mutable array regions at compile time. The design is inspired by
konn/linear-extra's Borrowable
SArray, which uses phantom-tagged tokens for zero-copy split/combine. A
self-contained ~200-line module (Data.Array.Linear) extracts the core pattern
and integrates it with the unboxed primops from
Appendix A.6
(readDoubleArray#/writeDoubleArray# instead of Storable).
The core idea: linearity is on tokens (RW s), not the array itself. You
need the right token to read/write, and split/combine tracks disjoint
ownership at the type level.
Design¶
-- Linear token: proves exclusive access to region s
data RW s where MkRW :: RW s
-- Array with phantom region (NOT linear — tokens enforce access)
data DArray s = DArray !Int !(MutableByteArray# RealWorld) !Int -- len, buf, offset
-- Split witness: proves l and r came from splitting s
data SlicesTo s l r where MkSlicesTo :: SlicesTo s l r
-- Operations consume and return tokens
unsafeRead :: RW s %1 -> DArray s -> Int -> (Double, RW s)
unsafeWrite :: RW s %1 -> DArray s -> Int -> Double -> RW s
split :: RW s %1 -> Int -> DArray s -> Slice s
combine :: SlicesTo s l r %1 -> RW l %1 -> RW r %1 -> RW s
The split operation is zero-copy — both halves share the same underlying
MutableByteArray#, just with different offset/length views. No allocation,
no copying, just arithmetic on the offset field.
Type-Safe Row-Partitioned DGEMM¶
case split rwC (half * n) arrC of
MkSlice st rwTop rwBot cTop cBot ->
let rwTop' = linearDgemm rwTop n 0 half arrA arrB cTop
rwBot' = linearDgemm rwBot n half half arrA arrB cBot
rwC' = combine st rwTop' rwBot'
in unsafeRead rwC' arrC 0 -- can only read after recombining
This is the Haskell-side type encoding of what OpenMP does at runtime with
#pragma omp parallel for schedule(static) — partitioning the output matrix
into disjoint row blocks. The type system statically guarantees:
- No two computations can write to the same rows
- The original array cannot be accessed while split
- All slices must be recombined before reading results
C FFI Integration¶
The unsafeWithPtr function passes pinned ByteArray data directly to C:
This threads through IO properly (unlike runRW#-based approaches that
create independent state threads), ensuring C side effects are visible to
subsequent Haskell reads.
Implementation Notes¶
unsafeDupablePerformIOoverrunRW#: Element access usesunsafeDupablePerformIOwith{-# NOINLINE #-}to ensure writes from one operation are visible to subsequent reads.runRW#creates independent state threads that GHC can reorder under-O2.- Bang patterns for sequencing: In lazy contexts, token-producing
operations must be forced with
!to ensure their side effects execute. parCombinefor GHC sparks: A parallel variant ofcombinethat sparks the left token for parallel evaluation usingspark#/seq#. UsesunsafePerformIO(which includesnoDuplicate#) to prevent thunk duplication — essential when tokens are produced by destructive array operations.unsafeCoerce#bridges linear types with non-linear GHC primitives (same approach askonn/linear-extra'sUnsafe.toLinear).- Self-contained: ~280 lines, no dependencies beyond
baseandghc-prim.