ghc-openmp: GHC's Runtime System as an OpenMP Runtime¶

An OpenMP runtime backed by GHC RTS Capabilities

GitHub · API Reference (Haddock) · Multi-page view

Contents¶

Abstract
Motivation
Background
Architecture
Optimization: From 24x Slower to Parity
Haskell Integration
Low-Level Techniques
Shared Memory Demos
Benchmarks
Implementation Timeline
Notable Bugs and Fixes
Limitations
Related Work
Conclusions
Appendix A.1: Implemented ABI Surface
Appendix A.2: GOMP ABI Primer
Appendix A.3: NCG vs LLVM Code Generation
Appendix A.4: GHC RTS Internals
Appendix A.5: Sense-Reversing Barrier
Appendix A.6: Zero-Copy FFI with Pinned ByteArray
Appendix A.7: Linear Typed Arrays

1. Abstract¶

We implement a drop-in OpenMP runtime library that uses GHC's Runtime System as its threading infrastructure. Standard C code compiled with gcc -fopenmp runs on GHC Capabilities instead of libgomp's pthreads. The runtime implements the GCC GOMP_* ABI and the omp_* user API, supporting parallel regions, worksharing loops, barriers, critical sections, tasks, and sections.

After lock-free optimization, the runtime achieves performance parity with native libgomp on both microbenchmarks and real numerical workloads (dense matrix multiplication). Haskell programs call OpenMP-parallelized C code via FFI, with both runtimes sharing the same thread pool. OpenMP workers call back into Haskell via FunPtr with automatic Capability acquisition. GHC's stop-the-world garbage collector does not pause OpenMP workers because they do not hold Capabilities. The culmination is type-safe shared memory: using GHC's linear types, Haskell and OpenMP C code operate on disjoint regions of the same array with compile-time proof of safety and zero synchronization overhead.

2. Motivation¶

GHC's RTS has a mature, production-quality thread pool with per-Capability run queues, work-stealing spark pools, NUMA awareness, and sophisticated scheduling. OpenMP runtimes (libgomp, libomp) maintain their own, separate thread pools. When a Haskell program calls OpenMP-annotated C code via FFI, two independent thread pools compete for the same CPU cores.

If the OpenMP runtime used GHC's thread pool directly, we would get:

Unified resource management — one thread pool, not two
Seamless interop — Haskell green threads and OpenMP parallel regions coexist naturally
GHC as a platform — C programs benefit from GHC's scheduler, and Haskell programs get access to OpenMP's parallel-for without reinventing it
Type-safe shared memory — linear types can prove disjoint access at compile time, eliminating synchronization barriers when Haskell and C share data

This project investigates whether this is feasible and what the performance cost is.

3. Background¶

3.1. GHC RTS Capabilities¶

A Capability is GHC's central execution unit: one OS thread, one run queue of lightweight Haskell threads (TSOs), and one work-stealing spark pool. The number of Capabilities is set by +RTS -N. Each Capability has a 0-indexed number (cap->no) that maps directly to OpenMP's omp_get_thread_num().

Key RTS APIs for embedding:

hs_init_ghc(&argc, &argv, conf);       // Boot the RTS
Capability *cap = rts_lock();           // Acquire a Capability
rts_unlock(cap);                        // Release it
rts_setInCallCapability(i, 1);          // Pin OS thread to Capability i
uint32_t getNumCapabilities(void);      // Current Capability count
uint32_t getNumberOfProcessors(void);   // CPU count

hs_init_ghc() is reference-counted: calling it when the RTS is already running (as in a Haskell host program) simply increments the counter and returns. This is the key to transparent interop — our runtime auto-detects whether it is being hosted by a C program or a Haskell program.

3.2. The libgomp ABI¶

Source: ghc_omp_runtime_rts.c

GCC transforms OpenMP pragmas into calls to GOMP_* functions. For example:

#pragma omp parallel
{ body; }

// becomes:
void outlined_fn(void *data) { body; }
GOMP_parallel(outlined_fn, &data, num_threads, flags);

A minimum viable runtime needs only 9 symbols (GOMP_parallel, GOMP_barrier, GOMP_critical_start/end, GOMP_single_start, GOMP_task, GOMP_taskwait, omp_get_num_threads, omp_get_thread_num). Full OpenMP 4.5 coverage requires ~85 symbols. Our implementation provides ~97.

3.3. Cmm and `foreign import prim`¶

Source: omp_prims.cmm

Cmm (C minus minus) is GHC's low-level intermediate representation — a portable assembly language that sits between STG and native code. GHC compiles all Haskell to Cmm before generating machine code.

GHC provides three FFI calling conventions with different overhead:

Convention	Mechanism	Overhead
`foreign import ccall safe`	Releases Capability, calls C, reacquires	~68 ns
`foreign import ccall unsafe`	Saves STG registers, calls C, restores	~2 ns
`foreign import prim`	Direct STG register passing, no boundary	~0 ns

The prim convention is the fastest: arguments pass directly in GHC's STG registers (R1, R2, ...) with no calling convention switch. Functions written in Cmm can access RTS internals like MyCapability() directly. GHC treats prim calls as pure expressions and can optimize them away entirely (loop-invariant code motion, common subexpression elimination).

The inline-cmm library lets you embed Cmm code directly in Haskell modules via a [cmm| ... |] quasiquoter (similar to inline-c for C code). It automatically generates the foreign import prim declaration and compiles the Cmm via Template Haskell.

4. Architecture¶

Source: ghc_omp_runtime_rts.c

Figure 1: Runtime architecture. Workers are plain OS threads pinned to GHC Capabilities. After rts_lock()/rts_unlock() init, they do NOT hold Capabilities — invisible to GC.

Design decision: We chose a hybrid approach — a C shim calling GHC RTS APIs — over modifying GHC's RTS source directly or using foreign export. This keeps the runtime as a single .c file with no GHC fork required.

4.1. Worker Pool Design¶

N-1 worker threads are created at initialization. Each is pinned to a GHC Capability via rts_setInCallCapability(i, 1), performs one rts_lock()/rts_unlock() cycle to register with the RTS, then enters a spin-wait loop on an atomic generation counter.

The master thread (Capability 0) dispatches work by:

Storing the function pointer and data
Atomically incrementing the generation counter (release fence)
Broadcasting a condvar (for sleeping workers)
Participating in the start barrier, executing fn(data), and hitting the end barrier

4.2. Synchronization Primitives¶

All barriers use a sense-reversing centralized barrier: each thread maintains a local sense flag. Threads atomically decrement a shared counter; the last thread flips the global sense, releasing all waiters. This is fully lock-free on the fast path.

Workers use a spin-then-sleep strategy: spin for ~4000 iterations (configurable via OMP_WAIT_POLICY: passive=100, active=10000) on the generation counter (using _mm_pause), then sched_yield() after the spin threshold, then fall back to a pthread_cond_wait for power efficiency during idle periods.

Worksharing loops support static, dynamic, guided, and runtime scheduling. Guided scheduling uses a CAS-based loop with exponentially-decreasing chunk sizes (remaining / nthreads), giving large initial chunks and progressively smaller ones for load balancing.

Serialized nested parallelism: Inner parallel regions execute with 1 thread. Full omp_get_level(), omp_get_active_level(), omp_get_ancestor_thread_num(), and omp_get_team_size() support with per-thread nesting state up to 8 levels deep.

4.3. Task Queues and Work Stealing¶

Source: ghc_omp_runtime_rts.c, test_omp_tasks.c

OpenMP tasks (#pragma omp task) enable fork-join parallelism where one thread creates work items and other threads steal them. Our runtime supports deferred execution: tasks are queued to per-Capability work-stealing queues and executed by idle threads waiting at barriers.

Each Capability has its own task queue protected by an atomic_flag spinlock, with an atomic pending counter for fast-path bypass:

GOMP_task: When if_clause is true and we're in a parallel region, copies data to heap (via cpyfn or memcpy) and pushes to the local queue. Otherwise executes inline. Task descriptors are allocated from a pre-allocated pool (4096 entries) with per-Capability lock-free free lists.
Work stealing: Idle threads first pop from their own queue, then steal from other threads' queues via linear scan from a pseudo-random start.
Barrier task stealing: spin_barrier_wait_tasks checks g_tasks_pending before attempting steals (avoiding expensive atomic operations when no tasks exist). The last thread arriving drains remaining tasks before releasing the barrier.
End-of-parallel stealing: The pool's end-barrier uses the task-stealing variant, since GCC may omit explicit GOMP_barrier calls after #pragma omp single.

Benchmark results are in Section 9.6.

5. Optimization: From 24x Slower to Parity¶

The Phase 2 runtime was functional but slow: fork/join took 24 us vs libgomp's 1 us. The bottleneck was mutex+condvar on every operation. We eliminated all locks from the hot path:

5.1. Lock-free Work Dispatch¶

// Master: store work, then release-fence generation increment
g_pool.fn = fn;
g_pool.data = data;
atomic_fetch_add(&g_pool.generation, 1, memory_order_release);

// Worker: spin on generation (acquire-fence)
while (atomic_load(&g_pool.generation, memory_order_acquire) == my_gen)
    _mm_pause();

No mutex on the hot path. The condvar broadcast is only for workers that fell asleep after 4000 spin iterations.

5.2. Sense-Reversing Barrier¶

The sense-reversing centralized barrier follows Mellor-Crummey & Scott's algorithm ("Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors", ACM TOCS 9(1), 1991):

void spin_barrier_wait(spin_barrier_t *b, int *local_sense) {
    *local_sense = 1 - *local_sense;
    if (atomic_fetch_sub(&b->count, 1, memory_order_acq_rel) == 1) {
        // Last thread: reset counter, flip global sense
        atomic_store(&b->count, b->size, memory_order_relaxed);
        atomic_store(&b->sense, *local_sense, memory_order_release);
    } else {
        // Spin until sense matches
        while (atomic_load(&b->sense, memory_order_acquire) != *local_sense)
            _mm_pause();
    }
}

Pure atomic operations, no locks. The centralized design has O(N) wakeup but is optimal for small team sizes (typical OpenMP use).

5.3. Results¶

Metric (4 threads)	Phase 2	Phase 3	Native libgomp
Fork/join	24.35 us	0.81 us	0.97 us
Barrier	7.01 us	0.25 us	0.51 us
Parallel for (1M sin)	6.71 ms	3.91 ms	3.85 ms
Critical section	0.39 ms	0.38 ms	0.92 ms

After optimization, the RTS-backed runtime matches or beats native libgomp on all benchmarks.

With parity established, the runtime becomes a platform. The following sections build the capabilities — FFI integration, GC isolation, zero-copy sharing, linear types — that culminate in Haskell and OpenMP collaborating on shared data structures (§8).

6. Haskell Integration¶

This section covers the integration between Haskell and the OpenMP runtime: calling conventions, initialization, concurrent execution, garbage collection behavior, and bidirectional callbacks.

See also the Haddock API reference for the GHC.OpenMP module.

6.1. FFI Calling Convention¶

Source: HsMain.hs

Haskell calls OpenMP C code via foreign import ccall safe:

foreign import ccall safe "parallel_sinsum"
    c_parallel_sinsum :: CInt -> IO CDouble

The safe keyword is critical: it tells GHC to release the calling Capability before entering the foreign code, and reacquire it on return. This means:

Other Haskell green threads can run on the released Capability
The C code enters GOMP_parallel, which dispatches to the worker pool — including potentially the Capability just released
No deadlock: workers don't need to hold Capabilities to execute C compute kernels

6.2. RTS Initialization¶

When called from a Haskell host, hs_init_ghc() is already done by GHC before main. Our runtime's ensure_rts() calls hs_init_ghc() again, which simply increments the reference count and returns. The runtime discovers the existing Capabilities via getNumCapabilities() and spawns workers for Caps 1..N-1.

6.3. Concurrent Execution¶

Source: HsConcurrent.hs

-- Haskell green thread: pure computation
_ <- forkIO $ do
    let !result = haskellSinSum 1200000
    putMVar hsDone result

-- OpenMP FFI call (safe: releases Capability)
_ <- forkIO $ do
    result <- c_parallel_sinsum 12000000
    putMVar ompDone result

-- Both run simultaneously!

Measured: sequential 68ms → concurrent 58ms, with 10ms of overlapping execution confirmed.

6.4. Garbage Collection Isolation¶

Source: HsGCStress.hs

A key concern: GHC's stop-the-world GC pauses all threads holding Capabilities. Would this stall OpenMP workers?

Answer: No. OpenMP workers do not hold Capabilities during parallel execution. After their initial rts_lock()/rts_unlock() registration, they are plain OS threads spinning on atomic variables. GC only synchronizes Capability-holding threads — our workers are invisible.

Experimental Validation¶

We ran 500 OpenMP parallel regions (each ~400us) concurrently with:

Scenario	p50 (us)	p99 (us)	max (us)
Baseline (OpenMP alone)	314–478	636–658	692–783
+ allocation pressure (50K rounds)	313–543	538–651	585–691
+ forced major GC (20 × performGC)	315–556	549–744	574–2262

Allocation pressure has negligible impact (within noise). Forced major GCs produced one outlier spike of 2262us on one run and none on another. The spike correlates with the GHC RTS reporting a 1.6ms max GC pause — likely the OS thread making the FFI call had its Capability briefly paused at a region boundary.

GHC RTS statistics: 99.7% productivity, GC time <0.5% of elapsed.

6.5. Bidirectional Callbacks¶

Source: HsCallback.hs, omp_compute.c

The previous sections demonstrated Haskell calling OpenMP. OpenMP workers can also call back into Haskell from within a parallel region.

Mechanism¶

Haskell creates a FunPtr via foreign import ccall "wrapper":

foreign import ccall "wrapper"
    mkCallback :: (CInt -> IO CDouble)
               -> IO (FunPtr (CInt -> IO CDouble))

sinCb <- mkCallback (\i -> return (sin (fromIntegral i * 0.001)))

GHC generates a C stub that wraps the Haskell closure with automatic Capability management:

// Generated wrapper (simplified):
CDouble wrapper(CInt arg) {
    Capability *cap = rts_lock();      // acquire Capability
    // ... evaluate Haskell closure ...
    rts_unlock(cap);                   // release Capability
    return result;
}

The C code calls this FunPtr from inside an OpenMP parallel for:

void parallel_reduce_callback(hs_callback_t callback, int n) {
    double sum = 0.0;
    #pragma omp parallel for reduction(+:sum) schedule(static)
    for (int i = 0; i < n; i++)
        sum += callback(i);  // each worker calls into Haskell
    return sum;
}

Correctness¶

All results verified against pure C and pure Haskell reference implementations:

Test	Result	Status
parallel_map (1000 sin values)	Element-wise match to 1e-10	OK
parallel_reduce (100K sin sum)	1839.343386 (matches pure C)	OK
polynomial callback (10K)	1109840.005000 (matches Haskell)	OK

Performance¶

Threads	Pure C (ms)	Callback (ms)	Overhead	Per-callback
1	1.69	46.60	27.6x	~0.47 us
2	1.17	60.43	51.8x	~0.60 us
4	0.71	57.91	82.1x	~0.58 us

The per-callback cost of ~0.5us is the rts_lock()/rts_unlock() round-trip. This is constant regardless of what the Haskell function does. For callbacks that perform milliseconds of work (e.g., looking up a Haskell data structure, evaluating a complex expression), the overhead is negligible. For tight inner loops like 100K trivial sin() calls, pure C should be used instead.

Practical guideline: Use Haskell callbacks when each invocation does ≥100us of work. Below that, the rts_lock/unlock overhead dominates. Structure code so that OpenMP handles the hot numerical loop in C, and calls Haskell for complex logic at coarser granularity.

7. Low-Level Techniques¶

This section describes advanced techniques for reducing overhead at the Haskell-C boundary: zero-overhead Cmm primitives, batched FFI calls, zero-overhead Cmm primitives and batched FFI calls.

7.1. Cmm Primitives¶

Source: omp_prims.cmm, HsCmmDemo.hs

GHC provides three calling conventions for foreign code, each with different overhead. We wrote a Cmm primitive that reads Capability_no(MyCapability()) — the same value as omp_get_thread_num() — to measure the overhead of each tier (see Section 9.7).

The Cmm Primitive¶

#include "Cmm.h"

omp_prim_cap_no(W_ dummy) {
    return (Capability_no(MyCapability()));
}

Called from Haskell via:

foreign import prim "omp_prim_cap_no" primCapNo# :: Int# -> Int#

The inline-cmm library provides a [cmm| ... |] quasiquoter that eliminates the need for separate .cmm files and manual foreign import prim declarations — Template Haskell handles compilation and linking automatically, producing identical zero-overhead results.

Key Findings¶

Prim calls are truly free: GHC treats foreign import prim functions as pure expressions. The Cmm function compiles to a single memory load from BaseReg, which GHC can hoist out of loops entirely via LICM (loop-invariant code motion). In a tight loop, 100M prim calls complete in <1ms.

The safe FFI tax is ~65ns: Each foreign import ccall safe call costs ~65ns more than unsafe — the price of suspendThread()/resumeThread() which release and reacquire the Capability. For OpenMP regions doing >1us of work, this is negligible (<7%).

The callback overhead gap: The ~500ns per callback overhead from Section 6.5 is ~7x larger than the raw safe FFI cost (~68ns). The difference comes from rts_lock()/rts_unlock() performing additional work beyond suspendThread()/resumeThread(): Task structure allocation, global Capability search, and lock acquisition. A Cmm-level fast path could potentially reduce this.

7.2. Batched Safe Calls¶

Source: omp_batch.cmm, HsCmmBatch.hs

The safe FFI tax of ~68ns per call comes from suspendThread()/resumeThread(), which release and reacquire the Capability. For workloads that make many short C calls, this overhead dominates.

By writing the suspend/resume manually in Cmm, we can batch N C calls within a single Capability release/reacquire cycle, amortizing the ~68ns overhead across all N calls.

The Batching Primitive¶

#include "Cmm.h"

cmm_batched_tid(W_ n) {
    W_ tok; W_ result; W_ new_base; W_ stack;
    W_ i; W_ t;

    /* Save Sp to TSO — GC needs valid stack pointer */
    stack = StgTSO_stackobj(CurrentTSO);
    StgStack_sp(stack) = Sp;

    (tok) = ccall suspendThread(BaseReg "ptr", 0);

    result = 0; i = 0;
    goto loop_check;
loop_body:
    (t) = ccall omp_get_thread_num();
    result = result + t;
    i = i + 1;
loop_check:
    if (i < n) goto loop_body;

    (new_base) = ccall resumeThread(tok);
    BaseReg = new_base;

    /* Restore Sp — GC may have moved the stack */
    stack = StgTSO_stackobj(CurrentTSO);
    Sp = StgStack_sp(stack);

    return (result);
}

Implementation Details¶

Three details were critical for correctness:

Save/restore Sp: suspendThread releases the Capability, allowing GC to run. The GC needs a valid Sp in the TSO to scan the suspended thread's stack. Without this, the GC follows a stale stack pointer and crashes.
No "ptr" on tok: The token from suspendThread is void*, not a GC-traceable pointer. Annotating it as "ptr" would tell GHC's Cmm compiler to treat it as a GC root, causing the GC to follow a non-heap pointer.
State# threading: foreign import prim is pure by default — GHC can CSE or hoist the call. Threading State# RealWorld through the type signature makes GHC treat it as effectful while adding zero runtime cost (State# is erased at the Cmm level).

Benchmark results showing speedups up to 27x are in Section 9.8.

8. Shared Memory Demos¶

Source: HsSharedMem1.hs, HsSharedMem2.hs, HsSharedMem3.hs, HsSharedMem4.hs, HsSharedMem5.hs, omp_shared.c

Everything in this project builds to this: Haskell and OpenMP C code operating on the same data, concurrently, with type-level proof that their access patterns are safe. The unified runtime (§6) makes shared memory possible; zero-copy FFI (§A.6) makes it practical; linear types (§A.7) make it correct by construction.

Five demos show this progression — from sequential handoff, through defensive synchronization, to compile-time proof of disjoint access, safety guarantees, and finally pure Haskell parallelism via GHC sparks. All use the same workload: element-wise f(x) = sin(x) * cos(x) + sqrt(|x|), applied by both Haskell and C/OpenMP to portions of a shared array.

Demo 1: The basic pattern works¶

Sequential handoff — Haskell fills, C transforms, Haskell reads:

arrIn  <- newPinnedDoubles n     -- pinned: not moved by GC
arrOut <- newPinnedDoubles n
forM_ [0..n-1] $ \i -> writeD arrIn i (fromIntegral i * 0.001)
c_transform_all (ptrOf arrIn) (ptrOf arrOut) (fromIntegral n)
s <- sumArray arrOut n           -- read back in Haskell

No concurrent access, no synchronization. This establishes that the plumbing works: a pinned ByteArray created in Haskell is directly readable and writable by C/OpenMP code via mutableByteArrayContents#.

Demo 2: The problem — defensive synchronization¶

Haskell and C/OpenMP each process a disjoint half of the output array. Without compile-time proof of disjointness, a GOMP_barrier() is needed for memory visibility — even though the regions never actually overlap:

hsTransformRange arrIn arrOut 0 half                 -- Haskell: [0, half)
c_transform_range_barrier pIn pOut half (n - half)   -- C: [half, n) + barrier

This "defensive synchronization" is the cost of not having type-level guarantees about disjoint access. The barrier is cheap (~0.3 us), but it represents a fundamental limitation: the programmer must manually reason about which regions are disjoint, and the compiler cannot help.

Demo 3: The solution — linear types eliminate barriers¶

Same partition, but using Data.Array.Linear's split/combine to prove disjointness at the type level:

case halve rw arrOut of
    MkSlice st rwL rwR arrL arrR ->
        let rwL' = linearTransform rwL arrIn arrL      -- Haskell: left half
            !()  = unsafePerformIO $                    -- C: right half
                unsafeWithPtr arrOut $ \pOut ->
                    unsafeWithPtr arrIn $ \pIn ->
                        c_transform_range pIn pOut
                            (fromIntegral half) (fromIntegral (n - half))
        in combine st rwL' rwR                         -- no barrier

The type system enforces:

rwL grants exclusive write access to the left half
rwR grants exclusive access to the right half — consumed by combine
No Haskell code can use rwR to access arrR while C is processing it
combine is zero-cost (no allocation, no copying)

Demo 4: Safety — why barriers exist¶

Two examples showing that removing barriers (nowait) without proof of disjointness silently introduces bugs.

Part A — Off-by-one overlap. Each partition writes [off..off+chunk+1) instead of [off..off+chunk). Boundary elements are written by two partitions. With nowait, this is a data race; with barriers, it is deterministic but still wrong (double-write at boundaries). With linear split, overlapping ranges are impossible — the type system rejects them:

Variant	N=10K, P=4	N=100K, P=16	N=1M, P=64
Disjoint (correct)	0.00	0.00	0.00
Overlap + barrier	3.06	9.26	31.8
Overlap + nowait	3.06	9.26	31.8

Part B — Two-pass stencil. Pass 1 writes out[i] = f(in[i]) (independent per element); pass 2 reads neighbors out[i] = avg(out[i-1..i+1]) (crosses partition boundaries). Without a barrier between passes, pass 2 reads stale data. With linear types, combine forces pass 1 to complete before pass 2 can access the parent token needed for cross-boundary reads:

let rw1 = linearMultiPass1 rw arrIn arrOut 0 numParts  -- partitioned pass 1
    -- combine happened inside — rw1 is the parent token
in  linearPass2 rw1 arrOut                              -- pass 2 needs parent

Variant	N=10K	N=100K	N=1M
C nowait (worst of 10 runs)	2.68e-2	1.09e-2	4.44e-3
Haskell linear vs ref	0.00	0.00	0.00

The C nowait version produces wrong results (stale reads at partition boundaries). The Haskell linear version is both correct and barrier-free.

Demo 5: GHC spark parallelism¶

Demos 2–4 use C/OpenMP for the parallel half. Demo 5 shows that the same split/combine pattern works for pure Haskell parallelism via GHC sparks. parCombine replaces combine — it sparks the left partition and evaluates the right on the current thread, using spark#/seq# with noDuplicate# for safety:

parPartition rw arrIn arrOut base depth =
    case halve rw arrOut of
        MkSlice st rwL rwR arrL arrR ->
            let rwL' = parPartition rwL arrIn arrL base (depth - 1)
                rwR' = parPartition rwR arrIn arrR (base + size arrL) (depth - 1)
            in parCombine st rwL' rwR'   -- spark left, eval right

Spark scaling (N=1,000,000, 4 capabilities):

Depth	Partitions	Sequential	Parallel	Speedup
0	1	41.9 ms	43.6 ms	0.96x
1	2	44.4 ms	22.3 ms	1.99x
2	4	42.5 ms	13.8 ms	3.09x
3	8	42.1 ms	15.0 ms	2.80x
5	32	46.1 ms	13.5 ms	3.41x
6	64	45.7 ms	13.7 ms	3.34x

Near-ideal 2x at 2 partitions, ~3x at 4+ partitions on 4 cores. No C/OpenMP involved — parallelism is GHC's work-stealing scheduler.

Results (Demos 1–3, 4 threads, i7-10750H)¶

Barrier vs linear — iteration loop:

N	Iters	With barrier	Linear	Saved
10,000	1000	202.5 ms	194.2 ms	4.1%
100,000	100	231.9 ms	221.7 ms	4.4%
1,000,000	10	224.6 ms	221.8 ms	1.3%

Partition scaling (N=1,000,000) — linear types:

Partitions	Time (ms)
2	39.0
4	39.8
8	38.9
16	38.3
32	38.4

Scaling is flat: zero barriers regardless of partition count, because split/combine is pure arithmetic on offset/length views into the same underlying buffer.

Interpretation¶

The modest 1–4% improvement in Demo 3 vs Demo 2 confirms that barrier elimination is not the point — the barrier was already fast. The point is correctness and composability:

Safety (Demo 4): The type checker rejects programs that access overlapping regions (Part A) and enforces ordering between passes that share data across partition boundaries (Part B). These are real bugs that nowait introduces silently — linear types catch them at compile time.
Zero-cost abstraction: split/combine involves no allocation, copying, or runtime checks. The tokens (RW s) are erased at runtime. The generated code is identical to the unsafe version minus the barrier.
Composable parallelism (Demo 5): The same split/combine pattern works with GHC sparks (parCombine) for pure Haskell parallelism, achieving near-ideal scaling. The programmer writes the same code regardless of whether parallelism comes from OpenMP (C FFI) or GHC sparks — linear types guarantee safety in both cases.

9. Benchmarks¶

All benchmarks on Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz (6C/12T), NixOS, GCC 15.2.0, GHC 9.10.3, powersave governor. Best-of-N timing to reduce CPU frequency variance.

9.1. Microbenchmarks¶

Source: bench_overhead.c

Fork/Join Overhead (us/iter)¶

Threads	Native libgomp	RTS-backed	Ratio
1	0.464	0.033	14.1x faster *
2	0.765	0.499	1.5x faster
4	0.931	0.945	1.02
8	1.461	1.692	1.16

* 1-thread: RTS uses a single-thread fast path (nthreads == 1 → direct function call, no barrier init). libgomp still performs full thread-pool setup.

Barrier Latency (us/iter)¶

Threads	Native libgomp	RTS-backed	Ratio
1	0.276	0.002	138.0x faster *
2	0.242	0.137	1.8x faster
4	0.248	0.27	1.09
8	0.434	0.47	1.08

* 1-thread: RTS uses a single-thread fast path (nthreads == 1 → direct function call, no barrier init). libgomp still performs full thread-pool setup.

Parallel For + Reduction (1M sin(), best of 10, ms)¶

Threads	Native libgomp	RTS-backed	Ratio
1	15.973	15.388	0.96
2	7.453	7.641	1.03
4	3.777	3.879	1.03
8	3.507	3.5	1.00

Critical Section (1000 lock/unlock per thread, ms)¶

Threads	Native libgomp	RTS-backed	Ratio
1	0.021	0.026	1.24
2	0.069	0.264	3.83
4	0.352	0.327	1.1x faster
8	0.942	1.318	1.40

9.2. DGEMM¶

Source: bench_dgemm.c, omp_compute.c

Same naive triple-loop DGEMM compiled identically, linked against either native libgomp or our runtime. Checksums match exactly.

4 Threads¶

N	Native (ms)	RTS (ms)	Ratio	GFLOPS (RTS)
128	0.93	1.05	1.13x	4.01
256	10.83	13.6	1.26x	2.47
512	78.59	83.66	1.06x	3.21
1024	670.05	654.66	0.98x	3.28

Interleaved re-runs confirm the two runtimes trade leads: the difference is CPU frequency noise, not runtime overhead.

DGEMM Head-to-Head: Native libgomp vs RTS

Scaling (RTS-backed, DGEMM 1024x1024)¶

Threads	Time (ms)	GFLOPS	Speedup
1	2965.94	0.72	1.0x
2	1880.42	1.14	1.6x
4	654.66	3.28	4.5x

9.3. FFI Scaling¶

Source: HsMain.hs

Haskell calling parallel sinsum via safe FFI:

Threads	Time (ms)	Speedup
1	32.5	1.0x
2	17.3	1.9x
4	9.9	3.3x
8	5.0	6.5x

Near-linear scaling through the FFI boundary, confirming the runtime correctly parallelizes work dispatched from Haskell.

9.4. Parallelism Crossover¶

Source: HsCrossover.hs

When does OpenMP from Haskell beat sequential C? We measured sinsum (compute-bound, ~11ns per element) at various sizes with 4 threads.

Elements	Sequential	Parallel	Speedup
100	0.7 us	2.8 us	0.26x
200	3.1 us	2.9 us	1.08x
500	8.8 us	5.8 us	1.50x
1000	10.2 us	5.2 us	1.96x
5000	64.6 us	21.7 us	2.98x
100000	1528.7 us	385.5 us	3.97x

The crossover is at ~500 elements — above this, OpenMP parallel execution from Haskell is faster than sequential C called via unsafe FFI. The fixed overhead is ~1.8us (86ns safe FFI + 1712ns OpenMP fork/join).

Parallelism Crossover: Sequential vs Parallel

9.5. GHC Native Parallelism vs OpenMP¶

Source: HsParCompare.hs

For the same compute-bound sinsum workload, how does Haskell's forkIO with manual work splitting compare to OpenMP via safe FFI?

GHC NCG (default code generator)¶

Elements	Seq Haskell	Seq C	Par Haskell	Par OpenMP	Hs/OMP ratio
10K	327.0 us	148.0 us	107.3 us	42.9 us	2.50x
100K	3283.6 us	1549.8 us	881.1 us	375.8 us	2.34x
1M	31922.8 us	15417.0 us	8637.3 us	6626.1 us	1.30x
10M	336020.5 us	153998.0 us	85324.7 us	40187.1 us	2.12x

With the default NCG backend, OpenMP is consistently ~2x faster than parallel Haskell. Sequential Haskell is also ~2.2x slower than sequential C. The gap comes entirely from per-element code quality, not parallelism overhead — both achieve near-ideal scaling on 4 threads.

GHC LLVM backend (`-fllvm`)¶

Elements	Seq Haskell	Seq C	Par Haskell	Par OpenMP	Hs/OMP ratio
10K	104 us	103 us	43 us	28 us	1.5x
100K	1109 us	1101 us	303 us	278 us	1.1x
1M	11156 us	10998 us	2972 us	2866 us	1.04x
10M	111751 us	111112 us	30924 us	28747 us	1.08x

Compiling with -fllvm (LLVM 20.1) eliminates the gap entirely. Sequential Haskell matches sequential C, and parallel Haskell reaches parity with OpenMP. The 2x gap under NCG is purely a code generator limitation — GHC's inner loop compiles to 17 instructions vs GCC/LLVM's 10. See Appendix A.3 for the full assembly analysis.

9.6. Task Execution¶

Source: HsTaskDemo.hs, test_omp_tasks.c

Deferred task execution with work-stealing barriers (4 threads, best of 5):

Tasks	Sequential	Parallel	Speedup
100	1.5 ms	0.4 ms	3.81x
500	7.7 ms	2.5 ms	3.12x
1,000	15.6 ms	5.0 ms	3.12x
5,000	78.1 ms	19.9 ms	3.93x
10,000	155.9 ms	42.6 ms	3.66x

Near-linear scaling (3.4-4.0x on 4 threads). Correctness verified against sequential reference with exact match.

9.7. Calling Convention Overhead¶

Source: HsCmmDemo.hs, omp_prims.cmm

Convention	ns/call	Relative	Mechanism
`foreign import prim` (Cmm)	~0	—	Direct register read, GHC optimizes away
`foreign import ccall unsafe`	3.1	—	Save/restore STG registers
`foreign import ccall safe`	89.8	29x vs unsafe	+ suspendThread/resumeThread

9.8. Batched Calls¶

Source: HsCmmBatch.hs, omp_batch.cmm

Amortizing the ~68ns safe FFI overhead by batching N C calls within a single suspendThread/resumeThread cycle:

Batch size	Standard safe	Cmm batched	Speedup
1	98.3 ns	97.9 ns	1.0x
2	100.0 ns	50.3 ns	2.0x
5	97.3 ns	20.8 ns	4.7x
10	97.4 ns	12.4 ns	7.8x
20	102.6 ns	7.6 ns	13.4x
50	104.6 ns	4.7 ns	22.2x
100	97.8 ns	3.7 ns	26.4x

At batch=100, per-call overhead drops to 2.7 ns — within 35% of unsafe FFI cost (~2 ns). The results match the theoretical prediction (68 + N × 2) / N closely at every batch size.

10. Implementation Timeline¶

The runtime was developed in 21 phases. Each phase is summarized below with a reference to the section containing full details.

Phase	Description	Section
1	Stub pthread-based runtime validating GCC GOMP ABI compatibility	§4
2	Replace pthreads with GHC Capabilities; hybrid C shim approach	§4
3	Lock-free optimization: atomic generation counter, sense-reversing barriers	§5
4	Haskell FFI interop via `foreign import ccall safe`	§6.1
5	Concurrent Haskell green threads + OpenMP parallel regions	§6.3
6	GC interaction testing — minimal impact on OpenMP latency	§6.4
7	Dense matrix multiply (DGEMM) workload	§9.2
8	Head-to-head comparison with native libgomp — performance parity	§9.2
9	Bidirectional interop: OpenMP workers call Haskell via FunPtr	§6.5
10	Cmm primitives via `foreign import prim` — zero-overhead FFI	§7.1
11	`inline-cmm` quasiquoter integration	§7.1
12	Batched safe calls amortizing 68ns FFI overhead	§7.2
13	Parallelism crossover analysis — break-even at ~500 elements	§9.4
14	GHC native parallelism vs OpenMP — parity with `-fllvm`	§9.5
15	Deferred task execution with work-stealing barriers	§4.3
16	Zero-copy FFI with pinned ByteArray — 19% inner loop speedup	§A.6
17	Linear typed arrays for type-safe disjoint partitioning	§A.7
18	Runtime improvements: guided scheduling, hybrid barriers, task pools	§4
19	Shared memory demos: producer-consumer, synchronized, linear	§8
20	Safety demos: overlap bugs, stencil ordering, linear type prevention	§8
21	GHC spark parallelism via `parCombine` with `spark#`/`noDuplicate#`	§8

11. Notable Bugs and Fixes¶

11.1. Barrier Sense Mismatch Deadlock¶

Symptom: Program hangs when calling GOMP_parallel from a forkIO thread at -N4. No output at all. Works at -N1.

Root cause: Workers' local barrier sense variables (start_sense, end_sense) persisted from previous parallel regions (value 1), but spin_barrier_init() reset the barrier's global sense to 0. On the next region:

Workers flipped 1→0, saw sense(0) == local_sense(0), passed through immediately
Master (on a new OS thread from forkIO) had fresh sense=0, flipped to 1, but couldn't complete the barrier

Fix: Reset all local sense variables to 0 at the start of each parallel region, matching the freshly initialized barriers.

11.2. False Parallel-For Regression¶

Symptom: At 4 threads, parallel for appeared 1.65x slower than native libgomp (6.7ms vs 4.1ms).

Root cause: Single-sample measurement on a laptop with powersave CPU governor (i7-10750H at 46% clock). CPU boost state varied between process invocations.

Fix: Changed to best-of-10 within each process. Controlled interleaved testing confirmed parity (3.85ms vs 3.91ms).

12. Limitations¶

Limitation	Impact	Notes
Serialized nesting only	Low	Inner parallel regions execute with 1 thread. True nested parallelism (multiple active levels) is not supported.
Single global team	Low	No support for different thread counts in nested teams.
No target offloading	None	Not applicable to this project's scope.
No doacross loops	Low	`GOMP_doacross_*` not implemented.

BOLT (bolt-omp.org, Best Paper PACT '19) is the closest analogue to this project. BOLT is a full OpenMP runtime built on Argobots, a lightweight user-level threading library from Argonne National Laboratory. Where libgomp maps OpenMP threads to pthreads, BOLT maps them to Argobots user-level threads (ULTs) scheduled on execution streams (ES) — achieving efficient nested parallelism and fine-grained tasking that pthreads cannot.

The architectural parallel is direct:

Concept	BOLT / Argobots	ghc-openmp / GHC RTS
OS-thread abstraction	Execution Stream (ES)	Capability
Lightweight work unit	ULT / Tasklet	Haskell green thread
OpenMP thread mapping	ULT on ES	OS thread pinned to Capability
Scheduler	Pluggable per-pool	GHC spark pool + spin-wait workers
Work stealing	Built-in	Phase 15 deferred tasks

The key difference is motivation: BOLT starts from a purpose-built threading substrate (Argobots) designed for composing HPC runtimes (MPI + OpenMP + task libraries). ghc-openmp repurposes an existing language runtime that already provides green threads, garbage collection, and an FFI — trading Argobots' generality for seamless Haskell interoperation.

14. Conclusions¶

GHC's Runtime System can serve as a fully functional OpenMP runtime with zero measurable overhead compared to native libgomp. The implementation is a single ~1300-line C file using only public GHC RTS APIs — no GHC fork required.

The key architectural insights are:

Capabilities as thread IDs: cap->no directly maps to omp_get_thread_num()
Workers without Capabilities: After RTS registration, worker threads release their Capabilities. They execute C code as plain OS threads, invisible to GC.
Reference-counted init: hs_init_ghc() is idempotent, enabling transparent use from both C and Haskell hosts.
Lock-free synchronization is essential: The naive mutex+condvar implementation was 20–25x slower. Sense-reversing barriers and atomic generation counters brought it to parity.
Bidirectional FFI works: OpenMP workers call Haskell functions via FunPtr with ~0.5us overhead per invocation (automatic rts_lock/unlock), making it practical for coarse-grained callbacks.

This demonstrates that language runtimes can share threading infrastructure across FFI boundaries. A Haskell program can call OpenMP C code, with both sharing the same thread pool, the same CPU cores, and coexisting with GHC's garbage collector.

Beyond performance parity, unifying the runtimes enables a new programming model: Haskell and C code operating on the same data with type-safe guarantees. Linear tokens prove disjoint access at compile time, eliminating defensive synchronization. The type checker becomes a concurrency tool — data races are compile errors, not runtime surprises.

Appendix A.1: Implemented ABI Surface¶

All implementations in ghc_omp_runtime_rts.c

Core Parallel: GOMP_parallel, GOMP_parallel_start, GOMP_parallel_end, GOMP_barrier

Synchronization: GOMP_critical_start, GOMP_critical_end, GOMP_critical_name_start, GOMP_critical_name_end, GOMP_atomic_start, GOMP_atomic_end, GOMP_single_start, GOMP_single_copy_start, GOMP_single_copy_end, GOMP_ordered_start, GOMP_ordered_end

Worksharing Loops: GOMP_loop_static_start, GOMP_loop_static_next, GOMP_loop_dynamic_start, GOMP_loop_dynamic_next, GOMP_loop_guided_start, GOMP_loop_guided_next, GOMP_loop_runtime_start, GOMP_loop_runtime_next, GOMP_loop_start, GOMP_loop_end, GOMP_loop_end_nowait, GOMP_loop_nonmonotonic_dynamic_start, GOMP_loop_nonmonotonic_dynamic_next, GOMP_loop_nonmonotonic_guided_start, GOMP_loop_nonmonotonic_guided_next, GOMP_parallel_loop_static, GOMP_parallel_loop_dynamic, GOMP_parallel_loop_guided, GOMP_parallel_loop_runtime, GOMP_parallel_loop_nonmonotonic_dynamic, GOMP_parallel_loop_nonmonotonic_guided

Tasks: GOMP_task, GOMP_taskwait, GOMP_taskyield, GOMP_taskgroup_start, GOMP_taskgroup_end

Sections: GOMP_sections_start, GOMP_sections_next, GOMP_sections_end, GOMP_sections_end_nowait, GOMP_parallel_sections

Cancellation & Teams: GOMP_cancel, GOMP_cancellation_point, GOMP_barrier_cancel, GOMP_loop_end_cancel, GOMP_sections_end_cancel, GOMP_teams_reg

omp_* User API: omp_get_num_threads, omp_get_thread_num, omp_get_max_threads, omp_get_num_procs, omp_set_num_threads, omp_in_parallel, omp_set_dynamic, omp_get_dynamic, omp_set_nested, omp_get_nested, omp_get_wtime, omp_get_wtick, omp_init_lock, omp_destroy_lock, omp_set_lock, omp_unset_lock, omp_test_lock, omp_init_nest_lock, omp_destroy_nest_lock, omp_set_nest_lock, omp_unset_nest_lock, omp_test_nest_lock, omp_get_level, omp_get_active_level, omp_get_ancestor_thread_num, omp_get_team_size, omp_get_thread_limit, omp_set_max_active_levels, omp_get_max_active_levels, omp_get_supported_active_levels, omp_set_schedule, omp_get_schedule, omp_in_final, omp_get_cancellation, omp_get_proc_bind, omp_get_num_places, omp_get_place_num, omp_get_default_device, omp_set_default_device, omp_get_num_devices, omp_get_num_teams, omp_get_team_num, omp_is_initial_device, omp_get_initial_device, omp_get_max_task_priority

Appendix A.2: GOMP ABI Primer¶

GCC does not interpret OpenMP pragmas at runtime. Instead, the compiler transforms each pragma into calls to GOMP_* functions at compile time. Any library that exports these symbols can serve as the OpenMP runtime.

Outlined Functions¶

GCC extracts the body of a parallel region into a separate outlined function. The original code is replaced by a call to GOMP_parallel:

// Source code:
#pragma omp parallel
{
    do_work(shared_data);
}

// GCC transforms this to:
static void outlined_fn(void *data) {
    struct shared *d = data;
    do_work(d->shared_data);
}
GOMP_parallel(outlined_fn, &shared_data, num_threads, flags);

GOMP_parallel calls outlined_fn on the master thread and dispatches it to worker threads. An implicit barrier at the end ensures all threads complete before the master returns.

Worksharing Loops¶

#pragma omp parallel for combines a parallel region with work distribution. GCC transforms the loop into a protocol:

// Source code:
#pragma omp parallel for schedule(static)
for (int i = 0; i < n; i++)
    a[i] = compute(i);

// Each thread executes:
long start, end;
if (GOMP_loop_static_start(0, n, 1, chunk, &start, &end)) {
    do {
        for (long i = start; i < end; i++)
            a[i] = compute(i);
    } while (GOMP_loop_static_next(&start, &end));
}
GOMP_loop_end();

The _start function assigns the first chunk to the calling thread. _next returns subsequent chunks until the iteration space is exhausted. _end synchronizes with an implicit barrier.

For dynamic and guided scheduling, the same protocol applies with GOMP_loop_dynamic_start/_next or GOMP_loop_guided_start/_next. The runtime decides chunk sizes: static divides evenly, dynamic uses fixed chunks, guided uses exponentially decreasing chunks.

Reductions¶

OpenMP reductions use thread-local accumulators combined at the barrier:

// Source code:
double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < n; i++)
    sum += a[i];

// Each thread gets a local copy of sum (initialized to 0.0).
// At the barrier, all local copies are combined with +.

GCC generates the local copies and the combining code. The runtime provides the barrier; the reduction logic is entirely compiler-generated.

Critical Sections and Atomics¶

GOMP_critical_start();    // acquire global mutex
  shared_counter++;
GOMP_critical_end();      // release global mutex

GOMP_critical_name_start(&named_lock);  // per-name mutex
  named_resource++;
GOMP_critical_name_end(&named_lock);

Unnamed critical sections share a single global mutex. Named critical sections use per-name mutexes, allowing independent critical regions to execute concurrently.

Tasks¶

GOMP_task supports deferred execution:

GOMP_task(fn, data, cpyfn, arg_size, arg_align, if_clause, flags, ...);

When if_clause is true and threads are available, the runtime copies the task data (via cpyfn or memcpy) to the heap and enqueues it. Idle threads steal tasks from other threads' queues. When if_clause is false, the task executes immediately (inlined). GOMP_taskwait blocks until all child tasks complete.

Appendix A.3: NCG vs LLVM Code Generation¶

This appendix details why GHC's default native code generator (NCG) produces a ~2x slower inner loop than GCC for sin()-heavy workloads, and how the LLVM backend eliminates the gap. See Section 9.5 for the benchmark results.

What the Haskell code compiles to¶

GHC -O2 already fully unboxes the inner loop (sinDouble#, +##, *## on Double#), and GCC does not vectorize sin() calls. Neither boxing nor SIMD explains the gap — it is purely a code generator quality issue.

NCG: 17 instructions per iteration¶

cvtsi2sdq %r14, %xmm1        # i -> double
mulsd   .Ln3Sj(%rip), %xmm1  # * 0.001
subq    $8, %rsp              # stack adjust (every iter!)
movsd   %xmm0, %xmm2         # shuffle acc
movsd   %xmm1, %xmm0         # move arg for sin()
movl    $1, %eax              # varargs ABI marker
movsd   %xmm2, 72(%rsp)      # spill acc
movq    %rsi, %rbx            # save STG register
call    sin
addq    $8, %rsp              # restore stack
movsd   64(%rsp), %xmm1      # reload acc
addsd   %xmm0, %xmm1         # acc += sin(...)
incq    %r14                  # i++
movsd   %xmm1, %xmm0         # shuffle back
movq    %rbx, %rsi            # restore STG register
cmpq    %rsi, %r14            # i < hi?
jl      loop

The extra instructions come from:

STG register save/restore (movq %rsi, %rbx / movq %rbx, %rsi): NCG saves R2 around every sin() call, inside the loop
Per-iteration stack adjustment (subq $8 / addq $8): NCG adjusts the stack frame every iteration instead of once at function entry
Register shuffles (movsd %xmm0, %xmm2, movsd %xmm1, %xmm0, movsd %xmm1, %xmm0): NCG's linear register allocator produces unnecessary moves
Varargs ABI marker (movl $1, %eax): required by x86-64 SysV ABI for variadic functions, but GCC elides it when it knows the callee

LLVM: 10 instructions per iteration¶

movsd   %xmm1, 16(%rsp)       # spill acc (same as GCC)
xorps   %xmm0, %xmm0
cvtsi2sd %r14, %xmm0          # i -> double
mulsd   .LCPI20_0(%rip), %xmm0 # * 0.001
callq   sin@PLT
movsd   16(%rsp), %xmm1       # reload acc
addsd   %xmm0, %xmm1          # acc += sin(...)
incq    %r14                   # i++
cmpq    %r14, %r15             # i < hi?
jne     loop

LLVM hoists the STG register save/restore outside the loop, allocates the stack frame once, and eliminates all redundant shuffles. The resulting loop matches GCC instruction-for-instruction.

Summary¶

The 2x NCG gap is not a fundamental Haskell overhead. It is an artifact of GHC's native code generator producing suboptimal machine code for tight loops with C calls. The LLVM backend produces identical code quality to GCC, achieving parity on both sequential and parallel benchmarks.

Environment: GHC 9.10.3, NCG vs -fllvm with LLVM 20.1, GCC 15.2.0.

Appendix A.4: GHC RTS Internals¶

The GHC Runtime System is the execution environment for compiled Haskell programs. This appendix covers the concepts needed to understand how our OpenMP runtime integrates with it.

Capabilities¶

A Capability is GHC's fundamental execution unit. Each Capability consists of:

One OS thread (the owner)
One run queue of lightweight Haskell threads (TSOs)
One spark pool for speculative parallelism (par)
A private allocation area for the generational GC

The number of Capabilities is set by +RTS -N4 (4 Capabilities). Each has a 0-indexed number (cap->no) that we map directly to omp_get_thread_num().

TSOs (Thread State Objects)¶

A TSO represents a lightweight Haskell thread — what forkIO creates. TSOs are much cheaper than OS threads (~1KB vs ~8MB stack). Thousands of TSOs can be multiplexed onto a single Capability. The Capability's scheduler picks TSOs from the run queue and executes them in round-robin fashion, yielding on allocation (every ~4KB allocated).

Scheduler Loop¶

Each Capability runs a loop:

loop:
  tso = pick from run queue (or steal a spark)
  run tso until it yields/blocks/finishes
  if tso blocked: move to blocked queue
  if tso yielded: put back on run queue
  goto loop

When a Capability has no work, it can steal sparks from other Capabilities or go idle. This is the same work-stealing mechanism that our OpenMP task implementation builds on.

RTS API for Embedding¶

These functions allow C code to interact with the RTS:

Function	Purpose
`hs_init_ghc(&argc, &argv, conf)`	Boot the RTS. Reference-counted: safe to call when already running.
`rts_lock()`	Acquire a Capability. Returns `Capability*`. Blocks until one is available.
`rts_unlock(cap)`	Release a Capability. Makes it available for Haskell threads or other callers.
`rts_setInCallCapability(i, 1)`	Pin the calling OS thread to Capability `i`. Subsequent `rts_lock()` calls will always get Capability `i`.
`getNumCapabilities()`	Return the current number of Capabilities.
`getNumberOfProcessors()`	Return the CPU count.

Our runtime uses these to create workers pinned to specific Capabilities. After the initial rts_lock()/rts_unlock() registration, workers release their Capabilities and become plain OS threads.

Safe vs Unsafe FFI¶

GHC provides two FFI calling conventions with different trade-offs:

Unsafe (foreign import ccall unsafe): The Haskell thread keeps holding its Capability during the C call. Fast (~2ns overhead), but blocks all other Haskell threads on that Capability. Suitable for short, non-blocking C functions.

Safe (foreign import ccall safe): The Haskell thread releases its Capability before calling C, and reacquires it on return. Slower (~68ns overhead) but allows other Haskell threads to run. Required for C functions that may block or run for a long time.

Internally, safe FFI calls suspendThread() (release Capability, return a token) before the C function, and resumeThread(token) (reacquire Capability) after. This is the mechanism our batched calls exploit (Section 7.2).

Garbage Collection¶

GHC uses a stop-the-world generational garbage collector. When a GC is triggered:

All Capabilities are synchronized (each thread reaches a safe point)
GC runs, scanning all Capability-local allocation areas
Capabilities are released and threads resume

Critically, GC only synchronizes threads that hold Capabilities. Our OpenMP workers do not hold Capabilities during parallel execution — they are invisible to the GC. This is why OpenMP compute kernels are not paused by Haskell garbage collection (Section 6.4).

STG Machine Registers¶

GHC compiles Haskell to STG (Spineless Tagless G-machine) code, which uses a set of virtual registers mapped to hardware registers:

Register	x86-64	Purpose
BaseReg	`%r13`	Pointer to current Capability
Sp	`%rbp`	STG stack pointer
Hp	`%r12`	Heap allocation pointer
R1	`%rbx`	First argument / return value
R2-R6	`%r14`, `%rsi`, `%rdi`, `%r8`, `%r9`	Arguments
SpLim	`%r15`	Stack limit

These registers are caller-saved with respect to C calls. Every foreign import ccall must save them before and restore them after the C function. This is the source of the NCG overhead analyzed in the Appendix A.3: the NCG saves/restores these registers inside the loop, while the LLVM backend hoists them outside.

Environment: NixOS, GHC 9.10.3, GCC 15.2.0, Intel i7-10750H (6C/12T). Source code: ghc-openmp repository. February 2026.

Appendix A.5: Sense-Reversing Barrier¶

Source: spin_barrier_wait, spin_barrier_wait_tasks

The runtime's barrier is a centralized sense-reversing barrier from Mellor-Crummey & Scott ("Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors", ACM TOCS 9(1), 1991).

Data Structure¶

typedef struct {
    atomic_int count;   /* threads remaining (decremented by arrivals) */
    atomic_int sense;   /* global sense flag (flipped by last arrival) */
    int size;           /* team size (reset value for count) */
} spin_barrier_t;

Each thread also maintains a thread-local int local_sense, initially 0.

Algorithm¶

Arrive: Thread flips its local_sense (1 - local_sense), then atomically decrements count with memory_order_acq_rel.
Last thread (decrement returns 1): Resets count to size (relaxed store), then flips the global sense to match local_sense (release store). This releases all waiting threads.
Other threads: Spin-wait until the global sense matches their local_sense (acquire load).

The sense-reversing trick avoids the resetting race in naive barriers: because each phase uses the opposite sense value, threads that haven't yet left a barrier cannot be confused with threads entering the next one.

Hybrid Spin-Wait¶

Spinning uses a three-tier strategy controlled by g_spin_iters (configurable via OMP_WAIT_POLICY):

Spin with pause: For the first g_spin_iters iterations (~4000 default), execute _mm_pause() to reduce pipeline contention and save power on x86.
sched_yield(): After the spin threshold, yield the CPU to other threads. This avoids wasting cycles during longer waits.
Condvar fallback: The worker pool's generation-counter wait (not the barrier itself) uses pthread_cond_wait for idle periods between parallel regions.

The OMP_WAIT_POLICY environment variable controls aggressiveness: active sets 10000 spin iterations, passive sets 100.

Task-Stealing Variant¶

spin_barrier_wait_tasks extends the barrier with work stealing during the spin-wait phase:

While waiting: If g_tasks_pending > 0, attempt to steal and execute a task from another thread's queue. Reset the spin counter after productive work.
Last thread: Before releasing the barrier, drain all remaining tasks (spin on g_tasks_pending until zero). This ensures no tasks are lost when GCC omits explicit GOMP_barrier calls after #pragma omp single.

The g_tasks_pending fast-path check (acquire load of a single atomic) avoids the cost of scanning per-thread queues when no tasks exist. This is critical for the common case where barriers are used without tasks.

Appendix A.6: Zero-Copy FFI with Pinned ByteArray¶

Source: HsZeroCopy.hs

The standard FFI pattern (allocaArray + peekElemOff/pokeElemOff) boxes every element as CDouble and converts via realToFrac, adding overhead at the Haskell↔OpenMP boundary:

-- Boxed: every element goes through CDouble
a <- peekElemOff pA (i * n + k)    -- returns boxed CDouble
b <- peekElemOff pB (k * n + j)    -- returns boxed CDouble
go (acc + realToFrac a * realToFrac b) (k + 1)  -- 4 box/unbox ops

Using pinned ByteArray# with unboxed primops eliminates this overhead:

-- Unboxed: Double# throughout, no boxing
case readDoubleArray# mbaA (i *# n +# k) s of
    (# s', a #) ->     -- a :: Double# (unboxed)
        case readDoubleArray# mbaB (k *# n +# j) s' of
            (# s'', b #) ->     -- b :: Double# (unboxed)
                goK s'' i j (k +# 1#) (acc +## (a *## b))

The pinned ByteArray is passed to C via mutableByteArrayContents#, which returns a raw Addr# — zero-copy, no marshalling. touch# keeps the ByteArray alive during the C call.

Benchmark¶

Haskell sequential DGEMM inner loop, pinned ByteArray with unboxed primops vs standard boxed FFI:

N	Boxed (ms)	Unboxed (ms)	Speedup
256	57.3	53.4	1.07x
512	457.9	384.6	1.19x

The 19% improvement at N=512 comes from eliminating CDouble boxing in the O(n³) inner loop. -ddump-simpl confirms the hot loop uses +##, *##, and readDoubleArray# with no D# constructor.

Appendix A.7: Linear Typed Arrays¶

Source: Data/Array/Linear.hs, HsLinearDemo.hs

GHC's -XLinearTypes extension can enforce exclusive ownership of mutable array regions at compile time. The design is inspired by konn/linear-extra's Borrowable SArray, which uses phantom-tagged tokens for zero-copy split/combine. A self-contained ~200-line module (Data.Array.Linear) extracts the core pattern and integrates it with the unboxed primops from Appendix A.6 (readDoubleArray#/writeDoubleArray# instead of Storable).

The core idea: linearity is on tokens (RW s), not the array itself. You need the right token to read/write, and split/combine tracks disjoint ownership at the type level.

Design¶

-- Linear token: proves exclusive access to region s
data RW s where MkRW :: RW s

-- Array with phantom region (NOT linear — tokens enforce access)
data DArray s = DArray !Int !(MutableByteArray# RealWorld) !Int  -- len, buf, offset

-- Split witness: proves l and r came from splitting s
data SlicesTo s l r where MkSlicesTo :: SlicesTo s l r

-- Operations consume and return tokens
unsafeRead  :: RW s %1 -> DArray s -> Int -> (Double, RW s)
unsafeWrite :: RW s %1 -> DArray s -> Int -> Double -> RW s
split       :: RW s %1 -> Int -> DArray s -> Slice s
combine     :: SlicesTo s l r %1 -> RW l %1 -> RW r %1 -> RW s

The split operation is zero-copy — both halves share the same underlying MutableByteArray#, just with different offset/length views. No allocation, no copying, just arithmetic on the offset field.

Type-Safe Row-Partitioned DGEMM¶

case split rwC (half * n) arrC of
    MkSlice st rwTop rwBot cTop cBot ->
        let rwTop' = linearDgemm rwTop n 0    half arrA arrB cTop
            rwBot' = linearDgemm rwBot n half half arrA arrB cBot
            rwC'   = combine st rwTop' rwBot'
        in  unsafeRead rwC' arrC 0  -- can only read after recombining

This is the Haskell-side type encoding of what OpenMP does at runtime with #pragma omp parallel for schedule(static) — partitioning the output matrix into disjoint row blocks. The type system statically guarantees:

No two computations can write to the same rows
The original array cannot be accessed while split
All slices must be recombined before reading results

C FFI Integration¶

The unsafeWithPtr function passes pinned ByteArray data directly to C:

unsafeWithPtr :: DArray s -> (Ptr CDouble -> IO a) -> IO a

This threads through IO properly (unlike runRW#-based approaches that create independent state threads), ensuring C side effects are visible to subsequent Haskell reads.

Implementation Notes¶

unsafeDupablePerformIO over runRW#: Element access uses unsafeDupablePerformIO with {-# NOINLINE #-} to ensure writes from one operation are visible to subsequent reads. runRW# creates independent state threads that GHC can reorder under -O2.
Bang patterns for sequencing: In lazy contexts, token-producing operations must be forced with ! to ensure their side effects execute.
parCombine for GHC sparks: A parallel variant of combine that sparks the left token for parallel evaluation using spark#/seq#. Uses unsafePerformIO (which includes noDuplicate#) to prevent thunk duplication — essential when tokens are produced by destructive array operations. unsafeCoerce# bridges linear types with non-linear GHC primitives (same approach as konn/linear-extra's Unsafe.toLinear).
Self-contained: ~280 lines, no dependencies beyond base and ghc-prim.