Skip to content

3. Background

3.1. GHC RTS Capabilities

A Capability is GHC's central execution unit: one OS thread, one run queue of lightweight Haskell threads (TSOs), and one work-stealing spark pool. The number of Capabilities is set by +RTS -N. Each Capability has a 0-indexed number (cap->no) that maps directly to OpenMP's omp_get_thread_num().

Key RTS APIs for embedding:

hs_init_ghc(&argc, &argv, conf);       // Boot the RTS
Capability *cap = rts_lock();           // Acquire a Capability
rts_unlock(cap);                        // Release it
rts_setInCallCapability(i, 1);          // Pin OS thread to Capability i
uint32_t getNumCapabilities(void);      // Current Capability count
uint32_t getNumberOfProcessors(void);   // CPU count

hs_init_ghc() is reference-counted: calling it when the RTS is already running (as in a Haskell host program) simply increments the counter and returns. This is the key to transparent interop — our runtime auto-detects whether it is being hosted by a C program or a Haskell program.

3.2. The libgomp ABI

Source: ghc_omp_runtime_rts.c

GCC transforms OpenMP pragmas into calls to GOMP_* functions. For example:

#pragma omp parallel
{ body; }

// becomes:
void outlined_fn(void *data) { body; }
GOMP_parallel(outlined_fn, &data, num_threads, flags);

A minimum viable runtime needs only 9 symbols (GOMP_parallel, GOMP_barrier, GOMP_critical_start/end, GOMP_single_start, GOMP_task, GOMP_taskwait, omp_get_num_threads, omp_get_thread_num). Full OpenMP 4.5 coverage requires ~85 symbols. Our implementation provides ~97.

3.3. Cmm and foreign import prim

Source: omp_prims.cmm

Cmm (C minus minus) is GHC's low-level intermediate representation — a portable assembly language that sits between STG and native code. GHC compiles all Haskell to Cmm before generating machine code.

GHC provides three FFI calling conventions with different overhead:

Convention Mechanism Overhead
foreign import ccall safe Releases Capability, calls C, reacquires ~68 ns
foreign import ccall unsafe Saves STG registers, calls C, restores ~2 ns
foreign import prim Direct STG register passing, no boundary ~0 ns

The prim convention is the fastest: arguments pass directly in GHC's STG registers (R1, R2, ...) with no calling convention switch. Functions written in Cmm can access RTS internals like MyCapability() directly. GHC treats prim calls as pure expressions and can optimize them away entirely (loop-invariant code motion, common subexpression elimination).

The inline-cmm library lets you embed Cmm code directly in Haskell modules via a [cmm| ... |] quasiquoter (similar to inline-c for C code). It automatically generates the foreign import prim declaration and compiles the Cmm via Template Haskell.