6. Haskell Integration¶
This section covers the integration between Haskell and the OpenMP runtime: calling conventions, initialization, concurrent execution, garbage collection behavior, and bidirectional callbacks.
See also the Haddock API reference for the GHC.OpenMP module.
6.1. FFI Calling Convention¶
Source: HsMain.hs
Haskell calls OpenMP C code via foreign import ccall safe:
The safe keyword is critical: it tells GHC to release the
calling Capability before entering the foreign code, and reacquire it on
return. This means:
- Other Haskell green threads can run on the released Capability
- The C code enters
GOMP_parallel, which dispatches to the worker pool — including potentially the Capability just released - No deadlock: workers don't need to hold Capabilities to execute C compute kernels
6.2. RTS Initialization¶
When called from a Haskell host, hs_init_ghc() is already done
by GHC before main. Our runtime's ensure_rts() calls
hs_init_ghc() again, which simply increments the reference count
and returns. The runtime discovers the existing Capabilities via
getNumCapabilities() and spawns workers for Caps 1..N-1.
6.3. Concurrent Execution¶
Source: HsConcurrent.hs
-- Haskell green thread: pure computation
_ <- forkIO $ do
let !result = haskellSinSum 1200000
putMVar hsDone result
-- OpenMP FFI call (safe: releases Capability)
_ <- forkIO $ do
result <- c_parallel_sinsum 12000000
putMVar ompDone result
-- Both run simultaneously!
Measured: sequential 68ms → concurrent 58ms, with 10ms of overlapping execution confirmed.
6.4. Garbage Collection Isolation¶
Source: HsGCStress.hs
A key concern: GHC's stop-the-world GC pauses all threads holding Capabilities. Would this stall OpenMP workers?
Answer: No. OpenMP workers do not hold Capabilities during parallel execution. After their initial
rts_lock()/rts_unlock()registration, they are plain OS threads spinning on atomic variables. GC only synchronizes Capability-holding threads — our workers are invisible.
Experimental Validation¶
We ran 500 OpenMP parallel regions (each ~400us) concurrently with:
| Scenario | p50 (us) | p99 (us) | max (us) |
|---|---|---|---|
| Baseline (OpenMP alone) | 314–478 | 636–658 | 692–783 |
| + allocation pressure (50K rounds) | 313–543 | 538–651 | 585–691 |
| + forced major GC (20 × performGC) | 315–556 | 549–744 | 574–2262 |
Allocation pressure has negligible impact (within noise). Forced major GCs produced one outlier spike of 2262us on one run and none on another. The spike correlates with the GHC RTS reporting a 1.6ms max GC pause — likely the OS thread making the FFI call had its Capability briefly paused at a region boundary.
GHC RTS statistics: 99.7% productivity, GC time <0.5% of elapsed.
6.5. Bidirectional Callbacks¶
Source: HsCallback.hs, omp_compute.c
The previous sections demonstrated Haskell calling OpenMP. OpenMP workers can also call back into Haskell from within a parallel region.
Mechanism¶
Haskell creates a FunPtr via
foreign import ccall "wrapper":
foreign import ccall "wrapper"
mkCallback :: (CInt -> IO CDouble)
-> IO (FunPtr (CInt -> IO CDouble))
sinCb <- mkCallback (\i -> return (sin (fromIntegral i * 0.001)))
GHC generates a C stub that wraps the Haskell closure with automatic Capability management:
// Generated wrapper (simplified):
CDouble wrapper(CInt arg) {
Capability *cap = rts_lock(); // acquire Capability
// ... evaluate Haskell closure ...
rts_unlock(cap); // release Capability
return result;
}
The C code calls this FunPtr from inside an OpenMP parallel for:
void parallel_reduce_callback(hs_callback_t callback, int n) {
double sum = 0.0;
#pragma omp parallel for reduction(+:sum) schedule(static)
for (int i = 0; i < n; i++)
sum += callback(i); // each worker calls into Haskell
return sum;
}
Correctness¶
All results verified against pure C and pure Haskell reference implementations:
| Test | Result | Status |
|---|---|---|
| parallel_map (1000 sin values) | Element-wise match to 1e-10 | OK |
| parallel_reduce (100K sin sum) | 1839.343386 (matches pure C) | OK |
| polynomial callback (10K) | 1109840.005000 (matches Haskell) | OK |
Performance¶
| Threads | Pure C (ms) | Callback (ms) | Overhead | Per-callback |
|---|---|---|---|---|
| 1 | 1.69 | 46.60 | 27.6x | ~0.47 us |
| 2 | 1.17 | 60.43 | 51.8x | ~0.60 us |
| 4 | 0.71 | 57.91 | 82.1x | ~0.58 us |
The per-callback cost of ~0.5us is the rts_lock()/rts_unlock()
round-trip. This is constant regardless of what the Haskell function does.
For callbacks that perform milliseconds of work (e.g., looking up a Haskell
data structure, evaluating a complex expression), the overhead is negligible.
For tight inner loops like 100K trivial sin() calls, pure C should be used
instead.
Practical guideline: Use Haskell callbacks when each invocation does ≥100us of work. Below that, the
rts_lock/unlockoverhead dominates. Structure code so that OpenMP handles the hot numerical loop in C, and calls Haskell for complex logic at coarser granularity.