8. Shared Memory Demos¶
Source: HsSharedMem1.hs, HsSharedMem2.hs, HsSharedMem3.hs, HsSharedMem4.hs, HsSharedMem5.hs, omp_shared.c
Everything in this project builds to this: Haskell and OpenMP C code operating on the same data, concurrently, with type-level proof that their access patterns are safe. The unified runtime (§6) makes shared memory possible; zero-copy FFI (§A.6) makes it practical; linear types (§A.7) make it correct by construction.
Five demos show this progression — from sequential handoff, through
defensive synchronization, to compile-time proof of disjoint access,
safety guarantees, and finally pure Haskell parallelism via GHC sparks.
All use the same workload: element-wise
f(x) = sin(x) * cos(x) + sqrt(|x|), applied by both Haskell and C/OpenMP
to portions of a shared array.
Demo 1: The basic pattern works¶
Sequential handoff — Haskell fills, C transforms, Haskell reads:
arrIn <- newPinnedDoubles n -- pinned: not moved by GC
arrOut <- newPinnedDoubles n
forM_ [0..n-1] $ \i -> writeD arrIn i (fromIntegral i * 0.001)
c_transform_all (ptrOf arrIn) (ptrOf arrOut) (fromIntegral n)
s <- sumArray arrOut n -- read back in Haskell
No concurrent access, no synchronization. This establishes that the
plumbing works: a pinned ByteArray created in Haskell is directly
readable and writable by C/OpenMP code via mutableByteArrayContents#.
Demo 2: The problem — defensive synchronization¶
Haskell and C/OpenMP each process a disjoint half of the output array.
Without compile-time proof of disjointness, a GOMP_barrier() is needed
for memory visibility — even though the regions never actually overlap:
hsTransformRange arrIn arrOut 0 half -- Haskell: [0, half)
c_transform_range_barrier pIn pOut half (n - half) -- C: [half, n) + barrier
This "defensive synchronization" is the cost of not having type-level guarantees about disjoint access. The barrier is cheap (~0.3 us), but it represents a fundamental limitation: the programmer must manually reason about which regions are disjoint, and the compiler cannot help.
Demo 3: The solution — linear types eliminate barriers¶
Same partition, but using Data.Array.Linear's split/combine to prove
disjointness at the type level:
case halve rw arrOut of
MkSlice st rwL rwR arrL arrR ->
let rwL' = linearTransform rwL arrIn arrL -- Haskell: left half
!() = unsafePerformIO $ -- C: right half
unsafeWithPtr arrOut $ \pOut ->
unsafeWithPtr arrIn $ \pIn ->
c_transform_range pIn pOut
(fromIntegral half) (fromIntegral (n - half))
in combine st rwL' rwR -- no barrier
The type system enforces:
rwLgrants exclusive write access to the left halfrwRgrants exclusive access to the right half — consumed bycombine- No Haskell code can use
rwRto accessarrRwhile C is processing it combineis zero-cost (no allocation, no copying)
Demo 4: Safety — why barriers exist¶
Two examples showing that removing barriers (nowait) without proof of
disjointness silently introduces bugs.
Part A — Off-by-one overlap. Each partition writes [off..off+chunk+1)
instead of [off..off+chunk). Boundary elements are written by two
partitions. With nowait, this is a data race; with barriers, it is
deterministic but still wrong (double-write at boundaries). With linear
split, overlapping ranges are impossible — the type system rejects them:
| Variant | N=10K, P=4 | N=100K, P=16 | N=1M, P=64 |
|---|---|---|---|
| Disjoint (correct) | 0.00 | 0.00 | 0.00 |
| Overlap + barrier | 3.06 | 9.26 | 31.8 |
| Overlap + nowait | 3.06 | 9.26 | 31.8 |
Part B — Two-pass stencil. Pass 1 writes out[i] = f(in[i])
(independent per element); pass 2 reads neighbors
out[i] = avg(out[i-1..i+1]) (crosses partition boundaries). Without
a barrier between passes, pass 2 reads stale data. With linear types,
combine forces pass 1 to complete before pass 2 can access the parent
token needed for cross-boundary reads:
let rw1 = linearMultiPass1 rw arrIn arrOut 0 numParts -- partitioned pass 1
-- combine happened inside — rw1 is the parent token
in linearPass2 rw1 arrOut -- pass 2 needs parent
| Variant | N=10K | N=100K | N=1M |
|---|---|---|---|
| C nowait (worst of 10 runs) | 2.68e-2 | 1.09e-2 | 4.44e-3 |
| Haskell linear vs ref | 0.00 | 0.00 | 0.00 |
The C nowait version produces wrong results (stale reads at partition
boundaries). The Haskell linear version is both correct and barrier-free.
Demo 5: GHC spark parallelism¶
Demos 2–4 use C/OpenMP for the parallel half. Demo 5 shows that the same
split/combine pattern works for pure Haskell parallelism via GHC sparks.
parCombine replaces combine — it sparks the left partition and evaluates
the right on the current thread, using spark#/seq# with noDuplicate#
for safety:
parPartition rw arrIn arrOut base depth =
case halve rw arrOut of
MkSlice st rwL rwR arrL arrR ->
let rwL' = parPartition rwL arrIn arrL base (depth - 1)
rwR' = parPartition rwR arrIn arrR (base + size arrL) (depth - 1)
in parCombine st rwL' rwR' -- spark left, eval right
Spark scaling (N=1,000,000, 4 capabilities):
| Depth | Partitions | Sequential | Parallel | Speedup |
|---|---|---|---|---|
| 0 | 1 | 41.9 ms | 43.6 ms | 0.96x |
| 1 | 2 | 44.4 ms | 22.3 ms | 1.99x |
| 2 | 4 | 42.5 ms | 13.8 ms | 3.09x |
| 3 | 8 | 42.1 ms | 15.0 ms | 2.80x |
| 5 | 32 | 46.1 ms | 13.5 ms | 3.41x |
| 6 | 64 | 45.7 ms | 13.7 ms | 3.34x |
Near-ideal 2x at 2 partitions, ~3x at 4+ partitions on 4 cores. No C/OpenMP involved — parallelism is GHC's work-stealing scheduler.
Results (Demos 1–3, 4 threads, i7-10750H)¶
Barrier vs linear — iteration loop:
| N | Iters | With barrier | Linear | Saved |
|---|---|---|---|---|
| 10,000 | 1000 | 202.5 ms | 194.2 ms | 4.1% |
| 100,000 | 100 | 231.9 ms | 221.7 ms | 4.4% |
| 1,000,000 | 10 | 224.6 ms | 221.8 ms | 1.3% |
Partition scaling (N=1,000,000) — linear types:
| Partitions | Time (ms) |
|---|---|
| 2 | 39.0 |
| 4 | 39.8 |
| 8 | 38.9 |
| 16 | 38.3 |
| 32 | 38.4 |
Scaling is flat: zero barriers regardless of partition count, because
split/combine is pure arithmetic on offset/length views into the same
underlying buffer.
Interpretation¶
The modest 1–4% improvement in Demo 3 vs Demo 2 confirms that barrier elimination is not the point — the barrier was already fast. The point is correctness and composability:
-
Safety (Demo 4): The type checker rejects programs that access overlapping regions (Part A) and enforces ordering between passes that share data across partition boundaries (Part B). These are real bugs that
nowaitintroduces silently — linear types catch them at compile time. -
Zero-cost abstraction:
split/combineinvolves no allocation, copying, or runtime checks. The tokens (RW s) are erased at runtime. The generated code is identical to the unsafe version minus the barrier. -
Composable parallelism (Demo 5): The same
split/combinepattern works with GHC sparks (parCombine) for pure Haskell parallelism, achieving near-ideal scaling. The programmer writes the same code regardless of whether parallelism comes from OpenMP (C FFI) or GHC sparks — linear types guarantee safety in both cases.