Skip to content

Appendix A.6: Zero-Copy FFI with Pinned ByteArray

Source: HsZeroCopy.hs

The standard FFI pattern (allocaArray + peekElemOff/pokeElemOff) boxes every element as CDouble and converts via realToFrac, adding overhead at the Haskell↔OpenMP boundary:

-- Boxed: every element goes through CDouble
a <- peekElemOff pA (i * n + k)    -- returns boxed CDouble
b <- peekElemOff pB (k * n + j)    -- returns boxed CDouble
go (acc + realToFrac a * realToFrac b) (k + 1)  -- 4 box/unbox ops

Using pinned ByteArray# with unboxed primops eliminates this overhead:

-- Unboxed: Double# throughout, no boxing
case readDoubleArray# mbaA (i *# n +# k) s of
    (# s', a #) ->     -- a :: Double# (unboxed)
        case readDoubleArray# mbaB (k *# n +# j) s' of
            (# s'', b #) ->     -- b :: Double# (unboxed)
                goK s'' i j (k +# 1#) (acc +## (a *## b))

The pinned ByteArray is passed to C via mutableByteArrayContents#, which returns a raw Addr# — zero-copy, no marshalling. touch# keeps the ByteArray alive during the C call.

Benchmark

Haskell sequential DGEMM inner loop, pinned ByteArray with unboxed primops vs standard boxed FFI:

N Boxed (ms) Unboxed (ms) Speedup
256 57.3 53.4 1.07x
512 457.9 384.6 1.19x

The 19% improvement at N=512 comes from eliminating CDouble boxing in the O(n³) inner loop. -ddump-simpl confirms the hot loop uses +##, *##, and readDoubleArray# with no D# constructor.