Skip to content

5. Optimization: From 24x Slower to Parity

The Phase 2 runtime was functional but slow: fork/join took 24 us vs libgomp's 1 us. The bottleneck was mutex+condvar on every operation. We eliminated all locks from the hot path:

5.1. Lock-free Work Dispatch

// Master: store work, then release-fence generation increment
g_pool.fn = fn;
g_pool.data = data;
atomic_fetch_add(&g_pool.generation, 1, memory_order_release);

// Worker: spin on generation (acquire-fence)
while (atomic_load(&g_pool.generation, memory_order_acquire) == my_gen)
    _mm_pause();

No mutex on the hot path. The condvar broadcast is only for workers that fell asleep after 4000 spin iterations.

5.2. Sense-Reversing Barrier

The sense-reversing centralized barrier follows Mellor-Crummey & Scott's algorithm ("Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors", ACM TOCS 9(1), 1991):

void spin_barrier_wait(spin_barrier_t *b, int *local_sense) {
    *local_sense = 1 - *local_sense;
    if (atomic_fetch_sub(&b->count, 1, memory_order_acq_rel) == 1) {
        // Last thread: reset counter, flip global sense
        atomic_store(&b->count, b->size, memory_order_relaxed);
        atomic_store(&b->sense, *local_sense, memory_order_release);
    } else {
        // Spin until sense matches
        while (atomic_load(&b->sense, memory_order_acquire) != *local_sense)
            _mm_pause();
    }
}

Pure atomic operations, no locks. The centralized design has O(N) wakeup but is optimal for small team sizes (typical OpenMP use).

5.3. Results

Metric (4 threads) Phase 2 Phase 3 Native libgomp
Fork/join 24.35 us 0.81 us 0.97 us
Barrier 7.01 us 0.25 us 0.51 us
Parallel for (1M sin) 6.71 ms 3.91 ms 3.85 ms
Critical section 0.39 ms 0.38 ms 0.92 ms

After optimization, the RTS-backed runtime matches or beats native libgomp on all benchmarks.

Optimization Journey: Phase 2 โ†’ Phase 3

With parity established, the runtime becomes a platform. The following sections build the capabilities โ€” FFI integration, GC isolation, zero-copy sharing, linear types โ€” that culminate in Haskell and OpenMP collaborating on shared data structures (ยง8).