Why Not Process-Level Snapshots?¶

The original ghc-fastboot design envisioned process-level snapshots: dump the entire process memory (heap, stack, RTS state) and restore it via mmap, bypassing hs_init() entirely. This was implemented (capture.c / restore.c) and worked for trivial cases but was ultimately abandoned.

The RTS Shutdown Tax¶

GHC's RTS interval timer (-V flag, default 20ms) adds ~12ms to hs_exit() — the shutdown path. For a program with 2.6ms of actual work, the measured wall-clock time is 14.6ms, giving the false impression that startup is slow. The fix is -with-rtsopts=-V0, which disables the timer and brings hs_exit() down to ~0.1ms. This is deceptive: benchmarking without -V0 makes every optimization look ineffective.

This observation was critical to the project. What appeared to be a "startup problem" was partly a "shutdown problem" that masked the real performance. With -V0, closure-level freeze/thaw alone achieves 2.6ms total — making process-level snapshots unnecessary for the startup use case.

Fundamental Obstacles¶

Process-level snapshots face issues that cannot be worked around:

1. TLS / Stack Canaries¶

glibc uses PTR_MANGLE to encrypt function pointers stored in jmp_buf and thread-local storage. The mangling key is per-process, randomized at execve() time, and stored in TLS at a fixed %fs offset. A restored process has a different key — every glibc function that uses longjmp, atexit, or signal handlers will crash or silently corrupt state.

2. RTS Internal State¶

The RTS maintains thread-local Capability structures, signal handler registrations, timer state, I/O manager state. Restoring these correctly requires intimate knowledge of every RTS subsystem — fragile and version-dependent.

3. Kernel State¶

File descriptors, signal dispositions, process IDs, memory mappings visible to the kernel — none of these survive a snapshot/restore across execve() boundaries.

4. Diminishing Returns¶

With -V0, hs_init() + hs_exit() costs ~1.6ms. Process-level snapshots can't reduce this below ~0.5ms (the absolute minimum for execve() + dynamic linker + stack setup). The potential savings are at most ~1.1ms — not worth the complexity.

Closure-Level Wins¶

Closure-level freeze/thaw sidesteps all of these: it operates entirely within the Haskell heap, above the RTS. No kernel state, no TLS, no glibc internals. The RTS initializes normally, then the frozen data is mapped in. Clean separation.

Measurement Pitfalls¶

The `-V0` Lesson¶

GHC's RTS interval timer (controlled by -V flag) defaults to a 20ms tick for context switching and profiling. The timer setup and teardown during hs_init() / hs_exit() adds 10-12ms of overhead — purely shutdown cost, not startup. Without -V0, a program that takes 2.6ms to run appears to take 14.6ms.

This led to a wasted week investigating "slow startup" that was actually slow shutdown. The fix is trivial: -with-rtsopts=-V0 in the cabal file for all benchmark executables. But the lesson is general: always isolate what you're measuring. hyperfine -N (no shell) + wall clock includes process lifecycle overhead that may have nothing to do with the code under test.

Benchmark Methodology¶

All performance numbers in this project use:

nix build (not cabal run, which rebuilds and adds overhead)
-with-rtsopts=-V0 on all benchmark executables
hyperfine -N --warmup 3 for sub-5ms measurements
perf stat -e page-faults to distinguish I/O from computation
Internal clock_gettime(CLOCK_MONOTONIC) timestamps for phase-level breakdown