Shared Anonymous Pages

Shared anonymous pages on Linux conflate two desirable types of sharing: aliasing memory within a process and aliasing memory between processes. Both of these types of sharing are useful, but it is not currently possible to choose only one or the the other on Linux today, AFAICT. Unfortunately, fixing this is hard.

First, for orientation, the heap of a program on Linux is backed by what is known as “private anonymous pages”. The heap is private in the sense that it is distinct from other processes, including its parent and children. Memory used in a program’s heap is anonymous in the sense that it doesn’t correspond to a file on disk somewhere. And a page is the granularity at which the hardware manages memory – on x86 and arm systems a page is 4096 bytes (4kB). All of the heap memory you use is rounded up to units of pages.

Private anonymous pages are what make copy-on-write after fork(2) work. On fork, all VMAs related to private, anonymous pages in both the parent and the child are marked as read-only in both processes' page tables, the reference count for the anonymous page is increased by one, but no physical memory is duplicated during the call to fork. Reads from these pages behave as expected, but a write to either processes' heap triggers a hardware fault (because the page tables are maked read only, this is enforced by the CPU’s MMU). In response to the fault the CPU invokes Linux’s page fault handler. In the handler the OS allocates a new physical page, copies the contents of the original page to the new page, updates the page table in the faulting program to point to the new page (and to mark it read/write), and finally decreases the reference count to the original anonymous page. Once finished with the Linux page fault handler, the hardware re-executes the write instruction (which now succeeds) and execution continues.

Part of the reason this works is because Linux knows that writes to ‘private and anonymous’ pages imply copy-on-write behavior.

Shared anonymous pages are more exotic, but useful in a number of scenarios. Like regular pages in the heap, they are anonymous and don’t correspond to files on disk. The physical memory backing shared pages, however, can be mapped read/write in multiple places. They can be used to communicate between forked child processes and a parent coordinator (PostgreSQL does this). But they can also be used to alias a piece of physical memory at multiple virtual addresses within a single process.

This aliasing within a process is what I’m currently interested in, and where shared anonymous pages come up short.

This works great, and is achievable through both creative use of mmap(2) or mremap(2). The problem comes when you fork(2). Writing to these addresses in the child affects the contents in the parent (and vice versa). We want the ability to have the memory in the child be unrelated to the memory in the parent while taking advantage of the traditional copy-on-write fork optimization. Unfortunately, this doesn’t appear possible in Linux today.

Instead, we are forced to work around this in userspace with pthread_atfork(3). We can register handlers to run in the child, after fork, where we eagerly duplicate the shared physical memory and re-set-up the shared mappings.

As far as I can tell, it would take two things for Linux to fix this at the OS level. First, A new flag for mmap or madvise to denote that “shared” anonymous pages should only be shared within the process. Second, an update to the logic around writes to COW’ed pages. Instead of assumming a 1:1 mapping between VMAs and processes, when copying-on-write all VMAs in the writing process should be updated as well.

Rob Landley has a great FAQ that was helpful when researching this.