I'm looking for comments and opinions on switching to software write barriers. Short explanation below.
gencgc currently uses mprotect tricks to detect writes to boxed pages. It's not exactly the most common use case, and doesn't seem to be that well supported by modern OSes. As 188.8.131.52 notes, handling the exception and mprotecting back to write access is expensive enough that it makes sense to do so on more than one OS page at a time; moreover, contiguous pages with different protection levels get different mapping entries, and we hit the limit on some OSes, with large enough heaps.
Most other runtimes instrument writes instead. As a few papers from 90's note, as execution speeds grow, and interrupts become more expensive, the situations in which it is beneficial to mprotect instead of instrumenting become more and more rare. Another potentially significant downside is that increasing the protection granularity makes GC more expensive. Generally speaking, mprotect tricks also make SBCL behave less like the C programs modern OSes are optimised for, so we don't benefit (as well) from, e.g., transparent merging of 4k pages into 2M huge pages.
Obviously, the downside of instrumentation is that instrumentation means more code (and additional *writes*), which slows down every write, instead of the first write to a protected page.
So, I've got code for x86-64. The instrumentation patch only changes ~160 LOC (and another 60ish in the GC), and I'm fairly confident it's correct. It doesn't seem to have any effect on self-build times, and I can craft semi-realistic microbenchmarks on which it's 5-20x faster than mprotect. Conversely, streaming writes to boxed vectors are slowed down by a bit more than 50% in the best case (still faster in the worst case, right after a GC). Basically, the only programs I can think of so far that'd significantly suffer from a switch to software write barriers are those that hammer only a few boxed pages, and write to them thousands of times between GCs...
Software write barriers are still a trade off, and they'll change SBCL's performance profile rather dramatically in some corner cases.
On another note: we currently pre-mark as written allocation regions that are pinned and in which at least one write was registered before the GC. As far as I can tell, this is to protect against the case in which we register a write via page protection, but get SIG_STOP_FOR_GC before the write. I don't know why we mark *the whole allocation region* as written, since it's really lossy with large objects (e.g. simple vectors). Any idea how to handle this? The simplest thing I can think of would be to only mark as written those pinned pages that were already marked before the GC, but that still fails (less so) on vectors that are pinned for a long time.