Attached patch is intended to improve GENCGC performance, and seems to
mostly succeed in doing that. Unless there are objections, I'd like to
commit most of it for 0.9.8 (i.e. probably on next weekend). Testing
on non-Linux or low-memory x86 systems would be appreciated. Other
comments are of course also welcome.
* Instead of zeroing memory by remapping memory with munmap/mmap at
GC time, pages are just marked as needing zeroing and zeroed with
memset when they're added to a new allocation region. This reduces
GC latency both for the common and worst cases (~30% improvement
for both on an Athlon X2 with kernel 2.6.14, ~5% average/~15%
worst-case on a PIII/2.6.10).
It also improves the performance of the whole Lisp system
noticeably (up to 45% on some CL-BENCH tests on x86-64). Attached
are CL-BENCH results for a Pentium 120, Pentium III, Duron,
Athlon, P IV, and Athlon X2. (Thanks to Hannu Koivisto and Peter
De Wachter for some of these results). As a summary these results
show mostly improvements on all platforms, with few significant
regressions. (See the end of this message for some instructions on
how to decipher the results).
* To keep the memory footprint down, clear the pages by remapping after
major GCs (arbitrarily defined as a collection of generation 2 or older).
The memory freed from a minor GC is just going to get used again immediately,
so releasing them back to the OS would make little sense.
The RSS of a vanilla SBCL and a modified one acting are very
similar for things like acting as a SBCL host compiler.
Anecdotally the new version causes no more thrashing on a
low-memory system than a vanilla one, though I haven't really
* Supply hand-coded assembly routines for zeroing memory instead of
relying on the libc memset() which seems to be suboptimal on a lot of
* On x86-64 use SSE2 (MOVNTDQ)
* On x86 use either SSE2 (MOVNTDQ), MMX (MOVNTQ) or REP STOSL depending
on CPUID flags.
The extra complexity introduced here is quite manageable, since we're
only using these routines for zeroing page-aligned blocks of memory.
Separate results for this change are included in the CL-BENCH
reports. As a summary, this is very beneficial for the SSE2
systems and the PIII, quite good for the P120, and terrible for
the Duron and the Athlon. Since the x86 results were mixed, this
part is probably not something to commit in the 0.9.8 timeframe.
* Shrink generation_size_t and reorganize struct page a bit to shrink
the page-table (25% reduction on x86, 33% on x86-64). Reduces memory
use and improves performance. This change is not included in any
of the CL-BENCH reports.
* Make MAP-ALLOCATED-OBJECTS page-table aware, so that the non-zero free
pages don't confuse ROOM. As a bonus the results from ROOM are also
more accurate now, instead of reporting each free page as consisting
of a large number of conses.
* On BSDs GENCGC always used memset instead of mmap tricks, apparently
due to some bugs in swap space handling on some ancient FreeBSD version.
Get rid of this irregularity, and do the same thing on all platforms.
I'm not sure of the effect this will have on performance on BSDs.
* Add a GENCGC mode (#define READ_PROTECT_FREE_PAGES) for catching attempts
to read unallocated pages
* Genesify the GENCGC page size
CL-BENCH readers guide:
Each file contains five columns:
* Benchmark name
* Absolute run-time for vanilla SBCL (on some boxes the tests were
run with different iteration counts, so the absolute values of different
reports are not comparable)
* Relative run-time (to the vanilla results) for vanilla SBCL
* Relative run-time (to the vanilla results) for memset-using SBCL
* Relative run-time (to the vanilla results) for SBCL using
Measurements are reported as xxx|yyy, where xxx is the result and yyy
is the standard error.
I'll make some pretty pictures of the results available later.