From: Luís O. <lu...@gm...> - 2014-05-09 10:04:00
|
On Thu, May 8, 2014 at 11:36 PM, Vsevolod Dyomkin <vse...@gm...> wrote: > We develop a CPU-heavy application, and in our production setup and > experiments there's a huge loss in CPU utilization because of GC pauses with > the increase of the number of processor cores (for instance, going from 8 to > 16 cores, basically, doesn't increase throughput of the application, because > all of the gain is lost in GC). So this topic became quite important to me. A concurrent GC executes simultaneously with your program. A parallel GC utilizes multiple threads when performing its work. Some GCs are both parallel and concurrent. SBCL's is neither, as you well know. As Martin pointed out, if you're interested in throughput, you want a parallel stop-the-world GC. Concurrent GCs trade some throughput for decreased latency, i.e. they will not stop your program (our at least pauses will be very brief) but both the GC and the program will run a little bit slower. > Has anyone explored this direction before? Yes, once upon a time I tried to introduce parallelism into SBCL's GENCGC, following the architecture described in this paper: <http://research.microsoft.com/en-us/um/people/simonpj/papers/parallel-gc/par-gc-ismm08.pdf>. In that paper the authors describe how they've parallelized a stop-the-world copying GC that's pretty similar to SBCL's. > What are the main obstacles > (besides the obvious need to devote a hefty amount of time to it) in > implementing one of the existing concurrent GC approaches in SBCL? The main obstacle for me, besides a general lack of knowledge/guidance, was that SBCL's copying code is much more complicated that GHC's. Whereas GHC's GC has a single function that copies some sort of hunk from old to new space, SBCL's has various copying functions for dozens of object types, some with parallelism-unfriendly optimization tricks. It took a while to nudge all of those copiers into apparent correctness. In hindsight, it was a big mistake not to simplify the copiers before making them concurrent. The other challenges are getting a balanced distribution of work across the GC threads and, possibly, solving the page table bottleneck described in the paper. I definitely bumped into the former. I very likely bumped into the latter as well, but I didn't know how to measure it. Some years later, I came across Simon Marlow -- the main author of the aforementioned paper -- at a GC workshop and he told me he had used a thread profiler tool from Intel, whose name I can't recall. In the end, the bottom line was: I reproduced the 20-30% hit on single-thread GC time when the synchronization mechanism was enabled; running with two threads recovered that overhead and on some benchmarks even yielded a little bit of speedup compared to the original GC; but, there was little to no speedup beyond 2 threads. :-( > Maybe, someone wanted to approach it, but didn't have motivation? Well, motivation was pretty low when I was stuck chasing heisenbugs. :-) > I'd be grateful for any pointers and can provide more detail here or in the > personal communication if needed. I can try and dig out my thesis and (outdated) code if you're interested. -- Luís Oliveira http://kerno.org/~luis/ |