Many thanks to everyone who shared their opinions — those were very helpful in re-thinking over the problem.

Starting from the end...
On Sat, May 10, 2014 at 3:53 PM, Thomas Burdick <thomas@burdick.fr> wrote:

This isn't really a reply to your question, but possibly the question behind it. For CPU heavy tasks, I've always used multiple processes. It's a balance between the concurrent GCs, and reduced GC pressure, on the one hand, and the cost of IPC and forking on the other. But if you can arrange your worker processes to do a significant chunk of work and avoid going into a global GC, it's probably well worth the cost of forking.

This is, actually, a way we're contemplating in the short term, because anyway we have a distributed application, and beyond an administrative overhead we don't have any other significant cost of having 4 processes instead of 1 (the memory overhead is acceptable, because anyway we won't be utilizing the full memory capacity).

Still, I think, this is not the best solution in the long run.

Regarding parallel vs concurrent GC.

I agree with your points that increasing throughput can be better achieved with a parallel GC by reducing GC time proportionally to the number of worker threads (ideally). Additionaly, having a way to control the number of threads assigned to doing that work will really help in tuning it.

Reducing consing is always a good advice, for sure, although it's not the path I'd prefer to take for our application, because it is constantly evolving, so we don't have a stable version that is worth spending a month optimizing.

As for the question, why I think GC is the bottleneck here, I can't share the numbers as we don't have any specially-tailored benchmark. Our conclusion comes from 2 sources:
- studying GC logs
- playing with the sizes of generations: setting them to the maximal possible size (4Gb, as far as I understand) has significantly changed th load profile of the application. Actually, the throughput increased almost twice. The additional problem intoruced, though, was long GC pauses. It showed that moving from frequent short collections to seldom long collections was beneficial. And the problem became especially visible with the increase in the number of threads, probably, because each thread generates quite a lot of garbage, and the increase had pushed the application running in the mode of frequent small collections to some tipping point, when GC is run very often and that intoduces too much context switching. All that pointed me into thinking of concurrent GC, although I agree now that reducing GC pauses for big collections with a parallel one should be the most efficient approach for our load profile.

Luís, I'd be grateful if you could publish your parallelization experiments. At least we can take a look at the amount of work needed there.

Thanks again,