[Bigdata-developers] GC overhead exceeded and order by

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi

I believe I avoid GC overhead exceeded messages by limiting the amount of sorting intermediate result sets needed.
What are realistic sizing guidelines?

===

The detail in my case is as follows:

I generate SPARQL queries in response to the user specifying their intent on our advanced search UI

We always give them results back in 'pages' of say 20 items, and they can then go through page by page.
We allow the user to sort these results (in fact they are always sorted) by clicking on column headings.
These clicks control the ORDER BY modifier, the paging is by the OFFSET and LIMIT modifiers.

Obviously sometimes they ask queries that are not so well thought out and there are thousands of results (realistically other aspects of our design cap the number at an order of magnitude 1,000,000)

I am doing scale testing, and if I have a large dataset (quarter of a billion triples - which is large for me). and a result set of 150,000 items, then an appropriate sized machine (30 GB core, journal on SSD, java of size 20GB) does fine with the results coming back in a few seconds (AWS c3.4xlarge)

OTOH if I have a machine that is too small (3.75 GB of core, 2GB Java heap, journal on EBS provisioned IIOPs but without EBS optimization), then while 'easier' queries are fine, the same 150,000 result query (taking the first 20), takes two minutes, which is unacceptable,

Clearly the right thing to do is to buy a bigger machine, in the cases where we have the larger data sizes.

However, from time to time, we may find that we have under-provisioned, and so I am considering putting a LIMIT 50000 on an unsorted subquery, and then sorting only the first 50000 entries (this is on the small machine).
I believe this will work in terms of avoiding the GC Overhead exceeded message. (Our team has a strong prejudice against  Java OOMEs: we have a collective preference to treat OOMEs as fatal requiring a restart, and are somewhat suspect of bigdata's attempt to continue despite OOMEs).

My question is the number 50000 for a 2GB heap size is pretty arbitrary, but may work, what are reasonable policies for limiting these on-heap sorts by Java heap size, and if say we have a 30 GB machine, what is a good way to divide the memory up. 

I am thinking of a table maybe like:

             Memory size GB	         Java Heap GB	              LIMIT
1.7	0.96	21000
3.75	2.6	56875
7.5	5.6	122500
30	23.6	516250

which is allowing 32000 bytes per item being sorted in memory - which seems enormous!
I observe that on-disk a triple is about 200 bytes, and I remember that in Jena a triple in memory is about 2000 bytes, I guess a sort item here may be more than a single triple …

any thoughts?

Jeremy

[Bigdata-developers] GC overhead exceeded and order by

Fast, scalable, robust graph database platform

[Bigdata-developers] GC overhead exceeded and order by