Re: [Bigdata-developers] Current Revision of Blazegraph: Journal consumes extremely much disk space

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I don't see how the small slot optimisation can result in more waste 
with larger allocators.

It is simply a mechanism to avoid rapid re-allocation of the small slot 
allocators to attempt to improve write elision on recycled slots.

In the latest Allocator dump, there are a lot of 64 byte allocators.  
Unlike the larger allocators (128 and greater) a large proportion of the 
64 byte slots will be used for long literal values (note that the mean 
allocation is only 27 bytes).

Counter intuitively, there may well be a case for excluding the 64 byte 
allocators from the "small slot optimisation".  So "small slot" NOT 
"smallest slot" ;-)

- Martyn

On 24/04/2015 00:18, Bryan Thompson wrote:
> I've updated the ticket.  I've also copied my main conclusions inline below.
>
> I think that the issue here is the use of the small slot optimization
> without proper configuration of the indices in order to target small
> allocation slots for at least one of the indices.  The small slot
> optimization changes the allocation policy in two ways.
>
> 1. It has a strong preference to use only empty 8k pages for small
> allocations (as configured, for allocations less than 1k).  This allows us
> to coalesce writes by combining them onto the same page.
> 2. It has a preference to use allocation blocks that are relatively empty
> for small slots.
>
> As a consequence, the small slot optimization MAY recruit more allocators
> in order to have allocators for small slots that have good sparsity.
>
> The main goal of the small slot optimization is to optimize for indices
> that have very scattered IO patterns.  The indices that exhibits this the
> most are the OSP and OCSP indices.  In many cases even batched updates will
> modify no more than a single tuple per page on this index.  However, in
> your configuration (and in mine when I enabled the small slot optimization
> without adjusting the branching factors), the O(C)SP indices were not
> created with a small branching factor, so the small slot allocation could
> not be put to any good effect. However it did have a negative effect -- by
> recruiting more allocators.  If you want to use the small slot
> optimization, make sure that at least the O(C)SP index has a relatively
> small branching factor giving an effective slot size of 256 bytes or less
> on average.
>
> I suggest that you retest w/o the small slot optimization and with group
> commit still enabled.
>
> I've asked Martyn to look over the allocators from the small slot
> optimization run and think about whether we can make this policy a little
> more adaptive when the branching factors are not really tuned properly and
> too many allocators with too much wasted space are allocated as a result.
> Basically, how to avoid file bloat from misconfiguration.
>
> Thanks,
> Bryan
>
> ----
> Bryan Thompson
> Chief Scientist & Founder
> SYSTAP, LLC
> 4501 Tower Road
> Greensboro, NC 27410
> br...@sy...
> http://blazegraph.com
> http://blog.bigdata.com <http://bigdata.com>
> http://mapgraph.io
>
> Blazegraph™ <http://www.blazegraph.com/> is our ultra high-performance
> graph database that supports both RDF/SPARQL and Tinkerpop/Blueprints
> APIs.  MapGraph™ <http://www.systap.com/mapgraph> is our disruptive new
> technology to use GPUs to accelerate data-parallel graph analytics.
>
> CONFIDENTIALITY NOTICE:  This email and its contents and attachments are
> for the sole use of the intended recipient(s) and are confidential or
> proprietary to SYSTAP. Any unauthorized review, use, disclosure,
> dissemination or copying of this email or its contents or attachments is
> prohibited. If you have received this communication in error, please notify
> the sender by reply email and permanently delete all copies of the email
> and its contents and attachments.
>
> On Thu, Apr 23, 2015 at 9:36 AM, Andreas Kahl <ka...@bs...> wrote:
>
>> Ok, I can redo the test with smallSlots + groupCommit enabled, and run
>> http://localhost:8080/bigdata/status?dumpJournal&dumpPages after some
>> minutes. (I cannot run it on the fully loaded dataset because my disk is
>> not sufficient for the resulting Journal).
>>
>> By the way: Please find attached my custom Vocabulary classes. They are
>> just one of my many attempts to improve IO Perfomance on rotating disks.
>>
>> Best Regards
>> Andreas
>>
>>>>> Bryan Thompson <br...@sy...> 23.04.15 15.31 Uhr >>>
>> I just noticed that you have the full text index enabled as well.  I have
>> not be enabling that.
>>
>> I would like to see the output from this command on the fully loaded data
>> sets.
>>
>> http://localhost:8080/bigdata/status?dumpJournal&dumpPages
>>
>> This will let us if any specific index is taking up a very large number of
>> pages.  It will also tell us the distribution over the page sizes for each
>> index.
>>
>> Bryan
>>
>> ----
>> Bryan Thompson
>> Chief Scientist & Founder
>> SYSTAP, LLC
>> 4501 Tower Road
>> Greensboro, NC 27410
>> br...@sy...
>> http://blazegraph.com
>> http://blog.bigdata.com <http://bigdata.com>
>> http://mapgraph.io
>>
>> Blazegraph™ <http://www.blazegraph.com/> is our ultra high-performance
>> graph database that supports both RDF/SPARQL and Tinkerpop/Blueprints
>> APIs.  MapGraph™ <http://www.systap.com/mapgraph> is our disruptive new
>> technology to use GPUs to accelerate data-parallel graph analytics.
>>
>> CONFIDENTIALITY NOTICE:  This email and its contents and attachments are
>> for the sole use of the intended recipient(s) and are confidential or
>> proprietary to SYSTAP. Any unauthorized review, use, disclosure,
>> dissemination or copying of this email or its contents or attachments is
>> prohibited. If you have received this communication in error, please notify
>> the sender by reply email and permanently delete all copies of the email
>> and its contents and attachments.
>>
>> On Thu, Apr 23, 2015 at 8:54 AM, Andreas Kahl <ka...@bs...>
>> wrote:
>>
>>> Bryan,
>>>
>>> in the meantime, I could successfully load the file into a 18GB journal
>>> after disabling groupCommit (I simply commented out the line in
>>> RWStore.properties).
>>> I can try again with groupCommit enabled, but smallSlotOptimization
>>> disabled.
>>>
>>> Best Regards
>>> Andreas
>>>
>>>>>> Bryan Thompson <br...@sy...> 23.04.2015 13:24 >>>
>>> Andreas,
>>>
>>> I was not able to replicate your result.  Unfortunately I navigated away
>>> from the browser page in which I had submitted the request, so it loaded
>>> all the data but failed to commit.  However, the resulting file is only
>>> 16GB.
>>>
>>> I will redo this run and verify that the journal after the commit has
>> this
>>> same size on the disk.
>>>
>>> I was only assuming that this was related to group commit because of your
>>> original message.  Perhaps I misinterpreted your message. This is simply
>>> about 1.5.1 (with group commit) vs 1.4.0.
>>>
>>> Perhaps the issue is related to the small slot optimization?  Maybe in
>>> combination with group commit?
>>>
>>> *> com.bigdata.rwstore.RWStore.smallSlotType=1024*
>>>
>>> I could not replicate your properties exactly because you are using a
>>> non-standard vocabulary class.  Therefore I simply deleted the default
>>> namespace (in quads mode) and recreated it with the defaults in triples
>>> mode.  The small slot optimization and other parameters were not enabled
>> in
>>> my run.
>>>
>>> Perhaps you could try to replicate my experience and I will enable the
>>> small slots optimization?
>>>
>>> Thanks,
>>> Bryan
>>>
>>> ----
>>> Bryan Thompson
>>> Chief Scientist & Founder
>>> SYSTAP, LLC
>>> 4501 Tower Road
>>> Greensboro, NC 27410
>>> br...@sy...
>>> http://blazegraph.com
>>> http://blog.bigdata.com <http://bigdata.com>
>>> http://mapgraph.io
>>>
>>> Blazegraph™ <http://www.blazegraph.com/> is our ultra high-performance
>>> graph database that supports both RDF/SPARQL and Tinkerpop/Blueprints
>>> APIs.  MapGraph™ <http://www.systap.com/mapgraph> is our disruptive new
>>> technology to use GPUs to accelerate data-parallel graph analytics.
>>>
>>> CONFIDENTIALITY NOTICE:  This email and its contents and attacP. Any
>> unauthorized review, use, disclosure,
>>> dissemination or copying of this email or its contents or attachments is
>>> prohibited. If you have received this communication in error, please
>> notify
>>> the sender by reply email and permanently delete all copies of the email
>>> and its contents and attachments.
>>>
>>> On Thu, Apr 23, 2015 at 1:51 AM, Andreas Kahl <ka...@bs...>
>>> wrote:
>>>
>>>> Bryan & Martyn,
>>>>
>>>> Thank you very much for investigating the issue. I assume  from the
>>> ticket
>>>> that the error will vanish if I disable groupCommit. I will do so for
>> the
>>>> meantime.
>>>>
>>>> Although there is already extensive information in Bryan's ticket,
>> please
>>>> find attached my logs and DumpJournal outputs:
>>>> - dumpJournal.html contains a dump from the 67GB journal after
>> Blazegraph
>>>> ran into "No space left on device"
>>>> - dumpJournalWithTraceEnabled.html is the same dump for a running query
>>>> when the journal was at about 14GB
>>>> - queryStatus.html is just the status page showing my query
>>>> - catalina.out.gz contains the trace outputs from starting Tomcat
>> until I
>>>> killed the curl running the SPARQL Update by Ctrl-C
>>>> - loadGnd.log.gz is Blazegraphs output when loading the data
>>>>
>>>> Best Regards
>>>> Andreas
>>>>
>>>>
>>>>
>>>>>>> Bryan Thompson <br...@sy...> 22.04.15 20.56 Uhr >>>
>>>> See http://trac.bigdata.com/ticket/1206.  This is still in the
>>>> investigation stage.
>>>>
>>>> Thanks,
>>>> Bryan
>>>>
>>>> ----
>>>> Bryan Thompson
>>>> Chief Scientist & Founder
>>>> SYSTAP, LLC
>>>> 4501 Tower Road
>>>> Greensboro, NC 27410
>>>> br...@sy...
>>>> http://blazegraph.com
>>>> http://blog.bigdata.com <http://bigdata.com>
>>>> http://mapgraph.io
>>>>
>>>> Blazegraph™ <http://www.blazegraph.com/> is our ultra high-performance
>>>> graph database that supports both RDF/SPARQL and Tinkerpop/Blueprints
>>>> APIs.  MapGraph™ <http://www.systap.com/mapgraph> is our disruptive
>> new
>>>> technology to use GPUs to accelerate data-parallel graph analytics.
>>>>
>>>> CONFIDENTIALITY NOTICE:  This email and its contents and attachments
>> are
>>>> for the sole use of the intended recipient(s) and are confidential or
>>>> proprietary to SYSTAP. Any unauthorized review, use, disclosure,
>>>> dissemination or copying of this email or its contents or attachments
>> is
>>>> prohibited. If you have received this communication in error, please
>>> notify
>>>> the sender by reply email and permanently delete all copies of the
>> email
>>>> and its contents and attachments.
>>>>
>>>> On Wed, Apr 22, 2015 at 5:37 AM, Andreas Kahl <ka...@bs...>
>>>> wrote:
>>>>
>>>>> Hello everyone,
>>>>>
>>>>> I currently updated to the current Revision (f4c63e5) of Blazegraph
>>> from
>>>>> Git and tried to load a dataset into the updated Webapp. With Bigdata
>>>> 1.4.0
>>>>> this resulted in a journal of ~18GB. Now the process was cancelled
>>>> because
>>>>> the disk was full - the journal was beyond 50GB for the same file
>> with
>>>> the
>>>>> same settings.
>>>>> The only exception was that I activated GroupCommit.
>>>>>
>>>>> The dataset can be downloaded here:
>>>>>
>> http://datendienst.dnb.de/cgi-bin/mabit.pl?cmd=fetch&userID=opendata&pass=opendata&mabheft=GND.rdf.gz
>>>>> .
>>>>> Please find the settings used to load the file below.
>>>>>
>>>>> Do I have a misconfiguration, or is there a bug eating all disk
>> memory?
>>>>> Best regards
>>>>> Andreas
>>>>>
>>>>> Namespace-Properties:
>>>>> curl -H "Accept: text/plain"
>>>>> http://localhost:8080/bigdata/namespace/gnd/properties
>>>>> #Wed Apr 22 11:35:31 CEST 2015
>>>>>
>>> com.bigdata.namespace.kb.spo.com.bigdata.btree.BTree.branchingFactor=700
>>>>> com.bigdata.relation.container=gnd
>>>>> com.bigdata.rwstore.RWStore.smallSlotType=1024
>>>>> com.bigdata.journal.AbstractJournal.bufferMode=DiskRW
>>>>> com.bigdata.journal.AbstractJournal.file=/var/lib/bigdata/bigdata.jnl
>>>>>
>>>>>
>>> com.bigdata.rdf.store.AbstractTripleStore.vocabu.textIndex=true
>>>>> com.bigdata.btree.BTree.branchingFactor=700
>>>>>
>>>>>
>> com.bigdata.rdf.store.AbstractTripleStore.axiomsClass=com.bigdata.rdf.axioms.NoAxioms
>>>>> com.bigdata.rdf.sail.isolatableIndices=false
>>>>> com.bigdata.service.AbstractTransactionService.minReleaseAge=1
>>>>> com.bigdata.rdf.sail.bufferCapacity=2000
>>>>> com.bigdata.rdf.sail.truthMaintenance=false
>>>>> com.bigdata.rdf.sail.namespace=gnd
>>>>> com.bigdata.relation.class=com.bigdata.rdf.store.LocalTripleStore
>>>>> com.bigdata.rdf.store.AbstractTripleStore.quads=false
>>>>> com.bigdata.journal.AbstractJournal.writeCacheBufferCount=500
>>>>> com.bigdata.search.FullTextIndex.fieldsEnabled=false
>>>>> com.bigdata.relation.namespace=gndity=10000
>>>>> com.bigdata.rdf.sail.BigdataSail.bufferCapacity=2000
>>>>> com.bigdata.rdf.store.AbstractTripleStore.statementIdentifiers=false
>>>>>
>>>>>
>>>>>
>> ------------------------------------------------------------------------------
>>>>> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
>>>>> Develop your own process in accordance with the BPMN 2 standard
>>>>> Learn Process modeling best practices with Bonita BPM through live
>>>>> exercises
>>>>> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
>>>>> event?utm_
>>>>>
>> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
>>>>> _______________________________________________
>>>>> Bigdata-developers mailing list
>>>>> Big...@li...
>>>>> https://lists.sourceforge.net/lists/listinfo/bigdata-developers
>>>>>
>>>>>
>>>>
>>
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>
>
> _______________________________________________
> Bigdata-developers mailing list
> Big...@li...
> https://lists.sourceforge.net/lists/listinfo/bigdata-developers

Re: [Bigdata-developers] Current Revision of Blazegraph: Journal consumes extremely much disk space

Fast, scalable, robust graph database platform

Re: [Bigdata-developers] Current Revision of Blazegraph: Journal consumes extremely much disk space