From: Bryan T. <br...@sy...> - 2015-05-14 12:02:28
|
Andreas, We have moved to JIRA. This is now http://jira.blazegraph.com/browse/BLZG-201. Could you please attach your vocabulary file to the ticket (java code). Martyn is trying to replicate things using a subset of your data using a part of the GND rdf data - 1G of RDF statements. He wrote: But I was only able to do this by commenting out the vocabulary line #com.bigdata.rdf.store.AbstractTripleStore.vocabularyClass= de.bsb_muenchen.bigdata.vocab.B3KatVocabulary As I see it this is the main difference between my load and theirs. How can I get this class to recreate the load more accurately? Thanks, Bryan ---- Bryan Thompson Chief Scientist & Founder SYSTAP, LLC 4501 Tower Road Greensboro, NC 27410 br...@sy... http://blazegraph.com http://blog.bigdata.com <http://bigdata.com> http://mapgraph.io Blazegraph™ <http://www.blazegraph.com/> is our ultra high-performance graph database that supports both RDF/SPARQL and Tinkerpop/Blueprints APIs. MapGraph™ <http://www.systap.com/mapgraph> is our disruptive new technology to use GPUs to accelerate data-parallel graph analytics. CONFIDENTIALITY NOTICE: This email and its contents and attachments are for the sole use of the intended recipient(s) and are confidential or proprietary to SYSTAP. Any unauthorized review, use, disclosure, dissemination or copying of this email or its contents or attachments is prohibited. If you have received this communication in error, please notify the sender by reply email and permanently delete all copies of the email and its contents and attachments. On Mon, Apr 27, 2015 at 8:38 AM, Andreas Kahl <ka...@bs...> wrote: > Hello Bryan & Martin, > > Sorry for the long delay. Now I ran two dumpJournal&dumpPages: > 1. Dump while the SPARQL LOAD was running with groupCommit and > smallSlotOptimization enabled (the one that cannot finish due to disk space) > 2. Dump after the whole file was successfully loaded because I disabled > groupCommit (I could also use groupCommit and disable smallSlots) > > I will do what I can to help you testing and tracking down the problem. > For me here it is not too much trouble working with the knowledge that I > can only activate one of the both features at a time. > > Best Regards > Andreas > > P.S. I also followed your advice to increase > com.bigdata.rdf.sail.bufferCapacity as you can see from the settings of run > No. 2: > triples:/tmp # curl -H "Accept: text/plain" > http://localhost:8080/bigdata/namespace/gnd/properties > #Mon Apr 27 14:26:37 CEST 2015 > com.bigdata.namespace.kb.spo.com.bigdata.btree.BTree.branchingFactor=700 > com.bigdata.relation.container=gnd > com.bigdata.rwstore.RWStore.smallSlotType=1024 > com.bigdata.journal.AbstractJournal.bufferMode=DiskRW > com.bigdata.journal.AbstractJournal.file=/var/lib/bigdata/bigdata.jnl > com.bigdata.journal.AbstractJournal.initialExtent=209715200 > > com.bigdata.rdf.store.AbstractTripleStore.vocabularyClass=de.bsb_muenchen.bigdata.vocab.B3KatVocabulary > com.bigdata.rdf.store.AbstractTripleStore.textIndex=true > com.bigdata.btree.BTree.branchingFactor=700 > > com.bigdata.rdf.store.AbstractTripleStore.axiomsClass=com.bigdata.rdf.axioms.NoAxioms > com.bigdata.rdf.sail.isolatableIndices=false > com.bigdata.service.AbstractTransactionService.minReleaseAge=1 > com.bigdata.rdf.sail.bufferCapacity=200000 > com.bigdata.rdf.sail.truthMaintenance=false > com.bigdata.rdf.sail.namespace=gnd > com.bigdata.relation.class=com.bigdata.rdf.store.LocalTripleStore > com.bigdata.rdf.store.AbstractTripleStore.quads=false > com.bigdata.journal.AbstractJournal.writeCacheBufferCount=2000 > com.bigdata.search.FullTextIndex.fieldsEnabled=false > com.bigdata.relation.namespace=gnd > com.bigdata.btree.writeRetentionQueue.capacity=10000 > com.bigdata.rdf.store.AbstractTripleStore.statementIdentifiers=false > > >>> Bryan Thompson <br...@sy...> 24.04.15 18.45 Uhr >>> > Martyn and I discussed this in some depth today. We've reopened the ticket > to: > > a. gain more understanding of the interaction of the small slot > optimization and group commit. > b. verify correct reporting by the allocators in dumpJournal. > c. modify the small slots optimization allocator policy to make it less > susceptible to mis-configuration. > > In the data as loaded, the OSP index was 66% blob slots (greater than 8k). > For the small slot optimization to be effective the O(C)SP index should > target a page size of 64-256 bytes. > > (c) should minimize or remove the negative impact of the small slot > optimization in such cases. > > Thanks, > Bryan > > > > ---- > Bryan Thompson > Chief Scientist & Founder > SYSTAP, LLC > 4501 Tower Road > Greensboro, NC 27410 > br...@sy... > http://blazegraph.com > http://blog.bigdata.com <http://bigdata.com> > http://mapgraph.io > > Blazegraph™ <http://www.blazegraph.com/> is our ultra high-performance > graph database that supports both RDF/SPARQL and Tinkerpop/Blueprints > APIs. MapGraph™ <http://www.systap.com/mapgraph> is our disruptive new > technology to use GPUs to accelerate data-parallel graph analytics. > > CONFIDENTIALITY NOTICE: This email and its contents and attachments are > for the sole use of the intended recipient(s) and are confidential or > proprietary to SYSTAP. Any unauthorized review, use, disclosure, > dissemination or copying of this email or its contents or attachments is > prohibited. If you have received this communication in error, please notify > the sender by reply email and permanently delete all copies of the email > and its contents and attachments. > > On Fri, Apr 24, 2015 at 8:35 AM, Martyn Cutcher <ma...@sy...> wrote: > > > I don't see how the small slot optimisation can result in more waste > with > > larger allocators. > > > > It is simply a mechanism to avoid rapid re-allocation of the small > slotllocator dump, there are a lot of 64 byte allocators. > > Unlike the larger allocators (128 and greater) a large proportion of the > 64 > > byte slots will be used for long literal values (note that the mean > > allocation is only 27 bytes). > > > > Counter intuitively, there may well be a case for excluding the 64 byte > > allocators from the "small slot optimisation". So "small slot" NOT > > "smallest slot" ;-) > > > > - Martyn > > > > On 24/04/2015 00:18, Bryan Thompson wrote: > > > > I've updated the ticket. I've also copied my main conclusions inline > below. > > > > I think that the issue here is the use of the small slot optimization > > without proper configuration of the indices in order to target small > > allocation slots for at least one of the indices. The small slot > > optimization changes the allocation policy in two ways. > > > > 1. It has a strong preference to use only empty 8k pages for small > > allocations (as configured, for allocations less than 1k). This allows > us > > to coalesce writes by combining them onto the same page. > > 2. It has a preference to use allocation blocks that are relatively empty > > for small slots. > > > > As a consequence, the small slot optimization MAY recruit more allocators > > in order to have allocators for small slots that have good sparsity. > > > > The main goal of the small slot optimization is to optimize for indices > > that have very scattered IO patterns. The indices that exhibits this the > > most are the OSP and OCSP indices. In many cases even batched updates > will > > modify no more than a single tuple per page on this index. However, in > > your configuration (and in mine when I enabled the small slot > optimization > > without adjusting the branching factors), the O(C)SP indices were not > > created with a small branching factor, so the small slot allocation could > > not be put to any good effect. However it did have a negative effect -- > by > > recruiting more allocators. If you want to use the small slot > > optimization, make sure that at least the O(C)SP index has a relatively > > small branching factor giving an effective slot size of 256 bytes or less > > on average. > > > > I suggest that you retest w/o the small slot optimization and with group > > commit still enabled. > > > > I've asked Martyn to look over the allocators from the small slot > > optimization run and think about whether we can make this policy a little > > more adaptive when the branching factors are not really tuned properly > and > > too many allocators with too much wasted space are allocated as a result. > > Basically, how to avoid file bloat from misconfiguration. > > > > Thanks, > > Bryan > > > > ---- > > Bryan Thompson > > Chief Scientist & Founder > > SYSTAP, LLC > > 4501 Tower Road > > Greensboro, NC 274...@sy...http://blazegraph.comhttp:// > blog.bigdata.com <http://bigdata.com> <http://bigdata.com> > http://mapgraph.io > > > > Blazegraph™ <http://www.blazegraph.com/> <http://www.blazegraph.com/> > is our ultra high-performance > > graph database that supports both RDF/SPARQL and Tinkerpop/Blueprints > > APIs. MapGraph™ <http://www.systap.com/mapgraph> < > http://www.systap.com/mapgraph> is our disruptive new > > technology to use GPUs to accelerate data-parallel graph analytics. > > > > CONFIDENTIALITY NOTICE: This email and its contents and attachments are > > for the sole use of the intended recipient(s) and are confidential or > > proprietary to SYSTAP. Any unauthorized review, use, disclosure, > > dissemination or copying of this email or its contents or attachments is > > prohibited. If you have received this communication in error, please > notify > > the sender by reply email and permanently delete all copies of the email > > and its contents and attachments. > > > > On Thu, Apr 23, 2015 at 9:36 AM, Andreas Kahl <ka...@bs...> < > ka...@bs...> wrote: > > > > > > Ok, I can redo the test with smallSlots + groupCommit enabled, and > runhttp://localhost:8080/bigdata/status?dumpJournal&dumpPages after some > > minutes. (I cannot run it on the fully loadedjust one of my many > attempts to improve IO Perfomance on rotating disks. > > > > Best Regards > > Andreas > > > > > > Bryan Thompson <br...@sy...> <br...@sy...> 23.04.15 15.31 > Uhr >>> > > > > I just noticed that you have the full text index enabled as well. I > have > > not be enabling that. > > > > I would like to see the output from this command on the fully loaded data > > sets. > > http://localhost:8080/bigdata/status?dumpJournal&dumpPages > > > > This will let us if any specific index is taking up a very large number > of > > pages. It will also tell us the distribution over the page sizes for > each > > index. > > > > Bryan > > > > ---- > > Bryan Thompson > > Chief Scientist & Founder > > SYSTAP, LLC > > 4501 Tower Road > > Greensboro, NC 274...@sy...http://blazegraph.comhttp:// > blog.bigdata.com <http://bigdata.com> <http://bigdata.com> > http://mapgraph.io > > > > Blazegraph™ <http://www.blazegraph.com/> <http://www.blazegraph.com/> > is our ultra high-performance > > graph database that supports both RDF/SPARQL and Tinkerpop/Blueprints > > APIs. MapGraph™ <http://www.systap.com/mapgraph> < > http://www.systap.com/mapgraph> is our disruptive new > > technology to use GPUs to accelerate data-parallel graph analytics. > > > > CONFIDENTIALITY NOTICE: This email and its contents and attachments are > > for the sole use of the intended recipient(s) and are confidential or > > proprietary to SYSTAP. Any unauthorized review, use, disclosure, > > dissemination or copying of this email or its contents or attachments is > > prohibited. If you have received this communication in error, please > notify > > the sender by reply email and permanently delete all copies of the email > > and its contents and attachments. > > > > On Thu, Apr 23, 2015 at 8:54 AM, Andreas Kahl <ka...@bs...> < > ka...@bs...> > > wrote: > > > > > > Bryan, > > > > in the meantime, I could successfully load the file into a 18GB journal > > after disabling groupCommit (I simply commented out the line in > > RWStore.properties). > > I can try again with groupCommit enabled, but smallSlotOptimization > > disabled. > > > > Best Regards > > Andreas > > > > > > Bryan Thompson <br...@sy...> <br...@sy...> 23.04.2015 13:24 > >>> > > > > Andreas, > > > > I was not able to replicate your result. Unfortunately I navigated away > > from the browser page in which I had submitted the request, so it loaded > > all the data but failed to commit. However, the resulting file is only > > 16GB. > > > > I will redo this run and verify that the journal after the commit has > > > > this > > > > same size on the disk. > > > > I was only assuming that this was related to group commit because of your > > original message. Perhaps I misinterpreted your message. This is simply > > about 1.5.1 (with group commit) vs 1.4.0. > > > > Perhaps the issue is related to the small slot optimization? Maybe in > > combination with group commit? > > > > *> com.bigdata.rwstore.RWStore.smallSlotType=1024* > > > > I could not replicate your properties exactly because you are using a > > non-standard vocabulary class. Therefore I simply deleted the default > > namespace (in quads mode) and recreated it with the defaults in triples > > mode. The small slot optimization and other parameters were not enabled > > > > in > > > > my run. > > > > Perhaps you could try to replicate my experience and I will enable the > > small slots optimization? > > > > Thanks, > > Bryan > > > > ---- > > Bryan Thompson > > Chief Scientist & Founder > > SYSTAP, LLC > > 4501 Tower Road > > Greensboro, NC 274...@sy...http://blazegraph.comhttp:// > blog.bigdata.com <http://bigdata.com> <http://bigdata.com> > http://mapgraph.io > > > > Blazegraph™ <http://www.blazegraph.com/> <http://www.blazegraph.com/> > is our ultra high-performance > > graph database that supports both RDF/SPARQL and Tinkerpop/Blueprints > > APIs. MapGraph™ <http://www.systap.com/mapgraph> < > http://www.systap.com/mapgraph> is our disruptive new > > technology to use GPUs to accelerate data-parallel graph analytics. > > > > CONFIDENTIALITY NOTICE: This email a> prohibited. If you have received > this communication in error, please > > > > notify > > > > the sender by reply email and permanently delete all copies of the email > > and its contents and attachments. > > > > On Thu, Apr 23, 2015 at 1:51 AM, Andreas Kahl <ka...@bs...> < > ka...@bs...> > > wrote: > > > > > > Bryan & Martyn, > > > > Thank you very much for investigating the issue. I assume from the > > > > ticket > > > > that the error will vanish if I disable groupCommit. I will do so for > > > > the > > > > meantime. > > > > Although there is already extensive information in Bryan's ticket, > > > > please > > > > find attached my logs and DumpJournal outputs: > > - dumpJournal.html contains a dump from the 67GB journal after > > > > Blazegraph > > > > ran into "No space left on device" > > - dumpJournalWithTraceEnabled.html is the same dump for a running query > > when the journal was at about 14GB > > - queryStatus.html is just the status page showing my query > > - catalina.out.gz contains the trace outputs from starting Tomcat > > > > until I > > > > killed the curl running the SPARQL Update by Ctrl-C > > - loadGnd.log.gz is Blazegraphs output when loading the data > > > > Best Regards > > Andreas > > > > > > > > > > Bryan Thompson <br...@sy...> <br...@sy...> 22.04.15 20.56 > Uhr >>> > > > > See http://trac.bigdata.com/ticket/1206. This is still in the > > investigation stage. > > > > Thanks, > > Bryan > > > > ---- > > Bryan Thompson > > Chief Scientist & Founder > > SYSTAP, LLC > > 4501 Tower Road > > Greensboro, NC 274...@sy...http://blazegraph.comhttp:// > blog.bigdata.com <http://bigdata.com> <http://bigdata.com> > http://mapgraph.io > > > > Blazegraph™ <http://www.blazegraph.com/> <http://www.blazegraph.com/> > is our ultra high-performance > > graph database that supports both RDF/SPARQL and Tinkerpop/Blueprints > > APIs. MapGraph™ <http://www.systap.com/mapgraph> < > http://www.systap.com/mapgraph> is our disruptive > > > > new > > > > technology to use GPUs to accelerate data-parallel graph analytics. > > > > CONFIDENTIALITY NOTICE: This email and its contents and attachments > > > > are > > > > for the sole use of the intended recipient(s) and are confidential or > > proprietary to SYSTAP. Any unauthorized review, use, disclosure, > > dissemination or copying of this email or its contents or attachments > > > > is > > > > prohibited. If you have received this communication in error, please > > > > notify > > > > the sender by reply email and permanently delete all copies of the > > > > email > > > > and its contents and attachments. > > > > On Wed, Apr 22, 2015 at 5:37 AM, Andreas Kahl <ka...@bs...> < > ka...@bs...> > > wrote: > > > > > > Hello everyone, > > > > I currently updated to the current Revision (f4c63e5) of Blazegraph > > > > from > > > > Git and tried to load a dataset into the updated Webapp. With Bigdata > > > > 1.4.0 > > > > this resulted in a journal of ~18GB. Now the process was cancelled > > > > because > > > > the disk was full - the journal was beyond 50GB for the same file > > > > with > > > > the > > > > same settings. > > The only exception was that I activated GroupCommit. > > > > The dataset can be downloaded here: > > > > > > > http://datendienst.dnb.de/cgi-bin/mabit.pl?cmd=fetch&userID=opendata&pass=opendata&mabheft=GND.rdf.gz > > > > . > > Please find the settings used to load the file below. > > > > Do I have a misconfiguration, or is there a bug eating all disk > > > > memory? > > > > Best regards > > Andreas > > > > Namespace-Properties: > > curl -H "Accept: text/plain" > http://localhost:8080/bigdata/namespace/gnd/properties > > #Wed Apr 22 11:35:31 CEST 2015 > > > > > > com.bigdata.namespace.kb.spo.com.bigdata.btree.BTree.branchingFactor=700 > > > > com.bigdata.relation.container=gnd > > com.bigdata.rwstore.RWStore.smallSlotType=1024 > > com.bigdata.journal.AbstractJournal.bufferMode=DiskRW > > com.bigdata.journal.AbstractJournal.file=/var/lib/bigdata/bigdata.jnl > > > > > > > > com.bigdata.rdf.store.AbstractTripleStore.vocabu.textIndex=true > > > > com.bigdata.btree.BTree.branchingFactor=7ionService.minReleaseAge=1 > > com.bigdata.rdf.sail.bufferCapacity=2000 > > com.bigdata.rdf.sail.truthMaintenance=false > > com.bigdata.rdf.sail.namespace=gnd > > com.bigdata.relation.class=com.bigdata.rdf.store.LocalTripleStore > > com.bigdata.rdf.store.AbstractTripleStore.quads=false > > com.bigdata.journal.AbstractJournal.writeCacheBufferCount=500 > > com.bigdata.search.FullTextIndex.fieldsEnabled=false > > com.bigdata.relation.namespace=gndity=10000 > > com.bigdata.rdf.sail.BigdataSail.bufferCapacity=2000 > > com.bigdata.rdf.store.AbstractTripleStore.statementIdentifiers=false > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT > > Develop your own process in accordance with the BPMN 2 standard > > Learn Process modeling best practices with Bonita BPM through live > > exerciseshttp:// > www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- > > event?utm_ > > > > > > source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF > > > > _______________________________________________ > > Bigdata-developers mailing listBigdata-developers > @lists.sourceforge.nethttps:// > lists.sourceforge.net/lists/listinfo/bigdata-developers > > > > > > > > > ------------------------------------------------------------------------------ > > One dashboard for servers and applications across Physical-Virtual-Cloud > > Widest out-of-the-box monitoring support with 50+ applications > > Performance metrics, stats and reports that give you Actionable Insights > > Deep dive visibility with transaction tracing using APM Insight. > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > > > > > > > > _______________________________________________ > > Bigdata-developers mailing listBigdata-developers > @lists.sourceforge.nethttps:// > lists.sourceforge.net/lists/listinfo/bigdata-developers > > > > > > > > > > > ------------------------------------------------------------------------------ > > One dashboard for servers and applications across Physical-Virtual-Cloud > > Widest out-of-the-box monitoring support with 50+ applications > > Performance metrics, stats and reports that give you Actionable Insights > > Deep dive visibility with transaction tracing using APM Insight. > > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > > _______________________________________________ > > Bigdata-developers mailing list > > Big...@li... > > https://lists.sourceforge.net/lists/listinfo/bigdata-developers > > > > > > |