|
From: Brad B. <be...@bl...> - 2016-04-14 16:53:22
|
Andreas, As an addition to the DumpJournal technique, which can give you a sense of the actual inlining performance for load with instance data. You can also validate your vocabulary in a unit test. If you look at TestPubChemVocabInlineUris.java, you will see an example of creating a unit test that creates a namespace using a custom vocabulary and inline URI Handler then validates that the intended URIs are being inlined. Our recommendation would be to first build your vocabulary and use the unit test, then load the data and use DumpJournal to see if there may be additional inlining opportunities. There is also a "latent" new feature in the com.bigdata.rdf.util.VocabBuilder [2], which Michael updated as part of 2.1.0. You can run this over your instance data as "java -cp blazegraph.jar com.bigdata.rdf.util.VocabBuilder /path/to/fileordir /path/to/fileordir ...". It will then generate a Java file containing a custom vocabulary starting point. We need to add parallelized reads to this for shorter processing on large data sets, but it will generate a Vocabulary with inlined URIs for the highest frequency URIs in your data. You can then augment this with a custom inline URI handler as you have done. We have a backlogged Blog post / Wiki Update on this feature. Thanks, --Brad [1] https://github.com/blazegraph/database/blob/master/vocabularies/src/test/java/com/blazegraph/vocab/pubchem/TestPubchemVocabInlineUris.java [2] https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata-rdf/src/java/com/bigdata/rdf/util/VocabBuilder.java On Thu, Apr 14, 2016 at 9:46 AM, Andreas Kahl <ka...@bs...> wrote: > Bryan, > > Thanks for the info. DumpJournal (with -pages because I want to tune > branching factors) is already running, but it will take some time as the > journal is 82GB. > As soon as I have the Html-output I will have a look at the numbers you > mentioned. > > Best Regards > Andreas > >>> Bryan Thompson <br...@sy...> 14.04.2016 15:28 >>> > Use DumpJournal (w/o -pages). Look at the number of entries in the TERM2ID > and BLOBS indices. This will tell you how many RDF Values were NOT inlined. > > If you want to figure out how many were inlined, look at the number of > statements in one of the statement indices. Multiple by 3 (or 4 for quads) > and then subtract the number of entries in (TERM2ID + BLOBS). That is the > number of inline IVs. > > You are probably after the distinct number of non-inlined IVs. This is not > so easy to find. However, just the size of the TERM2ID and BLOBS indices is > a very good indication of whether or not things are being inlined. > > Thanks, > Bryan > > ---- > Bryan Thompson > Chief Scientist & Founder > Blazegraph > e: br...@bl... > w: http://blazegraph.com > > Blazegraph products help to solve the Graph Cache Thrash to achieve large > scale processing for graph and predictive analytics. Blazegraph is the > creator of the industry’s first GPU-accelerated high-performance database > for large graphs, has been named as one of the “10 Companies and > Technologies to Watch in 2016” <http://insideanalysis.com/2016/01/20535/>. > > > Blazegraph Database <https://www.blazegraph.com/> is our ultra-high > performance graph database that supports both RDF/SPARQL and > Tinkerpop/Blueprints APIs. Blazegraph GPU > <https://www.blazegraph.com/product/gpu-accelerated/> andBlazegraph DAS > <https://www.blazegraph.com/product/gpu-accelerated/>L are disruptive new > technologies that use GPUs to enable extreme scaling that is thousands of > times faster and 40 times more affordable than CPU-based solutions. > > CONFIDENTIALITY NOTICE: This email and its contents and attachments are > for the sole use of the intended recipient(s) and are confidential or > proprietary to SYSTAP, LLC DBA Blazegraph. Any unauthorized review, use, > disclosure, dissemination or copying of this email or its contents or > attachments is prohibited. If you have received this communication in > error, please notify the sender by reply email and permanently delete all > copies of the email and its contents and attachments. > > On Thu, Apr 14, 2016 at 9:25 AM, Andreas Kahl <ka...@bs...> > wrote: > >> Hello everyone, >> >> how can I determine which portion of URIs in my journal were successfully >> inlined? >> >> From your example PubChem I derived my own InlineUriFactory (attached). >> This is the config used: >> <entry >> key="com.bigdata.rdf.store.AbstractTripleStore.vocabularyClass">de.bsb_muenchen.bigdata.vocab.B3KatVocabulary</entry> >> >> <entry >> key="com.bigdata.rdf.store.AbstractTripleStore.inlineURIFactory">de.bsb_muenchen.bigdata.vocab.B3KatInlineUriFactory</entry> >> >> All mentioned classes are attached. >> >> Thanks & Best Regards >> Andreas >> >> >> ------------------------------------------------------------------------------ >> Find and fix application performance issues faster with Applications >> Manager >> Applications Manager provides deep performance insights into multiple >> tiers of >> your business applications. It resolves application problems quickly and >> reduces your MTTR. Get your free trial! >> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z >> _______________________________________________ >> Bigdata-developers mailing list >> Big...@li... >> https://lists.sourceforge.net/lists/listinfo/bigdata-developers >> >> > > > ------------------------------------------------------------------------------ > Find and fix application performance issues faster with Applications > Manager > Applications Manager provides deep performance insights into multiple > tiers of > your business applications. It resolves application problems quickly and > reduces your MTTR. Get your free trial! > https://ad.doubleclick.net/ddm/clk/302982198;130105516;z > _______________________________________________ > Bigdata-developers mailing list > Big...@li... > https://lists.sourceforge.net/lists/listinfo/bigdata-developers > > -- _______________ Brad Bebee CEO Blazegraph e: be...@bl... m: 202.642.7961 w: www.blazegraph.com Blazegraph products help to solve the Graph Cache Thrash to achieve large scale processing for graph and predictive analytics. Blazegraph is the creator of the industry’s first GPU-accelerated high-performance database for large graphs, has been named as one of the “10 Companies and Technologies to Watch in 2016” <http://insideanalysis.com/2016/01/20535/>. Blazegraph Database <https://www.blazegraph.com/> is our ultra-high performance graph database that supports both RDF/SPARQL and Apache TinkerPop™ APIs. Blazegraph GPU <https://www.blazegraph.com/product/gpu-accelerated/> andBlazegraph DAS <https://www.blazegraph.com/product/gpu-accelerated/>L are disruptive new technologies that use GPUs to enable extreme scaling that is thousands of times faster and 40 times more affordable than CPU-based solutions. CONFIDENTIALITY NOTICE: This email and its contents and attachments are for the sole use of the intended recipient(s) and are confidential or proprietary to SYSTAP, LLC DBA Blazegraph. Any unauthorized review, use, disclosure, dissemination or copying of this email or its contents or attachments is prohibited. If you have received this communication in error, please notify the sender by reply email and permanently delete all copies of the email and its contents and attachments. |