BSBM 3.1

Help
meDavid
2011-10-26
2014-02-19
  • Bryan Thompson
    Bryan Thompson
    2011-10-26

    We are working on it right now. Bryan

     

  • Anonymous
    2011-11-09

    Hello,

       We're very interested in support for large literals, as storing large text as literal is a quite common task in many datasets, but after (slow) loading of such datasets the BigData performance goes down.

       What to expect with next release ? How big literals are managed internally, and what 'big' means in term of text size ?    For large literals the difference between hard disk and SSD is more or less important ? Any important related techinal info is appreciated (e.g. Sesame vs BigData on large literals)

    Thank you for this great piece of software.

     
  • Bryan Thompson
    Bryan Thompson
    2011-11-09

    You can get this right now in the TERMS_REFACTOR_BRANCH.  This is the basis for our 1.1 release, which should be out this month.  We are wrapping up support for a merge join pattern right now and then have a few minor issues left to close out before a release.

    Large literals/URIs are stored in a "BLOBS" index in the TERMS_REFACTOR_BRANCH and in the 1.1 release.  That index uses a key based on the type of the RDF Value, the hash code of the RDF Value, and a collision counter. This gives a fixed length 8 byte key. The value stored under that key may be very large.  It will be written as a raw record on the Journal.  Megabytes is certainly practical.  Pushing gigabyte sized literals through the JVM heap might cause you problems with memory management under Java, but the hard limitation is less than int32 bytes in the Literal.

    Try loading the BSBM data set under the TERMS_REFACTOR_BRANCH.  Performance should be fine.

    The RWStore is happy with SAS (Serial Attached SCSI) or SSD.  Scaling is limited with SATA disks since the SATA controllers are not smart enough to reorder the writes.

     

  • Anonymous
    2011-11-12

    Thank you for your reply.

    Last question about literals. It's common to have large text literals as XMLLiterals due to the text structure. If the storage has not fulltext indexing, storing as String or as XMLLiterals doen't matter.

      When full text indexing is enabled (Lucene based right?), how XMLLiterals are indexed ? To avoid indexing them as plain text, XML pre-processing is required to extract content before indexing. Is this supported by bigdata ? If no, I think that is trivial to add as feature.

      And using typed literals like XMLLiterals, is it possible to use langID too ? I mean, to specify the language of indexed text that is encoded through XML.

    Thanks again

     
  • Bryan Thompson
    Bryan Thompson
    2011-11-12

    Bigdata has an native free text index (based on our own B+Tree package, not Lucene).  Some 3rd party extensions provide Lucene support, but none of them has been ported to the 1.1 release yet.

    XML Literals are not handled specially right now.  This should not be too difficult, but it is not a scheduled feature.  You would want to modify BigdataRDFFullTextIndex at around line 290 to support this:
    {{{
                index(buffer, termId, 0/* fieldId */, languageCode,
                        new StringReader(text));
    }}}

    Instead of a simple StringReader, you would have to notice at line 260:
    {{{
                if (!indexDatatypeLiterals && lit.getDatatype() != null) {

                    // do not index datatype literals in this manner.
                    continue;

                }
    }

    And then run a SAX parser over the XML Literal if the appropriate datatype was observed and feed the output of that into the free text indexer. 

    We should probably make more distinctions than datatype / no datatype since xsd:string should be indexed.  The thing that we do not want to index (typically) are the xsd numerics since you can get those for free by going directly to the lexicon indices.

    If you want to post some code which does the right thing for you, I can take a look at integrating it.

    Thanks,
    Bryan

     
  • meDavid
    meDavid
    2012-03-14

    Any word on the results? I couldn't find them on the net?

     
  • meDavid
    meDavid
    2012-03-20

    Why is query 5 never executed? In the previous (reduced 3.0) run excluded the query as it effected the performance too much, I believe because of the large literals. Shouldn’t this be resolved?

    (See also unapproved comment on http://www.bigdata.com/bigdata/blog/?p=412#comment-5024 )

     
  • Bryan Thompson
    Bryan Thompson
    2012-03-20

    Query 5 has little to do with large literals.  The BLOBS index was introduced to address the presence of large literals in the indices during load.  Query 5 appears to be mainly related to the elimination of intermediate variables through an appropriate rewrite of the query into sub-selects and hash joins.  We'll get back to it eventually, but meanwhile we use the reduced query mix for benchmarking.

    The real benchmark is always your own application.  And to get the most out of the database, we offer consulting for people on their applications.