Usage of InverseFunctionalProperties in Bigdata

Help
2014-03-06
2014-03-06
  • Antoni Mylka

    Antoni Mylka - 2014-03-06

    Hi,

    We're trying to load a dataset into bigdata. It is a single SPARQL UPDATE query with about 32 thousand triples. I just stopped the evaluation after 19 hours, because bigdata created a temp file of 365 gigabytes and ran out of disk space. I suspect that the problem lies with our data, but it's rather fundamental to the architecture of our system, so I'd be very grateful for a clear confirmation before I start rearchitecting stuff.

    The data makes extensive use of an InverseFunctionalProperty. In that dataset there are about 10 thousand resources. For each resource there exists one triple with an inverse functional predicate. Within the dataset there are 72 distinct values for that predicate used by more than one resource, largest groups have more than a thousand resources.

    Assuming that N resources with the same value of an inverse functional predicate translate to 2(N-1)^2 owl:sameAs triples, I did a little calculation and my 32 thousand triples would translate to about 11,2 million owl:sameAs triples.

    AFAIU the real issue is because we use "indirect links". The dataset is about JIRA issues. We model the issues as one class and issue resolutions as another class. But we don't say:

    abox:PROJECT-234 tbox:issueResolution abox:Fixed .
    

    instead we say:

    abox:PROJECT-234 tbox:issueResolution <urn:somethingRandom> .
    <urn:somethingRandom> tbox:id <urn:someid> .
    abox:Fixed tbox:id <urn:someid> .
    

    ... where tbox:id is an inverse functional predicate.

    If there are about 800 Fixed issues each one points at a different random resolution, then each Fixed issue will get 801 issueResolution links. Each issue has 7 properties modelled in this way: status, resolution, priority, type, assignee, reporter and project. AFAIU this would place the total number of triples to infer somewhere in the tens of millions.

    Can it work at all? Can bigdata cope with a dataset where 32 thousand explicit triples yield say 50 Million inferred ones, given enough processing power? Is the size of the temp file normal? We use rev 6933 which is quite old, would an upgrade or any magic config switch help?

     
    • Bryan Thompson

      Bryan Thompson - 2014-03-06

      Without commenting on the data model, you need to use database at once closure rather than incremental truth maintenance to avoid the large temporary file. Database at once closure is exposed by the BigdataSail, but not at the sparql update layer. There is a open ticket to extend sparql update to allow control of this feature. If you would like to fund that, please let us know. It would allow you to continue to use the REST API while specifying when to use incremental truth maintenance and when to use database at once closure and when to load data without truth maintenance.

      Bryan

      On Mar 6, 2014, at 6:57 AM, "Antoni Mylka" mylka@users.sf.net<mailto:mylka@users.sf.net> wrote:

      Hi,

      We're trying to load a dataset into bigdata. It is a single SPARQL UPDATE query with about 32 thousand triples. I just stopped the evaluation after 19 hours, because bigdata created a temp file of 365 gigabytes and ran out of disk space. I suspect that the problem lies with our data, but it's rather fundamental to the architecture of our system, so I'd be very grateful for a clear confirmation before I start rearchitecting stuff.

      The data makes extensive use of an InverseFunctionalProperty. In that dataset there are about 10 thousand resources. For each resource there exists one triple with an inverse functional predicate. Within the dataset there are 72 distinct values for that predicate used by more than one resource, largest groups have more than a thousand resources.

      Assuming that N resources with the same value of an inverse functional predicate translate to 2(N-1)^2 owl:sameAs triples, I did a little calculation and my 32 thousand triples would translate to about 11,2 million owl:sameAs triples.

      AFAIU the real issue is because we use "indirect links". The dataset is about JIRA issues. We model the issues as one class and issue resolutions as another class. But we don't say:

      abox:PROJECT-234 tbox:issueResolution abox:Fixed .

      instead we say:

      abox:PROJECT-234 tbox:issueResolution <urn:somethingRandom> .
      <urn:somethingRandom> tbox:id <urn:someid> .
      abox:Fixed tbox:id <urn:someid> .

      ... where tbox:id is an inverse functional predicate.

      If there are about 800 Fixed issues each one points at a different random resolution, then each Fixed issue will get 801 issueResolution links. Each issue has 7 properties modelled in this way: status, resolution, priority, type, assignee, reporter and project. AFAIU this would place the total number of triples to infer somewhere in the tens of millions.

      Can it work at all? Can bigdata cope with a dataset where 32 thousand explicit triples yield say 50 Million inferred ones, given enough processing power? Is the size of the temp file normal? We use rev 6933 which is quite old, would an upgrade or any magic config switch help?


      Usage of InverseFunctionalProperties in Bigdatahttps://sourceforge.net/p/bigdata/discussion/676946/thread/993f09a7/?limit=25#1ee8


      Sent from sourceforge.nethttp://sourceforge.net because you indicated interest in https://sourceforge.net/p/bigdata/discussion/676946/

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

       
      Attachments
      • Bryan Thompson

        Bryan Thompson - 2014-03-06

        Yes, bigdata can handle it using the database at once closure approach for the initial bulk load. That is not to say that this is the best way to model the problem ....

        On Mar 6, 2014, at 7:02 AM, "Bryan Thompson" thompsonbry@users.sf.net<mailto:thompsonbry@users.sf.net> wrote:

        Without commenting on the data model, you need to use database at once closure rather than incremental truth maintenance to avoid the large temporary file. Database at once closure is exposed by the BigdataSail, but not at the sparql update layer. There is a open ticket to extend sparql update to allow control of this feature. If you would like to fund that, please let us know. It would allow you to continue to use the REST API while specifying when to use incremental truth maintenance and when to use database at once closure and when to load data without truth maintenance.

        Bryan

        On Mar 6, 2014, at 6:57 AM, "Antoni Mylka" mylka@users.sf.netmylka@users.sf.net<mailto:mylka@users.sf.net<mailto:mylka@users.sf.net> wrote:

        Hi,

        We're trying to load a dataset into bigdata. It is a single SPARQL UPDATE query with about 32 thousand triples. I just stopped the evaluation after 19 hours, because bigdata created a temp file of 365 gigabytes and ran out of disk space. I suspect that the problem lies with our data, but it's rather fundamental to the architecture of our system, so I'd be very grateful for a clear confirmation before I start rearchitecting stuff.

        The data makes extensive use of an InverseFunctionalProperty. In that dataset there are about 10 thousand resources. For each resource there exists one triple with an inverse functional predicate. Within the dataset there are 72 distinct values for that predicate used by more than one resource, largest groups have more than a thousand resources.

        Assuming that N resources with the same value of an inverse functional predicate translate to 2(N-1)^2 owl:sameAs triples, I did a little calculation and my 32 thousand triples would translate to about 11,2 million owl:sameAs triples.

        AFAIU the real issue is because we use "indirect links". The dataset is about JIRA issues. We model the issues as one class and issue resolutions as another class. But we don't say:

        abox:PROJECT-234 tbox:issueResolution abox:Fixed .

        instead we say:

        abox:PROJECT-234 tbox:issueResolution .
        tbox:id .
        abox:Fixed tbox:id .

        ... where tbox:id is an inverse functional predicate.

        If there are about 800 Fixed issues each one points at a different random resolution, then each Fixed issue will get 801 issueResolution links. Each issue has 7 properties modelled in this way: status, resolution, priority, type, assignee, reporter and project. AFAIU this would place the total number of triples to infer somewhere in the tens of millions.

        Can it work at all? Can bigdata cope with a dataset where 32 thousand explicit triples yield say 50 Million inferred ones, given enough processing power? Is the size of the temp file normal? We use rev 6933 which is quite old, would an upgrade or any magic config switch help?


        Usage of InverseFunctionalProperties in Bigdatahttps://sourceforge.net/p/bigdata/discussion/676946/thread/993f09a7/?limit=25#1ee8


        Sent from sourceforge.nethttp://sourceforge.nethttp://sourceforge.net because you indicated interest in https://sourceforge.net/p/bigdata/discussion/676946/

        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/


        Usage of InverseFunctionalProperties in Bigdatahttp://sourceforge.net/p/bigdata/discussion/676946/thread/993f09a7/?limit=25#1ee8/140c


        Sent from sourceforge.nethttp://sourceforge.net because you indicated interest in https://sourceforge.net/p/bigdata/discussion/676946/

        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

         
        Attachments
  • Antoni Mylka

    Antoni Mylka - 2014-03-06

    Thanks for the answer. Allow me to summarize the way I understand the situation:

    1. Under default settings, datasets where thousands of resources have the same value of an inverse functional property simply cannot be loaded. Nothing significant will change in this respect after an update from rev 6933 to the current one, or when running it with 32G of heap instead of 6G.
    2. I can add the com.bigdata.rdf.sail.truthMaintenance=false line to the .properties file used by bigdata webapp. My load started working through NanoSparqlServer, in about 5 seconds. There are no inferred triples though. I can't manually trigger the rebuild of an inference closure through the REST API, from the POV of my APP it's as If I had disabled the inference completely.
    3. When you expose that feature through the REST API I will be able to turn the truth maintenance on and off remotely. Turning it on will regenerate whatever datastructures are necessary, turning it off will delete them. In my app I will be able to turn it off for large bulk load operations. When it's off, there will be some feature to manually rebuild the inference closure. Manually rebuilding the inference closure without truth maintenance will be much faster than loading my stuff with truth maintenance enabled. In that mode, the database will be essentially read-only, as after every update all inferred statements will "disappear" and will reappear after the closure is rebuilt.
    4. I can stay with the current functionality and change the modelling, don't rely on InverseFunctionalPredicates as much, make sure that groups of sameAs resources are small, say below 10.
    5. I can also stay with the current functionality, leave the data as it is but disable the inference completely and simulate it with more complex SPARQL queries. Load will be fast, but queries will get slower and more complex.
     
    • Bryan Thompson

      Bryan Thompson - 2014-03-06

      The easy solution for you is to use the computeClosure() method on BigdataSail.BigdataSailConnection. This and the removeAllEntailments() methods are all you need to manage the entailments.

      *
      computeClosure

      public void computeClosure()
      throws org.openrdf.sail.SailException

      Computes the closure of the triple store for RDF(S)+ entailments.

      This computes the closure of the database. This can be used if you do NOT enable truth maintenance and choose instead to load up all of your data first and then compute the closure of the database. Note that some rules may be computed by eager closure while others are computed at query time.

      Note: If there are already entailments in the database AND you have retracted statements since the last time the closure was computed then you MUST delete all entailments from the database before re-computing the closure.

      Note: This method does NOT commit the database. See AbstractTripleStore.commit()http://www.bigdata.com/docs/api/com/bigdata/rdf/store/AbstractTripleStore.html#commit%28%29 and getTripleStore()http://www.bigdata.com/docs/api/com/bigdata/rdf/sail/BigdataSail.BigdataSailConnection.html#getTripleStore%28%29.

      Throws:
      org.openrdf.sail.SailException
      See Also:
      removeAllEntailments()http://www.bigdata.com/docs/api/com/bigdata/rdf/sail/BigdataSail.BigdataSailConnection.html#removeAllEntailments%28%29

      The relevant ticket to expose this capability through SPARQL UPDATE is

      595http://trac.bigdata.com/ticket/595 Manage truth maintenance in SPARQL UPDATEhttp://trac.bigdata.com/ticket/595

      That ticket would allow you to continue to operate over the REST interface while managing the truth maintenance behavior and the add/drop of the entailments through the more efficient database-at-once closure.

      Incremental truth maintenance IS more efficient for many cases, but it is never more efficient for bulk load.

      Thanks,
      Bryan

      From: Antoni Mylka mylka@users.sf.net<mailto:mylka@users.sf.net>
      Reply-To: "[bigdata:discussion]" 676946@discussion.bigdata.p.re.sf.net<mailto:676946@discussion.bigdata.p.re.sf.net>
      Date: Thursday, March 6, 2014 8:59 AM
      To: "[bigdata:discussion]" 676946@discussion.bigdata.p.re.sf.net<mailto:676946@discussion.bigdata.p.re.sf.net>
      Subject: [bigdata:discussion] Usage of InverseFunctionalProperties in Bigdata

      Thanks for the answer. Allow me to summarize the way I understand the situation:

      1. Under default settings, datasets where thousands of resources have the same value of an inverse functional property simply cannot be loaded. Nothing significant will change in this respect after an update from rev 6933 to the current one, or when running it with 32G of heap instead of 6G.
      2. I can add the com.bigdata.rdf.sail.truthMaintenance=false line to the .properties file used by bigdata webapp. My load started working through NanoSparqlServer, in about 5 seconds. There are no inferred triples though. I can't manually trigger the rebuild of an inference closure through the REST API, from the POV of my APP it's as If I had disabled the inference completely.
      3. When you expose that feature through the REST API I will be able to turn the truth maintenance on and off remotely. Turning it on will regenerate whatever datastructures are necessary, turning it off will delete them. In my app I will be able to turn it off for large bulk load operations. When it's off, there will be some feature to manually rebuild the inference closure. Manually rebuilding the inference closure without truth maintenance will be much faster than loading my stuff with truth maintenance enabled. In that mode, the database will be essentially read-only, as after every update all inferred statements will "disappear" and will reappear after the closure is rebuilt.
      4. I can stay with the current functionality and change the modelling, don't rely on InverseFunctionalPredicates as much, make sure that groups of sameAs resources are small, say below 10.
      5. I can also stay with the current functionality, leave the data as it is but disable the inference completely and simulate it with more complex SPARQL queries. Load will be fast, but queries will get slower and more complex.

      Usage of InverseFunctionalProperties in Bigdatahttps://sourceforge.net/p/bigdata/discussion/676946/thread/993f09a7/?limit=25#3569


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/bigdata/discussion/676946/

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

       
      Attachments

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks