Without commenting on the data model, you need to use database at once closure rather than incremental truth maintenance to avoid the large temporary file.  Database at once closure is exposed by the BigdataSail, but not at the sparql update layer.  There is a open ticket to extend sparql update to allow control of this feature.  If you would like to fund that, please let us know.  It would allow you to continue to use the REST API while specifying when to use incremental truth maintenance and when to use database at once closure and when to load data without truth maintenance.

Bryan

On Mar 6, 2014, at 6:57 AM, "Antoni Mylka" <mylka@users.sf.net> wrote:

Hi,

We're trying to load a dataset into bigdata. It is a single SPARQL UPDATE query with about 32 thousand triples. I just stopped the evaluation after 19 hours, because bigdata created a temp file of 365 gigabytes and ran out of disk space. I suspect that the problem lies with our data, but it's rather fundamental to the architecture of our system, so I'd be very grateful for a clear confirmation before I start rearchitecting stuff.

The data makes extensive use of an InverseFunctionalProperty. In that dataset there are about 10 thousand resources. For each resource there exists one triple with an inverse functional predicate. Within the dataset there are 72 distinct values for that predicate used by more than one resource, largest groups have more than a thousand resources.

Assuming that N resources with the same value of an inverse functional predicate translate to 2(N-1)^2 owl:sameAs triples, I did a little calculation and my 32 thousand triples would translate to about 11,2 million owl:sameAs triples.

AFAIU the real issue is because we use "indirect links". The dataset is about JIRA issues. We model the issues as one class and issue resolutions as another class. But we don't say:

abox:PROJECT-234 tbox:issueResolution abox:Fixed .

instead we say:

abox:PROJECT-234 tbox:issueResolution <urn:somethingRandom> .
<urn:somethingRandom> tbox:id <urn:someid> .
abox:Fixed tbox:id <urn:someid> .

... where tbox:id is an inverse functional predicate.

If there are about 800 Fixed issues each one points at a different random resolution, then each Fixed issue will get 801 issueResolution links. Each issue has 7 properties modelled in this way: status, resolution, priority, type, assignee, reporter and project. AFAIU this would place the total number of triples to infer somewhere in the tens of millions.

Can it work at all? Can bigdata cope with a dataset where 32 thousand explicit triples yield say 50 Million inferred ones, given enough processing power? Is the size of the temp file normal? We use rev 6933 which is quite old, would an upgrade or any magic config switch help?


Usage of InverseFunctionalProperties in Bigdata


Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/bigdata/discussion/676946/

To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/