Re: [Bigdata-developers] CONSTRUCT and LOAD vs DELETE/INSERT

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Jeremy,

Thank you for that question ;-).

First, I agree. We should improve the status information for UPDATE
queries.

SPARQL UPDATE is handled by AST2BOpUpdate.  The DELETE / INSERT WHERE
pattern is handled at ~ line 570.  I have inlined this bit of the code
below, but I also suggest reviewing it in place.  As you can see from the
comment, this requires us to run the WHERE clause once.

See below for more thoughts on this.

                /*

  * DELETE + INSERT.

  *

  * Note: The semantics of DELETE + INSERT are that the WHERE

  * clause is executed once. The solutions to that need to be fed

  * once through the DELETE clause. After the DELETE clause has

  * been processed for all solutions to the WHERE clause, the

  * INSERT clause is then processed. So, we need to materialize

  * the WHERE clause results when both the DELETE clause and the

  * INSERT clause are present.

  *

  * FIXME For large intermediate results, we would be much better

  * off putting the data onto an HTree (or, better yet, a chain

  * of blocks) and processing the bindings as IVs rather than

  * materializing them as RDF Values (and even for small data

  * sets, we would be better off avoiding materialization of the

  * RDF Values and using an ASTConstructIterator which builds

  * ISPOs using IVs rather than Values).

  *

  * Note: Unlike operations against a graph, we do NOT perform

  * truth maintenance for updates against solution sets,

  * therefore we could get by nicely with operations on

  * IBindingSet[]s without RDF Value materialization.

  *

  * @see https://sourceforge.net/apps/trac/bigdata/ticket/524

  * (SPARQL Cache)

  */
There is also a REST API UPDATE request that is blazegraph specific [2]
that accepts a query identifying the statements to be removed and a request
entity indicating the statements to be added.  This API method uses a
streaming approach.  Unfortunately, this streaming approach is not
compatible with the group commit isolation semantics introduced in 1.5.1
 So, if group commit is enabled for the end point, the query results are
fully materialized before we delete anything.  If group commit is NOT
enabled, there there is still a scalable code path that incrementally
deletes statements as they are materialized by the query.

So, per the comment block above we could improve performance for
DELETE/INSERT WHERE (the general case where both the DELETE and the INSERT
are specified).  It would be helpful to understand why the update process
slowed down for you.  I suspect that the JVM heap may have gone into
overdrive if the materialized result set was large enough.  In this case,
simply writing it onto a SolutionSetStream or HTree might be enough to
provide better scaling ergonomics.

The SPARQL UPDATE handling currently operates at the SailConnection level.
This means that we are materializing the RDF Values in the result set.
This is not strictly necessary and incurs additional overhead from both
dictionary lookups and the added help impact of RDF Values over IVs.
Originally I tried to write the code at a lower level, but got bitten
several times by the specific semantics of SPARQL UPDATE and the
SailConnection.  The way it is currently organized makes it significantly
easier to be correct, but it is missing some opportunities for being more
efficient.  The thing is, we need to watch out for the pre-conditions which
would allow those efficiencies.  For example, whether truth maintenance is
enabled for the backing triple store instance, etc.

Thanks,
Bryan

[1]
http://wiki.blazegraph.com/wiki/index.php/NanoSparqlServer#UPDATE_.28DELETE_statements_selected_by_a_QUERY_plus_INSERT_statements_from_Request_Body_using_PUT.29

----
Bryan Thompson
Chief Scientist & Founder
SYSTAP, LLC
4501 Tower Road
Greensboro, NC 27410
br...@sy...
http://blazegraph.com
http://blog.bigdata.com <http://bigdata.com>
http://mapgraph.io

Blazegraph™ <http://www.blazegraph.com/> is our ultra high-performance
graph database that supports both RDF/SPARQL and Tinkerpop/Blueprints
APIs.  MapGraph™ <http://www.systap.com/mapgraph> is our disruptive new
technology to use GPUs to accelerate data-parallel graph analytics.

CONFIDENTIALITY NOTICE:  This email and its contents and attachments are
for the sole use of the intended recipient(s) and are confidential or
proprietary to SYSTAP. Any unauthorized review, use, disclosure,
dissemination or copying of this email or its contents or attachments is
prohibited. If you have received this communication in error, please notify
the sender by reply email and permanently delete all copies of the email
and its contents and attachments.

On Tue, Apr 14, 2015 at 12:58 PM, Jeremy J Carroll <jj...@sy...> wrote:

> I found a CONSTRUCT and LOAD much more performant than a DELETE/INSERT,
> and was wondering why, and whether there is anything new (to me) about the
> blazegraph architecture that I should understand.
>
> =====
>
>
> I had a graph for which I wished to rename almost all URIs.
> The graph had about 3M triples
> I was working in AWS on
>
> I constructed a temporary graph with a rename mapping
> and then tried the following update query:
>
>       DELETE {
>             GRAPH <%(abox)s> {
>                ?oldS ?oldP ?oldO
>             }
>         }
>         INSERT {
>             GRAPH <%(abox)s> {
>                ?newS ?newP ?newO
>             }
>         }
>         WHERE {
>             graph <%(abox)s> {
>                ?oldS ?oldP ?oldO
>             }
>             GRAPH <x-eg:temporary-graph> {
>                 ?oldS  <x-eg:replaced-by> ?newS
>             }
>             GRAPH <x-eg:temporary-graph> {
>                 ?oldP  <x-eg:replaced-by> ?newP
>             }
>             {
>                 GRAPH <x-eg:temporary-graph> {
>                     ?oldO  <x-eg:replaced-by> ?newO
>                 }
>             } UNION {
>                 graph <%(abox)s> {
>                    ?oldS ?oldP ?oldO
>                 }
>                 FILTER ( isLiteral(?oldO) )
>                 BIND ( ?oldO as ?newO )
>             }
>         }
>
>
>
> where <%(abox)s>  is a variable
>
>
> At the point where we perform this query we have exclusive access to the
> blaze graph process.
>
> It took over 4 hours, with approx. the first hour showing some change in
> the query execution stats, and then the last 3 hours showing no change in
> the stats (the status page in the NSS display is not very useful with these
> update queries).
> After 4 hours I got bored. Cancel did not work. So I killed blazegraph and
> restarted.
>
> I then rewrote the code as follows.
>
>
> I wrote a construct query:
>
>     CONSTRUCT  {
>                ?newS ?newP ?newO
>             }
>         WHERE {
>             graph <%(abox)s> {
>                ?oldS ?oldP ?oldO
>             }
>             GRAPH <x-eg:temporary-graph> {
>                 ?oldS  <x-eg:replaced-by> ?newS
>             }
>             GRAPH <x-eg:temporary-graph> {
>                 ?oldP  <x-eg:replaced-by> ?newP
>             }
>             {
>                 GRAPH <x-eg:temporary-graph> {
>                     ?oldO  <x-eg:replaced-by> ?newO
>                 }
>             } UNION {
>                 graph <%(abox)s> {
>                    ?oldS ?oldP ?oldO
>                 }
>                 FILTER ( isLiteral(?oldO) )
>                 BIND ( ?oldO as ?newO )
>             }
>         }
>
> this created a temporary file.
>
> I replaced the DELETE part with
>
> DROP GRAPH <%(abox)s>
>
> and the INSERT with
>
> LOAD <file://%(tmpfile)s> INTO GRAPH <%(abox)s>
>
> ====
>
>
> The rewritten code took only a few minutes (less than 5 in total)
> I was expecting some improvement, but not as much as I saw.
>
> My understanding is that each of the three operations is atomic and
> isolated, but I lost the guarantee linking the three (which I did not need
> since I had exclusive lock at a higher level).
>
> Was it the atomicity that cost so much?
>
> Jeremy
>
>
>
>
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
> Develop your own process in accordance with the BPMN 2 standard
> Learn Process modeling best practices with Bonita BPM through live
> exercises
> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
> event?utm_
> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
> _______________________________________________
> Bigdata-developers mailing list
> Big...@li...
> https://lists.sourceforge.net/lists/listinfo/bigdata-developers
>

Re: [Bigdata-developers] CONSTRUCT and LOAD vs DELETE/INSERT

Fast, scalable, robust graph database platform

Re: [Bigdata-developers] CONSTRUCT and LOAD vs DELETE/INSERT