From: Bryan T. <br...@sy...> - 2015-04-14 18:13:07
|
Jeremy, Thank you for that question ;-). First, I agree. We should improve the status information for UPDATE queries. SPARQL UPDATE is handled by AST2BOpUpdate. The DELETE / INSERT WHERE pattern is handled at ~ line 570. I have inlined this bit of the code below, but I also suggest reviewing it in place. As you can see from the comment, this requires us to run the WHERE clause once. See below for more thoughts on this. /* * DELETE + INSERT. * * Note: The semantics of DELETE + INSERT are that the WHERE * clause is executed once. The solutions to that need to be fed * once through the DELETE clause. After the DELETE clause has * been processed for all solutions to the WHERE clause, the * INSERT clause is then processed. So, we need to materialize * the WHERE clause results when both the DELETE clause and the * INSERT clause are present. * * FIXME For large intermediate results, we would be much better * off putting the data onto an HTree (or, better yet, a chain * of blocks) and processing the bindings as IVs rather than * materializing them as RDF Values (and even for small data * sets, we would be better off avoiding materialization of the * RDF Values and using an ASTConstructIterator which builds * ISPOs using IVs rather than Values). * * Note: Unlike operations against a graph, we do NOT perform * truth maintenance for updates against solution sets, * therefore we could get by nicely with operations on * IBindingSet[]s without RDF Value materialization. * * @see https://sourceforge.net/apps/trac/bigdata/ticket/524 * (SPARQL Cache) */ There is also a REST API UPDATE request that is blazegraph specific [2] that accepts a query identifying the statements to be removed and a request entity indicating the statements to be added. This API method uses a streaming approach. Unfortunately, this streaming approach is not compatible with the group commit isolation semantics introduced in 1.5.1 So, if group commit is enabled for the end point, the query results are fully materialized before we delete anything. If group commit is NOT enabled, there there is still a scalable code path that incrementally deletes statements as they are materialized by the query. So, per the comment block above we could improve performance for DELETE/INSERT WHERE (the general case where both the DELETE and the INSERT are specified). It would be helpful to understand why the update process slowed down for you. I suspect that the JVM heap may have gone into overdrive if the materialized result set was large enough. In this case, simply writing it onto a SolutionSetStream or HTree might be enough to provide better scaling ergonomics. The SPARQL UPDATE handling currently operates at the SailConnection level. This means that we are materializing the RDF Values in the result set. This is not strictly necessary and incurs additional overhead from both dictionary lookups and the added help impact of RDF Values over IVs. Originally I tried to write the code at a lower level, but got bitten several times by the specific semantics of SPARQL UPDATE and the SailConnection. The way it is currently organized makes it significantly easier to be correct, but it is missing some opportunities for being more efficient. The thing is, we need to watch out for the pre-conditions which would allow those efficiencies. For example, whether truth maintenance is enabled for the backing triple store instance, etc. Thanks, Bryan [1] http://wiki.blazegraph.com/wiki/index.php/NanoSparqlServer#UPDATE_.28DELETE_statements_selected_by_a_QUERY_plus_INSERT_statements_from_Request_Body_using_PUT.29 ---- Bryan Thompson Chief Scientist & Founder SYSTAP, LLC 4501 Tower Road Greensboro, NC 27410 br...@sy... http://blazegraph.com http://blog.bigdata.com <http://bigdata.com> http://mapgraph.io Blazegraph™ <http://www.blazegraph.com/> is our ultra high-performance graph database that supports both RDF/SPARQL and Tinkerpop/Blueprints APIs. MapGraph™ <http://www.systap.com/mapgraph> is our disruptive new technology to use GPUs to accelerate data-parallel graph analytics. CONFIDENTIALITY NOTICE: This email and its contents and attachments are for the sole use of the intended recipient(s) and are confidential or proprietary to SYSTAP. Any unauthorized review, use, disclosure, dissemination or copying of this email or its contents or attachments is prohibited. If you have received this communication in error, please notify the sender by reply email and permanently delete all copies of the email and its contents and attachments. On Tue, Apr 14, 2015 at 12:58 PM, Jeremy J Carroll <jj...@sy...> wrote: > I found a CONSTRUCT and LOAD much more performant than a DELETE/INSERT, > and was wondering why, and whether there is anything new (to me) about the > blazegraph architecture that I should understand. > > ===== > > > I had a graph for which I wished to rename almost all URIs. > The graph had about 3M triples > I was working in AWS on > > I constructed a temporary graph with a rename mapping > and then tried the following update query: > > DELETE { > GRAPH <%(abox)s> { > ?oldS ?oldP ?oldO > } > } > INSERT { > GRAPH <%(abox)s> { > ?newS ?newP ?newO > } > } > WHERE { > graph <%(abox)s> { > ?oldS ?oldP ?oldO > } > GRAPH <x-eg:temporary-graph> { > ?oldS <x-eg:replaced-by> ?newS > } > GRAPH <x-eg:temporary-graph> { > ?oldP <x-eg:replaced-by> ?newP > } > { > GRAPH <x-eg:temporary-graph> { > ?oldO <x-eg:replaced-by> ?newO > } > } UNION { > graph <%(abox)s> { > ?oldS ?oldP ?oldO > } > FILTER ( isLiteral(?oldO) ) > BIND ( ?oldO as ?newO ) > } > } > > > > where <%(abox)s> is a variable > > > At the point where we perform this query we have exclusive access to the > blaze graph process. > > It took over 4 hours, with approx. the first hour showing some change in > the query execution stats, and then the last 3 hours showing no change in > the stats (the status page in the NSS display is not very useful with these > update queries). > After 4 hours I got bored. Cancel did not work. So I killed blazegraph and > restarted. > > I then rewrote the code as follows. > > > I wrote a construct query: > > CONSTRUCT { > ?newS ?newP ?newO > } > WHERE { > graph <%(abox)s> { > ?oldS ?oldP ?oldO > } > GRAPH <x-eg:temporary-graph> { > ?oldS <x-eg:replaced-by> ?newS > } > GRAPH <x-eg:temporary-graph> { > ?oldP <x-eg:replaced-by> ?newP > } > { > GRAPH <x-eg:temporary-graph> { > ?oldO <x-eg:replaced-by> ?newO > } > } UNION { > graph <%(abox)s> { > ?oldS ?oldP ?oldO > } > FILTER ( isLiteral(?oldO) ) > BIND ( ?oldO as ?newO ) > } > } > > this created a temporary file. > > I replaced the DELETE part with > > DROP GRAPH <%(abox)s> > > and the INSERT with > > LOAD <file://%(tmpfile)s> INTO GRAPH <%(abox)s> > > ==== > > > The rewritten code took only a few minutes (less than 5 in total) > I was expecting some improvement, but not as much as I saw. > > My understanding is that each of the three operations is atomic and > isolated, but I lost the guarantee linking the three (which I did not need > since I had exclusive lock at a higher level). > > Was it the atomicity that cost so much? > > Jeremy > > > > > > > > > > > ------------------------------------------------------------------------------ > BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT > Develop your own process in accordance with the BPMN 2 standard > Learn Process modeling best practices with Bonita BPM through live > exercises > http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- > event?utm_ > source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF > _______________________________________________ > Bigdata-developers mailing list > Big...@li... > https://lists.sourceforge.net/lists/listinfo/bigdata-developers > |