bigdata-developers Mailing List for Blazegraph (powered by bigdata) (Page 49)

Fast, scalable, robust graph database platform

Brought to you by: beebs, hyandell, mrpersonick, thompsonbry

bigdata-developers — List for bigdata developers

This list is closed, nobody may subscribe to it.

2010	Jan	Feb (19)	Mar (8)	Apr (25)	May (16)	Jun (77)	Jul (131)	Aug (76)	Sep (30)	Oct (7)	Nov (3)	Dec
2011	Jan	Feb	Mar	Apr	May (2)	Jun (2)	Jul (16)	Aug (3)	Sep (1)	Oct	Nov (7)	Dec (7)
2012	Jan (10)	Feb (1)	Mar (8)	Apr (6)	May (1)	Jun (3)	Jul (1)	Aug	Sep (1)	Oct	Nov (8)	Dec (2)
2013	Jan (5)	Feb (12)	Mar (2)	Apr (1)	May (1)	Jun (1)	Jul (22)	Aug (50)	Sep (31)	Oct (64)	Nov (83)	Dec (28)
2014	Jan (31)	Feb (18)	Mar (27)	Apr (39)	May (45)	Jun (15)	Jul (6)	Aug (27)	Sep (6)	Oct (67)	Nov (70)	Dec (1)
2015	Jan (3)	Feb (18)	Mar (22)	Apr (121)	May (42)	Jun (17)	Jul (8)	Aug (11)	Sep (26)	Oct (15)	Nov (66)	Dec (38)
2016	Jan (14)	Feb (59)	Mar (28)	Apr (44)	May (21)	Jun (12)	Jul (9)	Aug (11)	Sep (4)	Oct (2)	Nov (1)	Dec
2017	Jan (20)	Feb (7)	Mar (4)	Apr (18)	May (7)	Jun (3)	Jul (13)	Aug (2)	Sep (4)	Oct (9)	Nov (2)	Dec (5)
2018	Jan	Feb	Mar	Apr (2)	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2019	Jan	Feb	Mar (1)	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec

Flat | Threaded

<< < 1 .. 47 48 49 50 51 .. 72 > >> (Page 49 of 72)

[Bigdata-developers] Bug with a query that contains BIND and UNION

From: Antoni M. <ant...@ba...> - 2013-09-19 15:46:27

Hi,

I think I found a bug. I downloaded the latest bigdata.war release from sourceforge, deployed in on the latest tomcat release, with out-of-the box configuration. Then I went to localhost:8080/bigdata and did this INSERT query:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX umbel: <http://umbel.org/umbel/>
INSERT DATA
{ 
  <http://example/book1> skos:narrower <http://example/chapter1> .
  <http://example/book1> umbel:isRelatedToClass <http://example/book> .
}

And then I tried to do this SELECT:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX umbel: <http://umbel.org/umbel/>
SELECT ?p ?x  
WHERE {
 {?concept skos:narrower ?x BIND (skos:narrower as ?p) . }
  UNION
 {?concept umbel:isRelatedToClass ?x BIND (umbel:isRelatedToClass as ?p) . }
 FILTER (?concept in (<http://example/book1))
}

I would expect this:

?p                     ?x
skos:narrower          <http://example/chapter1>
umbel:isRelatedToClass <http://example/book>

But when I do this query many times I always get either the first or the second, but never both. This looks to me like a race condition somewhere in the code that handles BIND or UNION. 

Two questions:
1. Is this a bug? What should be the behavior?
2. I can rephrase this query to say {?concept ?p ?x FILTER(?p in (skos:narrower, umbel:isRelatedToClass)) .}. This works OK. Is the variant with BIND likely to perform better (when it works)?

Could anyone confirm?


--
Antoni Myłka
Software Engineer

basis06 AG, Birkenweg 61, CH-3013 Bern - Fon +41 31 311 32 22
http://www.basis06.ch - source of smart business

[Bigdata-developers] my to do list and status

From: Jeremy C. <jj...@gm...> - 2013-09-12 17:33:46

This is my current list of bigdata related work … 

To-do:

732: CBD options, one line fix

736: MIN - produce test case and initial exploration

739: BIND and optional path: test case and initial exploration

740: performance NSPIN, revisit

725: FILTER EXISTS - not really sure on next steps …

737: Class Cast Exception, do I need to do anything here?

(review earlier e-mails of such lists and see if I have forgotten something :) )

==

However, I have burned my bigdata related time budget (on 740) for this week, and probably next week too, and need to get back to other non bigdata work items.
When I do get back to bigdata I will pick up on 732 as an easy win, and 736 and 739 as easy to move forward


Jeremy

[Bigdata-developers] Moving two interfaces

From: Bryan T. <br...@sy...> - 2013-09-12 13:25:04

I need to move the following interfaces from com.bigdata.striterator into cutthecrap.utils.striterator.

       - com.bigdata.striterator.ICloseable
       - com.bigdata.striterator.ICloseableIterator

This is to support the compilation of the CTC striterator package as a distinct module.  Right now it depends on the com.bigdata.striterator package.  I need to break that dependency.

This will touch a large number of files.  However, it should be straightforward to reconcile any conflicts that result.  Just fixup the import for those interfaces.

Thanks,
Bryan

Re: [Bigdata-developers] performance issue write up: trac 740

From: Bryan T. <br...@sy...> - 2013-09-12 12:34:42

Jeremy,

The code is not specifically optimized for a single or dual core CPU. If you are trying to tune performance for that situation, then I would recommend looking at the following properties:

- NSPIN - I think that this is a red-herring, but who knows. My thought is that you are adjusting the likelihood of a context switch when changing this value. I would suggest working with the parameters discussed below and obtaining the stack frames for slow producers and consumers in order to understand what parts of the query are the bottleneck in your use case.

- CHUNK_CAPACITY- This is the size of a vectored chunk of solutions. The default is 100. Query performance can be improved for some queries by increasing this value. However, if you have a highly concurrent workload then a larger value will increase the heap pressure and the GC time and result in a lower throughput. Try 1,000 or 10,000. The larger the value, the fewer times any given operator will execute. Therefore this can effect context switching. Larger values will tend to cause each operator to execute once and will therefore tend to increase the latency to the first result, but may decrease the latency to the last result.

IChunkedIterator:: // This will effect iterator patterns.

int DEFAULT_CHUNK_SIZE = 100;

BufferAnnotations:: // This will effect query operators.

int DEFAULT_CHUNK_SIZE = 100;

Some other relevant configuration options are defined on PipelineOp.Annotations. I can answer questions about the other options as you become oriented to this part of the code.

I am reassigning https://sourceforge.net/apps/trac/bigdata/ticket/740 to you. Please see my comments there.

Thanks,
Bryan

On 9/11/13 11:05 PM, "Jeremy J Carroll" <jj...@sy...<mailto:jj...@sy...>> wrote:

Since the typically scenario is
multiple queries, multiple operators, and multiple operation execution
phases all running in parallel, there is generally work available to be
done somewhere.

Yes - the improvement in the multi client scenario in the report is less than in the single client scenario, but still pretty impressive.

I am of course thinking about the Syapse system, where each deployment may have a relative small number of users (e.g. 10 or 20), only one or two of whom may be active at any one time. So we may have a server with say a dual core processor with hyper-threading, actually serving just one person.

A different usage scenario for Syapse is a batch job with one enormous query.

While this may differ from the typical bigdata user, I don't think it is totally abnormal.

Re: [Bigdata-developers] performance issue write up: trac 740

From: Jeremy J C. <jj...@sy...> - 2013-09-12 03:27:18


> Since the typically scenario is
> multiple queries, multiple operators, and multiple operation execution
> phases all running in parallel, there is generally work available to be
> done somewhere.

Yes - the improvement in the multi client scenario in the report is less than in the single client scenario, but still pretty impressive.

I am of course thinking about the Syapse system, where each deployment may have a relative small number of users (e.g. 10 or 20), only one or two of whom may be active at any one time. So we may have a server with say a dual core processor with hyper-threading, actually serving just one person.

A different usage scenario for Syapse is a batch job with one enormous query.

While this may differ from the typical bigdata user, I don't think it is totally abnormal.

Re: [Bigdata-developers] performance issue write up: trac 740

From: Bryan T. <br...@sy...> - 2013-09-12 00:57:38

It might be low, but if I recall when it falls out of that spin it is
really just falling into another loop.

I just took a peek at the code.  The whole thing is wrapped by a
while(true). It checks for an asynchronous close.  If there is nothing,
then it drops into a non-blocking poll() in the NSPIN loop.  Then it will
drop into a blocking poll() with a timeout.  If all of that fails, it is
going to wind up reentering from the top of the top.

When you play with NSPIN, it is playing with how long the CPU will spin on
that thread looking for something from the producer.  When you ramp that
value up, it is spinner longer.  If that results in higher throughput then
this maybe a tradeoff point where less context switching is occurring and
the net yield is better throughput.

However, normally the producer is dropping chunks of something (solutions,
IVs, Values) onto the BlockingBuffer.  If it hits the poll() with the
timeout, then my expectation is that it will wake up a bit later and find
that there is some work to be done.  Since the typically scenario is
multiple queries, multiple operators, and multiple operation execution
phases all running in parallel, there is generally work available to be
done somewhere.

Try getting those stack frames and also see what's going into / out of the
buffer.  There is a log timeout that you can mess with if you want to see
when the producer is slow (consumer is blocking).  You can enable that with

    private static final boolean producerConsumerWarnings = false;

But, again, this is typically because of a bad join in the plan.  For
example, you might be spinning waiting for the final solutions and the
join is doing too much work and the work is getting eliminated by a filter.

If you set the LOG @ INFO it should grab those stack frames.  It will log
them automatically in _hasNext() if the logTimeout is exceeded and the
logger is at INFO or finer.

Bryan

On 9/11/13 8:35 PM, "Jeremy J Carroll" <jj...@sy...> wrote:

>I will try and work out how to get you something more concrete tomorrow.
>
>I thought your 100 looked somewhat low for a spin lock, since I
>remembered being surprised at how high
>java.util.concurrent.locks.AbstractQueuedLongSynchronizer.spinForTimeoutTh
>reshold
>is (1000 ns, maybe 2000 spins) Š then suck it and see pushed the number
>higher.
>
>Jeremy J Carroll
>Principal Architect
>Syapse, Inc.
>
>
>
>On Sep 11, 2013, at 5:00 PM, Bryan Thompson <br...@sy...> wrote:
>
>> Can you obfuscate the data and provide queries so we can reproduce this
>>workload? That would make it easier to have some understanding of the
>>problem.  But that is not really a ticket we can work as such.  The
>>problem needs to be reproducible.  Alternatively, can you reproduce a
>>beneficial effect by mucking around with NSPIN on a known benchmark?
>>E.g., BSBM.  Right now, I suspect the configuration and query plans.
>> 
>> Anything slow with only 57000 quads is going to be a bad join resulting
>>in an imbalance in the consumers and producers and possibly spamming the
>>heap. Take out each query from the mix in a process or elimination to
>>identify the culprits or just look at each query plan by hand - NSS has
>>an explain page for doing this.
>> 
>> I think that nspin is a red herring. Look at the time for each query.
>>Which ones are running slowly?  Look at their query plans.
>> 
>> There is an implementation of a runtime query optimizer that is not yet
>>integrated into the SPARQL layer.  If you are feeling ambitious you can
>>code up the triple patterns and use that to see how it orders the joins
>>based on the estimated cardinalty from sampling cut off join paths.
>>JoinGraph is the entry point. This implements the ROX approach to chain
>>sampling with some minor variations.
>> 
>> Bryan
>> 
>> On Sep 11, 2013, at 7:44 PM, "Jeremy J Carroll" <jj...@sy...> wrote:
>> 
>>> I have written up the performance issue as trac740
>>> 
>>> I am coming from observing the code essentially as a black box; maybe
>>>someone who understands the code better might care to review my write
>>>up and the recommended response.
>>> 
>>> 
>>> Jeremy J Carroll
>>> Principal Architect
>>> Syapse, Inc.
>>> 
>>> 
>>> 
>>> 
>>> 
>>>------------------------------------------------------------------------
>>>------
>>> How ServiceNow helps IT people transform IT departments:
>>> 1. Consolidate legacy IT systems to a single system of record for IT
>>> 2. Standardize and globalize service processes across IT
>>> 3. Implement zero-touch automation to replace manual, redundant tasks
>>> 
>>>http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clk
>>>trk
>>> _______________________________________________
>>> Bigdata-developers mailing list
>>> Big...@li...
>>> https://lists.sourceforge.net/lists/listinfo/bigdata-developers
>

Re: [Bigdata-developers] performance issue write up: trac 740

From: Jeremy J C. <jj...@sy...> - 2013-09-12 00:36:00

I will try and work out how to get you something more concrete tomorrow.

I thought your 100 looked somewhat low for a spin lock, since I remembered being surprised at how high
java.util.concurrent.locks.AbstractQueuedLongSynchronizer.spinForTimeoutThreshold
is (1000 ns, maybe 2000 spins) … then suck it and see pushed the number higher.

Jeremy J Carroll
Principal Architect
Syapse, Inc.



On Sep 11, 2013, at 5:00 PM, Bryan Thompson <br...@sy...> wrote:

> Can you obfuscate the data and provide queries so we can reproduce this workload? That would make it easier to have some understanding of the problem.  But that is not really a ticket we can work as such.  The problem needs to be reproducible.  Alternatively, can you reproduce a beneficial effect by mucking around with NSPIN on a known benchmark? E.g., BSBM.  Right now, I suspect the configuration and query plans.
> 
> Anything slow with only 57000 quads is going to be a bad join resulting in an imbalance in the consumers and producers and possibly spamming the heap. Take out each query from the mix in a process or elimination to identify the culprits or just look at each query plan by hand - NSS has an explain page for doing this.
> 
> I think that nspin is a red herring. Look at the time for each query.  Which ones are running slowly?  Look at their query plans. 
> 
> There is an implementation of a runtime query optimizer that is not yet integrated into the SPARQL layer.  If you are feeling ambitious you can code up the triple patterns and use that to see how it orders the joins based on the estimated cardinalty from sampling cut off join paths.  JoinGraph is the entry point. This implements the ROX approach to chain sampling with some minor variations.
> 
> Bryan
> 
> On Sep 11, 2013, at 7:44 PM, "Jeremy J Carroll" <jj...@sy...> wrote:
> 
>> I have written up the performance issue as trac740
>> 
>> I am coming from observing the code essentially as a black box; maybe someone who understands the code better might care to review my write up and the recommended response.
>> 
>> 
>> Jeremy J Carroll
>> Principal Architect
>> Syapse, Inc.
>> 
>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> How ServiceNow helps IT people transform IT departments:
>> 1. Consolidate legacy IT systems to a single system of record for IT
>> 2. Standardize and globalize service processes across IT
>> 3. Implement zero-touch automation to replace manual, redundant tasks
>> http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Bigdata-developers mailing list
>> Big...@li...
>> https://lists.sourceforge.net/lists/listinfo/bigdata-developers