Re: [Bigdata-developers] Analytic Mode and concurrent queries

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Edgar,

The upcoming 2.1.4 / 2.2.0 releases both include the ability to place the
intermediate solutions of either the native or the managed heap.  The
default patterns are either / or.  However, it is possible to configure any
pattern.  For example, the pattern could use native memory if there are
more than X solutions on the managed heap or GC time is more than some
threshold.  This is a useful tool for managing the heap / performance
tradeoff because much of the memory burden of running queries is the
managed object heap.

If you allow large heaps, then make sure you have the memory available for
those heaps and that swapping does not occur (swappiness is zero, etc).

With respect to the timings you cite below, performance of the analytic
mode is strongly dependent on whether the query has significant memory or
can benefit from increased parallelism (especially for distinct solution
filters, which are only concurrent for the non-analytic mode).  The
analytic mode is really designed for queries with larger hash joins.  The
new ability to put the intermediate solutions in native memory addresses
the memory burden from in flight intermediate solutions.  As indicated
above, this decision about managed vs native heap can be made dynamically
by overriding the default policy and orthogonal to the choice of analytic
or non-analytic joins.

I also suggest that you look at count(*) or explain versions of queries
when reporting timings. Often the query engine runs faster than the ability
of the client to drain the solutions.  Currently those solutions dwell on
the managed object heap until they are drained by the client.  We will
address this aspect in a subsequent release.  However, we have observed
that 50% of the evaluation time for queries with modest output cardinality
(10,000 rows) can be waiting on the client to drain the solutions. If there
are current high output cardinality queries, then the GC pressure arising
from a slow client can slow down overall evaluation.

You could also look at increasing the operator level parallelism.  The main
place where the analytic mode is slower is a distinct solutions filter.
For a quads mode application, we use a distinct solutions filter implicitly
for each default graph triple pattern in order enforce the RDF merge
semantics.  However, some applications (including I believe yours) ensure
that the same triple does not appear in more than one named graph.  In such
cases you can disable this distinct solutions filter on the quads mode
default graph access paths and enjoy improved with query parallelism as a
result.

As a general guideline, you can hide latency under concurrency.  If you are
getting results which are not consistent with this, then the system is
probably at some extreme.  This could be limited within query parallelism,
swapping, exceeding the viable disk bandwidth, etc.  I am not sure what the
limiting factor is for your queries, but I would suspect any of: slow
client draining results, distinct solutions filter for the quads access
path, swapping, etc.

Thanks,
Bryan

On Thursday, August 25, 2016, Edgar Rodriguez-Diaz <ed...@sy...>
wrote:

> Hi,
>
> We’ve been experimenting with analytic mode using a dataset with ~12M
> quads.
> While running a particular query, a bit complex, it produces the following
> running times consistently:
>
> Blazegraph version: 2.0.1
>
> Using Java Heap
> Mem Configurations tried: (Xms4g, Xmx8g)
>
> Concurrency Level | Approx avg. execution time
> ------------------|---------------------------
> 1                 | 3.5
> 3                 | 5.0
> 5                 | 5.5
> 9                 | 8
> 10                | 9
>
> Using Analytic Mode
> Mem Configurations tried:
> (Xms2g, Xmx2g, -XX:MaxDirectMemorySize=6g),
> (Xms4g, Xmx4g, -XX:MaxDirectMemorySize=4g)
>
> Concurrency Level | Approx avg. execution time
> ------------------|---------------------------
> 1                 | 3.09
> 3                 | 5.95
> 5                 | 10.58
> 9                 | 19.24
> 10                | 22.87
>
> On levels of concurrency > 3, queries are highly penalized on performance,
> being at least
> 2x slower. I know that there may be an overhead on performance for BG
> doing it's own memory management, but
> 2x slower queries on relatively low levels of concurrency seems like a bit
> too high.
>
> So, the questions are:
> Is the previous outcome something expected or an exception?, if it's an
> exception I could follow up with a bug report.
> What can be expected on the performance of concurrent queries while on
> analytic mode?
>
>

Re: [Bigdata-developers] Analytic Mode and concurrent queries

Fast, scalable, robust graph database platform

Re: [Bigdata-developers] Analytic Mode and concurrent queries