bigdata-developers Mailing List for Blazegraph (powered by bigdata) (Page 58)

Fast, scalable, robust graph database platform

Brought to you by: beebs, hyandell, mrpersonick, thompsonbry

This project can now be found here.

bigdata-developers — List for bigdata developers

This list is closed, nobody may subscribe to it.

2010	Jan	Feb (19)	Mar (8)	Apr (25)	May (16)	Jun (77)	Jul (131)	Aug (76)	Sep (30)	Oct (7)	Nov (3)	Dec
2011	Jan	Feb	Mar	Apr	May (2)	Jun (2)	Jul (16)	Aug (3)	Sep (1)	Oct	Nov (7)	Dec (7)
2012	Jan (10)	Feb (1)	Mar (8)	Apr (6)	May (1)	Jun (3)	Jul (1)	Aug	Sep (1)	Oct	Nov (8)	Dec (2)
2013	Jan (5)	Feb (12)	Mar (2)	Apr (1)	May (1)	Jun (1)	Jul (22)	Aug (50)	Sep (31)	Oct (64)	Nov (83)	Dec (28)
2014	Jan (31)	Feb (18)	Mar (27)	Apr (39)	May (45)	Jun (15)	Jul (6)	Aug (27)	Sep (6)	Oct (67)	Nov (70)	Dec (1)
2015	Jan (3)	Feb (18)	Mar (22)	Apr (121)	May (42)	Jun (17)	Jul (8)	Aug (11)	Sep (26)	Oct (15)	Nov (66)	Dec (38)
2016	Jan (14)	Feb (59)	Mar (28)	Apr (44)	May (21)	Jun (12)	Jul (9)	Aug (11)	Sep (4)	Oct (2)	Nov (1)	Dec
2017	Jan (20)	Feb (7)	Mar (4)	Apr (18)	May (7)	Jun (3)	Jul (13)	Aug (2)	Sep (4)	Oct (9)	Nov (2)	Dec (5)
2018	Jan	Feb	Mar	Apr (2)	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2019	Jan	Feb	Mar (1)	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec

Flat | Threaded

<< < 1 .. 56 57 58 59 60 .. 72 > >> (Page 58 of 72)

Re: [Bigdata-developers] [Bigdata-commit] SF.net SVN: bigdata:[3499] trunk/build.xml

From: Brian M. <btm...@gm...> - 2010-09-03 19:12:21

On Fri, Sep 3, 2010 at 12:04 PM, Bryan Thompson <br...@sy...> wrote:

While I do not disagree with your comments about the usefulness and utility
> of jini in non-distributed situations, would it be possible to move this
> method into another class so that the bigdata "core" packages do not have a
> jini dependency?
>

Rather than moving the method to another class,
you should probably simply remove it. It was included
as a convenience, and I'm pretty sure no one is
using it.

Note that you'll also want to remove the com.sun.jini.logging.Levels
class as well.

BrianM

Re: [Bigdata-developers] [Bigdata-commit] SF.net SVN: bigdata:[3499] trunk/build.xml

From: Bryan T. <br...@sy...> - 2010-09-03 16:53:26

Brian,

Mike applied the change to build.xml to include those jini jars because he was getting a ClassNotFoundException from the existing Sesame Server deployment task.  I think that it is likely that the need to add those jars goes back to the changes made to use NicUtil to obtain the address of the local host and NicUtil's dependency on Jini/River's Configuration object.  I found the following issues which relate to this [1,2] and of course [3].

Bryan

[1] https://sourceforge.net/apps/trac/bigdata/ticket/99
[2] https://sourceforge.net/apps/trac/bigdata/ticket/126
[3] https://sourceforge.net/apps/trac/bigdata/ticket/153

________________________________
From: Brian Murphy [mailto:btm...@gm...]
Sent: Friday, September 03, 2010 12:13 PM
To: big...@li...
Subject: Re: [Bigdata-developers] [Bigdata-commit] SF.net SVN: bigdata:[3499] trunk/build.xml

On Fri, Sep 3, 2010 at 12:03 PM, Bryan Thompson <br...@sy...<mailto:br...@sy...>> wrote:

So it might be this method in NicUtil which is dragging in the jini dependency?

I wouldn't have expected that. But then again, I don't
know exactly what was being done. As I understand it,
the change was to build.xml, so I assumed that file was
manually edited, not automatically updated. And if it was
automatically updated, I would have expected jsk-platform.jar
and jsk-lib.jar to have been added.

BrianM

Re: [Bigdata-developers] [Bigdata-commit] SF.net SVN: bigdata:[3499] trunk/build.xml

From: Brian M. <btm...@gm...> - 2010-09-03 16:13:25

On Fri, Sep 3, 2010 at 12:03 PM, Bryan Thompson <br...@sy...> wrote:

 So it might be this method in NicUtil which is dragging in the jini
> dependency?
>

I wouldn't have expected that. But then again, I don't
know exactly what was being done. As I understand it,
the change was to build.xml, so I assumed that file was
manually edited, not automatically updated. And if it was
automatically updated, I would have expected jsk-platform.jar
and jsk-lib.jar to have been added.

BrianM

Re: [Bigdata-developers] [Bigdata-commit] SF.net SVN: bigdata:[3499] trunk/build.xml

From: Bryan T. <br...@sy...> - 2010-09-03 16:05:26

Brian,

So it might be this method (below) in NicUtil which is dragging in the jini dependency?  While I do not disagree with your comments about the usefulness and utility of jini in non-distributed situations, would it be possible to move this method into another class so that the bigdata "core" packages do not have a jini dependency?

Thanks,
Bryan


/**

* Three-argument version of <code>getInetAddress</code> that retrieves

* the desired interface name from the given <code>Configuration</code>

* parameter.

*/

public static InetAddress getInetAddress(Configuration config,

String componentName,

String nicNameEntry)

{

String nicName = "NoNetworkInterfaceName";

try {

nicName = (String)Config.getNonNullEntry(config,

componentName,

nicNameEntry,

String.class,

"eth0");

} catch(ConfigurationException e) {

jiniConfigLogger.log(WARNING, e

+" - [componentName="+componentName

+", nicNameEntry="+nicNameEntry+"]");

utilLogger.log(Level.WARN, e

+" - [componentName="+componentName

+", nicNameEntry="+nicNameEntry+"]");

e.printStackTrace();

return null;

}

return ( getInetAddress(nicName, 0, null, false) );

}



________________________________
From: Brian Murphy [mailto:btm...@gm...]
Sent: Friday, September 03, 2010 11:53 AM
To: big...@li...
Subject: Re: [Bigdata-developers] [Bigdata-commit] SF.net SVN: bigdata:[3499] trunk/build.xml

On Fri, Sep 3, 2010 at 8:04 AM, Bryan Thompson <br...@sy...<mailto:br...@sy...>> wrote:

I was not aware that they were deprecated.  Are they being used anywhere else?

Yes. It looks like they're referenced in a couple of places;
specifically the ServerStarter.config file under bigdata-rdf
and the eclipse-specific .classpath file.

  Can we just remove them from the set of jars bundled with bigdata, substituting the jars which you identify below?

Sure.


Also, see [1] which is an issue to remove the jini/river dependency for the standalone database deployment.

Even if you're not executing scaleout services, the jini
platform and utility jars are useful to include; and actually
may be required in some cases. For example, even in
standalone, I believe configuration currently references
the jini configuration classes. Additionally, NicUtil references
some of the jini utility and/or configuration classes.

That said, there is no need to include any of the jini
service-specific jar files when you're not running those
services; that is, reggie.jar, etc.

I hope this helps,
BrianM

Re: [Bigdata-developers] [Bigdata-commit] SF.net SVN: bigdata:[3499] trunk/build.xml

From: Bryan T. <br...@sy...> - 2010-09-03 16:03:47

Brian,

So it might be this method in NicUtil which is dragging in the jini dependency?

________________________________
From: Brian Murphy [mailto:btm...@gm...]
Sent: Friday, September 03, 2010 11:53 AM
To: big...@li...
Subject: Re: [Bigdata-developers] [Bigdata-commit] SF.net SVN: bigdata:[3499] trunk/build.xml

On Fri, Sep 3, 2010 at 8:04 AM, Bryan Thompson <br...@sy...<mailto:br...@sy...>> wrote:

I was not aware that they were deprecated.  Are they being used anywhere else?

Yes. It looks like they're referenced in a couple of places;
specifically the ServerStarter.config file under bigdata-rdf
and the eclipse-specific .classpath file.

  Can we just remove them from the set of jars bundled with bigdata, substituting the jars which you identify below?

Sure.

Also, see [1] which is an issue to remove the jini/river dependency for the standalone database deployment.

Even if you're not executing scaleout services, the jini
platform and utility jars are useful to include; and actually
may be required in some cases. For example, even in
standalone, I believe configuration currently references
the jini configuration classes. Additionally, NicUtil references
some of the jini utility and/or configuration classes.

That said, there is no need to include any of the jini
service-specific jar files when you're not running those
services; that is, reggie.jar, etc.

I hope this helps,
BrianM

Re: [Bigdata-developers] [Bigdata-commit] SF.net SVN: bigdata:[3499] trunk/build.xml

From: Brian M. <btm...@gm...> - 2010-09-03 15:52:46

On Fri, Sep 3, 2010 at 8:04 AM, Bryan Thompson <br...@sy...> wrote:

 I was not aware that they were deprecated.  Are they being used anywhere
> else?
>

Yes. It looks like they're referenced in a couple of places;
specifically the ServerStarter.config file under bigdata-rdf
and the eclipse-specific .classpath file.

>   Can we just remove them from the set of jars bundled with bigdata,
> substituting the jars which you identify below?
>

Sure.

>
> Also, see [1] which is an issue to remove the jini/river dependency for the
> standalone database deployment.
>

Even if you're not executing scaleout services, the jini
platform and utility jars are useful to include; and actually
may be required in some cases. For example, even in
standalone, I believe configuration currently references
the jini configuration classes. Additionally, NicUtil references
some of the jini utility and/or configuration classes.

That said, there is no need to include any of the jini
service-specific jar files when you're not running those
services; that is, reggie.jar, etc.

I hope this helps,
BrianM

[Bigdata-developers] Hudson build is still unstable: BigData #156

From: husdon <no...@no...> - 2010-09-03 12:41:43

See <http://localhost/job/BigData/changes>

Re: [Bigdata-developers] [Bigdata-commit] SF.net SVN: bigdata:[3499] trunk/build.xml

From: Bryan T. <br...@sy...> - 2010-09-03 12:05:29

Brian,

Thanks for that.  I was not aware that they were deprecated.  Are they being used anywhere else?  Can we just remove them from the set of jars bundled with bigdata, substituting the jars which you identify below?

Also, see [1] which is an issue to remove the jini/river dependency for the standalone database deployment.

Bryan

[1] https://sourceforge.net/apps/trac/bigdata/ticket/153

________________________________
From: Brian Murphy [mailto:btm...@gm...]
Sent: Friday, September 03, 2010 7:54 AM
To: big...@li...
Subject: Re: [Bigdata-developers] [Bigdata-commit] SF.net SVN: bigdata:[3499] trunk/build.xml

On Thu, Sep 2, 2010 at 7:13 PM, Bryan Thompson <br...@sy...<mailto:br...@sy...>> wrote:

Can you tell me why we have this dependency now on the jini jars for the Sesame Server install?  Are they being dragged in by some utility class?

> +     <fileset dir="${bigdata.dir}/bigdata-jini/lib/jini/lib">
> +             <include name="jini-core.jar" />
> +             <include name="jini-ext.jar" />
> +     </fileset>

Just a reminder -- for what it's worth  -- that even if there is
a valid reason for depending on jini jar files above (or anywhere
else in the codebase), jini-core.jar and jini-ext.jar are not the
jars that anyone should be depending on. Those jars (as well
as sun-uitl.jar and possibly a couple of others) were deprecated
in the 2.x release that bigdata is using because the packaging
and deployment model was changed to support more common
install/upgrade strategies. They were included in that release to
provide users with 'gentler' conversion path by avoiding breaking
any scripts those users might have written that depend on the
old jars.

The jars that should be used are jsk-platform.jar, jsk-lib.jar and
jsk-resources.jar. That said, using jini-core.jar and jini-ext.jar
is not going to break anything, but I would recommend moving
away from them, as the apache river project will probably
eventually remove them in the future.

BrianM

 jini-core.jar

Re: [Bigdata-developers] [Bigdata-commit] SF.net SVN: bigdata:[3499] trunk/build.xml

From: Brian M. <btm...@gm...> - 2010-09-03 11:54:35

On Thu, Sep 2, 2010 at 7:13 PM, Bryan Thompson <br...@sy...> wrote:

>
> Can you tell me why we have this dependency now on the jini jars for the
> Sesame Server install?  Are they being dragged in by some utility class?
>

> > +     <fileset dir="${bigdata.dir}/bigdata-jini/lib/jini/lib">
> > +             <include name="jini-core.jar" />
> > +             <include name="jini-ext.jar" />
> > +     </fileset>

Just a reminder -- for what it's worth  -- that even if there is
a valid reason for depending on jini jar files above (or anywhere
else in the codebase), jini-core.jar and jini-ext.jar are not the
jars that anyone should be depending on. Those jars (as well
as sun-uitl.jar and possibly a couple of others) were deprecated
in the 2.x release that bigdata is using because the packaging
and deployment model was changed to support more common
install/upgrade strategies. They were included in that release to
provide users with 'gentler' conversion path by avoiding breaking
any scripts those users might have written that depend on the
old jars.

The jars that should be used are jsk-platform.jar, jsk-lib.jar and
jsk-resources.jar. That said, using jini-core.jar and jini-ext.jar
is not going to break anything, but I would recommend moving
away from them, as the apache river project will probably
eventually remove them in the future.

BrianM

 jini-core.jar

Re: [Bigdata-developers] [Bigdata-commit] SF.net SVN: bigdata:[3499] trunk/build.xml

From: Bryan T. <br...@sy...> - 2010-09-02 23:13:39

Mike,

Can you tell me why we have this dependency now on the jini jars for the Sesame Server install?  Are they being dragged in by some utility class? 

Thanks,
Bryan

> -----Original Message-----
> From: mrp...@us... 
> [mailto:mrp...@us...] 
> Sent: Thursday, September 02, 2010 6:24 PM
> To: big...@li...
> Subject: [Bigdata-commit] SF.net SVN: bigdata:[3499] trunk/build.xml
> 
> Revision: 3499
>           
> http://bigdata.svn.sourceforge.net/bigdata/?rev=3499&view=rev
> Author:   mrpersonick
> Date:     2010-09-02 22:24:22 +0000 (Thu, 02 Sep 2010)
> 
> Log Message:
> -----------
> added jini jars to Sesame Server install
> 
> Modified Paths:
> --------------
>     trunk/build.xml
> 
> Modified: trunk/build.xml
> ===================================================================
> --- trunk/build.xml	2010-09-02 20:42:43 UTC (rev 3498)
> +++ trunk/build.xml	2010-09-02 22:24:22 UTC (rev 3499)
> @@ -1992,6 +1992,10 @@
>  	<fileset dir="${bigdata.dir}/bigdata/lib">
>  		<include name="**/*.jar" />
>  	</fileset>
> +	<fileset dir="${bigdata.dir}/bigdata-jini/lib/jini/lib">
> +		<include name="jini-core.jar" />
> +		<include name="jini-ext.jar" />
> +	</fileset>
>  </copy>
>  
>  <!-- copy resources to Workbench webapp. -->
> 
> 
> This was sent by the SourceForge.net collaborative 
> development platform, the world's largest Open Source 
> development site.
> 
> --------------------------------------------------------------
> ----------------
> This SF.net Dev2Dev email is sponsored by:
> 
> Show off your parallel programming skills.
> Enter the Intel(R) Threading Challenge 2010.
> http://p.sf.net/sfu/intel-thread-sfd
> _______________________________________________
> Bigdata-commit mailing list
> Big...@li...
> https://lists.sourceforge.net/lists/listinfo/bigdata-commit
>

[Bigdata-developers] Hudson build is still unstable: BigData #155

From: husdon <no...@no...> - 2010-09-02 23:08:05

See <http://localhost/job/BigData/changes>

[Bigdata-developers] Hudson build is still unstable: BigData #154

From: husdon <no...@no...> - 2010-09-01 14:16:36

See <http://localhost/job/BigData/changes>

Re: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

From: Mike P. <mi...@sy...> - 2010-08-31 14:34:41

David,

What you noticed is exactly correct - there is no data specified to be loaded during the standard load phase of the dataset tests.  What those tests are supposed to do is test the DatasetRepository interface, which allows for data to be loaded during the query itself, data specified by a named graph.  This query time loading is what is broken.  If you go back and hand modify the test configurations to load the data via the normal load phase then they queries are answered correctly.

Thanks,
Mike

________________________________
From: dav...@no... [mailto:dav...@no...]
Sent: Tuesday, August 31, 2010 4:17 AM
To: Mike Personick; Bri...@no...; big...@li...
Subject: RE: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

On the subject of the dataset tests...

I had a quick look at those a few weeks ago because they are also run as part of the 'rdf.sail' unit test suite.  It seemed that the test manifest did not include the graphs to be loaded. So, unless we derive this information from elsewhere, I would expect all the dataset tests that should return a non-empty result set to fail.

David

From: ext Mike Personick [mailto:mi...@sy...]
Sent: Monday, August 30, 2010 9:05 PM
To: Levine Brian (Nokia-MS/Boston); 'big...@li...'
Subject: Re: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

Brian,

Also, since you're in the scale-out quads query mindset, have you given any more thought to a scale-out quads query performance benchmark test that we can use in the refactor? Like a quads version of BSBM or something?

Thanks,
Mike

________________________________
From: Bri...@no... <Bri...@no...>
To: Mike Personick; big...@li... <big...@li...>
Sent: Mon Aug 30 14:34:39 2010
Subject: RE: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out
Thanks for the detailed response Mike!

=b


From: ext Mike Personick [mailto:mi...@sy...]
Sent: Monday, August 30, 2010 2:54 PM
To: Levine Brian (Nokia-MS/Boston); big...@li...
Subject: RE: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

Brian,

The dataset test cases use the class "org.openrdf.repository.dataset.DatasetRepository" as a wrapper around the BigdataSailRepository.  The DatasetRepository class allows you to load data into the repository from a query itself rather than invoking explicit load operations and then running a query afterwards.  There is some incompatibility between the Sesame DatasetRepository class and the BigdataSailRepository that causes the query-time data load to never occur.  We've never bothered to track down the cause of this because it's never been important to anyone and we don't recommend using the Sesame DatasetRepository wrapper class for concurrency reasons.  If you look closely at those dataset test cases you will see that the ones that "succeed" are the ones where success happens to mean no results.  Also if you load the data in manually via a load operation and then perform the same query everything works fine.  We can take a closer look at this during the quads query refactor if this feature is important to you.

The iterator is not progressing warning is an obscure concurrency issue deep in the bidata core that is on Bryan's plate to eventually track down and eliminate.  This has been around for a while and has not caused any actual errors that I am aware of.

The ISOLATABLE_INDICES issue has to do with full read/write transactions, which as you know are not supported in scale-out.  You are doing the right thing running the database in unisolated mode for the TCK, that is the only way you would be able to get it to pass.  However this is not what you'd want to do in production - there you'd want to segregate your reads and writes, do your writes against the unisolated indices and do your reads against read-only views.

Thanks,
Mike

________________________________
From: Bri...@no... [mailto:Bri...@no...]
Sent: Monday, August 30, 2010 11:44 AM
To: big...@li...
Subject: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

Hi all,

Last week I ran the Sesame SPARQL compliance tests against bigdata scale-out. This email includes the results of those tests.  The attached TAR file contains maven surefire output (TXT, XML and HTML) for each configuration-one directory per config.  Also included in each directory is a file called failed.txt which lists the tests that failed for that configuration.  I haven't looked at the failure cases in detail yet.

Systap folks, if you could comment on these results in general and especially the issues I've raised prefixed by "Systap:", I'd appreciate it.

-brian

Summary:

The quads-ascii-noinline configuration most accurately duplicates the configuration of the scale-up compliance test. A total of 13 tests reported errors in this configuration.  Of those, 8 tests were dataset tests which bigdata apparently can't handle. These tests are purposely filtered out in the scale-up compliance test. Note that these same 8 tests do indeed fail in the scale-up tests when that filter is disabled. The remaining scale-out errors are in 5 graph-based compliance tests.

Systap: Could you provide some background on the dataset test case problem?  What is the limitation in bigdata that causes this functionality to be unsupported?

Description of the test:

The BigdataSparqlTest is a JUnit test included with bigdata.  It extends Sesame's SPARQLQueryTest which is an implementation of the W3C RDF compliance tests. This test is meant to run against a scale-up instance of bigdata.

To test scale-out, we cloned BigdataSparqlTest and created ScaleoutSparqlTest. This is essentially identical to BigdataSparqlTest except for the following:

*         SPARQLQueryTest tears down and creates a new repository for each test in the suite.  This was too expensive for the scale-out test (and I'm not sure that it even works programmatically) and so a single repository was created and used for an entire test run. However, a new repository (different namespace) was used for each of the 4 configurations.

*         SPARQLQueryTest also closes and reopens the RepositoryConnection for each test.  This started flooding the log with Zookeeper errors (I didn't spend a lot of time researching why this happens). So ScaleoutSparqlTest maintains a single connection and does a connection.clear() between each test to remove all statements.

*         BigdataSparqlTest identifies a number of test cases that fail due to the recent changes for inlining of literals. For the scale-up case, inlining programmatically disabled when one of these test cases is encountered.  This is done by setting an override property when the repository is created. Since ScaleoutSparqlTest does not create a new repository for each run, this was not possible and those test cases were allowed to run (and fail). Note that one of the test configurations (see below) disables inlining for all test cases to ensure that these test cases pass when inlining is disabled (which they do).

These failures are a result of doing lexical comparisons of literals (.e.g. "01" vs "1"). Since inlining stores the canonical form of the literal, these comparisons fail. It's debatable as to whether we'll care about this in production.

*         Four configurations (property settings) were tested. For each config, I've included the number of tests that failed. In each case, 8 of the failures are dataset-related test cases:

o   Quads enabled, ASCII collator, inline test excluded: 13 errors

o   Quads enabled, ICU collator, inline tests excluded: 14 errors. Note: using the ICU collator causes the normalization-01 test to fail.

o   Quads enabled, ICU collator, inline tests included: 29 errors

o   Quads disabled, ASCII collator, inline tests included: 32 errors

The first two configurations are probably the most relevant for our purposes.


Additional notes/questions:



*         These tests were run against a scale-out instance running on a single machine (a/k/a single-node cluster) and against an 8-node cluster (2 CS, 5DS, 1 infra). Test results were identical except for execution times.

*         During the course of these tests, there were many of the following warnings written to the log:

WARN : com.bigdata.relation.accesspath.BlockingBuffer$BlockingIterator._hasNext(BlockingBuffer.java:1920): Iterator is not progressing: ntries=298, elapsed=2003ms : BlockingIterator{ open=true, futureIsDone=true, bufferIsOpen=false, nextE=false}



This occurred on both the 1-node and 8-node configurations.

Systap:  Does this indicate a performance issue?  A configuration issue? Is this at all related to issuing un-isolated scale-out queries (see next item).



*         The repository connection was created using a SAIL that wrapped an AbstractTripleStore created with ITx.UNISOLATED. Consequently, a warning regarding issuing an "un-isolated scale-out query" was seen for each query which appeared to be benign. Comments in the code indicate that this is indeed benign, but not recommended.

Systap: Not sure what the proper workaround is for this.  If we instantiated an AbstractTripleStore with a known commit timestamp, we'd have a read-only store and therefore a read-only connection which would not allow the test to write out the test triples. Setting the ISOLATABLE_INDICES override property is not supported for scale-out due a cast of an IIndexManager to Journal in BigdataSail.




Brian Levine
Principal Software Engineer
Services/Ovi Cloud
Nokia

Re: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

From: <dav...@no...> - 2010-08-31 10:17:45

On the subject of the dataset tests…

I had a quick look at those a few weeks ago because they are also run as part of the ‘rdf.sail’ unit test suite.  It seemed that the test manifest did not include the graphs to be loaded. So, unless we derive this information from elsewhere, I would expect all the dataset tests that should return a non-empty result set to fail.

David

From: ext Mike Personick [mailto:mi...@sy...]
Sent: Monday, August 30, 2010 9:05 PM
To: Levine Brian (Nokia-MS/Boston); 'big...@li...'
Subject: Re: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

Brian,

Also, since you're in the scale-out quads query mindset, have you given any more thought to a scale-out quads query performance benchmark test that we can use in the refactor? Like a quads version of BSBM or something?

Thanks,
Mike

________________________________
From: Bri...@no... <Bri...@no...>
To: Mike Personick; big...@li... <big...@li...>
Sent: Mon Aug 30 14:34:39 2010
Subject: RE: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out
Thanks for the detailed response Mike!

=b

From: ext Mike Personick [mailto:mi...@sy...]
Sent: Monday, August 30, 2010 2:54 PM
To: Levine Brian (Nokia-MS/Boston); big...@li...
Subject: RE: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

Brian,

The dataset test cases use the class “org.openrdf.repository.dataset.DatasetRepository” as a wrapper around the BigdataSailRepository.  The DatasetRepository class allows you to load data into the repository from a query itself rather than invoking explicit load operations and then running a query afterwards.  There is some incompatibility between the Sesame DatasetRepository class and the BigdataSailRepository that causes the query-time data load to never occur.  We’ve never bothered to track down the cause of this because it’s never been important to anyone and we don’t recommend using the Sesame DatasetRepository wrapper class for concurrency reasons.  If you look closely at those dataset test cases you will see that the ones that “succeed” are the ones where success happens to mean no results.  Also if you load the data in manually via a load operation and then perform the same query everything works fine.  We can take a closer look at this during the quads query refactor if this feature is important to you.

The iterator is not progressing warning is an obscure concurrency issue deep in the bidata core that is on Bryan’s plate to eventually track down and eliminate.  This has been around for a while and has not caused any actual errors that I am aware of.

The ISOLATABLE_INDICES issue has to do with full read/write transactions, which as you know are not supported in scale-out.  You are doing the right thing running the database in unisolated mode for the TCK, that is the only way you would be able to get it to pass.  However this is not what you’d want to do in production – there you’d want to segregate your reads and writes, do your writes against the unisolated indices and do your reads against read-only views.

Thanks,
Mike

________________________________
From: Bri...@no... [mailto:Bri...@no...]
Sent: Monday, August 30, 2010 11:44 AM
To: big...@li...
Subject: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

Hi all,

Last week I ran the Sesame SPARQL compliance tests against bigdata scale-out. This email includes the results of those tests.  The attached TAR file contains maven surefire output (TXT, XML and HTML) for each configuration—one directory per config.  Also included in each directory is a file called failed.txt which lists the tests that failed for that configuration.  I haven’t looked at the failure cases in detail yet.

Systap folks, if you could comment on these results in general and especially the issues I’ve raised prefixed by “Systap:”, I’d appreciate it.

-brian

Summary:

The quads-ascii-noinline configuration most accurately duplicates the configuration of the scale-up compliance test. A total of 13 tests reported errors in this configuration.  Of those, 8 tests were dataset tests which bigdata apparently can’t handle. These tests are purposely filtered out in the scale-up compliance test. Note that these same 8 tests do indeed fail in the scale-up tests when that filter is disabled. The remaining scale-out errors are in 5 graph-based compliance tests.

Systap: Could you provide some background on the dataset test case problem?  What is the limitation in bigdata that causes this functionality to be unsupported?

Description of the test:

The BigdataSparqlTest is a JUnit test included with bigdata.  It extends Sesame’s SPARQLQueryTest which is an implementation of the W3C RDF compliance tests. This test is meant to run against a scale-up instance of bigdata.

To test scale-out, we cloned BigdataSparqlTest and created ScaleoutSparqlTest. This is essentially identical to BigdataSparqlTest except for the following:

•         SPARQLQueryTest tears down and creates a new repository for each test in the suite.  This was too expensive for the scale-out test (and I’m not sure that it even works programmatically) and so a single repository was created and used for an entire test run. However, a new repository (different namespace) was used for each of the 4 configurations.

•         SPARQLQueryTest also closes and reopens the RepositoryConnection for each test.  This started flooding the log with Zookeeper errors (I didn’t spend a lot of time researching why this happens). So ScaleoutSparqlTest maintains a single connection and does a connection.clear() between each test to remove all statements.

•         BigdataSparqlTest identifies a number of test cases that fail due to the recent changes for inlining of literals. For the scale-up case, inlining programmatically disabled when one of these test cases is encountered.  This is done by setting an override property when the repository is created. Since ScaleoutSparqlTest does not create a new repository for each run, this was not possible and those test cases were allowed to run (and fail). Note that one of the test configurations (see below) disables inlining for all test cases to ensure that these test cases pass when inlining is disabled (which they do).

These failures are a result of doing lexical comparisons of literals (.e.g. “01” vs “1”). Since inlining stores the canonical form of the literal, these comparisons fail. It’s debatable as to whether we’ll care about this in production.

•         Four configurations (property settings) were tested. For each config, I’ve included the number of tests that failed. In each case, 8 of the failures are dataset-related test cases:

o   Quads enabled, ASCII collator, inline test excluded: 13 errors

o   Quads enabled, ICU collator, inline tests excluded: 14 errors. Note: using the ICU collator causes the normalization-01 test to fail.

o   Quads enabled, ICU collator, inline tests included: 29 errors

o   Quads disabled, ASCII collator, inline tests included: 32 errors

The first two configurations are probably the most relevant for our purposes.

Additional notes/questions:

•         These tests were run against a scale-out instance running on a single machine (a/k/a single-node cluster) and against an 8-node cluster (2 CS, 5DS, 1 infra). Test results were identical except for execution times.

•         During the course of these tests, there were many of the following warnings written to the log:

WARN : com.bigdata.relation.accesspath.BlockingBuffer$BlockingIterator._hasNext(BlockingBuffer.java:1920): Iterator is not progressing: ntries=298, elapsed=2003ms : BlockingIterator{ open=true, futureIsDone=true, bufferIsOpen=false, nextE=false}

This occurred on both the 1-node and 8-node configurations.

Systap:  Does this indicate a performance issue?  A configuration issue? Is this at all related to issuing un-isolated scale-out queries (see next item).

•         The repository connection was created using a SAIL that wrapped an AbstractTripleStore created with ITx.UNISOLATED. Consequently, a warning regarding issuing an “un-isolated scale-out query” was seen for each query which appeared to be benign. Comments in the code indicate that this is indeed benign, but not recommended.

Systap: Not sure what the proper workaround is for this.  If we instantiated an AbstractTripleStore with a known commit timestamp, we’d have a read-only store and therefore a read-only connection which would not allow the test to write out the test triples. Setting the ISOLATABLE_INDICES override property is not supported for scale-out due a cast of an IIndexManager to Journal in BigdataSail.

Brian Levine
Principal Software Engineer
Services/Ovi Cloud
Nokia

Re: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

From: Mike P. <mi...@sy...> - 2010-08-30 20:06:07

Brian,

Also, since you're in the scale-out quads query mindset, have you given any more thought to a scale-out quads query performance benchmark test that we can use in the refactor? Like a quads version of BSBM or something?

Thanks,
Mike

________________________________
From: Bri...@no... <Bri...@no...>
To: Mike Personick; big...@li... <big...@li...>
Sent: Mon Aug 30 14:34:39 2010
Subject: RE: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

Thanks for the detailed response Mike!

=b

From: ext Mike Personick [mailto:mi...@sy...]
Sent: Monday, August 30, 2010 2:54 PM
To: Levine Brian (Nokia-MS/Boston); big...@li...
Subject: RE: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

Brian,

The dataset test cases use the class “org.openrdf.repository.dataset.DatasetRepository” as a wrapper around the BigdataSailRepository.  The DatasetRepository class allows you to load data into the repository from a query itself rather than invoking explicit load operations and then running a query afterwards.  There is some incompatibility between the Sesame DatasetRepository class and the BigdataSailRepository that causes the query-time data load to never occur.  We’ve never bothered to track down the cause of this because it’s never been important to anyone and we don’t recommend using the Sesame DatasetRepository wrapper class for concurrency reasons.  If you look closely at those dataset test cases you will see that the ones that “succeed” are the ones where success happens to mean no results.  Also if you load the data in manually via a load operation and then perform the same query everything works fine.  We can take a closer look at this during the quads query refactor if this feature is important to you.

The iterator is not progressing warning is an obscure concurrency issue deep in the bidata core that is on Bryan’s plate to eventually track down and eliminate.  This has been around for a while and has not caused any actual errors that I am aware of.

The ISOLATABLE_INDICES issue has to do with full read/write transactions, which as you know are not supported in scale-out.  You are doing the right thing running the database in unisolated mode for the TCK, that is the only way you would be able to get it to pass.  However this is not what you’d want to do in production – there you’d want to segregate your reads and writes, do your writes against the unisolated indices and do your reads against read-only views.

Thanks,
Mike

________________________________
From: Bri...@no... [mailto:Bri...@no...]
Sent: Monday, August 30, 2010 11:44 AM
To: big...@li...
Subject: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

Hi all,

Last week I ran the Sesame SPARQL compliance tests against bigdata scale-out. This email includes the results of those tests.  The attached TAR file contains maven surefire output (TXT, XML and HTML) for each configuration—one directory per config.  Also included in each directory is a file called failed.txt which lists the tests that failed for that configuration.  I haven’t looked at the failure cases in detail yet.

Systap folks, if you could comment on these results in general and especially the issues I’ve raised prefixed by “Systap:”, I’d appreciate it.

-brian

Summary:

The quads-ascii-noinline configuration most accurately duplicates the configuration of the scale-up compliance test. A total of 13 tests reported errors in this configuration.  Of those, 8 tests were dataset tests which bigdata apparently can’t handle. These tests are purposely filtered out in the scale-up compliance test. Note that these same 8 tests do indeed fail in the scale-up tests when that filter is disabled. The remaining scale-out errors are in 5 graph-based compliance tests.

Systap: Could you provide some background on the dataset test case problem?  What is the limitation in bigdata that causes this functionality to be unsupported?

Description of the test:

The BigdataSparqlTest is a JUnit test included with bigdata.  It extends Sesame’s SPARQLQueryTest which is an implementation of the W3C RDF compliance tests. This test is meant to run against a scale-up instance of bigdata.

To test scale-out, we cloned BigdataSparqlTest and created ScaleoutSparqlTest. This is essentially identical to BigdataSparqlTest except for the following:

·         SPARQLQueryTest tears down and creates a new repository for each test in the suite.  This was too expensive for the scale-out test (and I’m not sure that it even works programmatically) and so a single repository was created and used for an entire test run. However, a new repository (different namespace) was used for each of the 4 configurations.

·         SPARQLQueryTest also closes and reopens the RepositoryConnection for each test.  This started flooding the log with Zookeeper errors (I didn’t spend a lot of time researching why this happens). So ScaleoutSparqlTest maintains a single connection and does a connection.clear() between each test to remove all statements.

·         BigdataSparqlTest identifies a number of test cases that fail due to the recent changes for inlining of literals. For the scale-up case, inlining programmatically disabled when one of these test cases is encountered.  This is done by setting an override property when the repository is created. Since ScaleoutSparqlTest does not create a new repository for each run, this was not possible and those test cases were allowed to run (and fail). Note that one of the test configurations (see below) disables inlining for all test cases to ensure that these test cases pass when inlining is disabled (which they do).

These failures are a result of doing lexical comparisons of literals (.e.g. “01” vs “1”). Since inlining stores the canonical form of the literal, these comparisons fail. It’s debatable as to whether we’ll care about this in production.

·         Four configurations (property settings) were tested. For each config, I’ve included the number of tests that failed. In each case, 8 of the failures are dataset-related test cases:

o   Quads enabled, ASCII collator, inline test excluded: 13 errors

o   Quads enabled, ICU collator, inline tests excluded: 14 errors. Note: using the ICU collator causes the normalization-01 test to fail.

o   Quads enabled, ICU collator, inline tests included: 29 errors

o   Quads disabled, ASCII collator, inline tests included: 32 errors

The first two configurations are probably the most relevant for our purposes.

Additional notes/questions:

·         These tests were run against a scale-out instance running on a single machine (a/k/a single-node cluster) and against an 8-node cluster (2 CS, 5DS, 1 infra). Test results were identical except for execution times.

·         During the course of these tests, there were many of the following warnings written to the log:

WARN : com.bigdata.relation.accesspath.BlockingBuffer$BlockingIterator._hasNext(BlockingBuffer.java:1920): Iterator is not progressing: ntries=298, elapsed=2003ms : BlockingIterator{ open=true, futureIsDone=true, bufferIsOpen=false, nextE=false}

This occurred on both the 1-node and 8-node configurations.

Systap:  Does this indicate a performance issue?  A configuration issue? Is this at all related to issuing un-isolated scale-out queries (see next item).

·         The repository connection was created using a SAIL that wrapped an AbstractTripleStore created with ITx.UNISOLATED. Consequently, a warning regarding issuing an “un-isolated scale-out query” was seen for each query which appeared to be benign. Comments in the code indicate that this is indeed benign, but not recommended.

Systap: Not sure what the proper workaround is for this.  If we instantiated an AbstractTripleStore with a known commit timestamp, we’d have a read-only store and therefore a read-only connection which would not allow the test to write out the test triples. Setting the ISOLATABLE_INDICES override property is not supported for scale-out due a cast of an IIndexManager to Journal in BigdataSail.

Brian Levine
Principal Software Engineer
Services/Ovi Cloud
Nokia

Re: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

From: Mike P. <mi...@sy...> - 2010-08-30 19:56:03

Brian,

Thinking more on it, I'm a little surprised the TCK worked on the scale-out system at all. To support quads query right now we use an expander pattern, which I'm sure we've mentioned to you before. Basically the expander takes a triple pattern and expands the results you'd get using its normal access path based on the dataset (default and named graphs) in use in the query. This implementation was written for the single-server version of bigdata, and I thought we'd even made the code conditional on not being run in scale-out mode since it is so inefficient for scale-out. I'll have to talk this over with Bryan when he gets back tonight.

Thanks,
Mike

________________________________
From: Bri...@no... <Bri...@no...>
To: Mike Personick; big...@li... <big...@li...>
Sent: Mon Aug 30 14:34:39 2010
Subject: RE: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

Thanks for the detailed response Mike!

=b

From: ext Mike Personick [mailto:mi...@sy...]
Sent: Monday, August 30, 2010 2:54 PM
To: Levine Brian (Nokia-MS/Boston); big...@li...
Subject: RE: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

Brian,

The dataset test cases use the class “org.openrdf.repository.dataset.DatasetRepository” as a wrapper around the BigdataSailRepository.  The DatasetRepository class allows you to load data into the repository from a query itself rather than invoking explicit load operations and then running a query afterwards.  There is some incompatibility between the Sesame DatasetRepository class and the BigdataSailRepository that causes the query-time data load to never occur.  We’ve never bothered to track down the cause of this because it’s never been important to anyone and we don’t recommend using the Sesame DatasetRepository wrapper class for concurrency reasons.  If you look closely at those dataset test cases you will see that the ones that “succeed” are the ones where success happens to mean no results.  Also if you load the data in manually via a load operation and then perform the same query everything works fine.  We can take a closer look at this during the quads query refactor if this feature is important to you.

The iterator is not progressing warning is an obscure concurrency issue deep in the bidata core that is on Bryan’s plate to eventually track down and eliminate.  This has been around for a while and has not caused any actual errors that I am aware of.

The ISOLATABLE_INDICES issue has to do with full read/write transactions, which as you know are not supported in scale-out.  You are doing the right thing running the database in unisolated mode for the TCK, that is the only way you would be able to get it to pass.  However this is not what you’d want to do in production – there you’d want to segregate your reads and writes, do your writes against the unisolated indices and do your reads against read-only views.

Thanks,
Mike

________________________________
From: Bri...@no... [mailto:Bri...@no...]
Sent: Monday, August 30, 2010 11:44 AM
To: big...@li...
Subject: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

Hi all,

Last week I ran the Sesame SPARQL compliance tests against bigdata scale-out. This email includes the results of those tests.  The attached TAR file contains maven surefire output (TXT, XML and HTML) for each configuration—one directory per config.  Also included in each directory is a file called failed.txt which lists the tests that failed for that configuration.  I haven’t looked at the failure cases in detail yet.

Systap folks, if you could comment on these results in general and especially the issues I’ve raised prefixed by “Systap:”, I’d appreciate it.

-brian

Summary:

The quads-ascii-noinline configuration most accurately duplicates the configuration of the scale-up compliance test. A total of 13 tests reported errors in this configuration.  Of those, 8 tests were dataset tests which bigdata apparently can’t handle. These tests are purposely filtered out in the scale-up compliance test. Note that these same 8 tests do indeed fail in the scale-up tests when that filter is disabled. The remaining scale-out errors are in 5 graph-based compliance tests.

Systap: Could you provide some background on the dataset test case problem?  What is the limitation in bigdata that causes this functionality to be unsupported?

Description of the test:

The BigdataSparqlTest is a JUnit test included with bigdata.  It extends Sesame’s SPARQLQueryTest which is an implementation of the W3C RDF compliance tests. This test is meant to run against a scale-up instance of bigdata.

To test scale-out, we cloned BigdataSparqlTest and created ScaleoutSparqlTest. This is essentially identical to BigdataSparqlTest except for the following:

·         SPARQLQueryTest tears down and creates a new repository for each test in the suite.  This was too expensive for the scale-out test (and I’m not sure that it even works programmatically) and so a single repository was created and used for an entire test run. However, a new repository (different namespace) was used for each of the 4 configurations.

·         SPARQLQueryTest also closes and reopens the RepositoryConnection for each test.  This started flooding the log with Zookeeper errors (I didn’t spend a lot of time researching why this happens). So ScaleoutSparqlTest maintains a single connection and does a connection.clear() between each test to remove all statements.

·         BigdataSparqlTest identifies a number of test cases that fail due to the recent changes for inlining of literals. For the scale-up case, inlining programmatically disabled when one of these test cases is encountered.  This is done by setting an override property when the repository is created. Since ScaleoutSparqlTest does not create a new repository for each run, this was not possible and those test cases were allowed to run (and fail). Note that one of the test configurations (see below) disables inlining for all test cases to ensure that these test cases pass when inlining is disabled (which they do).

These failures are a result of doing lexical comparisons of literals (.e.g. “01” vs “1”). Since inlining stores the canonical form of the literal, these comparisons fail. It’s debatable as to whether we’ll care about this in production.

·         Four configurations (property settings) were tested. For each config, I’ve included the number of tests that failed. In each case, 8 of the failures are dataset-related test cases:

o   Quads enabled, ASCII collator, inline test excluded: 13 errors

o   Quads enabled, ICU collator, inline tests excluded: 14 errors. Note: using the ICU collator causes the normalization-01 test to fail.

o   Quads enabled, ICU collator, inline tests included: 29 errors

o   Quads disabled, ASCII collator, inline tests included: 32 errors

The first two configurations are probably the most relevant for our purposes.

Additional notes/questions:

·         These tests were run against a scale-out instance running on a single machine (a/k/a single-node cluster) and against an 8-node cluster (2 CS, 5DS, 1 infra). Test results were identical except for execution times.

·         During the course of these tests, there were many of the following warnings written to the log:

WARN : com.bigdata.relation.accesspath.BlockingBuffer$BlockingIterator._hasNext(BlockingBuffer.java:1920): Iterator is not progressing: ntries=298, elapsed=2003ms : BlockingIterator{ open=true, futureIsDone=true, bufferIsOpen=false, nextE=false}

This occurred on both the 1-node and 8-node configurations.

Systap:  Does this indicate a performance issue?  A configuration issue? Is this at all related to issuing un-isolated scale-out queries (see next item).

·         The repository connection was created using a SAIL that wrapped an AbstractTripleStore created with ITx.UNISOLATED. Consequently, a warning regarding issuing an “un-isolated scale-out query” was seen for each query which appeared to be benign. Comments in the code indicate that this is indeed benign, but not recommended.

Systap: Not sure what the proper workaround is for this.  If we instantiated an AbstractTripleStore with a known commit timestamp, we’d have a read-only store and therefore a read-only connection which would not allow the test to write out the test triples. Setting the ISOLATABLE_INDICES override property is not supported for scale-out due a cast of an IIndexManager to Journal in BigdataSail.

Brian Levine
Principal Software Engineer
Services/Ovi Cloud
Nokia

Re: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

From: <Bri...@no...> - 2010-08-30 19:35:01

Thanks for the detailed response Mike!

=b

From: ext Mike Personick [mailto:mi...@sy...]
Sent: Monday, August 30, 2010 2:54 PM
To: Levine Brian (Nokia-MS/Boston); big...@li...
Subject: RE: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

Brian,

The dataset test cases use the class "org.openrdf.repository.dataset.DatasetRepository" as a wrapper around the BigdataSailRepository.  The DatasetRepository class allows you to load data into the repository from a query itself rather than invoking explicit load operations and then running a query afterwards.  There is some incompatibility between the Sesame DatasetRepository class and the BigdataSailRepository that causes the query-time data load to never occur.  We've never bothered to track down the cause of this because it's never been important to anyone and we don't recommend using the Sesame DatasetRepository wrapper class for concurrency reasons.  If you look closely at those dataset test cases you will see that the ones that "succeed" are the ones where success happens to mean no results.  Also if you load the data in manually via a load operation and then perform the same query everything works fine.  We can take a closer look at this during the quads query refactor if this feature is important to you.

The iterator is not progressing warning is an obscure concurrency issue deep in the bidata core that is on Bryan's plate to eventually track down and eliminate.  This has been around for a while and has not caused any actual errors that I am aware of.

The ISOLATABLE_INDICES issue has to do with full read/write transactions, which as you know are not supported in scale-out.  You are doing the right thing running the database in unisolated mode for the TCK, that is the only way you would be able to get it to pass.  However this is not what you'd want to do in production - there you'd want to segregate your reads and writes, do your writes against the unisolated indices and do your reads against read-only views.

Thanks,
Mike

________________________________
From: Bri...@no... [mailto:Bri...@no...]
Sent: Monday, August 30, 2010 11:44 AM
To: big...@li...
Subject: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

Hi all,

Last week I ran the Sesame SPARQL compliance tests against bigdata scale-out. This email includes the results of those tests.  The attached TAR file contains maven surefire output (TXT, XML and HTML) for each configuration-one directory per config.  Also included in each directory is a file called failed.txt which lists the tests that failed for that configuration.  I haven't looked at the failure cases in detail yet.

Systap folks, if you could comment on these results in general and especially the issues I've raised prefixed by "Systap:", I'd appreciate it.

-brian

Summary:

The quads-ascii-noinline configuration most accurately duplicates the configuration of the scale-up compliance test. A total of 13 tests reported errors in this configuration.  Of those, 8 tests were dataset tests which bigdata apparently can't handle. These tests are purposely filtered out in the scale-up compliance test. Note that these same 8 tests do indeed fail in the scale-up tests when that filter is disabled. The remaining scale-out errors are in 5 graph-based compliance tests.

Systap: Could you provide some background on the dataset test case problem?  What is the limitation in bigdata that causes this functionality to be unsupported?

Description of the test:

The BigdataSparqlTest is a JUnit test included with bigdata.  It extends Sesame's SPARQLQueryTest which is an implementation of the W3C RDF compliance tests. This test is meant to run against a scale-up instance of bigdata.

To test scale-out, we cloned BigdataSparqlTest and created ScaleoutSparqlTest. This is essentially identical to BigdataSparqlTest except for the following:

*         SPARQLQueryTest tears down and creates a new repository for each test in the suite.  This was too expensive for the scale-out test (and I'm not sure that it even works programmatically) and so a single repository was created and used for an entire test run. However, a new repository (different namespace) was used for each of the 4 configurations.

*         SPARQLQueryTest also closes and reopens the RepositoryConnection for each test.  This started flooding the log with Zookeeper errors (I didn't spend a lot of time researching why this happens). So ScaleoutSparqlTest maintains a single connection and does a connection.clear() between each test to remove all statements.

*         BigdataSparqlTest identifies a number of test cases that fail due to the recent changes for inlining of literals. For the scale-up case, inlining programmatically disabled when one of these test cases is encountered.  This is done by setting an override property when the repository is created. Since ScaleoutSparqlTest does not create a new repository for each run, this was not possible and those test cases were allowed to run (and fail). Note that one of the test configurations (see below) disables inlining for all test cases to ensure that these test cases pass when inlining is disabled (which they do).

These failures are a result of doing lexical comparisons of literals (.e.g. "01" vs "1"). Since inlining stores the canonical form of the literal, these comparisons fail. It's debatable as to whether we'll care about this in production.

*         Four configurations (property settings) were tested. For each config, I've included the number of tests that failed. In each case, 8 of the failures are dataset-related test cases:

o   Quads enabled, ASCII collator, inline test excluded: 13 errors

o   Quads enabled, ICU collator, inline tests excluded: 14 errors. Note: using the ICU collator causes the normalization-01 test to fail.

o   Quads enabled, ICU collator, inline tests included: 29 errors

o   Quads disabled, ASCII collator, inline tests included: 32 errors

The first two configurations are probably the most relevant for our purposes.

Additional notes/questions:

*         These tests were run against a scale-out instance running on a single machine (a/k/a single-node cluster) and against an 8-node cluster (2 CS, 5DS, 1 infra). Test results were identical except for execution times.

*         During the course of these tests, there were many of the following warnings written to the log:

WARN : com.bigdata.relation.accesspath.BlockingBuffer$BlockingIterator._hasNext(BlockingBuffer.java:1920): Iterator is not progressing: ntries=298, elapsed=2003ms : BlockingIterator{ open=true, futureIsDone=true, bufferIsOpen=false, nextE=false}

This occurred on both the 1-node and 8-node configurations.

Systap:  Does this indicate a performance issue?  A configuration issue? Is this at all related to issuing un-isolated scale-out queries (see next item).

*         The repository connection was created using a SAIL that wrapped an AbstractTripleStore created with ITx.UNISOLATED. Consequently, a warning regarding issuing an "un-isolated scale-out query" was seen for each query which appeared to be benign. Comments in the code indicate that this is indeed benign, but not recommended.

Systap: Not sure what the proper workaround is for this.  If we instantiated an AbstractTripleStore with a known commit timestamp, we'd have a read-only store and therefore a read-only connection which would not allow the test to write out the test triples. Setting the ISOLATABLE_INDICES override property is not supported for scale-out due a cast of an IIndexManager to Journal in BigdataSail.

Brian Levine
Principal Software Engineer
Services/Ovi Cloud
Nokia

Re: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

From: Mike P. <mi...@sy...> - 2010-08-30 18:54:43

Brian,

The dataset test cases use the class "org.openrdf.repository.dataset.DatasetRepository" as a wrapper around the BigdataSailRepository.  The DatasetRepository class allows you to load data into the repository from a query itself rather than invoking explicit load operations and then running a query afterwards.  There is some incompatibility between the Sesame DatasetRepository class and the BigdataSailRepository that causes the query-time data load to never occur.  We've never bothered to track down the cause of this because it's never been important to anyone and we don't recommend using the Sesame DatasetRepository wrapper class for concurrency reasons.  If you look closely at those dataset test cases you will see that the ones that "succeed" are the ones where success happens to mean no results.  Also if you load the data in manually via a load operation and then perform the same query everything works fine.  We can take a closer look at this during the quads query refactor if this feature is important to you.

The iterator is not progressing warning is an obscure concurrency issue deep in the bidata core that is on Bryan's plate to eventually track down and eliminate.  This has been around for a while and has not caused any actual errors that I am aware of.

The ISOLATABLE_INDICES issue has to do with full read/write transactions, which as you know are not supported in scale-out.  You are doing the right thing running the database in unisolated mode for the TCK, that is the only way you would be able to get it to pass.  However this is not what you'd want to do in production - there you'd want to segregate your reads and writes, do your writes against the unisolated indices and do your reads against read-only views.

Thanks,
Mike

________________________________
From: Bri...@no... [mailto:Bri...@no...]
Sent: Monday, August 30, 2010 11:44 AM
To: big...@li...
Subject: [Bigdata-developers] Results of SPARQL compliance tests on bigdata scale-out

Hi all,

Last week I ran the Sesame SPARQL compliance tests against bigdata scale-out. This email includes the results of those tests.  The attached TAR file contains maven surefire output (TXT, XML and HTML) for each configuration-one directory per config.  Also included in each directory is a file called failed.txt which lists the tests that failed for that configuration.  I haven't looked at the failure cases in detail yet.

Systap folks, if you could comment on these results in general and especially the issues I've raised prefixed by "Systap:", I'd appreciate it.

-brian

Summary:

The quads-ascii-noinline configuration most accurately duplicates the configuration of the scale-up compliance test. A total of 13 tests reported errors in this configuration.  Of those, 8 tests were dataset tests which bigdata apparently can't handle. These tests are purposely filtered out in the scale-up compliance test. Note that these same 8 tests do indeed fail in the scale-up tests when that filter is disabled. The remaining scale-out errors are in 5 graph-based compliance tests.

Systap: Could you provide some background on the dataset test case problem?  What is the limitation in bigdata that causes this functionality to be unsupported?

Description of the test:

The BigdataSparqlTest is a JUnit test included with bigdata.  It extends Sesame's SPARQLQueryTest which is an implementation of the W3C RDF compliance tests. This test is meant to run against a scale-up instance of bigdata.

To test scale-out, we cloned BigdataSparqlTest and created ScaleoutSparqlTest. This is essentially identical to BigdataSparqlTest except for the following:

*         SPARQLQueryTest tears down and creates a new repository for each test in the suite.  This was too expensive for the scale-out test (and I'm not sure that it even works programmatically) and so a single repository was created and used for an entire test run. However, a new repository (different namespace) was used for each of the 4 configurations.

*         SPARQLQueryTest also closes and reopens the RepositoryConnection for each test.  This started flooding the log with Zookeeper errors (I didn't spend a lot of time researching why this happens). So ScaleoutSparqlTest maintains a single connection and does a connection.clear() between each test to remove all statements.

*         BigdataSparqlTest identifies a number of test cases that fail due to the recent changes for inlining of literals. For the scale-up case, inlining programmatically disabled when one of these test cases is encountered.  This is done by setting an override property when the repository is created. Since ScaleoutSparqlTest does not create a new repository for each run, this was not possible and those test cases were allowed to run (and fail). Note that one of the test configurations (see below) disables inlining for all test cases to ensure that these test cases pass when inlining is disabled (which they do).

These failures are a result of doing lexical comparisons of literals (.e.g. "01" vs "1"). Since inlining stores the canonical form of the literal, these comparisons fail. It's debatable as to whether we'll care about this in production.

*         Four configurations (property settings) were tested. For each config, I've included the number of tests that failed. In each case, 8 of the failures are dataset-related test cases:

o        Quads enabled, ASCII collator, inline test excluded: 13 errors

o        Quads enabled, ICU collator, inline tests excluded: 14 errors. Note: using the ICU collator causes the normalization-01 test to fail.

o        Quads enabled, ICU collator, inline tests included: 29 errors

o        Quads disabled, ASCII collator, inline tests included: 32 errors

The first two configurations are probably the most relevant for our purposes.


Additional notes/questions:



*         These tests were run against a scale-out instance running on a single machine (a/k/a single-node cluster) and against an 8-node cluster (2 CS, 5DS, 1 infra). Test results were identical except for execution times.

*         During the course of these tests, there were many of the following warnings written to the log:

WARN : com.bigdata.relation.accesspath.BlockingBuffer$BlockingIterator._hasNext(BlockingBuffer.java:1920): Iterator is not progressing: ntries=298, elapsed=2003ms : BlockingIterator{ open=true, futureIsDone=true, bufferIsOpen=false, nextE=false}



This occurred on both the 1-node and 8-node configurations.

Systap:  Does this indicate a performance issue?  A configuration issue? Is this at all related to issuing un-isolated scale-out queries (see next item).



*         The repository connection was created using a SAIL that wrapped an AbstractTripleStore created with ITx.UNISOLATED. Consequently, a warning regarding issuing an "un-isolated scale-out query" was seen for each query which appeared to be benign. Comments in the code indicate that this is indeed benign, but not recommended.

Systap: Not sure what the proper workaround is for this.  If we instantiated an AbstractTripleStore with a known commit timestamp, we'd have a read-only store and therefore a read-only connection which would not allow the test to write out the test triples. Setting the ISOLATABLE_INDICES override property is not supported for scale-out due a cast of an IIndexManager to Journal in BigdataSail.





Brian Levine
Principal Software Engineer
Services/Ovi Cloud
Nokia

[Bigdata-developers] latency vs throughput

From: Bryan T. <br...@sy...> - 2010-08-28 00:03:18

Fred,

One thing that I would like to introduce is a deadline for a query.  Internally, this would translate into its position within a priority queue when scheduling the tasks to consume chunks of intermediate results for concurrently executing queries.

However, there are several dimensions here. It is not just just latency vs. throughput, but also the nature of the queries.  Highly selective queries run very quickly and you can run a heavy mixture of such queries concurrently.  High volume queries take longer to execute and the variance in their latency is generally less critical.  One thought is that high volume queries could be drawn from a different queue where the order was based solely on arrival time.  We should also impose a separate restriction on the number of high volume queries which can be executed concurrently on the cluster in order to manage contention for disk and memory (result sets are buffered in memory).

With HA and shard affinity, we can route queries based shard affinity (the idea that some nodes in an HA group are preferred for reading on some shards, even though those nodes all have the same shards on the disk).  I see shard affinity as a disk cache multiplier since some nodes will never see queries for some shards.

For high volume queries we will be using the multi-block IO access paths.  Those are are based on streaming sequential disk reads rather than disk seek operations.  We could decide to dedicate one node out of an logical group to handle high volume queries, which would leave the other nodes in the HA group to handle low latency queries.  This would also have the advantage that the node running high volume queries would not have a lot of drive head chatter so it would be able to deliver higher sustained disk transfer rates.

Bryan

> -----Original Message-----
> From: Fred Oliver [mailto:fko...@gm...] 
> Sent: Friday, August 27, 2010 3:17 PM
> To: Bryan Thompson
> Cc: Mike Personick; big...@li...
> Subject: Re: [Bigdata-developers] chosing the index for 
> testing fully bound access paths based on index locality
> 
> In the locality vs. concurrency trade-off, how might you 
> balance the need for speed on single queries with the need 
> for speed on multiple queries in parallel?  Latency vs. throughput?
> 
> Fred
>

Re: [Bigdata-developers] chosing the index for testing fully bound access paths based on index locality

From: Fred O. <fko...@gm...> - 2010-08-27 19:17:29

In the locality vs. concurrency trade-off, how might you balance the
need for speed on single queries with the need for speed on multiple
queries in parallel?  Latency vs. throughput?

Fred

Re: [Bigdata-developers] chosing the index for testing fully bound access paths based on index locality

From: Bryan T. <br...@sy...> - 2010-08-27 18:57:51

Scale-out introduces other wrinkles, but for this specific query the query volume will be low since the subject and predicate are bound in the first triple pattern, so locality should trump in scale-out for this query as well.

If the query were less selective, for example if the subject in the first triple pattern were unbound, then I think that the outcome would be less clear cut.  However, unselective queries like that are better executed using multi-block IO on the second triple pattern with a hash join against the intermediate results from the first triple pattern.  So, a different kettle of fish all together.

Would you file an issue for this?

Thanks,
Bryan

> -----Original Message-----
> From: Mike Personick 
> Sent: Friday, August 27, 2010 2:53 PM
> To: Bryan Thompson
> Cc: big...@li...
> Subject: RE: chosing the index for testing fully bound access 
> paths based on index locality
> 
> You would definitely get better locality by using POS instead 
> of SPO in that case since the P and O remain constant for 
> every asBound version of the (?c, d, e) tail.  However, just 
> playing devil's advocate, we've had the argument before over 
> whether locality or concurrency wins in scale-out.  For 
> example, recall the discussion we had over forward or reverse 
> properties (excerpted below).  I argued that using forward 
> properties would always win because you'd have better 
> locality.  You argued that the locality would be offset by 
> increased parallelism from using reverse properties.  How is 
> this argument any different?  Is it because we are talking 
> about only one tail instead of many tails?
> 
> 
> ...
> 
> The case where you were gathering multiple properties for A 
> instead of just one is a little different:
> 
> Forward:
> 
> select ?a,?b,?c,?d
> where {
>   ?a type A .
>   ?a AtoB ?b .
>   ?a AtoC ?c .
>   ?a AtoD ?d .
> }
> 
> Reverse:
> 
> select ?a,?b,?c,?d
> where {
>   ?a type A .
>   ?b BtoA ?a .
>   ?c CtoA ?a .
>   ?d DtoA ?a .
> }
> 
> In the case of forward you'd visit the POS index and then all 
> the properties for a given ?a binding would be in the same 
> place for the remainder of the tails (same SPO shard).  For 
> reverse you'd visit the POS index and then you'd have a set 
> of bound PO tuples that would need to be visited using POS 
> shards.  Again, unlikely they'd be in the same place, which 
> means network movement to complete the join.  I assert that 
> forward will always beat reverse in this case because you 
> have all the data you need for tails 2-4 locally.  Bryan 
> asserts that reverse can overcome the lack of locality via 
> concurrency.
> 
> -----Original Message-----
> From: Bryan Thompson
> Sent: Friday, August 27, 2010 12:42 PM
> To: Mike Personick
> Cc: big...@li...
> Subject: chosing the index for testing fully bound access 
> paths based on index locality
> 
> Mike,
> 
> David and I talked through a scenario where we might benefit 
> by choosing a non-SPO index when testing a fully bound 
> triple.  Consider:
> 
> query :- (a b ?c) AND (?c, d, e).
> 
> Assuming that (a b ?c) is more selective, we would normally 
> choose the SPO index for both access paths.  However, the POS 
> index will have better locality for (?c d e), so perhaps we 
> would do better by sending the binding sets to that index?
> 
> The guiding priciple would be, "when fully bound, choose the 
> index with better locality based on the variable(s) in the 
> triple pattern."
> 
> If you think this makes sense, let's file and issue for this. 
>  I would have to review LUBM/BSBM to be certain, but I would 
> not be suprised if both of those benchmarks included queries 
> which had the same characteristic.  As the number of 
> variables in the second triple pattern increases, there will 
> be less locality in the index so this might work better for 
> one unbound triple patterns than for two unbound triple patterns.
> 
> Thanks,
> Bryan
>

Re: [Bigdata-developers] chosing the index for testing fully bound access paths based on index locality

From: Mike P. <mi...@sy...> - 2010-08-27 18:53:44

You would definitely get better locality by using POS instead of SPO in that case since the P and O remain constant for every asBound version of the (?c, d, e) tail.  However, just playing devil's advocate, we've had the argument before over whether locality or concurrency wins in scale-out.  For example, recall the discussion we had over forward or reverse properties (excerpted below).  I argued that using forward properties would always win because you'd have better locality.  You argued that the locality would be offset by increased parallelism from using reverse properties.  How is this argument any different?  Is it because we are talking about only one tail instead of many tails?


...

The case where you were gathering multiple properties for A instead of just one is a little different:

Forward:

select ?a,?b,?c,?d
where {
  ?a type A .
  ?a AtoB ?b .
  ?a AtoC ?c .
  ?a AtoD ?d .
}

Reverse:

select ?a,?b,?c,?d
where {
  ?a type A .
  ?b BtoA ?a .
  ?c CtoA ?a .
  ?d DtoA ?a .
}

In the case of forward you'd visit the POS index and then all the properties for a given ?a binding would be in the same place for the remainder of the tails (same SPO shard).  For reverse you'd visit the POS index and then you'd have a set of bound PO tuples that would need to be visited using POS shards.  Again, unlikely they'd be in the same place, which means network movement to complete the join.  I assert that forward will always beat reverse in this case because you have all the data you need for tails 2-4 locally.  Bryan asserts that reverse can overcome the lack of locality via concurrency.

-----Original Message-----
From: Bryan Thompson 
Sent: Friday, August 27, 2010 12:42 PM
To: Mike Personick
Cc: big...@li...
Subject: chosing the index for testing fully bound access paths based on index locality

Mike,

David and I talked through a scenario where we might benefit by choosing a non-SPO index when testing a fully bound triple.  Consider:

query :- (a b ?c) AND (?c, d, e).

Assuming that (a b ?c) is more selective, we would normally choose the SPO index for both access paths.  However, the POS index will have better locality for (?c d e), so perhaps we would do better by sending the binding sets to that index?

The guiding priciple would be, "when fully bound, choose the index with better locality based on the variable(s) in the triple pattern."

If you think this makes sense, let's file and issue for this.  I would have to review LUBM/BSBM to be certain, but I would not be suprised if both of those benchmarks included queries which had the same characteristic.  As the number of variables in the second triple pattern increases, there will be less locality in the index so this might work better for one unbound triple patterns than for two unbound triple patterns.

Thanks,
Bryan

[Bigdata-developers] chosing the index for testing fully bound access paths based on index locality

From: Bryan T. <br...@sy...> - 2010-08-27 18:42:22

Mike,

David and I talked through a scenario where we might benefit by choosing a non-SPO index when testing a fully bound triple.  Consider:

query :- (a b ?c) AND (?c, d, e).

Assuming that (a b ?c) is more selective, we would normally choose the SPO index for both access paths.  However, the POS index will have better locality for (?c d e), so perhaps we would do better by sending the binding sets to that index?

The guiding priciple would be, "when fully bound, choose the index with better locality based on the variable(s) in the triple pattern."

If you think this makes sense, let's file and issue for this.  I would have to review LUBM/BSBM to be certain, but I would not be suprised if both of those benchmarks included queries which had the same characteristic.  As the number of variables in the second triple pattern increases, there will be less locality in the index so this might work better for one unbound triple patterns than for two unbound triple patterns.

Thanks,
Bryan

Re: [Bigdata-developers] breadking out the bulk loader configuration from the main bigdata configuration file.

From: <Bri...@no...> - 2010-08-25 13:03:37

Sounds like a great idea to me. +1

-b

-----Original Message-----
From: ext Bryan Thompson [mailto:br...@sy...] 
Sent: Wednesday, August 25, 2010 5:49 AM
To: big...@li...
Subject: [Bigdata-developers] breadking out the bulk loader configuration from the main bigdata configuration file.

All,

I would like to decouple the bulk loader configuration from the main bigdata configuration file.  This will greatly simplify the main configuration file and make it possible to have sample configuration files for different bulk loader tasks.

We wind up provisioning the specific triple or quad store instance when running the bulk loader for the first time against the namespace for that triple/quads store.  For purely historical reasons, the bulk loader is configured by two component sections in the bigdata configuration file:

  - lubm : This is where we are setting the properties which will govern the triple/quad store.

  - com.bigdata.rdf.load.MappedRDFDataLoadMaster : This is where we describe the bulk load job.

The MappedRDFDataLoadMaster section also uses some back references into fields defined in the lubm section, but the entire lubm section could be folded into the MappedRDFDataLoadMaster section.

At present, there are the following back references from into the rest of the configuration file:

	bigdata.dataServiceCount : It seems that we should simply run with all logical data services found in jini/zookeeper.

    bigdata.clientServiceCount : It seems to me that the #of client services could default to all unless overridden.

There is also a relatively complex declaration of the services templates which is used to describe which services must be running as a precondition for the bulk loader job.  I propose that this should be either folder into the bulk loader code or these preconditions abolished as they basically assert that the configured system must be running (see bigdataCluster.config#1848).

    awaitServicesTimeout = 10000;

    servicesTemplates = new ServicesTemplate[] {...}

And bigdataCluster.config#1893, a template is established which says that the bulk loader will use dedicated client service nodes (rather than running the distributed bulk load job on the data service nodes, which can also host distributed job execution).

	clientsTemplate = new ServicesTemplate(...);

I like to use distinct client service nodes because the bulk loader tends to be memory hungry when it is buffering data for the shards in memory before writing on the data services.  Running the distributed job on the data services adds to the burden of the data service nodes and we still need to buffer the data and then scatter it to the appropriate shards.  It is possible to reduce the memory demand of the bulk loader (by adjusting the queue capacity and chunk size used by the asynchronous write pipeline), and I have done this in the bigdataStandalone.config file.  For this reason, the triple/quads store configurations are not "one size fits all" and I will get into this in a follow on email which addresses performance tuning properties used in the configuration files.

There are also some optional properties which really should be turned off unless you are engaged in forensics:

	indexDumpDir = new File("@NAS@/"+jobName+"-indexDumps");

	indexDumpNamespace = lubm.namespace;

Based on this, it seems that we could isolate the bulk loader configuration relatively easily into its own configuration file.  That configuration file would only need to know a bare minimum of things:

- jini groups and locators.
- zookeeper quorum IPs and ports.

Thanks,
Bryan

------------------------------------------------------------------------------
Sell apps to millions through the Intel(R) Atom(Tm) Developer Program
Be part of this innovative community and reach millions of netbook users 
worldwide. Take advantage of special opportunities to increase revenue and 
speed time-to-market. Join now, and jumpstart your future.
http://p.sf.net/sfu/intel-atom-d2d
_______________________________________________
Bigdata-developers mailing list
Big...@li...
https://lists.sourceforge.net/lists/listinfo/bigdata-developers

Re: [Bigdata-developers] zookeeper discovery

From: Bryan T. <br...@sy...> - 2010-08-25 12:16:31

Brian,

Ok.  My interest ties it with my other email to the list to break out the bulk loader configuration from the main configuration file [1].  I'd like to be able to configure the bulk loader based on the minimum amount of shared state.  This is not a blocking issue as I could proceed on this issue by replicating the zookeeper client configuration information into a bulk loader configuration file for now and then slim down the configuration further after you finish that wrapper.

Let me know if I can help on the smart proxy patterns for the CS/DS.  The work that I am currently doing on scale-out query evaluation will help to take some of the data out of the RMI messaging but it will not introduce any changes to the CS/DS public APIs.  What I am doing is extending the ResourceService exposed by the DS to move index segment files around to also allow the interchange of ByteBuffers containing intermediate query results.  In combination with a few other things, this is going to tremendously simplify how we handle join evaluation for scale-out.

Thanks,
Bryan

[1] https://sourceforge.net/apps/trac/bigdata/ticket/148

________________________________
From: Brian Murphy [mailto:btm...@gm...]
Sent: Wednesday, August 25, 2010 8:03 AM
To: big...@li...
Subject: Re: [Bigdata-developers] zookeeper discovery

On Wed, Aug 25, 2010 at 5:49 AM, Bryan Thompson <br...@sy...<mailto:br...@sy...>> wrote:

You had been looking at wrapping zookeeper for discovery from jini.  What is the status of that effort?

The wrapper class is complete and checked in to my
development branch, but only manually tested. Currently,
it can be started from pstart or from the boot manager
process added to that branch. It still has to be modified
to be startable from the ServicesManagerService, but
more importantly, a client wrapper needs to be written.
I was planning on doing the remaining work after the
smart proxy work is complete for each of the other services;
in which the data service and the client service remain.

BrianM

27 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 56 57 58 59 60 .. 72 > >> (Page 58 of 72)