Thread: Re: [Exist-development] [Exist-open] Performance of concurrent read queries. (Page 2)

eXist-db is a feature rich Open Source native XML database

Brought to you by: deliriumsky, dizzzz, windauer, wolfgang_m

exist-development

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Wolfgang M. <wol...@ex...> - 2010-09-01 09:00:04

> A stored node in eXist is
> currently represented by a pair <DocumentImpl doc, long nodeId>. This
> will change to <int docId, long nodeId>.

Sorry, it is <DocumentImpl doc, NodeId nodeId> and will become <int
docId, NodeId nodeId> where NodeId is essentially a binary encoded
hierarchical identifier.

Wolfgang

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Jason S. <js...@in...> - 2010-09-01 14:06:01

The problem with deadlocks (the one I ran into) doesn't actually have anything to do with the collection hierarchy.  They are simply caused by taking two locks out of order. The fact that ReentrantReadWriteLock appears to be using a hierarchy (it actually isn't) is a red herring.

If I understand this correctly, the new design will need to manage, potentially, hundreds of thousands of locks that can be taken in arbitrary order.  If that's true, there are still going to be deadlocks.  And since this is such fantastically granular locking, finding the deadlocks will be very slow (I think it's an n^2 algorithm - at least Java's deadlock detection appears to be n^2).

There are only three ways to avoid deadlocks.

THE THREE WAYS TO PREVENT DEADLOCKS

1) Use a single global mutex to lock.  A single mutex cannot deadlock.
2) If you have more than one lock, guarantee that all locks are taken in the same order every time.
3) Don't use locking at all.  Use a scheme that avoids the need for locking.

Otherwise, if you are using mutex locks, you will have deadlocks.  They are, unfortunately, unavoidable if the design doesn't fall into one of these 3 categories.

Option 2 is actually feasible.  This is the Clojure approach to the world - all data structures are read only, and you "mutate" something by reconstructing a new copy with the changes (sharing the old data wherever possible).  

I am assuming, going forward, that the structures in the database are intended to be mutable, and they need to be protected by locks that can be taken in arbitrary order.  If that is correct, then deadlocks are going to occur.  

Does this make sense, or am I out in left field? :-)



-----Original Message-----
From: Wolfgang Meier [mailto:wol...@ex...] 
Sent: Wednesday, September 01, 2010 2:53 AM
To: Jason Smith
Cc: Adam Retter; Paul Ryan; eXist development; Michael J. Pelikan; Todd Gochenour
Subject: Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

> If you remove collection locks, does that mean you are planning to lock at the resource level?

Yes. In the current design, a resource does physically belong to a
collection. To read or write a resource, you have to access the
collection object first, then acquire a lock on the resource. Removing
this direct dependency between collection and resource will have a
number of benefits:

1) resource updates will become faster and scale better: indexes are
currently organized by collection, which introduces a dependency
between collection size and update speed (this has already been
dropped for the structural index in trunk). The larger the collection,
the slower your updates. Removing this dependency will increase
scalability. You will be able to index or reindex a single resource
without touching or locking the rest of the collection.

2) queries will consume less memory: right now, a query needs to load
all required collections plus the internal metadata (name,
permissions, owner...) for all resources at the start. This is slow
and takes a lot of space. If resources are decoupled from collections,
a query will just need the document id plus the lock for every
resource. We no longer have to retrieve the actual document object.
Instead, a lock manager maintains a simple map of documentId -> lock
and the query uses the documentId only. A stored node in eXist is
currently represented by a pair <DocumentImpl doc, long nodeId>. This
will change to <int docId, long nodeId>.

3) the transaction log will become much easier to maintain. Right now
we have to make sure that transactional integrity is preserved for
both: dom.dbx and collections.dbx, at the same time, which introduces
a number of problems. Decoupling them simplifies the transaction log
since both indexes become independent and can be maintained
independently.

4) the main deadlock issue, which is caused by the hierarchy of
collection and resource locks, will disappear. If the collection is
just a virtual entity, you only need to lock it if you modify its
metadata (name, owner, permissions). Writing or reading a resource
will not require a lock on the collection anymore since the resources
are just loosely assigned to a collection.

5) if collections are entirely virtual, you can assign a resource to
more than one collection. On the other hand, it may become possible
for a resource to not be a member of any collection at all (in which
case it is handled as a direct child of the root collection?).

> Wouldn't that potentially lead to an awful lot of locks being taken?  Am I misreading the idea?

You won't need to take more locks than you do right now (rather less,
since the collection locks disappear).

Related to this discussion is the question of dirty reads: currently
the query engine does allow dirty reads to some extent. I'm not yet
sure if we should keep it like that (and just handle them more
transparently) or disallow them completely.

I'll try to come up with some graphic or mind map to explain the
overall picture.

Wolfgang

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Wolfgang M. <wol...@ex...> - 2010-09-01 15:23:40

> The problem with deadlocks (the one I ran into) doesn't actually have anything to do with
> the collection hierarchy.  They are simply caused by taking two locks out of order.

Seems we need a bunch of low-level test cases first (there are some
higher level tests I used to debug similar issues). Sure, thread T1
can take a lock on doc A, then doc B, while thread T2 locks B then A
(see XQuery below). This case should indeed be handled by the deadlock
detection. However, in most cases eXist does take locks in order (due
to the fact that collections are always ordered the same).

> If I understand this correctly, the new design will need to manage, potentially, hundreds
> of thousands of locks that can be taken in arbitrary order.

Most traditional databases use an even finer granularity (single
pages, records, tuples). The new design would in no way be different
from the current situation. The query optimizer can effectively limit
the number of resources to be locked. I'm convinced that investing
time into improving the optimizer is the best way to achieve quick
improvements.

By redesigning the storage backend, you can increase performance by
50% or maybe 70% if things go well. Improvements to the query
optimizer often result in performance wins up to 500% and more. I just
did it again for some full text queries (not committed yet).

> 2) If you have more than one lock, guarantee that all locks are taken in the same order every time.

Given the possibilities of XQuery, this is going to be difficult to
guarantee,e.g.:

let $doc := request:get-parameter("doc", ())
let $a := doc($doc)/root
let $b := doc($doc//link/@href/string())
return
   (: do something, even read-only, with $a and $b :)

Since $b is determined by querying $a, you don't know in advance where
it will point to. If there's a circular link between $a and $b, you
deadlock. Allowing dirty reads helps a bit here.

> 3) Don't use locking at all.  Use a scheme that avoids the need for locking.

There are (relational) databases which avoid locking, e.g. by using a
shadow page concept on low-level storage and make the transaction fail
if a conflict occurs when it is committed. This requires a different
design on all levels though - up to the user who has to live with the
fact that transactions can fail and need to be redone.

> Option 2 is actually feasible.  This is the Clojure approach to the world - all data
> structures are read only, and you "mutate" something by reconstructing a new copy with
> the changes (sharing the old data wherever possible).

Yes, it is feasible, see above. But very difficult to implement.

> I am assuming, going forward, that the structures in the database are intended to be
> mutable, and they need to be protected by locks that can be taken in arbitrary order.  If
> that is correct, then deadlocks are going to occur.

Correct. See above.

Wolfgang

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Dmitriy S. <sha...@gm...> - 2010-11-05 05:45:45

On Wed, Sep 1, 2010 at 8:23 PM, Wolfgang Meier <wol...@ex...>wrote:

> By redesigning the storage backend, you can increase performance by
> 50% or maybe 70% if things go well. Improvements to the query
> optimizer often result in performance wins up to 500% and more. I just
> did it again for some full text queries (not committed yet).
>
>
Is it possible to analyze xquery scripts and produce indexes configuration.

The normal practices is storing scripts at db, so it should be possible to
analyze xpath requests and give indexes settings, same time it require
statistics index. Wolfgang, is statistics index stable?

-- 
Dmitriy Shabanov

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Jason S. <js...@in...> - 2010-09-01 17:09:34

If used from the front-end APIs, eXist does not deadlock on typical XQueries.  I'm not sure what you mean by "collections are always ordered the same" though.  I can reference fn:collection("...") in multiple places in a single XPath, and these are not guaranteed to execute in the same order from one XQuery to the next.  Fn:collection("...") must acquire a collection lock at some point, correct?

Also, keep in mind that deadlock detection in eXist is currently broken, and I don't think it can be fixed.  I'm working on a write-up.

Other databases do have high concurrency.  I don't know about XML databases though.  I've worked with one in particular that is commercial.  It has a collection-based locking model, and it throws deadlock errors occasionally.  When this happens, you rollback and start over.  And this other database appears to have some concurrency limitations as well, just from some of the testing we've done (very preliminary).

So this is what is driving the conversation - how do we get to granular, concurrent access without killing ourselves with deadlocking?  A small number of intention locks is manageable.  Too few results in limited concurrency.  Too many results in a really difficult deadlock problem (detection is expensive).  Any solution that does not need locking at all should get CAREFUL consideration!!!  :-) IMHO



  

-----Original Message-----
From: Wolfgang Meier [mailto:wol...@ex...] 
Sent: Wednesday, September 01, 2010 9:24 AM
To: Jason Smith
Cc: Adam Retter; Paul Ryan; eXist development; Michael J. Pelikan; Todd Gochenour
Subject: Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

> The problem with deadlocks (the one I ran into) doesn't actually have anything to do with
> the collection hierarchy.  They are simply caused by taking two locks out of order.

Seems we need a bunch of low-level test cases first (there are some
higher level tests I used to debug similar issues). Sure, thread T1
can take a lock on doc A, then doc B, while thread T2 locks B then A
(see XQuery below). This case should indeed be handled by the deadlock
detection. However, in most cases eXist does take locks in order (due
to the fact that collections are always ordered the same).

> If I understand this correctly, the new design will need to manage, potentially, hundreds
> of thousands of locks that can be taken in arbitrary order.

Most traditional databases use an even finer granularity (single
pages, records, tuples). The new design would in no way be different
from the current situation. The query optimizer can effectively limit
the number of resources to be locked. I'm convinced that investing
time into improving the optimizer is the best way to achieve quick
improvements.

By redesigning the storage backend, you can increase performance by
50% or maybe 70% if things go well. Improvements to the query
optimizer often result in performance wins up to 500% and more. I just
did it again for some full text queries (not committed yet).

> 2) If you have more than one lock, guarantee that all locks are taken in the same order every time.

Given the possibilities of XQuery, this is going to be difficult to
guarantee,e.g.:

let $doc := request:get-parameter("doc", ())
let $a := doc($doc)/root
let $b := doc($doc//link/@href/string())
return
   (: do something, even read-only, with $a and $b :)

Since $b is determined by querying $a, you don't know in advance where
it will point to. If there's a circular link between $a and $b, you
deadlock. Allowing dirty reads helps a bit here.

> 3) Don't use locking at all.  Use a scheme that avoids the need for locking.

There are (relational) databases which avoid locking, e.g. by using a
shadow page concept on low-level storage and make the transaction fail
if a conflict occurs when it is committed. This requires a different
design on all levels though - up to the user who has to live with the
fact that transactions can fail and need to be redone.

> Option 2 is actually feasible.  This is the Clojure approach to the world - all data
> structures are read only, and you "mutate" something by reconstructing a new copy with
> the changes (sharing the old data wherever possible).

Yes, it is feasible, see above. But very difficult to implement.

> I am assuming, going forward, that the structures in the database are intended to be
> mutable, and they need to be protected by locks that can be taken in arbitrary order.  If
> that is correct, then deadlocks are going to occur.

Correct. See above.

Wolfgang

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Wolfgang M. <wol...@ex...> - 2010-09-01 19:29:44

> Wow steady on there! Thats a bit early! how about 3pm UTC?

I have an unexpected appointment tomorrow. Could we move the
conference to 6pm UTC? If not, I'll need to see how I can plug in.

Wolfgang

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Loren C. <lor...@gm...> - 2010-09-01 19:34:06

My only time constraint is 12:00 - 13:00 CDT (UTC - 5).  I have a lunch meeting.

Loren

On Sep 1, 2010, at 02:29 PM, Wolfgang Meier wrote:

>> Wow steady on there! Thats a bit early! how about 3pm UTC?
> 
> I have an unexpected appointment tomorrow. Could we move the
> conference to 6pm UTC? If not, I'll need to see how I can plug in.
> 
> Wolfgang

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Dmitriy S. <sha...@gm...> - 2010-09-02 06:30:07

Attachments: smime.p7s

On Wed, 2010-09-01 at 21:29 +0200, Wolfgang Meier wrote:
> > Wow steady on there! Thats a bit early! how about 3pm UTC?
> 
> I have an unexpected appointment tomorrow. Could we move the
> conference to 6pm UTC? If not, I'll need to see how I can plug in.

Fine for me.

-- 
Cheers,

Dmitriy Shabanov

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Adam R. <ad...@ex...> - 2010-09-02 11:23:38

On 1 September 2010 18:09, Jason Smith <js...@in...> wrote:
> If used from the front-end APIs, eXist does not deadlock on typical XQueries.  I'm not sure what you mean by "collections are always ordered the same" though.  I can reference fn:collection("...") in multiple places in a single XPath, and these are not guaranteed to execute in the same order from one XQuery to the next.  Fn:collection("...") must acquire a collection lock at some point, correct?
>
> Also, keep in mind that deadlock detection in eXist is currently broken, and I don't think it can be fixed.  I'm working on a write-up.
>
> Other databases do have high concurrency.  I don't know about XML databases though.  I've worked with one in particular that is commercial.  It has a collection-based locking model, and it throws deadlock errors occasionally.  When this happens, you rollback and start over.  And this other database appears to have some concurrency limitations as well, just from some of the testing we've done (very preliminary).
>
> So this is what is driving the conversation - how do we get to granular, concurrent access without killing ourselves with deadlocking?  A small number of intention locks is manageable.  Too few results in limited concurrency.  Too many results in a really difficult deadlock problem (detection is expensive).  Any solution that does not need locking at all should get CAREFUL consideration!!!  :-) IMHO

Is there not a whole class of lock free concurrent algorithms though?
For example CAS (Compare and Swap) is available in Java I believe. I
know that Scala exploit this for their parallel programming
constructs.

http://en.wikipedia.org/wiki/Compare-and-swap
http://download.oracle.com/javase/6/docs/api/java/util/concurrent/atomic/package-summary.html

>
>
>
>
>
> -----Original Message-----
> From: Wolfgang Meier [mailto:wol...@ex...]
> Sent: Wednesday, September 01, 2010 9:24 AM
> To: Jason Smith
> Cc: Adam Retter; Paul Ryan; eXist development; Michael J. Pelikan; Todd Gochenour
> Subject: Re: [Exist-development] [Exist-open] Performance of concurrent read queries.
>
>> The problem with deadlocks (the one I ran into) doesn't actually have anything to do with
>> the collection hierarchy.  They are simply caused by taking two locks out of order.
>
> Seems we need a bunch of low-level test cases first (there are some
> higher level tests I used to debug similar issues). Sure, thread T1
> can take a lock on doc A, then doc B, while thread T2 locks B then A
> (see XQuery below). This case should indeed be handled by the deadlock
> detection. However, in most cases eXist does take locks in order (due
> to the fact that collections are always ordered the same).
>
>> If I understand this correctly, the new design will need to manage, potentially, hundreds
>> of thousands of locks that can be taken in arbitrary order.
>
> Most traditional databases use an even finer granularity (single
> pages, records, tuples). The new design would in no way be different
> from the current situation. The query optimizer can effectively limit
> the number of resources to be locked. I'm convinced that investing
> time into improving the optimizer is the best way to achieve quick
> improvements.
>
> By redesigning the storage backend, you can increase performance by
> 50% or maybe 70% if things go well. Improvements to the query
> optimizer often result in performance wins up to 500% and more. I just
> did it again for some full text queries (not committed yet).
>
>> 2) If you have more than one lock, guarantee that all locks are taken in the same order every time.
>
> Given the possibilities of XQuery, this is going to be difficult to
> guarantee,e.g.:
>
> let $doc := request:get-parameter("doc", ())
> let $a := doc($doc)/root
> let $b := doc($doc//link/@href/string())
> return
>   (: do something, even read-only, with $a and $b :)
>
> Since $b is determined by querying $a, you don't know in advance where
> it will point to. If there's a circular link between $a and $b, you
> deadlock. Allowing dirty reads helps a bit here.
>
>> 3) Don't use locking at all.  Use a scheme that avoids the need for locking.
>
> There are (relational) databases which avoid locking, e.g. by using a
> shadow page concept on low-level storage and make the transaction fail
> if a conflict occurs when it is committed. This requires a different
> design on all levels though - up to the user who has to live with the
> fact that transactions can fail and need to be redone.
>
>> Option 2 is actually feasible.  This is the Clojure approach to the world - all data
>> structures are read only, and you "mutate" something by reconstructing a new copy with
>> the changes (sharing the old data wherever possible).
>
> Yes, it is feasible, see above. But very difficult to implement.
>
>> I am assuming, going forward, that the structures in the database are intended to be
>> mutable, and they need to be protected by locks that can be taken in arbitrary order.  If
>> that is correct, then deadlocks are going to occur.
>
> Correct. See above.
>
> Wolfgang
>



-- 
Adam Retter

eXist Developer
{ United Kingdom }
ad...@ex...
irc://irc.freenode.net/existdb

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Peter C. <pet...@me...> - 2010-09-02 12:18:01

On 1 September 2010 18:09, Jason Smith <js...@in...> wrote:

> Other databases do have high concurrency.  I don't know about XML databases
> though.  I've worked with one in particular that is commercial.  It has a
> collection-based locking model, and it throws deadlock errors occasionally.
>  When this happens, you rollback and start over.
>
> I think I just had an allergic reaction to a pronoun :-).

Let's just take the pronoun out of that paragraph and be clear about what
we're talking about here.  In particular, what is "you"?

In order to be able to start over, "you" must be able to roll back to a
point that is well-defined.  This generally requires transaction boundaries
to be controllable from "you"; if not, "you" cannot determine the point from
which to start over.  In general, the only sensible "you" for this purpose
is the calling application.

Does eXist have transaction boundaries that are controllable from the
calling application?  If not, what other values of "you" might make sense
for this approach?

- Peter

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Jason S. <js...@in...> - 2010-09-02 13:01:35

When a deadlock occurs, the part of the code that is responsible for starting and controlling the transaction must catch the deadlock exception, rollback to the previous state, and restart the operation.  One thread in the deadlock proceeds, the other thread in the deadlock must yield and restart.

Currently, eXist is not fully transactional, so there are conditions on restarting.

From: pet...@go... [mailto:pet...@go...] On Behalf Of Peter Crowther
Sent: Thursday, September 02, 2010 6:18 AM
To: Jason Smith
Cc: eXist development
Subject: Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

On 1 September 2010 18:09, Jason Smith <js...@in...<mailto:js...@in...>> wrote:
Other databases do have high concurrency.  I don't know about XML databases though.  I've worked with one in particular that is commercial.  It has a collection-based locking model, and it throws deadlock errors occasionally.  When this happens, you rollback and start over.
I think I just had an allergic reaction to a pronoun :-).

Let's just take the pronoun out of that paragraph and be clear about what we're talking about here.  In particular, what is "you"?

In order to be able to start over, "you" must be able to roll back to a point that is well-defined.  This generally requires transaction boundaries to be controllable from "you"; if not, "you" cannot determine the point from which to start over.  In general, the only sensible "you" for this purpose is the calling application.

Does eXist have transaction boundaries that are controllable from the calling application?  If not, what other values of "you" might make sense for this approach?

- Peter

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Wolfgang M. <wol...@ex...> - 2010-09-02 12:41:04

> Does eXist have transaction boundaries that are controllable from the
> calling application?  If not, what other values of "you" might make sense
> for this approach?

If we decouple resources from the collection, it would actually become
much easier to make transactions user-controllable and support full
rollbacks. The redo/undo methods are already there, but due to how
indexes are designed, we cannot use them except for recovery. The
proposed redesign would make this all much simpler.

Wolfgang

[Exist-development] Moving towards a new API

From: Jason S. <js...@in...> - 2010-09-04 02:40:17

We talked on Thursday about a new API that would be more powerful than existing APIs and provide a wrapper for the internal API.

After thinking about this some more, I'd like to consider the XMLDB-API model. There is a lot that is right about the XMLDB wrapper. It has pretty good collection and resource facilities, and it can be used to access large data sets in a "sequence"-like manner.

I think where XMLDB falls down is around standardization. The XMLDB standard was designed in the days when an XML database could do XPath 1.0. It can't fully support the XQuery standard without significant changes. It can't be extended to do some of the things that eXist does specifically. XMLDB as a standard seems kind of dead at this point anyway.

Just one example - this email is getting too long... I can't use XMLDB to return a Sequence of xs:dateTime, which I then pass as a parameter to another XQuery. XMLDB only deals with string output, and the set of input types is very limited.

So right now, I am thinking about an API that would aspire to replace the XMLDB API in Exist. I think it should also hide (wrap) the current locking and transaction mechanisms (since these are bound to change in the next year or two).

It would not be based on any standard, but would be designed to provide maximally simple and powerful access to as much of the internal API as possible, and to be easy to extend as new features become available.

If we do it right, we could also make this a cross-platform API (supporting languages other than Java), kind of like the REST API today, but targeted to language APIs.

Does this sound like a good path to start down? Different ideas?

BTW, I know there is a JSR that covers an XML database API somewhere out there, or one was in work... I looked at it a long time ago, and it did not seem all that applicable. Is there any interest in that API, and if so, can someone point me to a link? I seem to be having trouble finding it again.

Thanks!

-Jason S.

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Wolfgang M. <wol...@ex...> - 2010-09-02 12:43:31

> Is there not a whole class of lock free concurrent algorithms though?

As I said many times before: you cannot see locking in isolation. It
is tightly connected with transaction management, caching, log
sequence numbers, checkpoints and other concepts. Any change to
locking has to take all those other aspects into account.

Wolfgang

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Jason S. <js...@in...> - 2010-09-02 13:12:39

Adam, Wolfgang is correct.  The locking mechanism needed (useful) is tied to the database design.  In some designs, you don't need locks at all.  The current eXist design allows live writes (mutation), so you need locking to limit access.  The actual locking mechanism, however, could change significantly based on the underlying database implementation. 

-----Original Message-----
From: Wolfgang Meier [mailto:wol...@ex...] 
Sent: Thursday, September 02, 2010 6:43 AM
To: Adam Retter
Cc: Jason Smith; Paul Ryan; eXist development; Michael J. Pelikan; Todd Gochenour
Subject: Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

> Is there not a whole class of lock free concurrent algorithms though?

As I said many times before: you cannot see locking in isolation. It
is tightly connected with transaction management, caching, log
sequence numbers, checkpoints and other concepts. Any change to
locking has to take all those other aspects into account.

Wolfgang

Re: [Exist-development] Moving towards a new API

From: Jason S. <js...@in...> - 2010-09-07 16:00:20

I was thinking about remote/local too.  That is certainly something that could be done with just a little planning.  

Where is the best documentation on the Fluent API?  I guess I should take a look.  I didn't use it originally because when I started this, Fluent was in its infancy...  :-)


-----Original Message-----
From: Eugene Marcotte [mailto:eu...@em...] 
Sent: Tuesday, September 07, 2010 4:17 AM
To: Jason Smith
Subject: Re: [Exist-development] Moving towards a new API

Hi,

JSR 225 was the "XQuery API for Java (XQJ)" project. 
http://www.jcp.org/en/jsr/detail?id=225

I looked at it a couple months ago and thought it was mostly too much 
like JDBC to be a useful XMLDB-API replacement.

I really do like the XMLDB-API's approach, especially because it works 
both remote and local. If it was a tad more up-to-date -- e.g. more like 
the fluent API it'd be even better!



Eugene (an exist-devel lurker)


On 9/3/2010 10:38 PM, Jason Smith wrote:
> We talked on Thursday about a new API that would be more powerful than existing APIs and provide a wrapper for the internal API.
>
> After thinking about this some more, I'd like to consider the XMLDB-API model.  There is a lot that is right about the XMLDB wrapper.  It has pretty good collection and resource facilities, and it can be used to access large data sets in a "sequence"-like manner.
>
> I think where XMLDB falls down is around standardization.  The XMLDB standard was designed in the days when an XML database could do XPath 1.0.  It can't fully support the XQuery standard without significant changes.  It can't be extended to do some of the things that eXist does specifically.  XMLDB as a standard seems kind of dead at this point anyway.
>
> Just one example - this email is getting too long...  I can't use XMLDB to return a Sequence of xs:dateTime, which I then pass as a parameter to another XQuery.  XMLDB only deals with string output, and the set of input types is very limited.
>
>
>
> So right now, I am thinking about an API that would aspire to replace the XMLDB API in Exist.  I think it should also hide (wrap) the current locking and transaction mechanisms (since these are bound to change in the next year or two).
>
> It would not be based on any standard, but would be designed to provide maximally simple and powerful access to as much of the internal API as possible, and to be easy to extend as new features become available.
>
> If we do it right, we could also make this a cross-platform API (supporting languages other than Java), kind of like the REST API today, but targeted to language APIs.
>
> Does this sound like a good path to start down?  Different ideas?
>
> BTW, I know there is a JSR that covers an XML database API somewhere out there, or one was in work...  I looked at it a long time ago, and it did not seem all that applicable.  Is there any interest in that API, and if so, can someone point me to a link?  I seem to be having trouble finding it again.
>
> Thanks!
>
> -Jason S.
>
>
> ------------------------------------------------------------------------------
> This SF.net Dev2Dev email is sponsored by:
>
> Show off your parallel programming skills.
> Enter the Intel(R) Threading Challenge 2010.
> http://p.sf.net/sfu/intel-thread-sfd
> _______________________________________________
> Exist-development mailing list
> Exi...@li...
> https://lists.sourceforge.net/lists/listinfo/exist-development

Re: [Exist-development] Moving towards a new API

From: Adam R. <ad...@ex...> - 2010-09-13 22:17:21

On 7 September 2010 17:00, Jason Smith <js...@in...> wrote:
> I was thinking about remote/local too.  That is certainly something that could be done with just a little planning.
>
> Where is the best documentation on the Fluent API?  I guess I should take a look.  I didn't use it originally because when I started this, Fluent was in its infancy...  :-)
>
>
> -----Original Message-----
> From: Eugene Marcotte [mailto:eu...@em...]
> Sent: Tuesday, September 07, 2010 4:17 AM
> To: Jason Smith
> Subject: Re: [Exist-development] Moving towards a new API
>
> Hi,
>
> JSR 225 was the "XQuery API for Java (XQJ)" project.
> http://www.jcp.org/en/jsr/detail?id=225
>
> I looked at it a couple months ago and thought it was mostly too much
> like JDBC to be a useful XMLDB-API replacement.

Likewise I agree that it is heavily JDBC influenced, however there is
someone I am aware of working on an implementation of this for
eXist-db.

> I really do like the XMLDB-API's approach, especially because it works
> both remote and local. If it was a tad more up-to-date -- e.g. more like
> the fluent API it'd be even better!

I was never really a fan of XMLDB-API.

I do really like the Fluent API, but it is local only so you would
need to abstract and add a remote implementation. If you are
interested in Fluent API then you should contact Piotr Kaminski the
original author - pi...@id....

Whatever you decide you should certainly consider the aspect of
streaming in any API that you enhance or develop. This is important
when you have large datasets or long running queries.

> Eugene (an exist-devel lurker)
>
>
> On 9/3/2010 10:38 PM, Jason Smith wrote:
>> We talked on Thursday about a new API that would be more powerful than existing APIs and provide a wrapper for the internal API.
>>
>> After thinking about this some more, I'd like to consider the XMLDB-API model.  There is a lot that is right about the XMLDB wrapper.  It has pretty good collection and resource facilities, and it can be used to access large data sets in a "sequence"-like manner.
>>
>> I think where XMLDB falls down is around standardization.  The XMLDB standard was designed in the days when an XML database could do XPath 1.0.  It can't fully support the XQuery standard without significant changes.  It can't be extended to do some of the things that eXist does specifically.  XMLDB as a standard seems kind of dead at this point anyway.
>>
>> Just one example - this email is getting too long...  I can't use XMLDB to return a Sequence of xs:dateTime, which I then pass as a parameter to another XQuery.  XMLDB only deals with string output, and the set of input types is very limited.
>>
>>
>>
>> So right now, I am thinking about an API that would aspire to replace the XMLDB API in Exist.  I think it should also hide (wrap) the current locking and transaction mechanisms (since these are bound to change in the next year or two).
>>
>> It would not be based on any standard, but would be designed to provide maximally simple and powerful access to as much of the internal API as possible, and to be easy to extend as new features become available.
>>
>> If we do it right, we could also make this a cross-platform API (supporting languages other than Java), kind of like the REST API today, but targeted to language APIs.
>>
>> Does this sound like a good path to start down?  Different ideas?
>>
>> BTW, I know there is a JSR that covers an XML database API somewhere out there, or one was in work...  I looked at it a long time ago, and it did not seem all that applicable.  Is there any interest in that API, and if so, can someone point me to a link?  I seem to be having trouble finding it again.
>>
>> Thanks!
>>
>> -Jason S.
>>
>>
>> ------------------------------------------------------------------------------
>> This SF.net Dev2Dev email is sponsored by:
>>
>> Show off your parallel programming skills.
>> Enter the Intel(R) Threading Challenge 2010.
>> http://p.sf.net/sfu/intel-thread-sfd
>> _______________________________________________
>> Exist-development mailing list
>> Exi...@li...
>> https://lists.sourceforge.net/lists/listinfo/exist-development
>
>
> ------------------------------------------------------------------------------
> This SF.net Dev2Dev email is sponsored by:
>
> Show off your parallel programming skills.
> Enter the Intel(R) Threading Challenge 2010.
> http://p.sf.net/sfu/intel-thread-sfd
> _______________________________________________
> Exist-development mailing list
> Exi...@li...
> https://lists.sourceforge.net/lists/listinfo/exist-development
>



-- 
Adam Retter

eXist Developer
{ United Kingdom }
ad...@ex...
irc://irc.freenode.net/existdb

Re: [Exist-development] Moving towards a new API

From: Jason S. <js...@in...> - 2010-09-14 14:50:38

> Adam Retter wrote:
> I was never really a fan of XMLDB-API.

I found it to have severe limitations and be generally cumbersome to use.  However, some aspects I liked, particularly the ability to use it remotely.

I think XMLDB was hampered by trying to adhere to a standard.  Standards always *seem* like a good idea...  A good solid flexible API would be more useful though.

I'd rather have a non-standard API that works than a standard API that does not work across multiple databases.  ;-)

> Whatever you decide you should certainly consider the aspect of
> streaming in any API that you enhance or develop. This is important
> when you have large datasets or long running queries.

Not really my decision.  :-)  But yes, that is one of the detractors from XMLDB API.  They didn't get streaming right, but close.

If Fluent is far enough along, and it's a good abstraction, then maybe that's a better solution.  I'll do some reading/searching on it when I get the chance.  Super busy right now, so it will be a few days.  :-)

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Wolfgang M. <wol...@ex...> - 2010-08-28 16:49:23

> You ignored most of what I wrote in my previous emails, which I find a
> bit unfriendly: if your query is formulated in the right way and you
> have the proper indexes in place, the query engine SHOULD NOT access
> dom.dbx AT ALL!!!!!!!!!!!!!!! I don't think dom.dbx is the bottleneck
> - it's the QUERY.

Just re-read your last email and I see you did test an optimized query
in addition to the slow query. But my point is that you should first
try to optimize all queries and only test those.

Wolfgang

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Wolfgang M. <wol...@ex...> - 2010-08-28 17:06:45

> Instead, eXist has artificially limited access to dom.dbx to a single thread (at a time).

The assumption is that - during a query - dom.dbx is only read at
serialization time and only to read out a sequence of pages to display
the final query result to the user.

It's a complex interplay between cache manager, transaction log and
other components. I agree there could be ways to allow concurrent read
access to dom.dbx at the same time, but we would need to carefully
discuss the implications.

Wolfgang

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Dmitriy S. <sha...@gm...> - 2010-08-29 06:55:19

Attachments: smime.p7s

On Sat, 2010-08-28 at 19:06 +0200, Wolfgang Meier wrote:
> > Instead, eXist has artificially limited access to dom.dbx to a single thread (at a time).
> 
> The assumption is that - during a query - dom.dbx is only read at
> serialization time and only to read out a sequence of pages to display
> the final query result to the user.
> 
> It's a complex interplay between cache manager, transaction log and
> other components. I agree there could be ways to allow concurrent read
> access to dom.dbx at the same time, but we would need to carefully
> discuss the implications.

Would 'normal' lock mechanism be suitable here? Or any restrictions that
do not allow to use it?

-- 
Cheers,

Dmitriy Shabanov

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Wolfgang M. <wol...@ex...> - 2010-08-29 09:33:38

> Would 'normal' lock mechanism be suitable here? Or any restrictions that
> do not allow to use it?

I'm not yet sure of the consequences. I do believe without further
exploration that we could switch to a multi-read/exclusive write lock
mechanism in some places, though this would require some changes to
the cache management (which could - in return - result in new locks
being introduced ;-). The goal, let me repeat, would be to speed up
non index-assisted, non-optimized access to the DOM. Index-assisted
access itself is pretty fast and does allow for good concurrency.

But we have to be very careful here since the architecture is complex:
you have to consider transactional integrity, journalling, caching and
other aspects. If we change anything, we have to proceed carefully and
in very small steps. Stability is always my top priority.

Wolfgang

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Dmitriy S. <sha...@gm...> - 2010-08-29 13:48:17

Attachments: smime.p7s

On Sun, 2010-08-29 at 11:33 +0200, Wolfgang Meier wrote:
> > Would 'normal' lock mechanism be suitable here? Or any restrictions that
> > do not allow to use it?
> 
> I'm not yet sure of the consequences. I do believe without further
> exploration that we could switch to a multi-read/exclusive write lock
> mechanism in some places, though this would require some changes to
> the cache management (which could - in return - result in new locks
> being introduced ;-). The goal, let me repeat, would be to speed up
> non index-assisted, non-optimized access to the DOM. Index-assisted
> access itself is pretty fast and does allow for good concurrency.
> 
> But we have to be very careful here since the architecture is complex:
> you have to consider transactional integrity, journalling, caching and
> other aspects. If we change anything, we have to proceed carefully and
> in very small steps. Stability is always my top priority.

I was wonder: "is there are restrictions?" So, answer "no" is good
here ;-)

-- 
Cheers,

Dmitriy Shabanov

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Jason S. <js...@in...> - 2010-08-30 16:20:16

> Anyway, as I tried to explain a few weeks back, the locking is not
> fail-safe. You can produce deadlocks if you do not follow certain
> conventions (which are hard to know since they are not documented). In
> particular, acquiring a lock across multiple collections can cause a
> deadlock.

> In this case, eXist-internal code acquires a lock on the global
> collection cache, which is a singleton, before acquiring the lock on
> more than one collection. This is safe, though it puts the db into
> single-task mode (but it happens only in one or two special cases
> anyway). If I remember well, I sent you a junit test to demonstrate
> this. Did you check your code if it does try to work across multiple
> collections? I think it does. If so, please try my fix.

I understand.  The solution we have in place right now is similar to the solution you mentioned, but we put it in place a while ago.  Augmenting the locking with a singleton lock does, indeed, work.

That was my first replacement of the locking mechanism.

The second replacement I've come up with allows two read queries to run simultaneously, even when they target the same collection, and when multiple collections are used simultaneously.  And it enforces atomic read and write operations, something the current locking does not appear to do as well (I am still researching this one).  

It isn't as concurrent as I would like (due to dom.dbx locking), and we still need to put it through its paces to test for stability.


-----Original Message-----
From: Wolfgang Meier [mailto:wol...@ex...] 
Sent: Sunday, August 29, 2010 3:34 AM
To: Dmitriy Shabanov
Cc: Jason Smith; eXist development
Subject: Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

> Would 'normal' lock mechanism be suitable here? Or any restrictions that
> do not allow to use it?

I'm not yet sure of the consequences. I do believe without further
exploration that we could switch to a multi-read/exclusive write lock
mechanism in some places, though this would require some changes to
the cache management (which could - in return - result in new locks
being introduced ;-). The goal, let me repeat, would be to speed up
non index-assisted, non-optimized access to the DOM. Index-assisted
access itself is pretty fast and does allow for good concurrency.

But we have to be very careful here since the architecture is complex:
you have to consider transactional integrity, journalling, caching and
other aspects. If we change anything, we have to proceed carefully and
in very small steps. Stability is always my top priority.

Wolfgang

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: James F. <jam...@ex...> - 2010-08-30 16:29:50

On 30 August 2010 18:19, Jason Smith <js...@in...> wrote:
> I understand.  The solution we have in place right now is similar to the solution you mentioned, but we put it in place a while ago.  Augmenting the locking with a singleton lock does, indeed, work.
>
> That was my first replacement of the locking mechanism.
>
> The second replacement I've come up with allows two read queries to run simultaneously, even when they target the same collection, and when multiple collections are used simultaneously.  And it enforces atomic read and write operations, something the current locking does not appear to do as well (I am still researching this one).
>
> It isn't as concurrent as I would like (due to dom.dbx locking), and we still need to put it through its paces to test for stability.

Guys,

I have followed the thread and (as u would expect) agree mostly with
Wolfgangs thoughts and defo welcome any and all contributions but we
have to be careful to revisit all original assumptions. I am not
saying its a bad goal to try and make non index-assisted,
non-optimized access to the DOM more performant, but in a database
indexes are so important I think we sometimes ignore corner cases,
though I would be interested to see what knock on impact to general
performance with this modification we have to be careful with respect
to stability of the codebase.

Can I suggest a irc chat or skype call ... email sucks for resolving
deeper levels of tech detail.

James Fuller

<< < 1 2 3 > >> (Page 2 of 3)