Thread: Re: [Exist-development] [Exist-open] Performance of concurrent read queries. (Page 3)

eXist-db is a feature rich Open Source native XML database

Brought to you by: deliriumsky, dizzzz, windauer, wolfgang_m

exist-development

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Wolfgang M. <wol...@ex...> - 2010-08-30 16:37:54

> I understand.  The solution we have in place right now is similar to the solution you
> mentioned, but we put it in place a while ago.  Augmenting the locking with a singleton
> lock does, indeed, work.

Internally, eXist does need the singleton lock in rare cases only,
mainly when reading or storing the collection configuration document
for a collection, or when locking documents for an XQuery update
expression.

Otherwise, eXist just avoids locking multiple collections at once
wherever possible as it is known to be an expensive operation and
limits concurrency.

> The second replacement I've come up with allows two read queries to run
> simultaneously, even when they target the same collection, and when multiple collections
> are used simultaneously.

As I said before, I welcome any exploration in this area. As James
just suggested, we may want to have a skype telecon on this to discuss
the possibilities and dangers.

Finally, just as a note to other users who code against the internal
API: Dannes' WebDAV reimplementation shows some clean examples of how
to use internals:

http://exist.svn.sourceforge.net/viewvc/exist/branches/dizzzz/trunk-webdav-upgrade/extensions/webdav/src/org/exist/webdav/

Wolfgang

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Adam R. <ad...@ex...> - 2010-08-31 13:39:09

> Otherwise, eXist just avoids locking multiple collections at once
> wherever possible as it is known to be an expensive operation and
> limits concurrency.

We have discussed several times replacing eXist-db's current
collection mechanism with a virtualised implementation where
Collections are just another number in the system. This was discussed
for the purposes of performance when large collections are involved.

Would this simplify the overal problem domain? If so, perhaps this
work should be undertaken before a redesign of the locking system?

>> The second replacement I've come up with allows two read queries to run
>> simultaneously, even when they target the same collection, and when multiple collections
>> are used simultaneously.
>
> As I said before, I welcome any exploration in this area. As James
> just suggested, we may want to have a skype telecon on this to discuss
> the possibilities and dangers.

Skype teleconference would be good :-)

-- 
Adam Retter

eXist Developer
{ United Kingdom }
ad...@ex...
irc://irc.freenode.net/existdb

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Jason S. <js...@in...> - 2010-08-31 21:23:32

I think we need to talk about how much granularity is actually valuable.  For example, if you have a single dom.dbx file, can you write to that from multiple threads at the same time safely?  If you can't, then it doesn't make much sense to write-lock at the collection level, right?

Also, if you have a maximally granular locking mechanism, allowing locks on collections and resources, both deep locks and shallow locks, with multiple readers and a single writer allowed on each, the deadlock detection gets really complex.  Performance on deadlock detection can blow up.

Still, it seems like it would be nice to write to one document while querying against another document in the same collection...

How the collections are implemented underneath isn't that important to the locking mechanism, other than this affects how much concurrency that you can actually take advantage of.  

I think, though, that you **need to have the locking mechanism in place pretty early on.**  If you design a wonderfully concurrent back end, but you control access though a global mutex, well, what have you got?  :-)

Plus, in any non-trivial locking mechanism, there will be deadlocks and deadlock detection-and-recovery.  And the software, as a whole, has to use the standards for detection-and-recovery if you want to be able to take advantage of the more concurrent locking. 

And this all has to be done in a system that currently does not support true transactional rollback (I think).  Which means there are some additional rules when it comes to write locking...

Too much information.  I'll write something up for Thursday, and hopefully all this stuff will become more clear.  This is not an easy topic for anyone, including myself!


-----Original Message-----
From: Adam Retter [mailto:ad...@ex...] 
Sent: Tuesday, August 31, 2010 6:42 AM
To: Wolfgang Meier
Cc: Jason Smith; Paul Ryan; eXist development; Michael J. Pelikan; Todd Gochenour
Subject: Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

> Otherwise, eXist just avoids locking multiple collections at once
> wherever possible as it is known to be an expensive operation and
> limits concurrency.

We have discussed several times replacing eXist-db's current
collection mechanism with a virtualised implementation where
Collections are just another number in the system. This was discussed
for the purposes of performance when large collections are involved.

Would this simplify the overal problem domain? If so, perhaps this
work should be undertaken before a redesign of the locking system?

>> The second replacement I've come up with allows two read queries to run
>> simultaneously, even when they target the same collection, and when multiple collections
>> are used simultaneously.
>
> As I said before, I welcome any exploration in this area. As James
> just suggested, we may want to have a skype telecon on this to discuss
> the possibilities and dangers.

Skype teleconference would be good :-)

-- 
Adam Retter

eXist Developer
{ United Kingdom }
ad...@ex...
irc://irc.freenode.net/existdb

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Wolfgang M. <wol...@ex...> - 2010-08-31 22:04:28

> For example, if you have a single dom.dbx file, can you write to that from multiple threads at the same time safely?  If you can't, then it doesn't make much sense to write-lock at the collection level, right?

The bigger picture is much more complex. You have to talk about
transactions, recovery, caching and more. You do not write to dom.dbx
directly. You write to the page cache. And there's more than just
dom.dbx. Writing the actual data just takes a small part of the
overall indexing time. More time is spent with indexing, maintaining
the transaction log etc.

>From my point of view, the next step in any redesign effort should be
to remove the collection locks entirely. We have discussed this
before. It will greatly simplify the locking and transaction log. My
roadmap roughly looks like this:

1) remove collection dependency from core indexes:
1a) structural index, DONE
1b) range index, IN PROGESS
1c) remove document metadata from collection store and keep it
separately. a collection is just a sequence of (arbitrary) document
ids. a document can be linked to more than one collection.
2) drop all collection locks, except for the case where the collection
metadata itself is modified

Those steps have to be completed before we address other things. We
need to simplify the architecture first, then try to do further
redesigns. Any help will be welcome. As an added value, 1a and b will
improve update/write performance in general.

> Still, it seems like it would be nice to write to one document while querying against another document in the same collection...

Normally, eXist will acquire a lock on the collection, acquire one on
the document, release the collection lock, continue parsing the
document. In some cases (node updates), the transaction handling has
forced us to keep the lock on the collection longer than desired. But
this can be changed (see above).

> And this all has to be done in a system that currently does not support true transactional rollback (I think).

eXist maintains a transaction log and does redo/undo on recovery. The
only limitation is that the transaction log is incomplete, i.e. it
does not cover any secondary indexes. Still transactional integrity
has to be preserved and puts further requirements on the locking (this
is the reason why collection locks are often not released early).

Well, I will stop writing emails now and better explain everything on Thursday.

Wolfgang

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Jason S. <js...@in...> - 2010-08-31 22:55:20

Quick OS Survey:

What operating systems is everyone using?  For purposes of finding some common white-boarding app.  I'm good with Windows and/or Linux.  Is anyone planning to attend Linux-only?

-----Original Message-----
From: Wolfgang Meier [mailto:wol...@ex...] 
Sent: Tuesday, August 31, 2010 4:04 PM
To: Jason Smith
Cc: Adam Retter; Paul Ryan; eXist development; Michael J. Pelikan; Todd Gochenour
Subject: Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

> For example, if you have a single dom.dbx file, can you write to that from multiple threads at the same time safely?  If you can't, then it doesn't make much sense to write-lock at the collection level, right?

The bigger picture is much more complex. You have to talk about
transactions, recovery, caching and more. You do not write to dom.dbx
directly. You write to the page cache. And there's more than just
dom.dbx. Writing the actual data just takes a small part of the
overall indexing time. More time is spent with indexing, maintaining
the transaction log etc.

From my point of view, the next step in any redesign effort should be
to remove the collection locks entirely. We have discussed this
before. It will greatly simplify the locking and transaction log. My
roadmap roughly looks like this:

1) remove collection dependency from core indexes:
1a) structural index, DONE
1b) range index, IN PROGESS
1c) remove document metadata from collection store and keep it
separately. a collection is just a sequence of (arbitrary) document
ids. a document can be linked to more than one collection.
2) drop all collection locks, except for the case where the collection
metadata itself is modified

Those steps have to be completed before we address other things. We
need to simplify the architecture first, then try to do further
redesigns. Any help will be welcome. As an added value, 1a and b will
improve update/write performance in general.

> Still, it seems like it would be nice to write to one document while querying against another document in the same collection...

Normally, eXist will acquire a lock on the collection, acquire one on
the document, release the collection lock, continue parsing the
document. In some cases (node updates), the transaction handling has
forced us to keep the lock on the collection longer than desired. But
this can be changed (see above).

> And this all has to be done in a system that currently does not support true transactional rollback (I think).

eXist maintains a transaction log and does redo/undo on recovery. The
only limitation is that the transaction log is incomplete, i.e. it
does not cover any secondary indexes. Still transactional integrity
has to be preserved and puts further requirements on the locking (this
is the reason why collection locks are often not released early).

Well, I will stop writing emails now and better explain everything on Thursday.

Wolfgang

Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

From: Jason S. <js...@in...> - 2010-08-30 16:09:39

If you are referring to org.exist.storage.lock.ReentrantReadWriteLock, which serves as both the collection locking mechanism and the one used by "dom.dbx", "collections.dbx", etc., the problem is that this lock is a mutex.  The name is Reentrant...  However, the implementation uses a mutex over reads and writes.

The ideal would be to allow multiple readers and a single writer to any resource at any time.  The standard locking mechanism, when used with "dom.dbx", allows only one reader or writer at any time.  For long, unoptimized read queries, this results in a choke point on dom.dbx that looks to me like it slows down even optimized queries.

I hope I answered the right question...  :-)

-----Original Message-----
From: Dmitriy Shabanov [mailto:sha...@gm...] 
Sent: Sunday, August 29, 2010 12:56 AM
To: Wolfgang Meier
Cc: Jason Smith; eXist development
Subject: Re: [Exist-development] [Exist-open] Performance of concurrent read queries.

On Sat, 2010-08-28 at 19:06 +0200, Wolfgang Meier wrote:
> > Instead, eXist has artificially limited access to dom.dbx to a single thread (at a time).
> 
> The assumption is that - during a query - dom.dbx is only read at
> serialization time and only to read out a sequence of pages to display
> the final query result to the user.
> 
> It's a complex interplay between cache manager, transaction log and
> other components. I agree there could be ways to allow concurrent read
> access to dom.dbx at the same time, but we would need to carefully
> discuss the implications.

Would 'normal' lock mechanism be suitable here? Or any restrictions that
do not allow to use it?

-- 
Cheers,

Dmitriy Shabanov

<< < 1 2 3 (Page 3 of 3)