From: Kevin D. <ke...@tr...> - 2006-01-20 00:52:45
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML><HEAD> <STYLE type=text/css> P, UL, OL, DL, DIR, MENU, PRE { margin: 0 auto;}</STYLE> <META content="MSHTML 6.00.2900.2802" name=GENERATOR></HEAD> <BODY leftMargin=1 topMargin=1 rightMargin=1><FONT face=Tahoma size=2> <DIV><FONT face=Arial>Hi folks - I've done some noodling on things, and I think I finally "get" the BFIM strategy for long transactions. I've also started coming up with some thoughts on how to make this actually work for jdbm, including fine grained locking semantics.</FONT></DIV> <DIV><FONT face=Arial></FONT> </DIV> <DIV><FONT face=Arial>Here's what I'm thinking (much of this is similar to DB Cache, but with the addition of lower-than-page locking):</FONT></DIV> <DIV><FONT face=Arial></FONT> </DIV> <DIV><FONT face=Arial></FONT> </DIV> <DIV><FONT face=Arial>First, we have to throw out a few of the posits of the DB Cache system. Namely, the concept that there are only two versions of a given page in cache. If multiple txs are going to operate concurrently, then each transaction must have a copy of a given page that contains only the changes that tx has made since it's inception.</FONT></DIV> <DIV><FONT face=Arial></FONT> </DIV> <DIV><FONT face=Arial>I am going to suggest that what we really need is a 'current' version of each page in cache, plus transaction specific versions of that page for all active transactions.</FONT></DIV> <DIV><FONT face=Arial></FONT> </DIV> <DIV><FONT face=Arial>Next, we need a way of determing low level locking regions to ensure compatible updates. To that end, we need to have what I'm calling a delta mask for each version of each page. The 'current' version of a page will have a master delta mask that tracks the updated bytes of all active transactions. Each transaction version of a page will have a tx delta mask that tracks the updated bytes of other transactions that have worked on the page since the inception of the tx under consideration (this is a little tricky, but I think it works).</FONT></DIV> <DIV><FONT face=Arial></FONT> </DIV> <DIV><FONT face=Arial></FONT> </DIV> <DIV><FONT face=Arial>I am putting together some transaction usage cases in Excel that will be very helpful for explaining what I'm talking about - are you both able to open and view XLS files? If so, I'll put something together with the full strategy and get it out for your review.</FONT></DIV> <DIV><FONT face=Arial></FONT> </DIV> <DIV><FONT face=Arial>The long and short is that I'm pretty sure I have an algorithm that will allow byte level locking, with high performance for small transactions and support for long transactions (using a modified BFIM strategy).</FONT></DIV> <DIV><FONT face=Arial></FONT> </DIV> <DIV><FONT face=Arial>If this works the way I think it should, that will give us the low level tools for supporting multiple transactions with high concurrency, all while avoiding the non-deterministic locking problems that arise with doing full page locking.</FONT></DIV> <DIV><FONT face=Arial></FONT> </DIV> <DIV><FONT face=Arial>Let's make sure that you don't see any issues with the algorithm I describe (tomorrow - too darned tired now), then we can start a discussion about what data structures need to be in place in the translation page and free row lists to support high concurrency.</FONT></DIV> <DIV><FONT face=Arial></FONT> </DIV> <DIV><FONT face=Arial>Cheerio!</FONT></DIV> <DIV><FONT face=Arial></FONT> </DIV> <DIV><FONT face=Arial>- K</FONT></DIV> <DIV></DIV> <DIV><FONT face=Arial></FONT></DIV> <DIV><FONT face=Arial></FONT> </DIV> <DIV><FONT face=Arial>Kevin Day<BR>Trumpet, Inc.<BR><A href="http://www.trumpetinc.com"><FONT color=#0000ff>www.trumpetinc.com</FONT></A><BR><A href="mailto:ke...@tr..."><FONT color=#0000ff>ke...@tr...</FONT></A><BR>602-438-7030</FONT></DIV> <DIV><FONT face=Arial></FONT> </DIV></FONT></BODY></HTML> |
From: Alex B. <boi...@in...> - 2006-01-20 15:37:06
|
Kevin, I think your algorithm would work but it would not account for changes that have ripple effects, such as rebalancing a BTree, or updating an object with a larger size. Both of those cases would ultimately 'lock' bytes beyond the original 'row' and therefore you would still end up with non-deterministic case of conflicts between transactions. I am more in favor of splitting the physical record and logical object management such that transactions operate at the logical level until their commit point (to avoid conflicts) and migrate objects to physical pages after the commit point. In DBCache parlance, the safe and cache would contain objects (not physical records) and objects would be reallocated (and hopefully clustered) when moved to the database. Makes sense? alex Kevin Day wrote: > Hi folks - I've done some noodling on things, and I think I finally > "get" the BFIM strategy for long transactions. I've also started > coming up with some thoughts on how to make this actually work for > jdbm, including fine grained locking semantics. > > Here's what I'm thinking (much of this is similar to DB Cache, but > with the addition of lower-than-page locking): > > > First, we have to throw out a few of the posits of the DB Cache > system. Namely, the concept that there are only two versions of a > given page in cache. If multiple txs are going to operate > concurrently, then each transaction must have a copy of a given page > that contains only the changes that tx has made since it's inception. > > I am going to suggest that what we really need is a 'current' version > of each page in cache, plus transaction specific versions of that page > for all active transactions. > > Next, we need a way of determing low level locking regions to ensure > compatible updates. To that end, we need to have what I'm calling a > delta mask for each version of each page. The 'current' version of a > page will have a master delta mask that tracks the updated bytes of > all active transactions. Each transaction version of a page will have > a tx delta mask that tracks the updated bytes of other transactions > that have worked on the page since the inception of the tx under > consideration (this is a little tricky, but I think it works). > > > I am putting together some transaction usage cases in Excel that will > be very helpful for explaining what I'm talking about - are you both > able to open and view XLS files? If so, I'll put something together > with the full strategy and get it out for your review. > > The long and short is that I'm pretty sure I have an algorithm that > will allow byte level locking, with high performance for small > transactions and support for long transactions (using a modified BFIM > strategy). > > If this works the way I think it should, that will give us the low > level tools for supporting multiple transactions with high > concurrency, all while avoiding the non-deterministic locking problems > that arise with doing full page locking. > > Let's make sure that you don't see any issues with the algorithm I > describe (tomorrow - too darned tired now), then we can start a > discussion about what data structures need to be in place in the > translation page and free row lists to support high concurrency. > > Cheerio! > > - K > > Kevin Day > Trumpet, Inc. > www.trumpetinc.com <http://www.trumpetinc.com> > ke...@tr... <mailto:ke...@tr...> > 602-438-7030 > > ------------------------------------------------------- This SF.net > email is sponsored by: Splunk Inc. Do you grep through log files for > problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 > _______________________________________________ Jdbm-developer mailing > list Jdb...@li... > https://lists.sourceforge.net/lists/listinfo/jdbm-developer |
From: Kevin D. <ke...@tr...> - 2006-01-20 18:34:41
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML><HEAD> <STYLE type=text/css> P, UL, OL, DL, DIR, MENU, PRE { margin: 0 auto;}</STYLE> <META content="MSHTML 6.00.2900.2802" name=GENERATOR></HEAD> <BODY leftMargin=1 topMargin=1 rightMargin=1><FONT face=Tahoma> <DIV><FONT face=Arial size=2>Alex-</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT></DIV> <DIV><FONT face=Arial size=2>I'm not sure I understand how this would be an issue... I'd like to explore this further with you if possible.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>In the scheme I describe, each transaction would have it's own view of the DB as of the time that it started...</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>If a transaction rebalances a tree, all other transactions that were already in process would continue to have a view of the tree as it was before it was balanced.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>The only time there would be a conflict is if both transactions attempted to rebalance the same tree at the same time, in which case one of the transactions would fail and roll-back (obviously, an application would want to use higher level locking symantecs to prevent this from happening - and I could see where this would be complicated by the fact that each transaction would have it's own copy of the BTree object... but that's going to be an issue no matter what we do here).</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>That said, I am quite open to the idea of handling transactions at the object level instead of the page level. So much so that I'm probably not going to bother putting together the full byte level locking concept for you guys.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>From your suggestion, it sounds like you might be thinking about implementing the DBCache at the level of the record manager instead of the page manager. The current jdbm architecture already has a nice split between these two concerns, so maybe this would be a very elegant solution... It may be desirable to do this at the record *content* level (i.e. the bytes of the record, but not the record header).</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>With this in hand, I see that there are two main areas of discussion:</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>a) implementation of the safe (a ring buffer may be much more difficult to achieve with variable length records...)</FONT></DIV> <DIV><FONT face=Arial size=2>b) handling translation pages</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>I'm going to defer thinking about (a) because I'm certain it can be achieved - even an ugly hack of doing linear searches during recovery would work.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>Translation Pages:</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>The crux of the problem as I see it is that the application must have access to the recid of any new records created by a tx, which implies that the translation page must be updated in the middle of a tx. This, in turn, requires updates to the free slot tracking mechanism... What's even more important is that we have to be able to roll these changes back (re-free an allocated slot if the tx that created that slot rolls back), and do it in such a way that we don't mess up other transactions.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>Each transaction will already have a list of rows that it has changed - maybe we need to just have a list of rows that it has created as well. Let's call those the 'updated' and 'new' lists...</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>Could we manage the translation pages in the DBCache, and have a special transaction for managing changes to the translation pages? Calls against the transaction are synchronized, and the api is created so that all calls are atomic... Each 'record' managed by the translation-tx would be equivalent to one translation page (without page headers).</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>When a new record is created, a new slot in the translation page gets pulled from the free slot list, and marked as 'NEW' (i.e. it has no offset, because we haven't actually copied data into the DB yet). When a record is deleted, it's slot is marked as 'FREE' and it is added to the free slot list.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2><U>Commit of a tx would involve:</U></FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>a) Sync the new and updated records (record content only) to Safe</FONT></DIV> <DIV><FONT face=Arial size=2>b) Sync the translation transaction to Safe</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>If there are long tx, there would be additional steps:</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>c) Delete the tx log file</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2> <DIV><FONT face=Arial size=2>Note that I'm pretty sure that step (b) is fine if you have multiple transactions rolling back at the same time. The translation transaction will always be in a valid state when it syncs to state because all operations on the translation system are synchronized and atomic.</FONT></DIV></FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>As an optimization, we could track which transactions have actually changed the state of the translation system (by creating a new record, or by freeing a record), and only sync the translation transaction if the tx in question is in that list.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2><U>Rollback of a transaction would involve:</U></FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2> <DIV><FONT face=Arial size=2>a) Calling translationSystem.free() for each record in the 'new' list</FONT></DIV> <DIV>b) Sync the translation transaction to Safe</DIV>c) Discarding tx copies of records (just record content) in cache</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>In the case of long tx, there would be additional steps:</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>d) Restore 'updated' cache records from tx log</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2><U>Crash Recovery would involve:</U></FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>This is where I start running into problems... With what I've described above, we can implement the system and we are guaranteed to not have database corruption. However, it is still possible that a crash could result in slots in the translation table being consumed when they are not actually used. They would be marked as 'NEW0', so we could find and explicitly free them during recovery - but I'm not sure there is a good way to do this other than actually ripping through the translation table during recovery. That said, I'm not sure that this is that big of a deal, because it can be done in parallel with other transactions after the recovery is complete...</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>OK - now that I've talked myself through that, here is how recovery would happen:</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>1. The state of the cache is recovered from the safe (tx that didn't succed in writing BOTH record and translation data to the safe are discarded)</FONT></DIV> <DIV><FONT face=Arial size=2>2. A low priority thread is started that runs through the translation table and explicitly frees any records that are marked as 'NEW0'. During this time period, all newly allocated records will be marked as 'NEW1' in the translation table.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>We have to do the NEW0/NEW1 flip flop to prevent the recovery thread from accidentally blowing away new records for transactions that start after the recovery.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>The only thing that I don't have a super solid idea of is how to track the NEW0/NEW1 state (the first time a DB is opened, it should be in NEW0, the next time, NEW1, etc... It may be appropriate to just store this value into the header of the DB when it is initially opened).</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>Whew... do you guys think this is workable?</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>- K</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2> </FONT> <TABLE> <TBODY> <TR> <TD width=1 bgColor=blue><FONT face=Arial size=2></FONT></TD> <TD><FONT face=Arial size=2><FONT color=red>> <BR>Kevin,<BR><BR>I think your algorithm would work but it would not account for changes<BR>that have ripple effects, such as rebalancing a BTree, or updating an<BR>object with a larger size. Both of those cases would ultimately 'lock'<BR>bytes beyond the original 'row' and therefore you would still end up<BR>with non-deterministic case of conflicts between transactions.<BR><BR>I am more in favor of splitting the physical record and logical object<BR>management such that transactions operate at the logical level until<BR>their commit point (to avoid conflicts) and migrate objects to physical<BR>pages after the commit point.<BR><BR>In DBCache parlance, the safe and cache would contain objects (not<BR>physical records) and objects would be reallocated (and hopefully<BR>clustered) when moved to the database.<BR><BR>Makes sense?<BR><BR>alex<BR><BR><BR></FONT></FONT></TD></TR></TBODY></TABLE></DIV></FONT></BODY></HTML> |