From: Kevin D. <ke...@tr...> - 2006-01-31 22:37:13
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML><HEAD> <STYLE type=text/css> P, UL, OL, DL, DIR, MENU, PRE { margin: 0 auto;}</STYLE> <META content="MSHTML 6.00.2900.2802" name=GENERATOR></HEAD> <BODY leftMargin=1 topMargin=1 rightMargin=1><FONT face=Tahoma size=2> <DIV><FONT face=Arial>Hi all - here is the outline of a record level version of the dbCache design... This was at the end of an email sent to the list last week, so it might have gotten lost in the shuffle... (Or you guys looked at 2 pages of dense text and decided that I was just blabbering :-) ). I am actually quite excited about this idea...</FONT></DIV> <DIV><FONT face=Arial></FONT> </DIV> <DIV><FONT face=Arial>Forwarded email section:</FONT></DIV> <DIV><FONT face=Arial></FONT> </DIV> <DIV><FONT face=Arial size=2>From your [Alex's] suggestion, it sounds like you might be thinking about implementing the DBCache at the level of the record manager instead of the page manager. The current jdbm architecture already has a nice split between these two concerns, so maybe this would be a very elegant solution... It may be desirable to do this at the record *content* level (i.e. the bytes of the record, but not the record header).</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>With this in hand, I see that there are two main areas of discussion:</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>a) implementation of the safe (a ring buffer may be much more difficult to achieve with variable length records...)</FONT></DIV> <DIV><FONT face=Arial size=2>b) handling translation pages</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>I'm going to defer thinking about (a) because I'm certain it can be achieved - even an ugly hack of doing linear searches during recovery would work.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>Translation Pages:</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>The crux of the problem as I see it is that the application must have access to the recid of any new records created by a tx, which implies that the translation page must be updated in the middle of a tx. This, in turn, requires updates to the free slot tracking mechanism... What's even more important is that we have to be able to roll these changes back (re-free an allocated slot if the tx that created that slot rolls back), and do it in such a way that we don't mess up other transactions.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>Each transaction will already have a list of rows that it has changed - maybe we need to just have a list of rows that it has created as well. Let's call those the 'updated' and 'new' lists...</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>Could we manage the translation pages in the DBCache, and have a special transaction for managing changes to the translation pages? Calls against the transaction are synchronized, and the api is created so that all calls are atomic... Each 'record' managed by the translation-tx would be equivalent to one translation page (without page headers).</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>When a new record is created, a new slot in the translation page gets pulled from the free slot list, and marked as 'NEW' (i.e. it has no offset, because we haven't actually copied data into the DB yet). When a record is deleted, it's slot is marked as 'FREE' and it is added to the free slot list.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2><U>Commit of a tx would involve:</U></FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>a) Sync the new and updated records (record content only) to Safe</FONT></DIV> <DIV><FONT face=Arial size=2>b) Sync the translation transaction to Safe</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>If there are long tx, there would be additional steps:</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>c) Delete the tx log file</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2> <DIV><FONT face=Arial size=2>Note that I'm pretty sure that step (b) is fine if you have multiple transactions rolling back at the same time. The translation transaction will always be in a valid state when it syncs to state because all operations on the translation system are synchronized and atomic.</FONT></DIV></FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>As an optimization, we could track which transactions have actually changed the state of the translation system (by creating a new record, or by freeing a record), and only sync the translation transaction if the tx in question is in that list.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2><U>Rollback of a transaction would involve:</U></FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2> <DIV><FONT face=Arial size=2>a) Calling translationSystem.free() for each record in the 'new' list</FONT></DIV> <DIV>b) Sync the translation transaction to Safe</DIV>c) Discarding tx copies of records (just record content) in cache</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>In the case of long tx, there would be additional steps:</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>d) Restore 'updated' cache records from tx log</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2><U>Crash Recovery would involve:</U></FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>This is where I start running into problems... With what I've described above, we can implement the system and we are guaranteed to not have database corruption. However, it is still possible that a crash could result in slots in the translation table being consumed when they are not actually used. They would be marked as 'NEW0', so we could find and explicitly free them during recovery - but I'm not sure there is a good way to do this other than actually ripping through the translation table during recovery. That said, I'm not sure that this is that big of a deal, because it can be done in parallel with other transactions after the recovery is complete...</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>OK - now that I've talked myself through that, here is how recovery would happen:</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>1. The state of the cache is recovered from the safe (tx that didn't succed in writing BOTH record and translation data to the safe are discarded)</FONT></DIV> <DIV><FONT face=Arial size=2>2. A low priority thread is started that runs through the translation table and explicitly frees any records that are marked as 'NEW0'. During this time period, all newly allocated records will be marked as 'NEW1' in the translation table.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>We have to do the NEW0/NEW1 flip flop to prevent the recovery thread from accidentally blowing away new records for transactions that start after the recovery.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>The only thing that I don't have a super solid idea of is how to track the NEW0/NEW1 state (the first time a DB is opened, it should be in NEW0, the next time, NEW1, etc... It may be appropriate to just store this value into the header of the DB when it is initially opened).</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>Whew... do you guys think this is workable?</FONT></DIV> <DIV></DIV> <DIV><FONT face=Arial></FONT></DIV> <DIV><FONT face=Arial></FONT> </DIV> <DIV><FONT face=Arial>Kevin Day<BR>Trumpet, Inc.<BR><A href="http://www.trumpetinc.com"><FONT color=#0000ff>www.trumpetinc.com</FONT></A><BR><A href="mailto:ke...@tr..."><FONT color=#0000ff>ke...@tr...</FONT></A><BR>602-438-7030</FONT></DIV> <DIV><FONT face=Arial></FONT> </DIV></FONT></BODY></HTML> |