From: 'Kevin D. ' <ke...@tr...> - 2006-03-07 00:15:54
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML><HEAD> <STYLE type=text/css> P, UL, OL, DL, DIR, MENU, PRE { margin: 0 auto;}</STYLE> <META content="MSHTML 6.00.2900.2802" name=GENERATOR></HEAD> <BODY leftMargin=1 topMargin=1 rightMargin=1><FONT face=Tahoma> <DIV><FONT face=Arial size=2>Bryan-</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT></DIV> <DIV><FONT face=Arial size=2>Geeze - you make it sounds like my assumptions are completely off base :-)</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>To answer your concerns, the assumption is that most records in the database will have one version. I feel that this is a reasonable assumption because the only time you will have versions in the DB is if a Tx commits while another version is actively using the record. Do you feel that this assumption is not valid? If so, how is it not valid? Can you give any counter-examples? That may help me to rethink the design so it adequately addresses your concerns.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>The proposal is not to replace the translation pages with version pages - it is to augment the translation pages with version pages - if versions exist. I think of the newly designed record manager as having several layers:</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>[Physical record manager] </FONT></DIV> <DIV><FONT face=Arial size=2>[Version Manager]</FONT></DIV> <DIV><FONT face=Arial size=2>[Logical record manager]</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>The logical record manager would make a request to the version manager for the appropriate data, in the context of the current tx. The version manager determines which physical record is required and requests it from the physical record manager. In a strictly segregated architecture, we'd have compleltely separate data structures for the logical, version and physical record manager. But because </FONT><FONT face=Arial size=2>very few records will actually have versions, this is not efficient, so we do an optimization in the version manager that tells it to only use a version table lookup if there is indeed a version to lookup.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>If there is a version table, then the logical row id will point to the version table in the version page system. If there is not a version table, then the logical row id will point to the data record itself. So, the exact read operation will look like this:</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>1. Read physical offset of data based on logical row id from translation system</FONT></DIV> <DIV><FONT face=Arial size=2>2. Fetch page for that physical offset</FONT></DIV> <DIV><FONT face=Arial size=2>3. Test the page type of the page retrieved</FONT></DIV> <DIV><FONT face=Arial size=2>4a. If page type (this is already contained in the jdbm page header) is PAGE_DATA, read the record content from the physical offset and build an in-memory version table with a single entry</FONT></DIV> <DIV><FONT face=Arial size=2>4b. If page type is PAGE_VERSION, read the version record content from the physical offset and create an in-memory version table array based on that content (the table will consist of a header containing the logical rec id, and an array of <OID, physical offset> tuples)</FONT></DIV> <DIV><FONT face=Arial size=2>5. Allow the version manager to determine the appropriate version to return for the current transaction context</FONT></DIV> <DIV><FONT face=Arial size=2>6. If the version byte data exists in the version list already, return it, otherwise...</FONT></DIV> <DIV><FONT face=Arial size=2>7. Go to the physical page offset of the version, read the bytes for the record, put the bytes into the version list, then return the bytes</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>So, yes, if there actually are versions for a given record, then you will have two disk lookups instead of one. But that only covers a very small percentage of the overall records in the store - and it leaves the lookup as O(logZ) - where Z is the average number of versions of all of the records in the store. logZ reflects using a binary search over Z versions. Z is going to be very, very close to 1, because very few records have more than one version, so the practical lookup performance will be O(1).</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>You are correct that using OID alone for determining whether an abort is necessary may not be as good as it could be. It would certainly be possible (and probably even desirable!) to have the in-memory version of the record capture the read and/or modified time of the records it works with (we have to capture a read and write set anyway for the conflict resolution mechanism - this just increases the record size of that structure a tad). This could be provided as input to the optimistic conflict resolution system. The OID becomes a basline time then, and the time offsets are milliseconds from the OID. This will allow us to use less memory (and faster calculations, if that matters). We don't have to actually persistently store this information because the OID and EID implicitly bracket the operations from the perspective of other tx, so I'm pretty sure there is no disk usage advantage or disadvantage of capturing this info.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>Note that I have really said nothing about how the version manager determines which physical record to return - it could use an algorithm based on the captured read/write timestamps in addition to the OID/EID bracketing rules that I suggested. The point is that the overall strategy is sound, and that we can tweak the details to optimize performance as we need - without further messing with the file format.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>As for whether it is going to be a performance win, I guess we'll just have to agree to disagree - I am certain that it will be a massive win for read only tx, and it makes the overall architecture much cleaner for handling VLR, etc... Given that it solves the VLR problem, allows for use of either optimistic or pessimistic locking, and provides a well understood mechanism for managing fragmentation and proximity, I would like to propose that, unless there are specific objections, that we give it a whack.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>It should be easy to completely disable the versioning if desired (the version manager would just always return the latest physical record). Changing the system to perform in place updates would require some significant changes, because one of the advantages of the proposed system is that it ditches the free lists entirely.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>If that sound good to you, I can put together an in-memory mockup of the system that we can tie into your transaction serialization testing harness and we can see how it goes. </FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>I am especially interested to see if there are going to be any interaction issues between 2PL and the versioning system... I'm pretty sure that if we make our locks on logical rows that everything should flow from that.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>Cheers,</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>- K</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2> </FONT> <TABLE> <TBODY> <TR> <TD width=1 bgColor=blue><FONT face=Arial size=2></FONT></TD> <TD><FONT face=Arial size=2><FONT color=blue>> Kevin,<BR><BR>It sounds like the performance of the design depends critically on the<BR>expectation that there is only a single record for each version and<BR>otherwise we do an additional indirection to find even the most current<BR>version of the record. If your assumption is off base, then you are<BR>going to add a significant overhead every time you go for the current<BR>version of the record. <BR><BR>Or is the proposal to replace the use of translation pages with "version"<BR>pages? In this case, the danger is that fewer translations will fit on<BR>a single page and the overall memory burden for the translation from OID<BR>to physical offset will be higher.<BR><BR>There also appears to be an assumption that the transaction creation<BR>order (sequential assignment of transaction identifiers) will produce<BR>transaction restart rates that will be acceptable. My understanding<BR>is that this overconstrains the serialization ordering and results in<BR>many more restarts that a policy which decides the serialization ordering<BR>at the locked point or commit point of the tx. While read-only txs can<BR>benefit by deliberately reading old versions, there may be increased cost<BR>both from the non-transparency of that decision and from the requirement<BR>to indirect to an earlier version.<BR><BR>So, I am still not convinced that this is a hands down performance win.<BR><BR>With regard to optimistic polices, [1] has a nice approach to hybrid<BR>atomicity which provides the opportunity for per-datatype choice of<BR>pessemistic, timestamp, or optimistic CC and expands the notion of<BR>conflict based (read-write and write-write conflict) CC to datatype<BR>specific definitions of conflict (allowing more interleavings of txs)<BR>and also to state-based conflict resolution.<BR><BR>-bryan<BR><BR>[1] Apologizing Versus Asking Permission: Optimistic Concurrency<BR> Control for Abstract Data Types MAURICE HERLIHY Carnegie Mellon<BR> University.<BR> <A href="http://www.cs.brown.edu/~mph/Herlihy90a/p96-herlihy.pdf"><FONT color=#0000ff>http://www.cs.brown.edu/~mph/Herlihy90a/p96-herlihy.pdf</FONT></A><BR><BR><BR>-----Original Message-----<BR>From: <A href="mailto:jdb...@li..."><FONT color=#0000ff>jdb...@li...</FONT></A><BR>To: JDBM Developer listserv<BR>Sent: 3/6/2006 3:18 PM<BR>Subject: re[2]: [Jdbm-developer] MVCC and very long transactions<BR><BR>Bryan-<BR><BR>I'm addressing some of your comments/thoughts below!<BR><BR>I thought of another advantage of the proposed system: Deleting a<BR>record can occur without writing data to the record itself. We just<BR>append a 'deleted' entry to the version list. If we are deleting a<BR>bunch of records, this will require a bunch of sequential writes of<BR>version tables, and a bunch of translation table updates. If we are<BR>doing in-place deletions, we'd also incur the cost of updating the pages<BR>containing record data.<BR><BR>As an FYI, I am still *very* big on using MVCC. The advantages it<BR>provides for non-blocking read-only tx, and the natural fit between it<BR>and VLR still make it extremely desirous, even given the 'serialization<BR>isolation' level issue I described over the weekend. I think that MVCC<BR>in conjunction with either 2PL or optimistic locking is going to be a<BR>very powerful combination.<BR><BR>- K<BR> <BR> <BR>> Kevin,<BR><BR>Sounds reasonable. In terms of DBCache, all you have to do is force<BR>the pages belonging to the VLR Tx to the DB. I was pointing to this<BR>issue a while back. There is an interaction between the persistent<BR>record allocation logic and the page manager which can simplify some<BR>things. If we have a guarentee that a page has data from at one and<BR>only one transaction, then we can force it to the DB without logging<BR>since rollback is not an issue (at the page content level).<BR><BR><BR>My concern with page-per-tx is that it will result in heavy<BR>fragmentation. With the strategy that I envision, there will be no<BR>fragmentation, and reclamation of space by deleted versions becomes a<BR>very natural operation.<BR><BR><BR><BR>Note that the wilderness page will get forced to safe for each<BR>transaction commit - if you have a lot of small transactions comitting,<BR>then the same wilderness page would be forced multiple times.<BR><BR><BR><BR>There are some things that I am not convinced about yet:<BR><BR>1. The need to maintain a persistent "master transaction table".<BR><BR><BR>This is necessary to ensure durability - but it's also not as difficult<BR>as it sounds. The table will only track data for comitted transactions<BR>that have records with versions associated with them - definitely a very<BR>small subset of the total. I suspect that the entire thing will fit in<BR>a single page, but it could extend past that. Regardless, the table<BR>itself would be held in memory at all times - syncing to the safe is<BR>only required to ensure durability.<BR><BR><BR><BR>The rules for purging the table are well defined, and are an integral<BR>part of the garbage collection mechanism.<BR><BR><BR><BR><BR>2. Efficient techniques to find a historical (or simply prior)<BR> version of a record.<BR><BR><BR>As long as old versions can be purged when they are no longer needed<BR>(which applies to every application I can envision except for the<BR>'eternal store' concept that you've mentioned in the past), a simple<BR>linear list per record should suffice. A binary search could be used<BR>against the list if desired, but a reverse linear search will probably<BR>be more efficient (most reads will want versions from the tail end of<BR>the list).<BR><BR><BR><BR>I actually have a design put together for all of this that I think will<BR>work quite nicely. Versions are stored in their own threaded list of<BR>pages (i.e. they are not mixed together with data pages). Once you've<BR>done that, then the version list can be treated very similarly to a<BR>regular data record. As the list is changed, it gets updated by<BR>appending the *entire* list (for a given record) to a clean page.<BR><BR><BR><BR>Having a separate page list for data and versions has an interesting<BR>implication for fetching:<BR><BR><BR><BR>The logical record id lookup will point to a physical address<BR><BR>During a fetch, the system pulls the page for that address. If the page<BR>is a data page, then it implies that there is only one version of the<BR>record, which can then be returned immediately. If the page is a<BR>translation page, then the system reads the translation list for the<BR>record at that location, then performs an additional lookup to find the<BR>physical location of the data record.<BR><BR><BR><BR>Note that I am very adamant about having per-record version lists<BR>(instead of a large global version lookup system that captures all<BR>version info for all records that have versions). With the per-record<BR>system, obtaining the version list is a single operation. With a global<BR>system (let's say we used a BTree for it), we'd have to do a<BR>linear/binary search on every single page as we navigate the tree.<BR>That's fine if you have lots of data to search, but accessing the<BR>correct version list directly in a single operation is much faster.<BR><BR><BR><BR>Think of the version system I'm proposing as an abstract BTree, where we<BR>make a rule that all nodes in a given leaf page belong to a single<BR>record, and we have hyper efficient access to the appropriate leaf node.<BR><BR><BR><BR>This system also significantly improves concurrency, because the only<BR>locking necessary is a quick spin lock on the in-memory version table<BR>itself.<BR><BR><BR><BR>Now, this starts to break down if the version list gets to be insanely<BR>long - but I think that it's safe to say that the typical record will<BR>have only 1 version. Of remaining, the vast majority will only have 2<BR>or 3 versions, and the rest may have up to as much as 100 - but only for<BR>a very short period of time.<BR><BR><BR>3. It seems that these are related since versions correspond to<BR> the timestamp associated with a transaction (however we derive<BR> that timestamp).<BR><BR><BR>I agree with this with one exception: timestamps imply an explicit<BR>ordinal. transaction identifiers can be used to implement ordinals in a<BR>ring buffer, so you don't have to keep transaction IDs around forever.<BR>It also means that you don't have to use a full 8 bytes for capturing<BR>the OID/EID information - a value sufficient to store the number of<BR>current active transactions is all that is required (plus some overhead<BR>to allow time for the GC to run without stopping the system).<BR><BR><BR>4. If the timestamp is induced at the locked phase of the tx, then<BR> this will not work for VLR Tx since we could have written too many<BR> versions to buffer before reaching the locked point -- and we will<BR> have to buffer versions until we have the timestamp, right?<BR><BR><BR>The oridnal of the transaction is created when the transaction starts.<BR>All information for specifying OID on a given record is available when<BR>the record is initially updated by a given tx. The EID is specified<BR>when the Tx commits - but the EID is part of the transaction master<BR>table - not part of the record header.<BR><BR><BR>5. An update in place store might not use MVCC and hence versions<BR> might not be an issue. I am just suggesting that clustering for<BR> read performance may tradeoff against other issues such as append<BR> only allocation and even concurrency for some applications. I do<BR> not think that there is a single solution for top performance for<BR> all applications.<BR><BR><BR>I agree - I think that the advantages outlined by the approach under<BR>discussion far outweigh the disadvantages for the majority of<BR>applications. I also think that we can use the functionality of the<BR>current approach to provide capabilities for clustering that would be<BR>very awkward in an update in place system. This means that there<BR>doesn't have to be a tradeoff between clustering and other issues - you<BR>can get both advantages!<BR><BR><BR><BR>I've refered to a "fake" update in the past. Imagine that you had a tx<BR>running in a low priority thread. This tx goes through every so often<BR>and checks to see if all of the records in a 'proximity set' do indeed<BR>reside on the same page. If they don't , then the tx does a fake update<BR>on those records. This would have the effect of moving the current<BR>version of those records so they are laid down one after another. If<BR>another tx does an update on any of those records at the same time, no<BR>big deal - you lose the proximity for the time being until the next pass<BR>of the proximity thread.<BR><BR><BR><BR>This type of process works very naturally with the garbage collector.<BR>As the GC runs, it will see that there is a page with an unused record<BR>on it, and will do a "fake" update on all records from that page, then<BR>move the page to the free list.<BR><BR><BR><BR>The fake update can be performed by making it a low priority transaction<BR>- if any other tx conflicts with it, then it aborts.<BR><BR><BR>-bryan<BR><BR>-----Original Message-----<BR>From: <A href="mailto:jdb...@li..."><FONT color=#0000ff><mailto:jdb...@li...></FONT></A><BR><A href="mailto:jdb...@li..."><FONT color=#0000ff>jdb...@li...</FONT></A><BR>To: JDBM Developer listserv<BR>Sent: 3/6/2006 2:08 PM<BR>Subject: [Jdbm-developer] MVCC and very long transactions<BR><BR>OK - here's a description of how MVCC can be used to elegantly handle<BR>VLR. As an up-front warning, there are aspects of this that do *not*<BR>fit into the strategy outlined in the DBCache paper - but please keep an<BR>open mind, and I think you'll see the advantages.<BR><BR><BR>Let's start by saying that I'm using what Bryan has affectionately named<BR>MVCC(k). The idea is that versions of rows are marked with an<BR>"originating transaction identifier" (OID), and that value is used to<BR>determine when version of a given row should be visible in the scope of<BR>any given transaction.<BR><BR>Transactions can have two states: Active and Committed<BR><BR>Remember also that each transaction has an "expirey transaction<BR>identifier" (EID), which indicates the youngest active transaction at<BR>the time that the transaction in question was committed.<BR><BR>When a transaction is active, it has a null EID. When it commits, it<BR>get's it's EID filled in.<BR><BR><BR>Support for VLR then becomes a matter of detecting whether a version of<BR>a row belongs to a comitted transaction or not - if it doesn't, then the<BR>only transaction that can safely read that row is the OID transaction.<BR><BR><BR>As versions of rows fall out of the record cache, they are written to a<BR>clean page in the file, and the version table for that record is updated<BR>to reflect the OID and physical offset of the record. This would<BR>naturally happen in chunks (lot's of changed records that are part of a<BR>VLR would be evicted at a time to free up memory). In any case, all<BR>pages affected by the VLR eviction (transaltion pages, version pages and<BR>data pages) are written to safe, then the records are safely removed<BR>from the cache.<BR><BR>Now, consider the following three scenarios:<BR><BR>1. The VLR itself needs to access the evicted record<BR>2. Another Tx needs to access the evicted record<BR>3. The VLR needs to be aborted<BR><BR><BR>1. The VLR itself needs to access the evicted record<BR><BR>The VLR makes a request for the record. The version manager sees that<BR>the transaction peforming the fetch is the same as the OID of the<BR>evicted record, and requests the record from the physical record<BR>manager. The physical record manager determines that the record is on a<BR>page not in the page cache and loads the page into the cache, then<BR>returns the record data.<BR><BR>2. Another Tx needs to access the evicted record<BR><BR>The other Tx makes a request for the record. The version manager sees<BR>that the most recent version does not have an EID and that the Tx is not<BR>equal to the OID, so it continues backwards in the version chain until<BR>it finds a version that meets the MVCC(k) rules for record selection.<BR>The physical record manager either finds the page containing the record<BR>in cache already, or it loads a page form the DB.<BR><BR>3. The VLR needs to be aborted<BR><BR>The VLR Tx in the master transaction table is marked as having been<BR>deleted, and the EID is left null. Any fetch performed by any other tx<BR>will never see the aborted version of the record (TxID != OID && EID ==<BR>null). The aborted version of the record then becomes a candidate for<BR>garbage collection, using exactly the same procedure as for other record<BR>versions that are no longer needed.<BR><BR>Restart from failure is achieved by sweeping the master transaction<BR>table and marking any Active transaction as deleted before processing<BR>actually starts.<BR><BR><BR><BR><BR>Advantages of this approach:<BR><BR>1. Proximity - all records used by a VLR will tend to clump together,<BR>making future queries by the VLR more efficient<BR>2. Efificent updates - the writes for the VLR have the exact same<BR>(fast) performance characteristics as writes for non-VLR. There is no<BR>need to write to two or even three files.<BR><BR>3. Consistent semantics - the handling of VLR requires almost no<BR>special handling - it just works in the natural order of the system.<BR><BR>4. Fast re-start. Restart performance is identical to that of DBCache,<BR>even if there were VLR active during the system failure<BR>5. No extra files laying around - much less accounting work to keep<BR>track of all the files, etc...<BR>6. Efficient abort - abort is very fast (single update to master<BR>transaction table) - this is a secondary consideration, but it is a<BR>bennefit<BR><BR><BR>Disadvantages of this approach:<BR><BR>1. Records are never updated in place. This is both an advantage<BR>(higher performance updates, much easier garbage collection, pack and<BR>rebuild implementations, implicit proximity) and a disadvantage (re:<BR>Bryan's concern about losing proximity). I really believe that this<BR>system will far outweigh the potential downside on this item - and I<BR>have already come up with a design for a mechanism that would allow the<BR>application level to periodically enforce proximity requirements on a<BR>set of records by using a false, non-blocking update.<BR><BR>2. ????<BR><BR><BR><BR>- K<BR><BR><BR><BR>------------------------------------------------------- This SF.Net<BR>email is sponsored by xPML, a groundbreaking scripting language that<BR>extends applications into web and mobile media. Attend the live webcast<BR>and join the prime developer group breaking into this new coding<BR>territory!<BR><BR><A href="http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=12164"><FONT color=#0000ff><http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=12164</FONT></A><BR>2><BR><A href="http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642"><FONT color=#0000ff>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642</FONT></A><BR>_______________________________________________ Jdbm-developer mailing<BR>list <A href="mailto:Jdb...@li..."><FONT color=#0000ff><mailto:Jdb...@li...></FONT></A><BR><A href="mailto:Jdb...@li..."><FONT color=#0000ff>Jdb...@li...</FONT></A><BR><A href="https://lists.sourceforge.net/lists/listinfo/jdbm-developer"><FONT color=#0000ff><https://lists.sourceforge.net/lists/listinfo/jdbm-developer></FONT></A><BR><A href="https://lists.sourceforge.net/lists/listinfo/jdbm-developer"><FONT color=#0000ff>https://lists.sourceforge.net/lists/listinfo/jdbm-developer</FONT></A><BR><BR><<BR><BR><BR>------------------------------------------------------- This SF.Net<BR>email is sponsored by xPML, a groundbreaking scripting language that<BR>extends applications into web and mobile media. Attend the live webcast<BR>and join the prime developer group breaking into this new coding<BR>territory!<BR><A href="http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642"><FONT color=#0000ff>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642</FONT></A><BR>_______________________________________________ Jdbm-developer mailing<BR>list <A href="mailto:Jdb...@li..."><FONT color=#0000ff>Jdb...@li...</FONT></A><BR><A href="https://lists.sourceforge.net/lists/listinfo/jdbm-developer"><FONT color=#0000ff>https://lists.sourceforge.net/lists/listinfo/jdbm-developer</FONT></A><BR><BR><<BR></FONT></FONT></TD></TR></TBODY></TABLE></DIV></FONT></BODY></HTML> |