re[4]: [Jdbm-developer] MVCC and very long transactions

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0
 Transitional//EN">
<HTML><HEAD>
<STYLE type=text/css> P, UL, OL, DL, DIR,
 MENU, PRE { margin: 0 auto;}</STYLE>

<META content="MSHTML 6.00.2900.2802" name=GENERATOR></HEAD>
<BODY leftMargin=1 topMargin=1 rightMargin=1><FONT
 face=Tahoma>
<DIV><FONT face=Arial size=2>Bryan-</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2></FONT></DIV>
<DIV><FONT face=Arial size=2>Geeze - you
 make it sounds like my assumptions are completely
 off base :-)</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2>To answer your
 concerns, the assumption is that most records
 in the database will have one version.&nbsp;
 I feel that this is a reasonable assumption
 because the&nbsp;only time you will have
 versions in the DB is if&nbsp;a Tx commits
 while another version is actively using the
 record.&nbsp; Do you feel that this assumption
 is not valid?&nbsp; If so, how is it not
 valid?&nbsp; Can you give any counter-examples?&nbsp;
 That may help me to rethink the design so
 it adequately addresses your concerns.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2>The proposal
 is not to replace the translation pages with
 version pages - it is to augment the translation
 pages with version pages - if versions exist.&nbsp;
 I think of the newly designed record manager
 as having several layers:</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2>[Physical record
 manager] </FONT></DIV>
<DIV><FONT face=Arial size=2>[Version Manager]</FONT></DIV>
<DIV><FONT face=Arial size=2>[Logical record
 manager]</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2>The logical
 record manager would make a request to the
 version manager for the appropriate data,
 in the context of the current tx.&nbsp; The
 version manager determines which physical
 record is required and requests it from the
 physical record manager.&nbsp; In a strictly
 segregated architecture, we'd have compleltely
 separate data structures for the logical,
 version and&nbsp;physical record manager.&nbsp;
 But because </FONT><FONT face=Arial size=2>very
 few records will actually have versions,
 this is not efficient, so we do an optimization
 in the version manager that tells it to only
 use a version table lookup if there is&nbsp;indeed
 a version to lookup.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2>If there is
 a version table, then the logical row id
 will point to the version table in the version
 page system.&nbsp; If there is not a version
 table, then the logical row id will point
 to the data record itself.&nbsp; So, the
 exact read operation will look like this:</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2>1.&nbsp; Read
 physical offset of data based on logical
 row id from translation system</FONT></DIV>
<DIV><FONT face=Arial size=2>2.&nbsp; Fetch
 page for that physical offset</FONT></DIV>
<DIV><FONT face=Arial size=2>3.&nbsp; Test
 the page type of the page retrieved</FONT></DIV>
<DIV><FONT face=Arial size=2>4a.&nbsp; If
 page type (this is already contained in the
 jdbm page header) is PAGE_DATA, read the
 record content from the physical offset and
 build an in-memory version table with a single
 entry</FONT></DIV>
<DIV><FONT face=Arial size=2>4b.&nbsp; If
 page type is PAGE_VERSION, read the version
 record content from the physical offset and
 create an in-memory version table array&nbsp;based
 on that content (the table will consist of
 a header containing the logical rec id, and
 an array of &lt;OID, physical offset&gt;
 tuples)</FONT></DIV>
<DIV><FONT face=Arial size=2>5.&nbsp; Allow
 the version manager to determine the appropriate
 version to return for the current transaction
 context</FONT></DIV>
<DIV><FONT face=Arial size=2>6.&nbsp; If
 the version byte data exists in the version
 list already, return it, otherwise...</FONT></DIV>
<DIV><FONT face=Arial size=2>7.&nbsp; Go
 to the physical page offset of the version,
 read the bytes for the record, put the bytes
 into the version list, then return the bytes</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2>So, yes, if
 there actually are versions for a given record,
 then you will have two disk lookups instead
 of one.&nbsp; But that only covers a very
 small percentage of the overall records in
 the store - and it leaves the lookup as O(logZ)
 - where Z is the average number of versions
 of all of the records in the store.&nbsp;
 logZ reflects using a binary search over
 Z versions.&nbsp; Z is going to be very,
 very close to 1, because very few records
 have more than one version, so the practical
 lookup performance will be O(1).</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2>You are correct
 that using OID alone for determining whether
 an abort is necessary may not be as good
 as it could be.&nbsp; It would certainly
 be possible (and probably even desirable!)
 to have the in-memory version of the record
 capture the read and/or modified time of
 the records it works with (we have to capture
 a read and write set anyway for the conflict
 resolution mechanism - this just increases
 the record size of that structure a tad).&nbsp;
 This could be provided as input to the optimistic
 conflict resolution system.&nbsp; The OID
 becomes a basline time then, and the time
 offsets are milliseconds from the OID.&nbsp;
 This will allow us to use&nbsp;less memory
 (and faster calculations, if that matters).&nbsp;
 We don't have to actually persistently store
 this information because the OID and EID
 implicitly bracket the operations from the
 perspective of other tx, so I'm pretty sure
 there is no disk usage advantage or disadvantage
 of capturing this info.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2>Note that I
 have really said nothing about how the version
 manager determines which physical record
 to return - it could use an algorithm based
 on the captured read/write timestamps in
 addition to the OID/EID bracketing rules
 that I suggested.&nbsp; The point is that
 the overall strategy is sound, and that we
 can tweak the details to optimize performance
 as we need - without further messing with
 the file format.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2>As for whether
 it is going to be a performance win, I guess
 we'll just have to agree to disagree - I
 am certain that it will be a massive win
 for read only tx, and it makes the overall
 architecture much cleaner for handling VLR,
 etc...&nbsp; Given that it solves the VLR
 problem, allows for use of either optimistic
 or pessimistic locking, and provides a well
 understood mechanism for managing fragmentation
 and proximity, I would like to propose that,
 unless there are specific objections, that
 we give it a whack.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2>It should be
 easy to completely disable the versioning
 if desired (the version manager would just
 always return the latest physical record).&nbsp;
 Changing the system to perform in place updates
 would require some significant changes, because
 one of the advantages of the proposed system
 is that it ditches the free lists entirely.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2>If that sound
 good to you, I can put together an in-memory&nbsp;mockup
 of the system that we can tie into your transaction
 serialization testing harness and we can
 see how it goes.&nbsp; </FONT></DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2>I am especially
 interested to see if there are going to be
 any interaction issues between 2PL and the
 versioning system...&nbsp; I'm pretty sure
 that if we make our locks on logical rows
 that everything should flow from that.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2>Cheers,</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2>- K</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
<DIV><FONT face=Arial size=2>&nbsp;</FONT>
 
<TABLE>
<TBODY>
<TR>
<TD width=1 bgColor=blue><FONT face=Arial
 size=2></FONT></TD>
<TD><FONT face=Arial size=2><FONT color=blue>&gt;
 Kevin,<BR><BR>It sounds like the performance
 of the design depends critically on the<BR>expectation
 that there is only a single record for each
 version and<BR>otherwise we do an additional
 indirection to find even the most current<BR>version
 of the record. &nbsp;If your assumption is
 off base, then you are<BR>going to add a
 significant overhead every time you go for
 the current<BR>version of the record. <BR><BR>Or
 is the proposal to replace the use of translation
 pages with "version"<BR>pages? &nbsp;In this
 case, the danger is that fewer translations
 will fit on<BR>a single page and the overall
 memory burden for the translation from OID<BR>to
 physical offset will be higher.<BR><BR>There
 also appears to be an assumption that the
 transaction creation<BR>order (sequential
 assignment of transaction identifiers) will
 produce<BR>transaction restart rates that
 will be acceptable. &nbsp;My understanding<BR>is
 that this overconstrains the serialization
 ordering and results in<BR>many more restarts
 that a policy which decides the serialization
 ordering<BR>at the locked point or commit
 point of the tx. &nbsp;While read-only txs
 can<BR>benefit by deliberately reading old
 versions, there may be increased cost<BR>both
 from the non-transparency of that decision
 and from the requirement<BR>to indirect to
 an earlier version.<BR><BR>So, I am still
 not convinced that this is a hands down performance
 win.<BR><BR>With regard to optimistic polices,
 [1] has a nice approach to hybrid<BR>atomicity
 which provides the opportunity for per-datatype
 choice of<BR>pessemistic, timestamp, or optimistic
 CC and expands the notion of<BR>conflict
 based (read-write and write-write conflict)
 CC to datatype<BR>specific definitions of
 conflict (allowing more interleavings of
 txs)<BR>and also to state-based conflict
 resolution.<BR><BR>-bryan<BR><BR>[1] Apologizing
 Versus Asking Permission: Optimistic Concurrency<BR>&nbsp;&nbsp;&nbsp;Control
 for Abstract Data Types MAURICE HERLIHY Carnegie
 Mellon<BR>&nbsp;&nbsp;&nbsp;University.<BR>&nbsp;&nbsp;&nbsp;<A
 href="http://www.cs.brown.edu/~mph/Herlihy90a/p96-herlihy.pdf"><FONT
 color=#0000ff>http://www.cs.brown.edu/~mph/Herlihy90a/p96-herlihy.pdf</FONT></A><BR><BR><BR>-----Original
 Message-----<BR>From: <A href="mailto:jdb...@li..."><FONT
 color=#0000ff>jdb...@li...</FONT></A><BR>To:
 JDBM Developer listserv<BR>Sent: 3/6/2006
 3:18 PM<BR>Subject: re[2]: [Jdbm-developer]
 MVCC and very long transactions<BR><BR>Bryan-<BR><BR>I'm
 addressing some of your comments/thoughts
 below!<BR><BR>I thought of another advantage
 of the proposed system: &nbsp;Deleting a<BR>record
 can occur without writing data to the record
 itself. &nbsp;We just<BR>append a 'deleted'
 entry to the version list. &nbsp;If we are
 deleting a<BR>bunch of records, this will
 require a bunch of sequential writes of<BR>version
 tables, and a bunch of translation table
 updates. &nbsp;If we are<BR>doing in-place
 deletions, we'd also incur the cost of updating
 the pages<BR>containing record data.<BR><BR>As
 an FYI, I am still *very* big on using MVCC.
 &nbsp;The advantages it<BR>provides for non-blocking
 read-only tx, and the natural fit between
 it<BR>and VLR still make it extremely desirous,
 even given the 'serialization<BR>isolation'
 level issue I described over the weekend.
 &nbsp;I think that MVCC<BR>in conjunction
 with either 2PL or optimistic locking is
 going to be a<BR>very powerful combination.<BR><BR>-
 K<BR>&nbsp;<BR>&nbsp;&nbsp;&nbsp;&nbsp;<BR>&gt;
 Kevin,<BR><BR>Sounds reasonable. &nbsp;In
 terms of DBCache, all you have to do is force<BR>the
 pages belonging to the VLR Tx to the DB.
 &nbsp;I was pointing to this<BR>issue a while
 back. &nbsp;There is an interaction between
 the persistent<BR>record allocation logic
 and the page manager which can simplify some<BR>things.
 &nbsp;If we have a guarentee that a page
 has data from at one and<BR>only one transaction,
 then we can force it to the DB without logging<BR>since
 rollback is not an issue (at the page content
 level).<BR><BR><BR>My concern with page-per-tx
 is that it will result in heavy<BR>fragmentation.
 &nbsp;With the strategy that I envision,
 there will be no<BR>fragmentation, and reclamation
 of space by deleted versions becomes a<BR>very
 natural operation.<BR><BR><BR><BR>Note that
 the wilderness page will get forced to safe
 for each<BR>transaction commit - if you have
 a lot of small transactions comitting,<BR>then
 the same wilderness page would be forced
 multiple times.<BR><BR><BR><BR>There are
 some things that I am not convinced about
 yet:<BR><BR>1. The need to maintain a persistent
 "master transaction table".<BR><BR><BR>This
 is necessary to ensure durability - but it's
 also not as difficult<BR>as it sounds. &nbsp;The
 table will only track data for comitted transactions<BR>that
 have records with versions associated with
 them - definitely a very<BR>small subset
 of the total. &nbsp;I suspect that the entire
 thing will fit in<BR>a single page, but it
 could extend past that. &nbsp;Regardless,
 the table<BR>itself would be held in memory
 at all times - syncing to the safe is<BR>only
 required to ensure durability.<BR><BR><BR><BR>The
 rules for purging the table are well defined,
 and are an integral<BR>part of the garbage
 collection mechanism.<BR><BR><BR><BR><BR>2.
 Efficient techniques to find a historical
 (or simply prior)<BR>&nbsp;version of a record.<BR><BR><BR>As
 long as old versions can be purged when they
 are no longer needed<BR>(which applies to
 every application I can envision except for
 the<BR>'eternal store' concept that you've
 mentioned in the past), a simple<BR>linear
 list per record should suffice. &nbsp;A binary
 search could be used<BR>against the list
 if desired, but a reverse linear search will
 probably<BR>be more efficient (most reads
 will want versions from the tail end of<BR>the
 list).<BR><BR><BR><BR>I actually have a design
 put together for all of this that I think
 will<BR>work quite nicely. &nbsp;Versions
 are stored in their own threaded list of<BR>pages
 (i.e. they are not mixed together with data
 pages). &nbsp;Once you've<BR>done that, then
 the version list can be treated very similarly
 to a<BR>regular data record. &nbsp;As the
 list is changed, it gets updated by<BR>appending
 the *entire* list (for a given record) to
 a clean page.<BR><BR><BR><BR>Having a separate
 page list for data and versions has an interesting<BR>implication
 for fetching:<BR><BR><BR><BR>The logical
 record id lookup will point to a physical
 address<BR><BR>During a fetch, the system
 pulls the page for that address. &nbsp;If
 the page<BR>is a data page, then it implies
 that there is only one version of the<BR>record,
 which can then be returned immediately. &nbsp;If
 the page is a<BR>translation page, then the
 system reads the translation list for the<BR>record
 at that location, then performs an additional
 lookup to find the<BR>physical location of
 the data record.<BR><BR><BR><BR>Note that
 I am very adamant about having per-record
 version lists<BR>(instead of a large global
 version lookup system that captures all<BR>version
 info for all records that have versions).
 &nbsp;With the per-record<BR>system, obtaining
 the version list is a single operation. &nbsp;With
 a global<BR>system (let's say we used a BTree
 for it), we'd have to do a<BR>linear/binary
 search on every single page as we navigate
 the tree.<BR>That's fine if you have lots
 of data to search, but accessing the<BR>correct
 version list directly in a single operation
 is much faster.<BR><BR><BR><BR>Think of the
 version system I'm proposing as an abstract
 BTree, where we<BR>make a rule that all nodes
 in a given leaf page belong to a single<BR>record,
 and we have hyper efficient access to the
 appropriate leaf node.<BR><BR><BR><BR>This
 system also significantly improves concurrency,
 because the only<BR>locking necessary is
 a quick spin lock on the in-memory version
 table<BR>itself.<BR><BR><BR><BR>Now, this
 starts to break down if the version list
 gets to be insanely<BR>long - but I think
 that it's safe to say that the typical record
 will<BR>have only 1 version. &nbsp;Of remaining,
 the vast majority will only have 2<BR>or
 3 versions, and the rest may have up to as
 much as 100 - but only for<BR>a very short
 period of time.<BR><BR><BR>3. It seems that
 these are related since versions correspond
 to<BR>&nbsp;the timestamp associated with
 a transaction (however we derive<BR>&nbsp;that
 timestamp).<BR><BR><BR>I agree with this
 with one exception: &nbsp;timestamps imply
 an explicit<BR>ordinal. &nbsp;transaction
 identifiers can be used to implement ordinals
 in a<BR>ring buffer, so you don't have to
 keep transaction IDs around forever.<BR>It
 also means that you don't have to use a full
 8 bytes for capturing<BR>the OID/EID information
 - a value sufficient to store the number
 of<BR>current active transactions is all
 that is required (plus some overhead<BR>to
 allow time for the GC to run without stopping
 the system).<BR><BR><BR>4. If the timestamp
 is induced at the locked phase of the tx,
 then<BR>&nbsp;this will not work for VLR
 Tx since we could have written too many<BR>&nbsp;versions
 to buffer before reaching the locked point
 -- and we will<BR>&nbsp;have to buffer versions
 until we have the timestamp, right?<BR><BR><BR>The
 oridnal of the transaction is created when
 the transaction starts.<BR>All information
 for specifying OID on a given record is available
 when<BR>the record is initially updated by
 a given tx. &nbsp;The EID is specified<BR>when
 the Tx commits - but the EID is part of the
 transaction master<BR>table - not part of
 the record header.<BR><BR><BR>5. An update
 in place store might not use MVCC and hence
 versions<BR>&nbsp;might not be an issue.
 &nbsp;I am just suggesting that clustering
 for<BR>&nbsp;read performance may tradeoff
 against other issues such as append<BR>&nbsp;only
 allocation and even concurrency for some
 applications. &nbsp;I do<BR>&nbsp;not think
 that there is a single solution for top performance
 for<BR>&nbsp;all applications.<BR><BR><BR>I
 agree - I think that the advantages outlined
 by the approach under<BR>discussion far outweigh
 the disadvantages for the majority of<BR>applications.
 &nbsp;I also think that we can use the functionality
 of the<BR>current approach to provide capabilities
 for clustering that would be<BR>very awkward
 in an update in place system. &nbsp;This
 means that there<BR>doesn't have to be a
 tradeoff between clustering and other issues
 - you<BR>can get both advantages!<BR><BR><BR><BR>I've
 refered to a "fake" update in the past. &nbsp;Imagine
 that you had a tx<BR>running in a low priority
 thread. &nbsp;This tx goes through every
 so often<BR>and checks to see if all of the
 records in a 'proximity set' do indeed<BR>reside
 on the same page. &nbsp;If they don't , then
 the tx does a fake update<BR>on those records.
 &nbsp;This would have the effect of moving
 the current<BR>version of those records so
 they are laid down one after another. &nbsp;If<BR>another
 tx does an update on any of those records
 at the same time, no<BR>big deal - you lose
 the proximity for the time being until the
 next pass<BR>of the proximity thread.<BR><BR><BR><BR>This
 type of process works very naturally with
 the garbage collector.<BR>As the GC runs,
 it will see that there is a page with an
 unused record<BR>on it, and will do a "fake"
 update on all records from that page, then<BR>move
 the page to the free list.<BR><BR><BR><BR>The
 fake update can be performed by making it
 a low priority transaction<BR>- if any other
 tx conflicts with it, then it aborts.<BR><BR><BR>-bryan<BR><BR>-----Original
 Message-----<BR>From: &nbsp;<A href="mailto:jdb...@li..."><FONT
 color=#0000ff>&lt;mailto:jdb...@li...&gt;</FONT></A><BR><A
 href="mailto:jdb...@li..."><FONT
 color=#0000ff>jdb...@li...</FONT></A><BR>To:
 JDBM Developer listserv<BR>Sent: 3/6/2006
 2:08 PM<BR>Subject: [Jdbm-developer] MVCC
 and very long transactions<BR><BR>OK - here's
 a description of how MVCC can be used to
 elegantly handle<BR>VLR. &nbsp;As an up-front
 warning, there are aspects of this that do
 *not*<BR>fit into the strategy outlined in
 the DBCache paper - but please keep an<BR>open
 mind, and I think you'll see the advantages.<BR><BR><BR>Let's
 start by saying that I'm using what Bryan
 has affectionately named<BR>MVCC(k). &nbsp;The
 idea is that versions of rows are marked
 with an<BR>"originating transaction identifier"
 (OID), and that value is used to<BR>determine
 when version of a given row should be visible
 in the scope of<BR>any given transaction.<BR><BR>Transactions
 can have two states: &nbsp;Active and Committed<BR><BR>Remember
 also that each transaction has an "expirey
 transaction<BR>identifier" (EID), which indicates
 the youngest active transaction at<BR>the
 time that the transaction in question was
 committed.<BR><BR>When a transaction is active,
 it has a null EID. &nbsp;When it commits,
 it<BR>get's it's EID filled in.<BR><BR><BR>Support
 for VLR then becomes a matter of detecting
 whether a version of<BR>a row belongs to
 a comitted transaction or not - if it doesn't,
 then the<BR>only transaction that can safely
 read that row is the OID transaction.<BR><BR><BR>As
 versions of rows fall out of the record cache,
 they are written to a<BR>clean page in the
 file, and the version table for that record
 is updated<BR>to reflect the OID and physical
 offset of the record. &nbsp;This would<BR>naturally
 happen in chunks (lot's of changed records
 that are part of a<BR>VLR would be evicted
 at a time to free up memory). &nbsp;In any
 case, all<BR>pages affected by the VLR eviction
 (transaltion pages, version pages and<BR>data
 pages) are written to safe, then the records
 are safely removed<BR>from the cache.<BR><BR>Now,
 consider the following three scenarios:<BR><BR>1.
 &nbsp;The VLR itself needs to access the
 evicted record<BR>2. &nbsp;Another Tx needs
 to access the evicted record<BR>3. &nbsp;The
 VLR needs to be aborted<BR><BR><BR>1. &nbsp;The
 VLR itself needs to access the evicted record<BR><BR>The
 VLR makes a request for the record. &nbsp;The
 version manager sees that<BR>the transaction
 peforming the fetch is the same as the OID
 of the<BR>evicted record, and requests the
 record from the physical record<BR>manager.
 &nbsp;The physical record manager determines
 that the record is on a<BR>page not in the
 page cache and loads the page into the cache,
 then<BR>returns the record data.<BR><BR>2.
 &nbsp;Another Tx needs to access the evicted
 record<BR><BR>The other Tx makes a request
 for the record. &nbsp;The version manager
 sees<BR>that the most recent version does
 not have an EID and that the Tx is not<BR>equal
 to the OID, so it continues backwards in
 the version chain until<BR>it finds a version
 that meets the MVCC(k) rules for record selection.<BR>The
 physical record manager either finds the
 page containing the record<BR>in cache already,
 or it loads a page form the DB.<BR><BR>3.
 &nbsp;The VLR needs to be aborted<BR><BR>The
 VLR Tx in the master transaction table is
 marked as having been<BR>deleted, and the
 EID is left null. &nbsp;Any fetch performed
 by any other tx<BR>will never see the aborted
 version of the record (TxID != OID &amp;&amp;
 EID ==<BR>null). &nbsp;The aborted version
 of the record then becomes a candidate for<BR>garbage
 collection, using exactly the same procedure
 as for other record<BR>versions that are
 no longer needed.<BR><BR>Restart from failure
 is achieved by sweeping the master transaction<BR>table
 and marking any Active transaction as deleted
 before processing<BR>actually starts.<BR><BR><BR><BR><BR>Advantages
 of this approach:<BR><BR>1. &nbsp;Proximity
 - all records used by a VLR will tend to
 clump together,<BR>making future queries
 by the VLR more efficient<BR>2. &nbsp;Efificent
 updates - the writes for the VLR have the
 exact same<BR>(fast) performance characteristics
 as writes for non-VLR. &nbsp;There is no<BR>need
 to write to two or even three files.<BR><BR>3.
 &nbsp;Consistent semantics - the handling
 of VLR requires almost no<BR>special handling
 - it just works in the natural order of the
 system.<BR><BR>4. &nbsp;Fast re-start. &nbsp;Restart
 performance is identical to that of DBCache,<BR>even
 if there were VLR active during the system
 failure<BR>5. &nbsp;No extra files laying
 around - much less accounting work to keep<BR>track
 of all the files, etc...<BR>6. &nbsp;Efficient
 abort - abort is very fast (single update
 to master<BR>transaction table) - this is
 a secondary consideration, but it is a<BR>bennefit<BR><BR><BR>Disadvantages
 of this approach:<BR><BR>1. &nbsp;Records
 are never updated in place. &nbsp;This is
 both an advantage<BR>(higher performance
 updates, much easier garbage collection,
 pack and<BR>rebuild implementations, implicit
 proximity) and a disadvantage (re:<BR>Bryan's
 concern about losing proximity). &nbsp;I
 really believe that this<BR>system will far
 outweigh the potential downside on this item
 - and I<BR>have already come up with a design
 for a mechanism that would allow the<BR>application
 level to periodically enforce proximity requirements
 on a<BR>set of records by using a false,
 non-blocking update.<BR><BR>2. &nbsp;????<BR><BR><BR><BR>-
 K<BR><BR><BR><BR>-------------------------------------------------------
 This SF.Net<BR>email is sponsored by xPML,
 a groundbreaking scripting language that<BR>extends
 applications into web and mobile media. Attend
 the live webcast<BR>and join the prime developer
 group breaking into this new coding<BR>territory!<BR><BR><A
 href="http://sel.as-us.falkag.net/sel?cmd=lnk&amp;kid=110944&amp;bid=241720&amp;dat=12164"><FONT
 color=#0000ff>&lt;http://sel.as-us.falkag.net/sel?cmd=lnk&amp;kid=110944&amp;bid=241720&amp;dat=12164</FONT></A><BR>2&gt;<BR><A
 href="http://sel.as-us.falkag.net/sel?cmd=lnk&amp;kid=110944&amp;bid=241720&amp;dat=121642"><FONT
 color=#0000ff>http://sel.as-us.falkag.net/sel?cmd=lnk&amp;kid=110944&amp;bid=241720&amp;dat=121642</FONT></A><BR>_______________________________________________
 Jdbm-developer mailing<BR>list &nbsp;<A href="mailto:Jdb...@li..."><FONT
 color=#0000ff>&lt;mailto:Jdb...@li...&gt;</FONT></A><BR><A
 href="mailto:Jdb...@li..."><FONT
 color=#0000ff>Jdb...@li...</FONT></A><BR><A
 href="https://lists.sourceforge.net/lists/listinfo/jdbm-developer"><FONT
 color=#0000ff>&lt;https://lists.sourceforge.net/lists/listinfo/jdbm-developer&gt;</FONT></A><BR><A
 href="https://lists.sourceforge.net/lists/listinfo/jdbm-developer"><FONT
 color=#0000ff>https://lists.sourceforge.net/lists/listinfo/jdbm-developer</FONT></A><BR><BR>&lt;<BR><BR><BR>-------------------------------------------------------
 This SF.Net<BR>email is sponsored by xPML,
 a groundbreaking scripting language that<BR>extends
 applications into web and mobile media. Attend
 the live webcast<BR>and join the prime developer
 group breaking into this new coding<BR>territory!<BR><A
 href="http://sel.as-us.falkag.net/sel?cmd=lnk&amp;kid=110944&amp;bid=241720&amp;dat=121642"><FONT
 color=#0000ff>http://sel.as-us.falkag.net/sel?cmd=lnk&amp;kid=110944&amp;bid=241720&amp;dat=121642</FONT></A><BR>_______________________________________________
 Jdbm-developer mailing<BR>list <A href="mailto:Jdb...@li..."><FONT
 color=#0000ff>Jdb...@li...</FONT></A><BR><A
 href="https://lists.sourceforge.net/lists/listinfo/jdbm-developer"><FONT
 color=#0000ff>https://lists.sourceforge.net/lists/listinfo/jdbm-developer</FONT></A><BR><BR>&lt;<BR></FONT></FONT></TD></TR></TBODY></TABLE></DIV></FONT></BODY></HTML>