RecordFile class using java.nio.MappedByteBuf

bormi
2005-07-15
2013-06-03
  • bormi
    bormi
    2005-07-15

    Hello,

    I have a patch for jdbm.recman.RecordFile class that can optionally use a java.nio.MappedByteBuffer for all io operations. It maps the whole data file. This speeds up inserts and searches by a factor from 10 - 100. Mapping mode can be turned on and off even during transactions without loss of data. If there is interest to incorporate this into CVS I can send a patch.

    The old RecordFile class could still be the default (for jdk<1.4) and the nio RecordFile class could be used only if desired.

    Cheers,
    Ingo

     
    • Bryan Thompson
      Bryan Thompson
      2005-07-15

      Hello,

      I am definately interested in this -- I've been looking NIO issues myself -- and I would be quite happy if this feature made it into the base release.  I would be happy to try out the patch myself, but I am not one of the project maintainers.

      Two questions:

      - Which version of the code base did you develop this against?

      - What OS platforms have you tested this on?

      Thanks,

      -bryan

       
      • Alex Boisvert
        Alex Boisvert
        2005-07-15

        This is interesting stuff.  I've always wondered about the performance gains of using NIO with JDBM.   Have you done any measurements/comparisons?

        I see no problem adding this to CVS as the new stardard implementation of RecordFile.   We could keep the current implementation for backwards compatibility with JDK 1.3 (with some kind of "auto-detection" mechanism) and use the NIO implementation whenever it is available on the target platform.

        alex

         
        • Kevin Day
          Kevin Day
          2005-07-18

          For what it's worth, I've seen some pretty big improvements in performance when using NIO - as long as it is properly designed (I've seen 3-5 X performance improvements - not 10-100 X, but this may be application specific).

          The biggest thing I've found with using mapped buffers and NIO is to remember that you MUST not pull data into a temporary byte[] array to work with it.  If you do that, you've pretty much killed most of the benefit of mapped buffers...

          This may be a problem for the serialization scheme implemented in the current JDBM implementation.

          I could be quite wrong on this (if there is a huge performance gain, even with the current serialization scheme, then I may be doing something wrong in the tests I've run in my other projects)...

          Just my $0.02.  I'm extremely interested in mapped file support for JDBM, and would be happy to help out if needed.

          One other thing that I ran into:  The current implementation of the memory mapped file in JDK 1.5 does NOT properly close the mapped file.  The native handle is ONLY closed in the finalize() method of the buffer object.  This means that if you close the file, then re-open it a few seconds later, it will fail (unpredictably).

          There is a workaround that I can dig up if need be, but it is specific to Sun's implementation of the JDK...

          - K

           
    • bormi
      bormi
      2005-07-16

      I submitted a patch for this. I tested it only on Win2000 and WinXP. The performance gains I posted yesterday, where measured with the database file on a samba mounted drive, which was probably kind of stupid. I tried again on a local drive, there I see something in the order of 30%, but this depends heavily on the cache size that is used. If everything is in cache anyway, you don't need fast file access ;-).  My test case is also included, so you can trie yourself.

      The RecordFile class I modified is originally from the 0.12 release, but I checked against CVS and found no code differences. The patch works with the current CVS revision.

       
    • bormi
      bormi
      2005-07-18

      In principle one should not copy data from the mapped file into a temp byte[] buffer. But you cannot do this with jdbm, if you want to keep the transaction support. From what I saw transactions are based on buffering the data in a byte[] buffer (class BlockIo) before they are written to disk. One could think of some copy on write scheme for the BlockIo class, but this would be a bit more work than just changing the file access type in RecordFile.

      Performance wise I am not sure if the temp byte[] buffer is bad, because in the end you have to read the data from the mapped buffer at some place in your code, if you directly serialize from the mapped buffer or a temp buffer is probably not a big issue. What I see as the main advantage of the mapped buffer is, that it seems to be faster for search operations, where you need to read data from a lot of arbitrary positions in a single file.

       
      • Kevin Day
        Kevin Day
        2005-07-18

        Sounds good - all my theory in the world isn't worth a thing compared to a single test result showing a benefit.

        I'm still not sure that I understand why mapped IO would matter during the search operations - at least in the case of a BTree - if the tree is sized so each leaf is sized for the disk page size, then the entire page will be read in a single operation anyway...  The mapped IO would be doing the same thing...

        Are you testing with a BTree or an HTree?  Have you adjusted the btree page size to be approximately to the disk page size?

        For raw reads in the record manager (without using a BTree or HTree), I could see where mapped IO could make a big difference.

        I actually experimented in the early days with using the trees for indexes only, and storing my data as inidividual records in the record manager, but the performance was terrible.  Maybe with the mapped IO, the performance hit isn't so bad?

        Just guessing...  There certainly isn't anything wrong with using NIO for file access (heck, that's exactly what the RandomAccessFile class is doing) - I just want to make sure we understand the problem that is being solved...

        - K

         
        • Bryan Thompson
          Bryan Thompson
          2005-07-19

          I've seen you mention a few times that using raw recids as values to lookup data elsewhere results in poor performance.  This is somewhat of a concern to us since we have multiple indices over the same data.  Perhaps this can be restructured so that the data is distributed across multiple indices, but that is a very different approach.

          I was wonder if anyone could elaborate on what the recman is doing when it maps a recid to an object and why this operation might be considered "slow", or whether it is just doing a BTree lookup followed by a record lookup that is slow.

          One possible reason that I am extracting from this thread by reading into what is said is that the BTree page is de-serialized as a whole, so if there a N application objects on that page, then you have those N objects in a single cache entry (since they are part of the object graph for the BTree page).

          If this is the case, then perhaps the performance of the record manager could be improved by introducing a similar kind of blocked container mechanism.

          -bryan

           
          • Bryan Thompson
            Bryan Thompson
            2005-07-19

            For example, a page-level cache.

             
          • Kevin Day
            Kevin Day
            2005-07-19

            The record manager isn't doing anything wrong when it maps a recid to an object - you have to keep in mind that the absolute slowest thing you can do on a computer is read from disk (with the exception of reading a file over a dial-up connection or something). 

            When the record manager reads a record, it pulls just that record's contents from the disk.  If you turn right around and read the next record from the disk, it has to initiate a completely new read.  Back in the ancient days of database design, this problem was fixed with the BTree.  BTree is insanely elegant - there may be slightly better approaches today - but the BTree has absolutely stood the test of time.

            Another potential issue (though probably of smaller magnitude) is file fragmentation.  When records are added to the recordmanager, they are added where there is room in the file, which can lead to records that are typically read in sequence being stored in different areas of the disk.  JDBM really needs a "pack & rebuild" capability that will remove empty space from the data file, etc... - but the speed and filesize penalties of the current implementation just haven't been big enough for me to want to take a crack at it...

            In terms of performance:

            At first, I thought that the slow lookups based on raw recids was going to be a major problem for us - but then I realized that I just had to re-think the problem.

            The reason you use a BTree in the first place is to optimize disk reads (that's why you want your BTreePage size to be the same as a disk page).

            The solution was pretty simple:  Instead of storing data with individual recids, I just store the data in a BTree, with the recid as the key.  So, I have a set of what I call PrimaryKey BTrees that map a recid to a piece of data.  Then I have a set of Index BTrees that contain no values - the keys contain all the information needed to determine which record id should be pulled from the primary index.

            This has been remarkably effective for us (you do have to be sure to put in business logic rules to ensure that your indexes are always intact, but this is actually pretty easy to do).  The primary index takes care of disk caching, optimizing disk writes, etc... and the index trees are used to locate things quickly without having to deserialize objects all over the place.

            Hope this helps!

            - K

             
            • Bryan Thompson
              Bryan Thompson
              2005-07-19

              Kevin,

              I am still slightly confused by your approach.  How are you creating the recid's that you use as the keys in your "primary" btrees to locate your data?

              - Do you just insert empty data into the record manager in order to get a recid to serve as the key in your btree?

              - Do you have the data duplicated so that it is stored in both the record manager and the primary key index?

              -  Are you simply using your own factory for the recid's that are used as the keys in your primary index?  E.g., storing the next recid to hand out in the root page of the btree that is your primary index?

              Given your description it sounds like you simple do not use the RecordManager interface directly.  This means that the cache is just caching pages in your primary and 2nd-ary indices, right?

              Thanks,

              -bryan

               
              • Kevin Day
                Kevin Day
                2005-07-19

                Brian-

                You've got it exactly right.  I generate the "next" recid by looking up the last recid in the primary index and adding one to it.  I use the BTree for all record management, and leave the record manager itself completely alone (actually, I do store some items in the record manager directly, but they are not items that have lots of read/write operations performed against them).

                Just to clarify:  the recid I'm talking about is NOT the recid returned by new objects added to the record manager.  It is a sequential counter managed by the primary index.

                Basically, I recognized that the BTree is the optimal storage mechanism for my application, and used it as such.

                Data is stored only in the values of the primary index BTree.  Indirectly, of course, this means that it is being saved in the record manager.

                Per your final question, the answer is "yes" - the cache is caching pages, and the pages are holding references to my actual business objects.

                I guess the point is that there would be no point in modifying the record manager to try to optimize all this - we already have a very efficient database data structure that we can use.

                It would be extremely nice if there were a way to somehow tie the lookup and primary indexes together to make updates automatically atomic, but it's really not that difficult to do by hand - it just requires some basic accounting skills...

                Basically, what I do is identify a given business object class that "owns" a given set of indexes.  It is solely responsible for updates to those indexes, and any other classes that want to update those indexes must do it through the owner...

                I keep staring at my code thinking that there ought to be a way to abstract some functionality and make this into a framework, but I haven't gotten my head around a solution just yet.  Maybe some sort of event driven mechanism...

                Cheers!

                - K

                 
    • bormi
      bormi
      2005-07-18

      I have tried both, BTree and HTree, the results were similar. BTW, I am not sure at all we understand the problem. In order to get a better feeling for the whole thing, you could just take my test case and play a bit with both methods. If you turn off the mapped buffer the RecordFile class should behave exactly like the old RandomAccesFile based aproach and gives identical performance although it uses channels instead of the Random Access file (which probably boils down to the same system calls anyway).

      The test does the following. It creates a tree with 50000 entries for a class containing some members (mostly primitives). Then (after restart) it reads 50000 entries back in random order(via get() or find(), no Iterator methods). This means reads spread over the whole file. If I set cache size to just 1 page, it takes 42s with RAF and 30s with the mapped buffer. If the cache is 100 pages, it's 24s with RAF and 17s with a mapped buffer (using the HTree and WinXP jdk1.5). I have no explanation, why the results are like this, but this is what I get. I did not dig into this very deeply, I did the whole thing more or less out of curiosity. Maybe with a different setup your results vary.

      Anyway, I am not sure, if a mapped buffer is the default implementation of choice, because the buffer seems to add to the application memory like reported by Win Task Manager.

       
      • Kevin Day
        Kevin Day
        2005-07-19

        I can help with a few of the questions:

        When you set the cache size to 1, only a single page of the tree can be loaded in memory at a time.  As you bridge from one page to another, the application is having to re-read the "next higher" page from disk, before descending  again to the next page.  That is most likely why your time is much larger for a cache of 1 than for a cache of 100.

        Remember:  The cache size has nothing to do with the number of records in the BTree - or the number of records in a page.  The page is loaded, in it's entirety, as a single operation.  So, if your page contains 25 entries, then all 25 are loaded into memory at the same time, and consume exactly *1* space in the MRU cache.

        In terms of optimization, instead of fiddling with the MRU cache size (just set it to 100 and forget about it), play with the number of objects stored in each page.  This is achieved using the pageSize parameter of the BTree.createInstance method (the 5th parameter).

        Picking a value for that can be a tad tricky because you need to be able to determine how big a given key/value pair will be - if you are just using simple primitives, then this is easy enough to calculate in advance, but if you are using Java serialization, then the size could be quite variable.  Anyway, I have a set of BTree indexes that I maintain in my application that consist of multi-value longs as keys, and null as the values.

        So a given index may consist of 3 long values - for a total of 24 bytes.  The values are all null, so that consumes 0 bytes.  Then there's the overhead of the BTreePage itself - I can't remember what that is exactly - but it's probably something like 100 bytes.

        Page size for a typical PC with NTFS is ~4K, so you should try to make your BTreePage less than that in size, but not too small.

        I pick a value of 64 for my page size, but I could probably get away with 128, and maybe even 256 (I do a rudimentary form of key compression, so my longs really don't take up the full 8 bytes apiece).

        I haven't done a lot of parameteric studies because we aren't at a point in our project where that level of performance optimization is necessary, but at some point we will.

        I find it extremely intersting that using NIO memory mapped files would increase the memory usage of your application...  RAF is using something similar behind the scenes, so there really shouldn't be an issue.

        One thing that *could* cause it:  Are you opening a map once, then never creating another?  Or are you closing and opening memory mapped files throughout the run?

        If you are opening and closing throughout the run, then you are most likely running into the problem I described in my earlier post where the system resources for the map are not being freed until the GC hits...

        - K

         
        • Bryan Thompson
          Bryan Thompson
          2005-07-19

          Kevin,

          Ok.  I think that I am getting closer.  However, I still do not understand how you get away with using long[3] as your key and not having any "value" in your secondary indices.

          This sounds a little like a triple store, where the index would be on s:p:o (subject, predicate, object), and where the subject, predicate and object were each refied as objects in your primary index.  I would be very curious if this is what you are actually doing - this sort of long[3] index could then be used to lookup any statement in the store.

          However I still don't see how you are using an index to impose a total ordering over something like a SSN, a date field, a person's last name, etc.  It seems like these sorts of indices need to have keys that are comparable in terms of the ordering on the value space, e.g., "Bar" comes before "Foo", in which case the value is naturally the recid of the data in your primary key index.

          So I think that I am still confused, or maybe I just don't understand your application.

          Thanks,

          -bryan

           
          • Kevin Day
            Kevin Day
            2005-07-21

            Woops - I had a reply to this, but I must have closed the browser without clicking Post.

            I'm not famliar with the s:p:o terminology, so my answer here may not be exactly what you are looking for - please expand on that if I missed the point.

            For my particular application, the majority of the indexes that I use are for relational mapping of records stored in three or four primary indexes.  For these kinds of secondary indexes, I have a lot of one-to-many relationships that I'm tracking, so using a 2 or 3 level key is appropriate.

            For example, if I have a parent-child relationship, between records in a single primary index, I would have:

            The primary index defined as:

            key:  [recid]
            value:  [object]

            A secondary index that uses:

            key:  [parent recid][child recid]
            value: null

            and another secondary that uses:

            key:  [child recid][parent recid]
            value: null

            Each of these can be used to quickly locate all of the recids of the children of a parent (or vice versa), then those can be used to look up the actual child (or parent) object(s) in the primary index.

            I do have a couple of indexes that are used to track non-relational lookups - and these are of the form:

            key: [sortable value][recid]
            value: null

            Where sortable value may be a string, a set of bytes, etc...  In this case, the value can be looked up quickly, and the recid used to do the secondary lookup into the primary table.

            Technically, I could have put the recid into the value (instead of having it at the end of the key), but then you run into problems with the unique key requirement of this particular BTree implementation.

            The only real downside that I've encountered with the above approach is that you have to be careful with the book keeping to make sure that the secondary indexes stay up to date as changes are made to the primaries - but the [object] stored in the primary has all of the data required to recreate the secondaries, so the data is all there - it is just a matter of remembering to do the updates.

            One thing I've done to make this a bit easier to manage for myself is a self-imposed rule that you are not allowed to modify any member value of a stored object that is required to access a secondary key.  This means that the only secondary key updates I have to worry about are ADD and DELETE.

            Hope this helps - please let me know if you have any thoughts - this is just something I came up with to meet the needs of our project - maybe there is a better way to approach it.

            Cheers,

            - Kevin

             
            • Bryan Thompson
              Bryan Thompson
              2005-07-21

              Kevin,

              Ok.  I think that I understand your approach now.

              The s:p:o "terminology" concerns RDF triple stores for the semantic web.  RDF breaks everything down into a subject (s), a predicate (p), and an object (o).  The s:p:o relation is called a statement.  Statements must be unique for RDF.

              Indexing into a triple store is often modeled using a set of btrees:

              s:p:o
              p:o:s
              o:p:s
              p:s:o
              o:s:p
              s:o:p

              These indices are designed to provide the optimal index for any query.  So, if you want to find all records with an unknown subject but a known predicate and object, then you would use the p:o:s index.

              The subject, predicate and object values are restricted to URIs, Literals (which can be typed), and unnamed nodes (BNodes).  If these are normalized out of the statement relation, then you need a non-relational index to locate a specific URI, literals, etc.

              ----

              Given that your approach releases you from the choice of a jdbm-imposed "long" factory for recids, have you considered using methods that would let you bias the primary object index so that objects that were frequently co-materialized (or which the application expected to be frequently co-materalized) had nearby values in the recid key space, and hence tended to be stored on the same page(s) of the btree?

              In the framework I am working with there is a notion that the application can provide a hint to the persistence layer that it tends to be co-materialized with another object, which is known as its "container".  Containers are flattened, so that if you are in a sub-sub-container, it is the same as if you were in the parent container.

              If I were to use the approach that you are describing, then I could simply manipulate the recid keys assigned when the objects were created so as to bias the contained objects to be near neighbors in the primary key space.

              For example, assuming that we stick with long keys, let's just increment the next_recid by 10,000 each time a persistent object is created that is not in any container, so:

              o1: recid=0
              o2: recid=10000
              o3: recid=20000
              o4: recid=30000
              etc.

              If I then create objects that tend to be co-materialized with o1, then I can assign them the next recid within that container, e.g., 1, 2, 3, 4, 5, ....

              Given the way that application objects are blocked on a page of a btree, this would provide a caching mechanism that is strongly biased by the application's requirements for co-materialization of objects.

              Thanks,

              -bryan

               
              • Kevin Day
                Kevin Day
                2005-07-21

                OK - I understand how you are using it now.

                It absolutely makes sense to clump recids in the primary BTree to minimize page reads.  The fact that your framework provides these kinds of hints is great.

                In my application, sequential ordering does clump records that are commonly used - so I get this for free (I guess my "recid clumping" algorithm is the simplest it could be :-)  ).

                - K

                 
                • Bryan Thompson
                  Bryan Thompson
                  2005-07-22

                  Kevin,

                  I was wondering if you had some means to cluster access in your application.  Without that I do not think that this approach is going to win.  There are two places in which it could be expected to outperform the recman: (1) clustered access to application objects; and (2) the recman cache for a given size of the cache.  However it simply has to require more record retrievals to locate something using the BTree that it does to retrieve a given record using the record manager - simply because the BTree is using the record manager for its own pages, so you need to walk the btree which means multiple trips through the record manager.

                  Also, I expect that you could do this with Unicode keys, e.g., URIs, if you have support for compressed keys in the primary key index (this is not something that I have tried to get working with the current cvs source).

                  -bryan

                   
                  • Kevin Day
                    Kevin Day
                    2005-07-22

                    Bryan-

                    Yes - in my application, the data naturally clusters.

                    In my scheme, there's no rule that says that you have to use a long as the key for the primary index BTree - you could easily use a string, or some other value that forces the primary BTree to cluster...  it just means that your secondary indexes will consume more disk space.

                    If you can come up with an ordered hash for your data that results in clustering, then you could use that instead of the full string value, as long as you check for hash collisions on insert into the primary key.

                    Regarding your comment on key compression:

                    The things I've seen on this forum regarding key compression have seemed to be primarily targeted at string keys.  I haven't taken a look at the patch that was submitted recently, but it is definitely something I'm interested in.  Maybe the latest implementations actually work on arbitrary byte streams.

                    I actually do a form of key compression in my long key values (I reserver the first 2 bits of the 64 bits to specify whether the number of bytes in use by the value).  This at least saves me the space of all those leading 0's until I actually need them.

                    I'd prefer to use key compression that takes all of the keys on a page into consideration - it would be even more efficient than what I'm doing.

                    - K

                     
    • bormi
      bormi
      2005-07-19

      Hi,

      I think the MRU cache size is the most important parameter in terms of performance; that's why I chose a cache size of 1, which means that the data need to be read from disk for every access, and this is what should reflect the raw IO performance, right?

      I am opening the mapped buffer only once, and only remap it, if the file needs to be resized, because of inserts that go to addresses that are outside the current buffer boundaries. In some cases (a lot of inserts), this can lead to frequent remapping of the whole buffer (sometimes 20 remappings/s), but I never ran into the problems you observed. The memory adder for the mapped buffer is there as soon as you map the buffer once and access the whole file - every part of the file, that was read seems to add up to the total app memory. Once you have accessed the whole file memory usage is stable. Seems there is some caching going on behind the scenes. If you set mappedBuffer=null and call System.gc(), memory is immediatly released.

      What do you suggest in terms of optimization of opjects per page? Page size seems to default to  8K in BlockIo, should I select the number of objects per page to just fill the page size, but under no circumstance exceed the page size?

       
    • Kevin Day
      Kevin Day
      2005-08-11

      Just responding to the following:

      "In some cases (a lot of inserts), this can lead to frequent remapping of the whole buffer (sometimes 20 remappings/s), but I never ran into the problems you observed"

      If you open and close the recman in quick succession, it will eventually fail due to file locking (the old map file won't release it's lock, etc...).  I thought I was OK with this in my system until I really started running lots of unit tests, then every now and again the thing would blow.  This was not related to jdbm - it was an NIO project I did awhile back.

      Here are some entries in Sun's bug parade regarding this issue:

      http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4715154

      If it does become an issue, I have some code laying around that calls into Sun's classes directly to force the GC (and subsequent unmap) of the file (don't have it handy right this instant, but I can get it).

      - K