Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

Strategies for paging

Anonymous
2010-07-03
2013-06-03

  • Anonymous
    2010-07-03

    Hi. Thanks for your previous help with implementing MultiMaps.

    My latest question is regarding the best way to implement paging. Image I only want to return the first n results or results 2400->2412. What's the most efficient way of doing that?

    Obviously, the browser loads Tuples from disk on each getNext(), so I don't really want to go through every result until I hit the 'page'.

    Is there something kooky I could do, maybe constructing a BTree with its keys as the page index? That would require some way of creating values lazily I think.

    Any ideas, I'd love to know…

     
  • Kevin Day
    Kevin Day
    2010-07-07

    I've thought about this a bit…

    One option is to keep track of the last record of the current page, then use that to start your browser on the next page request.

    A second option is to enhance the BTree a bit.  It should be relatively efficient to move to a numbered index in a BTree by starting at the root, and using the number of leafs in each page.  I haven't even begun to look into actually doing this, but it may be an interesting exercise ;-)

    If you do go this route, I recommend that you do the work in the code base at http://code.google.com/p/jdbm2/ - my suspicion is that is where future development activity for jdbm will be focused.

     

  • Anonymous
    2010-07-09

    Thanks. I'll mull this over myself.

    I didn't even realise the other project existed!

     

  • Anonymous
    2010-09-12

    Hi, I've only just began to look at this again. Thanks for your suggestions. Regarding the two options:

    The paging I am talking of is of the type "records 12-24" or similar. So I don't think the first option can help, if I understand what you mean by record (Tuple?) because what happens if many records are adding before the last record of the current page?

    I'm interested by the second option. Can you go into a little more detail? What type of index do you mean, do you mean one maintained within the BPages objects or one as part of the key for the BPages, as stored in the RecordManager?

     
  • Kevin Day
    Kevin Day
    2010-09-12

    By index I mean a record number - i.e. the 12th record, the 27th record.  With the BTree as it is now, you could certainly just open a tuple iterator, and walk the leaf nodes in order to obtain a record at a specific index - this is akin to moving to a specific index in a linked list.  If your paging requirements only involve accessing the first several thousand records in a BTree, then this may be fine for you (open a tuple iterator, skip over X records, read the next Y records).  If you are needing to do this on many, many records, then linked list performance stinks, and it would be better to be able to quickly find the first node of your range of interest using the tree itself.

    If each node in each page maintains a count of the total number of records beneath that node, then moving to a specific record number in the tree becomes a very efficient operation.  This is certainly feasible, but it would require a little bit of careful coding to ensure that the counts in each node are always correct - and it could theoretically cause insert/delete operations on the tree to be less efficient (they require all pages above the changed page to be updated).  I'm not sure how much of a performance hit this could be - my gut instinct says 'not much' - but it would have to be tested.

    Another intermediate solution would be to come up with a slightly more efficient mechanism for traversing the leaf nodes.  Right now, the iterator would be notified of every single node.  A more efficient implementation (for your specific case) would traverse the leaf BPages (so instead of an iterator callback for every single node, the seek(index) call would do the page traversal internally).  I doubt that this would be sufficiently different from the current strategy (the performance bottleneck here isn't in the iterator callback - it's in having to read every single leaf page from disk until you find the one you want).

     

  • Anonymous
    2010-09-13

    Yeah, my first thought on this was a lazier approach to TupleBrowsing. But of course, given we don't know how many records are in each page until they are loaded, we have to do pretty much the same number of loads as before anyway.

    In terms of data loads we are talking in the tens of thousands, but I would think for fairly large-ish objects (in the value, not the key).

    By "each node in each page maintains a count of the total number of records beneath that node" I take that to mean a companion array to _children would hold this. Where a recid is in _children, the count of number of records in that child and its children appear in (say) _childrenRecordCounts. This is rather than having it in the child BPage itself as an int field, which would imply loading pages (this would still be more efficient than currently, but would still involve more loading. It would be more efficient the deeper the BTree is).

    Updating the record count upon update could be done utilising the xResult objects - the fact that the update bubbles down means that we can update metadata on the subsequent bubble-up.

    Apologies, I tend to find myself stating what should be done declaratively in these kinds of discussions. The above is an invitation to highlight my misunderstanding!

     
  • Kevin Day
    Kevin Day
    2010-09-13

    Your understanding is consistent with my thoughts on the matter.  The reason that bubbling the update up the tree is a performance hit is that you'll wind up updating multiple BPages for each update - the current implementation only updates the leaf page - assuming, of course, that rebalancing doesn't have to occur.  I suspect the performance hit won't be that big of a deal, but it should be tested on large trees to make sure (or have the behavior be configurable).  Note that for a lot of use-cases, paging over 10K entries using the lazy approach may be performant enough.  In a web app, for example, it is crazy unlikely that users would ever actually page out to that extent.  Even if they do, the tuple iterator just walks the leaf nodes (no navigation of the tree), so the number of pages that have to be read from disk is really just X/pagesize.  I strongly recommend that you test and make sure you have a problem before you try fixing it :-)