making a B-tree persistent

  • liseg

    liseg - 2005-02-22


    I'm trying to use a B-tree to store an index of words on web-pages - hence the words are keys and a list of urls is the value. But I seems that the whole data structure stays in internal memory - and hence it doesn't make much sense to use a B-tree - the reason I wanted to use a B-tree was because the index won't fit in internal memory if I use a 'normal' data structure. Windows runs out of virtual memory when I try to run my program. Anything specific I need to do to write to disk?
    PS. I do recman.commit() quite often.

    • Alex Boisvert

      Alex Boisvert - 2005-02-22

      Hi Lise,

      If you program runs out of memory, it is probably because your program keeps one or more references to your objects.

      JDBM will not keep references to your objects beyond the commit() except if you use a cache.   Also, your objects should not normally reference each other as this would also  provoke memory retension.

      Let me know if you need further help.


    • liseg

      liseg - 2005-02-23

      Hi Alex

      Thanks a lot - that helped! I have another question though: What is the connection between the page-size, which you set when you construct a B-tree and the block-size, which is used when writing to the RandomAccessFile? I would think that one page would go on one block, but is that correct? Or does it depend on the size of the data which you put in your pages? But what if the data is too big, and a page can't fit in one block? Or if each data item doesn't take up very much space - will there then be a lot of wasted space in each block?

      Also, as I'm using jdbm in my thesis about search algorithms, I need to keep a close look at the number of I/O's I make at all times. Is this easy with jdbm? At the moment I have trouble controlling this.

      I hope that you will be able to help,
      Lise :-)

      • Alex Boisvert

        Alex Boisvert - 2005-02-25


        There's no real connection between b-tree page size and physical block size, although it's benefical to have b-tree pages closely fit the physical page size to optimize I/O efficiency.

        The b-tree page size is the number of b-tree key/value pairs to be kept on the b-page; it is the "fan-out" of the b-tree.  Again, you want the size of your keys and values to closely fit the physical page size but there's no easy way to figure this out in (plain) Java, since object serialization may yield different key/value sizes.  If you're serious about optimization, you'll want to use custom serializers that use fixed lengths to map your keys and values into fixed physical page sizes.

        My typical approach is to figure my (hopefully fixed) key/value size and then compute the ideal b-tree page size based on my physical block size.   The b-tree size is usually picked as a power of two such that dichotomic search is most efficient as well.

        As for measuring I/O's, I think it should be quite straightforward to instrument the RecordFile and TransactionManager to measure actualy I/Os.   Your operating system might also have built-in monitoring tools to do this (e.g. iostat on Linux/Solaris)



Log in to post a comment.