On Thu, 19 Sep 2002, Neal Richter wrote:
> We would also be able to avoid any dynamic resizing of the LOCATION
> Value-field in BDB by making it a fixed width.
> Ex: Let's say this LOCATION-value is 'Full' @ 32 characters. Further
> locations of 'affect' in doc 400 get new rows
As I said before, this is probably a good idea. But it's going to take
some work to get the right balance. Since the keys need to be unique, you
have to introduce at least some "padding" by putting a row field in the
word // doc id // row
OK, now you have a fixed-length record for:
(location // field // anchor) (location // field // anchor) ...
So the trick will be to find:
a) A short field length for "row" to minimize overhead.
b) The "right" fixed-length record for a "row."
(a) is partially offset by the reduction in the BDB control structures if
you cut down on the number of keys. But you don't want to make it too
small since you don't know how many words will be in long documents. We
can make guesses fortunately.
e.g. I just did counting from Project Gutenberg's text versions of
"Adventures of Huckleberry Finn" (563KB) most frequent: "and" 6138 times
"20,000 Leagues Under the Sea" (567KB) most frequent: "the" 7469 times
King James Bible (Old & New Test.) (4240KB) most freq.: "the" 62162 times
"and" 38611 times
"of" 34506 times
(b) is trickier. For short rows, you'll waste space since you've reserved
this record you aren't using. But the shorter it is, the fewer keys you
So the question is how often we'd waste space (and how much) on short
rows, versus how much we (might) regain from BDB control
structures. Experimentation will be needed.
Neil, do you think we can actually save bits across just the key/record
pair? It seems like you'll need to add bits for the row location.