|
From: Geoff H. <ghu...@ws...> - 2002-09-20 01:41:08
|
On Thu, 19 Sep 2002, Neal Richter wrote: > We would also be able to avoid any dynamic resizing of the LOCATION > Value-field in BDB by making it a fixed width. > > Ex: Let's say this LOCATION-value is 'Full' @ 32 characters. Further > locations of 'affect' in doc 400 get new rows As I said before, this is probably a good idea. But it's going to take some work to get the right balance. Since the keys need to be unique, you have to introduce at least some "padding" by putting a row field in the key. word // doc id // row OK, now you have a fixed-length record for: (location // field // anchor) (location // field // anchor) ... So the trick will be to find: a) A short field length for "row" to minimize overhead. b) The "right" fixed-length record for a "row." (a) is partially offset by the reduction in the BDB control structures if you cut down on the number of keys. But you don't want to make it too small since you don't know how many words will be in long documents. We can make guesses fortunately. e.g. I just did counting from Project Gutenberg's text versions of "Adventures of Huckleberry Finn" (563KB) most frequent: "and" 6138 times "20,000 Leagues Under the Sea" (567KB) most frequent: "the" 7469 times King James Bible (Old & New Test.) (4240KB) most freq.: "the" 62162 times "and" 38611 times "of" 34506 times (b) is trickier. For short rows, you'll waste space since you've reserved this record you aren't using. But the shorter it is, the fewer keys you can condense. So the question is how often we'd waste space (and how much) on short rows, versus how much we (might) regain from BDB control structures. Experimentation will be needed. Neil, do you think we can actually save bits across just the key/record pair? It seems like you'll need to add bits for the row location. -Geoff |