|
From: Geoff H. <ghu...@ws...> - 2002-10-23 04:11:57
|
On Tuesday, October 22, 2002, at 02:50 PM, Neal Richter wrote: > It looks to me like the db.words.db is using only a 'key' value, and has > a blank 'value' for each and every key! Nope. Remember that "value" as it currently stands is the anchor--if any. So if your documents don't have anchors defined, the value will never be entered. The key needs to be unique, so it has a structure: word // DocID // flags // location -> anchor > A more efficient solution to make the index smaller would be this: > key = 'people\02', value = '00\00\00\00\c9\cb\ce\00' OK, remember what I said a while ago about the data structure here. (a) Yes, it's "brain dead" for an traditional inverted index--but a traditional inverted index is done in 2 steps: index, then sort/invert. You have different problems if you're creating a "live" database in one step. (b) B-Tree databases get pretty convoluted if you start growing the size of the value record as you index: i.e. I add a word and then start tacking on new DocIDs... This happens because the database essentially becomes fragmented in the same way that files can get fragmented over a hard drive. So we take the above and think and performance test... * While the Berkeley DB allows you to have duplicate keys, performance suffers some. IIRC this happens b/c you get non-uniform trees. * You want to store as much as possible in the value field, *but* need to have a unique key. OK, you could probably shove "flags" into the value, like so--but I'm not sure if that'll make a huge difference: word // DocID // location -> flags // anchor I'd be glad to go into more depth into why the current database structure has been set up this way. There is probably a better structure, like I outlined a few weeks ago. But you simply can't build an on-line inverted word index in one fell swoop and use a traditional design. (Google doesn't do it, the current 3.2 code has never attempted to do it...) If you'd like, I can probably pull up some of the original message threads. -Geoff |