Re: [htdig-dev] db.words.db using only key and empty value?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Foot in Mouth:  I didn't mean anything with the "brain dead" remark.. poor
choice of words.  Sorry!

On Tue, 22 Oct 2002, Geoff Hutchison wrote:

>
> On Tuesday, October 22, 2002, at 02:50  PM, Neal Richter wrote:
>
> > It looks to me like the db.words.db is using only a 'key' value, and has
> > a blank 'value' for each and every key!
>
> Nope. Remember that "value" as it currently stands is the anchor--if
> any. So if your documents don't have anchors defined, the value will
> never be entered. The key needs to be unique, so it has a structure:
>
> word // DocID // flags // location -> anchor

  Could you give an example where the anchor won't be empty?  I tried a
couple things and couldn't get one.

> > A more efficient solution to make the index smaller would be this:
> > key = 'people\02', value = '00\00\00\00\c9\cb\ce\00'
>
> OK, remember what I said a while ago about the data structure here.
> (a) Yes, it's "brain dead" for an traditional inverted index--but a
> traditional inverted index is done in 2 steps: index, then sort/invert.
> You have different problems if you're creating a "live" database in one
> step.
> (b) B-Tree databases get pretty convoluted if you start growing the size
> of the value record as you index: i.e. I add a word and then start
> tacking on new DocIDs... This happens because the database essentially
> becomes fragmented in the same way that files can get fragmented over a
> hard drive.

  Now I understand what you meant ;-).  I was confused since my conceptual
model was using the value for word-num and flags.

  Would you explain your previous idea in more detail?

> So we take the above and think and performance test...
> * While the Berkeley DB allows you to have duplicate keys, performance
> suffers some. IIRC this happens b/c you get non-uniform trees.
> * You want to store as much as possible in the value field, *but* need
> to have a unique key.

  I think we can avoid duplicate keys with this scenario:

['people' is in a document 9 times]

[Row 1]
key =  people\02\00\00\00\00\c9\00    value == \00\c9\00\cb\00\ce\00\d1

[Row 2]
key =  people\02\00\00\00\00\d4\00    value == \00\d4\00\df\00\ee\00\00

I'm not sure where it gets implemented, but a row isn't written until
there are X references to it or until we change documents.   This would
involve a cache of rows in memory, which is probably in the classes
somewhere.

This avoids updating previously written rows and keeps the value size to a
known quantity.

> I'd be glad to go into more depth into why the current database
> structure has been set up this way. There is probably a better
> structure, like I outlined a few weeks ago. But you simply can't build
> an on-line inverted word index in one fell swoop and use a traditional
> design. (Google doesn't do it, the current 3.2 code has never attempted
> to do it...)

> If you'd like, I can probably pull up some of the original message
> threads.

  I need to read this stuff... is it in the archive?  Just point me in the
right direction.

Thanks!

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485