Re: [htdig-dev] db.words.db using only key and empty value?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Sorry, I've been really busy and haven't had much time to comment on 
this.

On Saturday, November 2, 2002, at 08:29  PM, Gilles Detillieux wrote:

> How much of this database fragmentation would be due to the fact that
> there are records of different lengths, and how much would be due to
> updating a given record from one length to a larger length.

It's the latter.

> this information in memory as it did in 3.1, and then just dumped
> out all the records like above after the whole document is parsed.
> That way, none of the records ever have to be updated and lengthened.

No, words are only sent out once the document is parsed. It's not 
writing "along the way" if you will. But remember that while the 
document itself has the vast majority of the words, we still would have 
to update the records when we're indexing another document and happen to 
hit a link back to the first one. :-(

> 	I implemented this type of caching very quickly using the STL with
> slight modifications to the WordDB object.  Mifluz contains a 
> WordDBCache
> object, but 3.2 doesn't use it, and it's excessively complicated in
> my opinion.

Basically, it does two things. One, it caches up to a certain amount of 
data--and this is already used by 3.2 by default. Two, the mifluz code 
will write out small temporary files along the way. As I wrote the 
mifluz-merge code, it collapsed the temporary files every 10 documents 
(and I've checked performance by varying that number). There is a 
decided performance improvement because the data can be stuffed back 
into the main DB with better ordering. There's also better performance 
on the initial writes because you're adding to a temporary file with a 
smaller lookup penalty.

> Do we REALLY need this function to goto the complexity of calling 
> 'Unpack'
> and comparing the keys?  Why not treat the entire key bitstream after 
> the
> word-string as a binary compare and return?

That depends on how the packing is done. Udi Manber (among others 
probably) outlined various strategies for comparing compressed strings 
w/o unpacking. In this case, I think as long as you have a one-to-one 
mapping and you're consistently doing comparisons for *every* key, 
you're fine. The ordering will be different on the compressed strings, 
but as long as everything is unique, I can't think of a problem.

-Geoff