From: Geoff H. <ghu...@ws...> - 2002-11-04 14:58:31
|
Sorry, I've been really busy and haven't had much time to comment on this. On Saturday, November 2, 2002, at 08:29 PM, Gilles Detillieux wrote: > How much of this database fragmentation would be due to the fact that > there are records of different lengths, and how much would be due to > updating a given record from one length to a larger length. It's the latter. > this information in memory as it did in 3.1, and then just dumped > out all the records like above after the whole document is parsed. > That way, none of the records ever have to be updated and lengthened. No, words are only sent out once the document is parsed. It's not writing "along the way" if you will. But remember that while the document itself has the vast majority of the words, we still would have to update the records when we're indexing another document and happen to hit a link back to the first one. :-( > I implemented this type of caching very quickly using the STL with > slight modifications to the WordDB object. Mifluz contains a > WordDBCache > object, but 3.2 doesn't use it, and it's excessively complicated in > my opinion. Basically, it does two things. One, it caches up to a certain amount of data--and this is already used by 3.2 by default. Two, the mifluz code will write out small temporary files along the way. As I wrote the mifluz-merge code, it collapsed the temporary files every 10 documents (and I've checked performance by varying that number). There is a decided performance improvement because the data can be stuffed back into the main DB with better ordering. There's also better performance on the initial writes because you're adding to a temporary file with a smaller lookup penalty. > Do we REALLY need this function to goto the complexity of calling > 'Unpack' > and comparing the keys? Why not treat the entire key bitstream after > the > word-string as a binary compare and return? That depends on how the packing is done. Udi Manber (among others probably) outlined various strategies for comparing compressed strings w/o unpacking. In this case, I think as long as you have a one-to-one mapping and you're consistently doing comparisons for *every* key, you're fine. The ordering will be different on the compressed strings, but as long as everything is unique, I can't think of a problem. -Geoff |