Re: [htdig] HTMerge memory problem

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

According to Sean Downey:
> Hi Geoff, Gilles
> 
> have ye had a chance to look at this problem since?
> nobody could fix it here :-(
> 
> 
> thanks
> Sean

Sorry, I've been swamped here for a few months, and haven't found time to
code anything in ht://Dig beyond simple 5 line patches for some bugs.
I think when Geoff has a spare moment or two, he's focused on getting the
new mifluz code into the package and getting that debugged.

> -----Original Message-----
> From: Geoff Hutchison 
> Sent: Tuesday, July 02, 2002 5:55 PM
> To: Sean Downey
> Cc: Gilles Detillieux
> Subject: RE: [htdig] HTMerge memory problem
> 
> > Is it a problem that could be explained and is it confined to a few code
> > files??
> 
> It's definitely confined to one file: httools/htmerge.cc.
> 
> Nothing else will need to change, only the code there.
> 
> In the code, the "merge" prefix refers to the database being merged into
> the other. So mergeWordDB would be the word database being merged into
> wordDB.
> 
> I'll do my best to explain. Basically, the current htmerge code grabs a
> List of all URLs in both databases and figures out duplicates. Then it
> constructs the "merged" list of URLs. This eats some memory, but it's not
> quite as bad as the next bit.
> 
> The big memory hog starts with:
>     // OK, after merging the doc DBs, we do the same for the words
> 
> then you'll see this, which is what's really bad:
> (actually just noticed the comment before this says "URLs" when it should
> say "words")
>     // Start the merging by going through all the URLs that are in
>     // the database to be merged
>         
>     words = mergeWordDB.WordRefs();
> 
> so then the code loops through and checks the DocIDs for each word--if
> they're duplicates that we should ignore, it keeps going. Otherwise, it
> adds it to the other database (with a new DocID).
> 
> Finally, 
>     words = wordDB.WordRefs();
> 
> Now it loops through the target DB (i.e. the one that received
> everything) and deletes words that are in duplicate documents--i.e. they
> were made obsolete by the mergeWordDB.
> 
> OK, documentation for htword/mifluz can be found at:
> 
> http://www.gnu.org/software/mifluz/doc.en.html
> It's actually for a newer version of mifluz than is currently used by
> ht://Dig. That version would be 0.14. But most of the API is
> similar. Obviously see the headers in htword/ for the exact details. :-)
> 
> For the loop before deletion, you'll want to use the WordList::Cursor
> methods to loop--you'll need to set up a callback as the previous patch
> did too. The callback function would add the words from the mergeWordDB.
> 
> For the next loop, you'll want to use the WalkDelete method from the
> WordList object to delete the words (rather than constructing a full list
> in memory all at once!).

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)