From: Gilles D. <gr...@sc...> - 2002-10-03 22:17:27
|
According to Sean Downey: > Hi Geoff, Gilles > > have ye had a chance to look at this problem since? > nobody could fix it here :-( > > > thanks > Sean Sorry, I've been swamped here for a few months, and haven't found time to code anything in ht://Dig beyond simple 5 line patches for some bugs. I think when Geoff has a spare moment or two, he's focused on getting the new mifluz code into the package and getting that debugged. > -----Original Message----- > From: Geoff Hutchison > Sent: Tuesday, July 02, 2002 5:55 PM > To: Sean Downey > Cc: Gilles Detillieux > Subject: RE: [htdig] HTMerge memory problem > > > Is it a problem that could be explained and is it confined to a few code > > files?? > > It's definitely confined to one file: httools/htmerge.cc. > > Nothing else will need to change, only the code there. > > In the code, the "merge" prefix refers to the database being merged into > the other. So mergeWordDB would be the word database being merged into > wordDB. > > I'll do my best to explain. Basically, the current htmerge code grabs a > List of all URLs in both databases and figures out duplicates. Then it > constructs the "merged" list of URLs. This eats some memory, but it's not > quite as bad as the next bit. > > The big memory hog starts with: > // OK, after merging the doc DBs, we do the same for the words > > then you'll see this, which is what's really bad: > (actually just noticed the comment before this says "URLs" when it should > say "words") > // Start the merging by going through all the URLs that are in > // the database to be merged > > words = mergeWordDB.WordRefs(); > > so then the code loops through and checks the DocIDs for each word--if > they're duplicates that we should ignore, it keeps going. Otherwise, it > adds it to the other database (with a new DocID). > > Finally, > words = wordDB.WordRefs(); > > Now it loops through the target DB (i.e. the one that received > everything) and deletes words that are in duplicate documents--i.e. they > were made obsolete by the mergeWordDB. > > OK, documentation for htword/mifluz can be found at: > > http://www.gnu.org/software/mifluz/doc.en.html > It's actually for a newer version of mifluz than is currently used by > ht://Dig. That version would be 0.14. But most of the API is > similar. Obviously see the headers in htword/ for the exact details. :-) > > For the loop before deletion, you'll want to use the WordList::Cursor > methods to loop--you'll need to set up a callback as the previous patch > did too. The callback function would add the words from the mergeWordDB. > > For the next loop, you'll want to use the WalkDelete method from the > WordList object to delete the words (rather than constructing a full list > in memory all at once!). -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |