From: Joe R. J. <jj...@cl...> - 2001-11-12 03:49:19
|
On Wed, 17 Oct 2001, Gilles Detillieux wrote: > Date: Wed, 17 Oct 2001 15:35:53 -0500 (CDT) > From: Gilles Detillieux <gr...@sc...> > To: Joe R. Jah <jj...@cl...> > Cc: htd...@li... > Subject: Re: [htdig-dev] Re: URL Rewrite patch for 3.1.6 snapshots > > > I found 82 links from one document with META ROBOT: Noindex tag;) I could > > not find an efficient way of hunting down the other 138 links that were > > unaccounted for in two 20 meg+ files; however, I must assume that they are > > some sort of duplicates;-/ > > Hmm. Too bad we couldn't get something more definitive. I'm fairly > confident that the changes to the HTML parser didn't break anything, but > I'd feel much more comfortable if we could explain the missing files you > discovered rather than just assuming it's OK. If I recall, there were > 88 URLs with doubled slashes that were eliminated in an earlier test, > but that still leaves around 50 URLs unaccounted for. > > If there's any way you can take a snapshot of your site, or a few major > subdirectories, and duplicate them somewhere else where they won't get > modified, it would be a big help in getting conclusive results. If you > index the exact same files with 3.1.5 and 3.1.6, you should be able to > diff the output of htdig -vvv from both, and pinpoint exactly where the > differences are happening. I know this is asking a lot, but it would be > a shame to release 3.1.6 after all the work that's gone into it, only to > discover afterward that it introduced a serious bug. Sorry it took such a long time to respond, but I have been very busy lately. It is not easy to prove a negative; however, I have tried a few times to make 3.1.6 miss indexing files in stable snapshots of my site without success;) Here is a comparison of the latest 3.1.6 snapshot on a snapshot of my site -- 163 HTML-only documents -- with 3.1.6-072901: _______3.1.6-072901 + Armstrong patch + ssl.4_______ htdig: Start digging: Sun Nov 11 18:15:43 PST 2001 htmerge: Start merging: Sun Nov 11 18:16:16 PST 2001 33 seconds htmerge: Total word count: 13171 htmerge: Total documents: 163 htmerge: Total doc db size (in K): 1888 -------------------------8<------------------------- __________3.1.6-111101 + ssl.5 + FAQ#5.14___________ htdig: Start digging: Sun Nov 11 18:19:19 PST 2001 htmerge: Start merging: Sun Nov 11 18:20:58 PST 2001 99 seconds htmerge: Total word count: 13171 htmerge: Total documents: 163 htmerge: Total doc db size (in K): 1888 -------------------------8<------------------------- CPU: 350 MHz Pentium RAM: 384 Megs OS: BSDi-4.2 They both index the exact number of documents; this is as conclusive a result as I can produce. The only difference is the the time they take. Incidentally, ssl.4 fails to apply to the latest snapshot because of the recent changes to Connection.cc. I have modified the patch to apply cleanly to the latest snapshot of 3.1.6: ftp://ftp.ccsf.org/htdig-patches/3.1.6/ssl.5 Regards, Joe -- _/ _/_/_/ _/ ____________ __o _/ _/ _/ _/ ______________ _-\<,_ _/ _/ _/_/_/ _/ _/ ......(_)/ (_) _/_/ oe _/ _/. _/_/ ah jj...@cl... |