From: Gilles D. <gr...@sc...> - 2001-10-17 20:36:03
|
According to Joe R. Jah: > On Wed, 3 Oct 2001, Gilles Detillieux wrote: > > Date: Wed, 3 Oct 2001 09:51:03 -0500 (CDT) > > From: Gilles Detillieux <gr...@sc...> > > To: Joe R. Jah <jj...@cl...> > > Cc: htd...@li... > > Subject: Re: [htdig-dev] Re: URL Rewrite patch for 3.1.6 snapshots > > > > > > > > If you get a chance to run old and new snapshots of htdig with -vvv and > > > > > > compare the outputs, you may be able to track down the source of the > > > > > > different URLs that are parsed in both cases. To do this in a meaningful > > > > > > way, though, you'll need to try a static site, or perhaps a snapshot of > > > > > > your site, so you don't get thrown off in your comparisons by updates > > > > > > to the site between digs. > > > > > > > > > > Yes, I have kept that snapshot for a happy occasion like that;) > > > > > > > > Keep me posted if you get a chance to run this test with both snapshots. > > > > I can't think of any changes to 3.1.6 that would cause it to lose valid > > > > URLs, but it would be good to confirm without a doubt that the lost URLs > > > > on your system are all indeed URLs that should not have been indexed. > > > > > > In the happy hour;))) > > > > It might be best if you're sober when you do this test. ;-) > > The happy hour turned into a couple of unhappy weeks:( > > -r--r--r-- 1 jjah www 24621528 Oct 2 13:20 rundig_vvv.082901 > -r--r--r-- 1 jjah www 20266702 Oct 2 14:15 rundig_vvv.093001 > > I found 82 links from one document with META ROBOT: Noindex tag;) I could > not find an efficient way of hunting down the other 138 links that were > unaccounted for in two 20 meg+ files; however, I must assume that they are > some sort of duplicates;-/ Hmm. Too bad we couldn't get something more definitive. I'm fairly confident that the changes to the HTML parser didn't break anything, but I'd feel much more comfortable if we could explain the missing files you discovered rather than just assuming it's OK. If I recall, there were 88 URLs with doubled slashes that were eliminated in an earlier test, but that still leaves around 50 URLs unaccounted for. If there's any way you can take a snapshot of your site, or a few major subdirectories, and duplicate them somewhere else where they won't get modified, it would be a big help in getting conclusive results. If you index the exact same files with 3.1.5 and 3.1.6, you should be able to diff the output of htdig -vvv from both, and pinpoint exactly where the differences are happening. I know this is asking a lot, but it would be a shame to release 3.1.6 after all the work that's gone into it, only to discover afterward that it introduced a serious bug. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |