From: Gilles D. <gr...@sc...> - 2002-04-03 17:21:25
|
According to Gabriele Bartolini: > I was attempting to use the url_rewrite_urls attribute, because I need > it in a special case. > > While trying it, I noticed this thing, and if it is possible I would > like to have an explanation from you (particularly by Gilles and Geoff, I > guess). > > Is there a reason why URLs belonging to the start list are neither > normalized nor rewritten? Just wondering ... Otherwise we should add these > two lines to the Initial method of the Retriever class: > > u.normalize(); > u.rewrite(); > > after the 'URL u(tokens[i]);' row. I'm guessing it was just an oversight, or an assumption that the URLs you feed it via start_url would already be in the form you want. I don't see a problem with the modification you suggest, with one very important condition: the rewriting should not be done more than once on a given URL. So, if I'm not mistaken, the URLs from db.docdb and those from db.log have already gone through the process of being normalized and rewritten, and only the URLs from start_url should be processed. I think if you only do the rewrite if from == 1 you should be safe. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |