|
From: Neal R. <ne...@ri...> - 2003-04-18 00:21:13
|
On Thu, 17 Apr 2003, Geoff Hutchison wrote: > > > 1) The function reparses 'bad_extensions' & 'valid_extensions' each time > > through. This seems wastefull. And good reason to do this? > > Depends. Once upon a time, we thought that these should be configurable on > a per-URL basis. (Which is why they're "reparsed.") Now maybe it's better > to re-think this in terms of improved performance? It would certainly be easy to add two hash/lists, fill them during 'Initial', and ping against this in IsValidURL. We could also preserve present functionality by adding local copies of the 'bad_extensions' & 'valid_extensions' strings, and checking the current ones against the local ones to check for changes, and reparse if necessary.... Is there a good example of the utility of per-URL changes to these? How would they change in the middle of a spidering run? > > 2) Toward the end of the function, just before we test the URL against > > 'limits' & 'limit_normalized', we check the server's robots.txt file. > > Wouldn't it make sense to do the robots.txt check AFTER the limits > > check, so as not to waste network connections on servers that will get > > rejected by the next two tests? > > Good point. I fixed this checked it in. I also ran 'GNU indent' on the file and did a second commit to clean up the formatting. I did this in two steps so the first change is easily readable. Thanks. Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |