From: Gilles D. <gr...@sc...> - 2002-10-17 18:29:32
|
This rightly belongs on htdig-general, not htdig-dev... Session IDs are the bane of search engines, and just generally make life difficult for getting around the web. If getting rid of them altogether isn't an option, then you can at least remove them while indexing. See http://www.htdig.org/attrs.html#url_rewrite_rules for an example of how to do this. This will only work if you can access the documents without a session ID, as that's what htdig will do - it rewrites the URL before fetching the document, and presents the rewritten URL (without session ID) in search results. According to Paolo Subiaco: > Hi all. > I see there is a problem spidering forums like phpBB and electrifiedpenguin . > Because these forums return a Session ID, it occurr that htdig spidering the > forum pages will get more than one SID. > The result is that > 1. the same forum page is indexed more than one time > 2. the amount of CPU and time used for indexing is very large. > > Take a look at the log above.... > Thank you. Paolo > > 217.168.237.106 - - [14/Oct/2002:01:06:54 +0200] "GET /forum/index.php > HTTP/1.0" 200 35398 "http://www.ir3ip.net/forum/" "htdig/3.1.5 > > 217.168.237.106 - - [14/Oct/2002:01:06:55 +0200] "GET /forum/search.php > HTTP/1.0" 200 19754 "http://www.ir3ip.net/forum/" "htdig/3.1.5 > > 217.168.237.106 - - [14/Oct/2002:01:06:57 +0200] "GET /forum/faq.php > HTTP/1.0" 200 51949 "http://www.ir3ip.net/forum/" "htdig/3.1.5 (ro > > 217.168.237.106 - - [14/Oct/2002:01:07:00 +0200] "GET /forum/memberlist.php > HTTP/1.0" 200 22715 "http://www.ir3ip.net/forum/" "htdig/3. > > 217.168.237.106 - - [14/Oct/2002:01:07:02 +0200] "GET > /forum/index.php?sid=6923db608dd988b9167c2464278dcffb HTTP/1.0" 200 35398 > "http:/ > > 217.168.237.106 - - [14/Oct/2002:01:07:04 +0200] "GET > /forum/faq.php?sid=6923db608dd988b9167c2464278dcffb HTTP/1.0" 200 51949 > "http://w > > ..... another dozen of lines with the same sid was removed.... > after few minutes, several dozens of access with another sid: > > 217.168.237.106 - - [14/Oct/2002:01:09:19 +0200] "GET > /forum/index.php?sid=27619eca3a821c36bbfe3222b99f62aa HTTP/1.0" 200 35398 > "http://www.ir3ip.net/forum/viewforum.php?f=12" "htdig/3.1.5 > > 217.168.237.106 - - [14/Oct/2002:01:09:21 +0200] "GET > /forum/faq.php?sid=27619eca3a821c36bbfe3222b99f62aa HTTP/1.0" 200 51949 > "http://www.ir3ip.net/forum/viewforum.php?f=12" "htdig/3.1.5 (r > > 217.168.237.106 - - [14/Oct/2002:01:09:23 +0200] "GET > /forum/search.php?sid=27619eca3a821c36bbfe3222b99f62aa HTTP/1.0" 200 19754 > "http://www.ir3ip.net/forum/viewforum.php?f=12" "htdig/3.1.5 > > 217.168.237.106 - - [14/Oct/2002:01:09:26 +0200] "GET > /forum/memberlist.php?sid=27619eca3a821c36bbfe3222b99f62aa HTTP/1.0" 200 > 22715 "http://www.ir3ip.net/forum/viewforum.php?f=12" "htdig/3 -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |