From: Matthew S. <mws...@la...> - 2003-06-11 21:14:23
|
I am trying to index a website using htdig and I am having a hard time understanding why some of my links are being followed and others aren't. The site that I am trying to index is http://www.law.upenn.edu/ Links on the front page are followed properly. One of those links leads to http://www.law.upenn.edu/departments/, which htdig "pushes" and then requests. htdig then fails to follow the links in that second document but I can't figure out why -- it doesn't seem to be rejecting them, just silently ignoring them. I have increased htdig's verbose output to -vvv and have posted two segments of the generated log here: http://faculty.law.upenn.edu/~mwsnyder/log1.txt http://faculty.law.upenn.edu/~mwsnyder/log2.txt I am running htdig-3.1.6. These are the possibly relevant config options: database_dir: /usr/local/htdig/db start_url: http://www.law.upenn.edu/ limit_urls_to: ${start_url} exclude_urls: /cgi-bin/ .cgi /bll/ulc/ bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \ .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css .pdf max_head_length: 10000 max_doc_size: 200000 no_excerpt_show_top: true search_algorithm: exact:1 synonyms:0.5 endings:0.1 Can anyone tell me how to convince htdig to follow the links within http://www.law.upenn.edu/departments ? Thanks. -- Matthew Snyder University of Pennsylvania Law School |