From: Brian R. <br...@bj...> - 2006-01-03 16:53:05
|
Hi I am trying to use htdig 3.1.6-6 to index and search a hierarchy of .doc files, saved on a server based on redhat 7.3. I am using doc2html, which runs fine on documents from the command line, but as far as I can see does not get called from htdig. I have run rundig with -vvvv and can see it it accessing the doc files (html indexed with modindex). Here is an example of the log file: (the filesize is correct) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <snip> Tag: <td align="left" class="tbl_files">, matched -1 Tag: <a href="Aune_2392.doc" class ="files">, matched 2 Tag: <img src="/ModIndex_Files/file.gif" height="13" width="13" alt="file" />, matched 18 word: file@937 image: http://bjsserver/ModIndex_Files/file.gif word: Aune_2392.doc@938 word part: Aune@938 word part: 2392@938 word part: doc@938 word part: Aune_2392@938 word part: 2392.doc@938 Tag: </a>, matched 3 href: http://bjsserver/testdocs/files/A/Aune_2392.doc (file Aune_2392.doc) resolving 'http://bjsserver/testdocs/files/A/Aune_2392.doc' pushing http://bjsserver/testdocs/files/A/Aune_2392.doc <snip> pick: bjsserver, # servers = 1 203:203:2:http://bjsserver/testdocs/files/A/Aune_2392.doc: Retrieval command for http://bjsserver/testdocs/files/A/Aune_2392.doc: GET /testdocs/files/A/Aune_2392.doc HTTP/1.0 User-Agent: htdig/3.1.6 (br...@bj...) Referer: http://bjsserver/testdocs/files/A/ Host: bjsserver Header line: HTTP/1.1 200 OK Header line: Date: Tue, 03 Jan 2006 16:40:31 GMT Header line: Server: Apache Header line: Last-Modified: Thu, 18 Aug 2005 17:56:00 GMT Converted Thu, 18 Aug 2005 17:56:00 GMT to Thu, 18 Aug 2005 17:56:00 Header line: ETag: "4400d9-15c00-4304cbb0" Header line: Accept-Ranges: bytes Header line: Content-Length: 89088 Header line: Connection: close Header line: Content-Type: text/plain Header line: returnStatus = 0 Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 7168 from document Read a total of 89088 bytes size = 89088 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Here is my config file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> start_url: http://bjsserver/testdocs/files/ limit_urls_to: http://bjsserver/testdocs/files/ local_urls: http://bjsserver/testdocs/files/ #common_url_parts: http://bjsserver/testdocs/files/ .html database_dir: /opt/mailman/htdig/db/uretek bad_word_list: /opt/mailman/htdig/common/bad_words nothing_found_file: /opt/mailman/htdig/common/nomatch.html search_results_wrapper: /opt/mailman/htdig/common/wrapper.html exclude_urls: .dir .htaccess .mhonarc.db max_head_length: 10000 remove_bad_urls: true use_star_image: no maintainer: br...@bj... search_algorithm: exact:1 synonyms:0.5 endings:0.1 allow_virtual_hosts: true allow_numbers: true no_next_page_text: no_prev_page_text: backlink_factor: 0 sort: date maximum_pages: 30 next_page_text: <img src=/icons/buttonr.gif border=0 align=middle width=30 height=30 alt=next> prev_page_text: <img src=/icons/buttonl.gif border=0 align=middle width=30 height=30 alt=prev> page_number_text: "<img src=/icons/button1.gif border=0 align=middle width=30 height=30 alt=1>" \ "<img src=/icons/button2.gif border=0 align=middle width=30 height=30 alt=2>" \ "<img src=/icons/button3.gif border=0 align=middle width=30 height=30 alt=3>" \ "<img src=/icons/button4.gif border=0 align=middle width=30 height=30 alt=4>" \ "<img src=/icons/button5.gif border=0 align=middle width=30 height=30 alt=5>" \ "<img src=/icons/button6.gif border=0 align=middle width=30 height=30 alt=6>" \ "<img src=/icons/button7.gif border=0 align=middle width=30 height=30 alt=7>" \ "<img src=/icons/button8.gif border=0 align=middle width=30 height=30 alt=8>" \ "<img src=/icons/button9.gif border=0 align=middle width=30 height=30 alt=9>" \ "<img src=/icons/button10.gif border=0 align=middle width=30 height=30 alt=10>" no_page_number_text: "<img src=/icons/button1.gif border=2 align=middle width=30 height=30 alt=1>" \ "<img src=/icons/button2.gif border=2 align=middle width=30 height=30 alt=2>" \ "<img src=/icons/button3.gif border=2 align=middle width=30 height=30 alt=3>" \ "<img src=/icons/button4.gif border=2 align=middle width=30 height=30 alt=4>" \ "<img src=/icons/button5.gif border=2 align=middle width=30 height=30 alt=5>" \ "<img src=/icons/button6.gif border=2 align=middle width=30 height=30 alt=6>" \ "<img src=/icons/button7.gif border=2 align=middle width=30 height=30 alt=7>" \ "<img src=/icons/button8.gif border=2 align=middle width=30 height=30 alt=8>" \ "<img src=/icons/button9.gif border=2 align=middle width=30 height=30 alt=9>" \ "<img src=/icons/button10.gif border=2 align=middle width=30 height=30 alt=10>" external_parsers: application/rtf->text/html /opt/mailman/htdig/bin/doc2html.pl \ text/rtf->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/pdf->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/postscript->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/msword->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/wordperfect5.1->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/wordperfect6.0->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/msexcel->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/vnd.ms-excel->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/powerpoint->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/vnd.ms-powerpoint->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/x-shockwave-flash->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/x-shockwave-flash2-preview->text/html /opt/mailman/htdig/bin/doc2html.pl >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Any ideas? -- Cheers Brian http://www.abandonmicrosoft.co.uk |