From: Brian R. <br...@bj...> - 2006-01-03 16:53:05
|
Hi I am trying to use htdig 3.1.6-6 to index and search a hierarchy of .doc files, saved on a server based on redhat 7.3. I am using doc2html, which runs fine on documents from the command line, but as far as I can see does not get called from htdig. I have run rundig with -vvvv and can see it it accessing the doc files (html indexed with modindex). Here is an example of the log file: (the filesize is correct) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <snip> Tag: <td align="left" class="tbl_files">, matched -1 Tag: <a href="Aune_2392.doc" class ="files">, matched 2 Tag: <img src="/ModIndex_Files/file.gif" height="13" width="13" alt="file" />, matched 18 word: file@937 image: http://bjsserver/ModIndex_Files/file.gif word: Aune_2392.doc@938 word part: Aune@938 word part: 2392@938 word part: doc@938 word part: Aune_2392@938 word part: 2392.doc@938 Tag: </a>, matched 3 href: http://bjsserver/testdocs/files/A/Aune_2392.doc (file Aune_2392.doc) resolving 'http://bjsserver/testdocs/files/A/Aune_2392.doc' pushing http://bjsserver/testdocs/files/A/Aune_2392.doc <snip> pick: bjsserver, # servers = 1 203:203:2:http://bjsserver/testdocs/files/A/Aune_2392.doc: Retrieval command for http://bjsserver/testdocs/files/A/Aune_2392.doc: GET /testdocs/files/A/Aune_2392.doc HTTP/1.0 User-Agent: htdig/3.1.6 (br...@bj...) Referer: http://bjsserver/testdocs/files/A/ Host: bjsserver Header line: HTTP/1.1 200 OK Header line: Date: Tue, 03 Jan 2006 16:40:31 GMT Header line: Server: Apache Header line: Last-Modified: Thu, 18 Aug 2005 17:56:00 GMT Converted Thu, 18 Aug 2005 17:56:00 GMT to Thu, 18 Aug 2005 17:56:00 Header line: ETag: "4400d9-15c00-4304cbb0" Header line: Accept-Ranges: bytes Header line: Content-Length: 89088 Header line: Connection: close Header line: Content-Type: text/plain Header line: returnStatus = 0 Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 7168 from document Read a total of 89088 bytes size = 89088 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Here is my config file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> start_url: http://bjsserver/testdocs/files/ limit_urls_to: http://bjsserver/testdocs/files/ local_urls: http://bjsserver/testdocs/files/ #common_url_parts: http://bjsserver/testdocs/files/ .html database_dir: /opt/mailman/htdig/db/uretek bad_word_list: /opt/mailman/htdig/common/bad_words nothing_found_file: /opt/mailman/htdig/common/nomatch.html search_results_wrapper: /opt/mailman/htdig/common/wrapper.html exclude_urls: .dir .htaccess .mhonarc.db max_head_length: 10000 remove_bad_urls: true use_star_image: no maintainer: br...@bj... search_algorithm: exact:1 synonyms:0.5 endings:0.1 allow_virtual_hosts: true allow_numbers: true no_next_page_text: no_prev_page_text: backlink_factor: 0 sort: date maximum_pages: 30 next_page_text: <img src=/icons/buttonr.gif border=0 align=middle width=30 height=30 alt=next> prev_page_text: <img src=/icons/buttonl.gif border=0 align=middle width=30 height=30 alt=prev> page_number_text: "<img src=/icons/button1.gif border=0 align=middle width=30 height=30 alt=1>" \ "<img src=/icons/button2.gif border=0 align=middle width=30 height=30 alt=2>" \ "<img src=/icons/button3.gif border=0 align=middle width=30 height=30 alt=3>" \ "<img src=/icons/button4.gif border=0 align=middle width=30 height=30 alt=4>" \ "<img src=/icons/button5.gif border=0 align=middle width=30 height=30 alt=5>" \ "<img src=/icons/button6.gif border=0 align=middle width=30 height=30 alt=6>" \ "<img src=/icons/button7.gif border=0 align=middle width=30 height=30 alt=7>" \ "<img src=/icons/button8.gif border=0 align=middle width=30 height=30 alt=8>" \ "<img src=/icons/button9.gif border=0 align=middle width=30 height=30 alt=9>" \ "<img src=/icons/button10.gif border=0 align=middle width=30 height=30 alt=10>" no_page_number_text: "<img src=/icons/button1.gif border=2 align=middle width=30 height=30 alt=1>" \ "<img src=/icons/button2.gif border=2 align=middle width=30 height=30 alt=2>" \ "<img src=/icons/button3.gif border=2 align=middle width=30 height=30 alt=3>" \ "<img src=/icons/button4.gif border=2 align=middle width=30 height=30 alt=4>" \ "<img src=/icons/button5.gif border=2 align=middle width=30 height=30 alt=5>" \ "<img src=/icons/button6.gif border=2 align=middle width=30 height=30 alt=6>" \ "<img src=/icons/button7.gif border=2 align=middle width=30 height=30 alt=7>" \ "<img src=/icons/button8.gif border=2 align=middle width=30 height=30 alt=8>" \ "<img src=/icons/button9.gif border=2 align=middle width=30 height=30 alt=9>" \ "<img src=/icons/button10.gif border=2 align=middle width=30 height=30 alt=10>" external_parsers: application/rtf->text/html /opt/mailman/htdig/bin/doc2html.pl \ text/rtf->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/pdf->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/postscript->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/msword->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/wordperfect5.1->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/wordperfect6.0->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/msexcel->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/vnd.ms-excel->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/powerpoint->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/vnd.ms-powerpoint->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/x-shockwave-flash->text/html /opt/mailman/htdig/bin/doc2html.pl \ application/x-shockwave-flash2-preview->text/html /opt/mailman/htdig/bin/doc2html.pl >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Any ideas? -- Cheers Brian http://www.abandonmicrosoft.co.uk |
From: G. T. Stresen-R. <ted...@ma...> - 2006-01-03 17:17:50
|
Make sure the path to doc2html.pl is right and that htdig has permission to launch it. Make sure you have a carriage return at the end of the .conf file. I recall finding that if I didn't have the carriage return, some things failed (but that might just have been me not understanding what was _really_ going on...) Make sure that doc2html.pl is also properly configured. If I'm not mistaken, there are a few variables that need to be set in order for it to work properly (but considering you've run it manually, I would assume it is). Make sure that the max_doc_size (or something like that) is significantly larger than your largest document (like twice the size). This will definitely keep htdig from indexing the documents if it is too small and looking at your config file, it appears to be set at the default of 100000. I've got mine set to 24 MB since we have people completely unaware of what file size is all about adding things to the web site. Hope these ideas help. Ted Stresen-Reuter On Jan 3, 2006, at 4:52 PM, Brian Read wrote: > Any ideas? |
From: brian r. <br...@bj...> - 2006-01-03 19:58:50
|
Ted Thanks for the suggestions... > Make sure the path to doc2html.pl is right and that htdig has permission > to launch it. Tried that - widened the premissions to 0777 just in case. > > Make sure you have a carriage return at the end of the .conf file. I > recall finding that if I didn't have the carriage return, some things > failed (but that might just have been me not understanding what was > _really_ going on...) > That looks ok > Make sure that doc2html.pl is also properly configured. If I'm not > mistaken, there are a few variables that need to be set in order for it > to work properly (but considering you've run it manually, I would assume > it is). > Agreed > Make sure that the max_doc_size (or something like that) is > significantly larger than your largest document (like twice the size). > This will definitely keep htdig from indexing the documents if it is too > small and looking at your config file, it appears to be set at the > default of 100000. I've got mine set to 24 MB since we have people > completely unaware of what file size is all about adding things to the > web site. > I set this to 24000000, but still no luck. -- Cheers Brian http://www.abandonmicrosoft.co.uk |
From: G. T. Stresen-R. <ted...@ma...> - 2006-01-03 20:10:18
|
Reviewing the output from -vvvv I see this line: Header line: Content-Type: text/plain The .doc parser is triggered by a content-type of application/msword. Using Firefox with the web developer extensions installed (or using telnet and manually sending an HTTP request) you can see what content-type headers Apache is really sending for Word documents. If it is not sending application/msword (and it appears that it is not), then the doc2html.pl script will not be triggered... Let us know if that is the problem. Note that depending on your version of Apache, how to configure it to send the right header varies... check the documentation to find out how to do it. Good luck! Ted On Jan 3, 2006, at 7:58 PM, brian read wrote: > Ted > > Thanks for the suggestions... > >> Make sure the path to doc2html.pl is right and that htdig has >> permission to launch it. > > Tried that - widened the premissions to 0777 just in case. > >> Make sure you have a carriage return at the end of the .conf file. I >> recall finding that if I didn't have the carriage return, some things >> failed (but that might just have been me not understanding what was >> _really_ going on...) > > That looks ok > >> Make sure that doc2html.pl is also properly configured. If I'm not >> mistaken, there are a few variables that need to be set in order for >> it to work properly (but considering you've run it manually, I would >> assume it is). > > Agreed > >> Make sure that the max_doc_size (or something like that) is >> significantly larger than your largest document (like twice the >> size). This will definitely keep htdig from indexing the documents if >> it is too small and looking at your config file, it appears to be set >> at the default of 100000. I've got mine set to 24 MB since we have >> people completely unaware of what file size is all about adding >> things to the web site. > > I set this to 24000000, but still no luck. > > > -- > Cheers > > Brian > > http://www.abandonmicrosoft.co.uk |
From: Brian R. <br...@bj...> - 2006-01-03 22:45:40
|
G. T. Stresen-Reuter wrote: > Reviewing the output from -vvvv I see this line: > > Header line: Content-Type: text/plain > > The .doc parser is triggered by a content-type of application/msword. > > Using Firefox with the web developer extensions installed (or using > telnet and manually sending an HTTP request) you can see what > content-type headers Apache is really sending for Word documents. If > it is not sending application/msword (and it appears that it is not), > then the doc2html.pl script will not be triggered... > > Let us know if that is the problem. Note that depending on your > version of Apache, how to configure it to send the right header > varies... check the documentation to find out how to do it. > The version of Apache is 1.3.27-8. This is an SMEserver system (aka E-Smith), which is based on RH 7.3. I have looked in /etc/mime.types and can see that there is no entry for application/msword. I have added one and restarted apache. And it works!! Many thanks for your help. -- Cheers Brian http://www.abandonmicrosoft.co.uk |