From: shams k. <sha...@ho...> - 2002-11-07 01:13:42
|
Hi, I've tried doc2html.pl, but am also having problems indexing word documents. I have the following line in the doc2html.pl script: #version of catdoc for Word6, Word7 & Word97 files: my $CATDOC = '/usr/local/bin'; And, I have the catdoc package installed (with the catdoc binaries in /isr/local/bin and /usr/local/lib) After changing the line in htdig.conf to use doc2html.pl, I get the following error messages rundig tries to index word documents: http://10.5.1.35/sme/micro/test.doc: ! UNABLE to convert size = 11264 any suggestions on what could be wrong ? Also, is there any benefit of using doc2html.pl over conv_doc.pl to index .pdf documents for htDig. ? I have been using conv_doc.pl and it has been giving me very satisfactory results, I did try doc2html.pl as well to see if there was any difference... however with doc2html.pl I found that less pdf documents were indexed and all the excerpts on htDig search results were garbled. (e.g. 0000000016 00000 n 0000001025 00000 n 0000001337 00000 n 0000001543 00000 n 0000001750 00000 n 0000001789 00000 n 0000002255 00000 n 0000002455 00000 n 0000002643 00000 n 0000003042 00000 n 0000003064 00000 n 00000). Thanks for your help, Shams ----- Original Message ----- From: "Gilles Detillieux" <gr...@sc...> To: "shams khan" <sha...@ho...> Cc: "ht://Dig" <htd...@li...> Sent: Monday, November 04, 2002 9:52 PM Subject: Re: [htdig] using conv_doc.pl to index MS Word documents > According to shams khan: > > I've used conv_doc.pl (with XPDF) to index PDF documents. I am now > > trying to index MS Word documents, but am having problems. > > > > I've copied the conv_doc.pl script into /usr/local/bin, which contains > > the line: > > > > $CATDOC = "/usr/local/bin/catdoc"; > > > > I've installed the CATDOC package (which has placed the catdoc binary in > > /usr/local/bin and /usr/local/lib) > > > > I've placed the follwing line within the htdig.conf file: > > > > application/msword->text/html /usr/local/bin/conv_doc.pl > > > > But when I try and re-index my website (this time, with the hope of > > indexing word documents too), i get the following error message which > > apeears next to the word documents: > > > > test.doc: can't determine type of file /var/www/html/htdig/dv/htdex.8KvYOL; content-type: application/msword; URL: http://10.5.1.35/sme/micro/management_self_assessment_guide/test/doc size = 11264 > > I suggest you try doc2html.pl instead of conv_doc.pl. conv_doc allows > only one "magic number" for recgnizing Word documents, whereas I think > doc2html allows a few different ones. Not all Word documents have the > same identifying byte sequence at the start. > > -- > Gilles R. Detillieux E-mail: <gr...@sc...> > Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ > Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) > > > ------------------------------------------------------- > This SF.net email is sponsored by: ApacheCon, November 18-21 in > Las Vegas (supported by COMDEX), the only Apache event to be > fully supported by the ASF. http://www.apachecon.com > _______________________________________________ > htdig-general mailing list <htd...@li...> > To unsubscribe, send a message to <htd...@li...> with a subject of unsubscribe > FAQ: http://htdig.sourceforge.net/FAQ.html > |