From: David A. <D.J...@so...> - 2002-03-04 11:38:51
|
Deleted, no excerpt with pdf filesTry running doc2html.pl from the = command line: /opt/www/htdig/bin/doc2html.pl filename.pdf application/pdf where filename.pdf is the full path name of a PDF document. -- David Adams Computing Services Southampton University ----- Original Message -----=20 From: Steve Marshall=20 To: htd...@li...=20 Sent: Monday, March 04, 2002 10:08 AM Subject: [htdig] Deleted, no excerpt with pdf files //htDig is working fine for us with a large intranet 2Gig or so which = is entirely graphics & .html. I want to index pdfs too of course. I am running the doc2html.pl script on a very simple (test) index.html = file which links only to a .GIF and small .pdf file.( I have tried = parse_doc & conv_doc too) I have the latest XPDF, and pdftotext works fine on the same .pdf at = the command line and produces a perfect .txt file=20 When I run htDig with the -vvvvv option it lists all the lines in that = .pdf file as plain text so it is apparently parsing properly.=20 However when I try to htmerge I get a "Deleted, no exerpt" message. = The wordlist file is tiny.=20 I can see from an earlier response that the problem might be that the = parser hasn't emitted a usable "h" record - how would I go about fixing = that? Would this apply to a .txt file - the test output hasn't got any = tags (of course). This is the only relevant uncommented line in htdig.conf=20 external parsers application/pdf->text/html = /opt/www/htdig/bin/doc2html.pl=20 Any help gratefully appreciated=20 Steve Marshall=20 = ________________________________________________________________________ This e-mail has been scanned for all viruses by Star Internet. The service is powered by MessageLabs. For more information on a proactive anti-virus service working around the clock, around the globe, visit: http://www.star.net.uk = ________________________________________________________________________ |