RE: [htdig] Error Msg when ht:Digging PDF files

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

>-----Original Message-----
>From: htd...@li...
[mailto:htd...@li...] On Behalf Of Milan
>Andric
>Sent: Monday, November 22, 2004 2:49 PM
>To: htd...@li...
>Subject: Re: [htdig] Error Msg when ht:Digging PDF files
>
>
>On Mon, Nov 22, 2004 at 02:32:05PM -0600, Cutts III, James H. wrote:
>> I am slowly working my way through the process of getting PDF files
to=20
>> be indexed by ht://Dig.  I've found and installed the xpdf 2.01-11
and=20
>> verified that pdftotext works. I've installed and modified the=20
>> doc2html.pl.  I've modified the pdf2html.pl files.  And I've created=20
>> an html file that is points to my PDF files and tweaked my htdig.conf

>> to include the external_parsers: command.
>>=20
>> I run htdig -vv -i -c htdig.conf and I get the following errors
>>=20
>> External parser error: can't parse Content-Type "txt/html"
>>  URL:=20
>>
http://128.206.75.187/cori_kbase_jhc/pdfs/missouri_hmo/Commuity%20Care
>> Pl
>> us-Hospitals%20Expansion%202-99pc.pdf
>>=20
>> Once for each pdf file.
>>=20
>> Any suggestions?  The file displays nicely in a web browser.  I=20
>> suspect that it may be the setup of the Apache server and the mime=20
>> type that it's sending.
>
>the default content-type header for html/apache is=20
>Content-Type: text/html; charset=3DISO-8859-1
>
>for pdf you probably want to use
>Content-Type: application/pdf
>
>this should happen automatically with mime_module apache module. the
mime.types file by default should contain=20
>application/pdf                      pdf
>
>some browsers will figure out by the file extension how to
>open a pdf file.
>
>"Content-Type: txt/html" header is just wrong, i think.
>
>maybe if you fix the header, it should work better.
>
>--
>Milan
>
>

Milan,=20

Thanks for the suggestion.  That was what I had thought.  I worked
carefully thought my Apache configuration files and they certainly
looked correct.  I then started poking at the doc2html.pl script as I
could add debugging statements to it.  My testing version of the
doc2html.pl script is now quite verbose.  Here is what I've identified.

1. htDig calls Apache for a page
2. Apache returns page to htDig
(I have htDig configured for a maximum size larger than the largest .PDF
in my test collection. So it's not an issue of the parser choking on a
partial PDF file.)
3. htDig identifies the page as a PDF file.
4. htDig passes the page to the external parser.
(I've correctly configured htDig to use the doc2html.pl script that
comes with htDig.  I know this is working because I've stuck debugging
comments through the script and I'm able to trace the execution through
the script.)
5. The doc2html.pl uses the file extension, magic code and MIME type to
determine the specific appropriate conversion utility.
(In this case, the conversion utility is the pdf2html.pl script that
also comes with htDig.  I know that this script is working because I've
stuck debugging comments through the script and I'm able to trace the
execution through the script.)
6. The pdf2html.pl script calls the pdftotext program to convert the
file from pdf to a text stream.  The pdf2html wraps the text stream in
HTML tags to be returned to (eventually) htDig.
(The system is dying  where the pdf2html.pl script calls the pdftotext
application.  The pdftotext application is opened as a pipe returning
the results of the conversion to the pdf2html.pl script.  However, it
appears that the pdftotext program is failing in a way to cause
pd2html.pl to abend as the error trapping statements are not being
executed.
=20
I checked the syntax of the command being executed from within the
pdf2html.pl at the command line.  It works perfectly converting the
document nicely.
=20
The documents I am trying to process are ones that we have created
ourselves with Adobe Acrobat from scanned sources. They successfully
convert when run from the command line.)

So that's where I've gotten to.  I've found where the system is blowing
up, but I can't identify why.  I've been doing my testing as root.
While this is not really how I'm going to want to run the system in the
future, it minimizes rights issues.

Any further assistance and suggestions would be much appreciated.  I
could rather easily rewrite the pdf2html.pl script to call the pdftotext
application in a different manner, but I don't know if that would really
solve the problem.

Thanks,

James H. Cutts III
CORI - 143C Mumford