From: Cutts I. J. H. <Cu...@mi...> - 2004-11-23 23:37:28
|
>-----Original Message----- >From: htd...@li... [mailto:htd...@li...] On Behalf Of Milan >Andric >Sent: Monday, November 22, 2004 2:49 PM >To: htd...@li... >Subject: Re: [htdig] Error Msg when ht:Digging PDF files > > >On Mon, Nov 22, 2004 at 02:32:05PM -0600, Cutts III, James H. wrote: >> I am slowly working my way through the process of getting PDF files to=20 >> be indexed by ht://Dig. I've found and installed the xpdf 2.01-11 and=20 >> verified that pdftotext works. I've installed and modified the=20 >> doc2html.pl. I've modified the pdf2html.pl files. And I've created=20 >> an html file that is points to my PDF files and tweaked my htdig.conf >> to include the external_parsers: command. >>=20 >> I run htdig -vv -i -c htdig.conf and I get the following errors >>=20 >> External parser error: can't parse Content-Type "txt/html" >> URL:=20 >> http://128.206.75.187/cori_kbase_jhc/pdfs/missouri_hmo/Commuity%20Care >> Pl >> us-Hospitals%20Expansion%202-99pc.pdf >>=20 >> Once for each pdf file. >>=20 >> Any suggestions? The file displays nicely in a web browser. I=20 >> suspect that it may be the setup of the Apache server and the mime=20 >> type that it's sending. > >the default content-type header for html/apache is=20 >Content-Type: text/html; charset=3DISO-8859-1 > >for pdf you probably want to use >Content-Type: application/pdf > >this should happen automatically with mime_module apache module. the mime.types file by default should contain=20 >application/pdf pdf > >some browsers will figure out by the file extension how to >open a pdf file. > >"Content-Type: txt/html" header is just wrong, i think. > >maybe if you fix the header, it should work better. > >-- >Milan > > Milan,=20 Thanks for the suggestion. That was what I had thought. I worked carefully thought my Apache configuration files and they certainly looked correct. I then started poking at the doc2html.pl script as I could add debugging statements to it. My testing version of the doc2html.pl script is now quite verbose. Here is what I've identified. 1. htDig calls Apache for a page 2. Apache returns page to htDig (I have htDig configured for a maximum size larger than the largest .PDF in my test collection. So it's not an issue of the parser choking on a partial PDF file.) 3. htDig identifies the page as a PDF file. 4. htDig passes the page to the external parser. (I've correctly configured htDig to use the doc2html.pl script that comes with htDig. I know this is working because I've stuck debugging comments through the script and I'm able to trace the execution through the script.) 5. The doc2html.pl uses the file extension, magic code and MIME type to determine the specific appropriate conversion utility. (In this case, the conversion utility is the pdf2html.pl script that also comes with htDig. I know that this script is working because I've stuck debugging comments through the script and I'm able to trace the execution through the script.) 6. The pdf2html.pl script calls the pdftotext program to convert the file from pdf to a text stream. The pdf2html wraps the text stream in HTML tags to be returned to (eventually) htDig. (The system is dying where the pdf2html.pl script calls the pdftotext application. The pdftotext application is opened as a pipe returning the results of the conversion to the pdf2html.pl script. However, it appears that the pdftotext program is failing in a way to cause pd2html.pl to abend as the error trapping statements are not being executed. =20 I checked the syntax of the command being executed from within the pdf2html.pl at the command line. It works perfectly converting the document nicely. =20 The documents I am trying to process are ones that we have created ourselves with Adobe Acrobat from scanned sources. They successfully convert when run from the command line.) So that's where I've gotten to. I've found where the system is blowing up, but I can't identify why. I've been doing my testing as root. While this is not really how I'm going to want to run the system in the future, it minimizes rights issues. Any further assistance and suggestions would be much appreciated. I could rather easily rewrite the pdf2html.pl script to call the pdftotext application in a different manner, but I don't know if that would really solve the problem. Thanks, James H. Cutts III CORI - 143C Mumford |