From: Steve Y. <sy...@fr...> - 2004-04-15 21:00:50
|
Ok, I'm going mad trying to get pdf's indexed. I've got things configured to the best my little brain can fathom but rundig -vvv tells me the following for each pdf: Deleted, no excerpt: 50/http://www.domain.com/download/xxxx.pdf It appears that the files are being read: pick: www.domain.com, # servers = 1 50:50:2:http://www.domain.com/download/xxxx.pdf: Retrieval command for http://www.domain.com/download/xxxx.pdf: GET /download/xxxx.pdf HTTP/1.0 User-Agent: htdig/3.1.6 (web...@do...) Referer: http://www.domain.com/download/pdf.html Host: www.domain.com Header line: HTTP/1.1 200 OK Header line: Date: Thu, 15 Apr 2004 20:17:47 GMT Header line: Server: Apache/1.3.27 (Unix) (Red-Hat/Linux) mod_ssl/2.8.12 OpenSSL/0.9.6b DAV/1.0.3 PHP/4.1.2 Header line: Last-Modified: Fri, 19 Dec 2003 16:51:20 GMT Converted Fri, 19 Dec 2003 16:51:20 GMT to Fri, 19 Dec 2003 16:51:20 Header line: ETag: "136619f-c76f-3fe32c88" Header line: Accept-Ranges: bytes Header line: Content-Length: 51055 Header line: Connection: close Header line: Content-Type: application/pdf Header line: returnStatus = 0 Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 1903 from document Read a total of 51055 bytes size = 51055 I've confirmed that pdf2html.pl and pdftotext both work from the command line. doc2html.pl just spits out garbage in between the html tags when I try to convert a pdf on the command line with it. I have the following line in htdig.conf: external_parsers: application/pdf->text/html /path/to/convertor/htdig/scripts/doc2html.pl I've also tried to call pdf2html.pl directly in the conf file to no avail. Any ideas??? Am I missing some config. somewhere? I'm not getting any errors in the doc2html log file so I dont know where to look... |