From: Chris B. <cbl...@te...> - 2003-04-14 17:03:05
|
I just set up a new web server (FreeBSD 4.7, Apache 2) and have htdig working correctly with HTML files. http://www.arlington.k12.va.us/ However, none of the PDF files on my site are indexing correctly. I have the right path set in external parsers I can run pdftotext and pdfinfo on files and get the expected info. However, when I search for a known PDF file, this is an example result [40-tech-gifts-PIM.pdf]**** ...> /CreationDate (D:20020730143656) /Title (40-Technology-Gifts-PIM) /Author <feff006d0076006f0075006c007400730069... http://www.arlington.k12.va.us/schoolboard/sbp/Finance/40-tech-gifts- PIM.pdf 09-18-2002, 12033 bytes The title isn't there and neither is any of the content. Here is pdfinfo apsweb3# pdfinfo 40-tech-gifts-PIM.pdf Title: 40-Technology-Gifts-PIM Author: mvoultsi Creator: Microsoft Word Producer: Acrobat PDFWriter 4.0 for Windows CreationDate: Tue Jul 30 14:36:56 2002 ModDate: Tue Jul 30 14:55:13 2002 Tagged: no Pages: 2 Encrypted: no Page size: 612 x 792 pts (letter) File size: 12033 bytes Optimized: no PDF version: 1.3 Here is the first part of pdftotext on the file ARLINGTON PUBLIC SCHOOLS Policy Implementation Manual 40-Technology (Gifts) GENERAL GUIDELINES ASD 40-1.03 GIFTS TO SCHOOLS states "The Arlington Public Schools welcome all gifts that assist or enhance its programs and schools." The number of Our webserver is running in the FreeBSD "jail", where the server is basically chrooted to a subdirectory on the server. This means we are duplicating some programs in /usr/local/bin and /directoryjail/usr/local/bin, for example. I have copied all of the PDftotext programs and the doc2html folder into the jail. Does anyone know what I'm doing wrong? Thanks Chris |