From: Gilles D. <gr...@sc...> - 2002-08-07 14:12:32
|
According to vir...@bu...: > I have a problem with pdf files and indexing. > I get this error message: > "Deleted, no excerpt: xx/http://www.mywebsite.com/doc.pdf" > for each pdf files i have on this website.... > It recognizes pdf files because i also get this message during the > indexing: > Read 8192 from document > Read 8192 from document > Read 8192 from document > (many lines like that) > Read a total of 3942250 bytes > PDF::setContents(3942250 bytes) > PDF::parse(http://www.mywebsite.com/doc.pdf) size = 3942250 > etc... > I don't see what is wrong, the max docsize is set well, no disable in the > robots.txt file. > If anyone got an idea, thanks by advance. Well, given that it's using the internal parser, which calls acroread to convert the PDF to PS, presumably you have a working acroread program installed and that's what you want to use. However, there are some problems with using acroread - it crashes occasionally (especially version 4), and there are some PDFs that it just can't deal with and give indexable PS output. Mind you, there are some PDFs that xpdf's pdftotext has problems with too. Are you sure that your PDF files contain text in them, and not just scanned images of text? In any case, you may want to try an external converter like doc2html. See http://www.htdig.org/FAQ.html#q4.9 -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |