Re: [htdig] Deleted, no excerpt with pdf files

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

According to vir...@bu...:
> I have a problem with pdf files and indexing.
> I get this error message:
> "Deleted, no excerpt: xx/http://www.mywebsite.com/doc.pdf"
> for each pdf files i have on this website....
> It recognizes pdf files because i also get this message during the
> indexing:
> Read 8192 from document
> Read 8192 from document
> Read 8192 from document
> (many lines like that)
> Read a total of 3942250 bytes
> PDF::setContents(3942250 bytes)
> PDF::parse(http://www.mywebsite.com/doc.pdf) size = 3942250
> etc...
> I don't see what is wrong, the max docsize is set well, no disable in the
> robots.txt file.
> If anyone got an idea, thanks by advance.

Well, given that it's using the internal parser, which calls acroread to
convert the PDF to PS, presumably you have a working acroread program
installed and that's what you want to use.  However, there are some
problems with using acroread - it crashes occasionally (especially
version 4), and there are some PDFs that it just can't deal with and
give indexable PS output.  Mind you, there are some PDFs that xpdf's
pdftotext has problems with too.  Are you sure that your PDF files
contain text in them, and not just scanned images of text?

In any case, you may want to try an external converter like doc2html.
See http://www.htdig.org/FAQ.html#q4.9

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)