Thread: [htdig] Processing pdf Files

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

There have been several ways of including the text content of pdf files
mentioned in the list and in the documentation.  What are the pros and cons of
each?

Is there some advantage of xpdf version 3.00 over version 1.00?

doc2html.pl, pdf2html.pl, and pdftotext are all mentioned.  Is there an
advantage to one over another?  How about in comparison to other means?  Is
there some advantage to having more than one of these?

The http://www.htdig.org/contrib/ page on the Web site lists more possibilities
like acroconv.pl, conv_doc.pl, and parsepdf.pl.  What are their pros and cons?

The description of conv_doc.pl makes a distinction between parsing and
converting with the statement, "External converters have two advantages over
external parsers.  They are easier to write, and the parsing is done in a more
consistent way for all document types."  I'm not sure I understand this.  Does
the external parser do more than the external converter by doing some of what
htdig would do in searching for strings?  Would there be some efficiency
advantage to an external parser?  If an external converter parses "in a more
consistent way for all document types", then how is it different from an
external parser and what kind of inconsistencies might arise?  Wouldn't strings
be unambiguously identified in a pdf file by any of these tools?

TIA.

Douglas

========
Douglas Kline
kl...@he...

Thread: [htdig] Processing pdf Files

htdig-general