From: Douglas K. <kl...@he...> - 2004-04-20 22:40:00
|
There have been several ways of including the text content of pdf files mentioned in the list and in the documentation. What are the pros and cons of each? Is there some advantage of xpdf version 3.00 over version 1.00? doc2html.pl, pdf2html.pl, and pdftotext are all mentioned. Is there an advantage to one over another? How about in comparison to other means? Is there some advantage to having more than one of these? The http://www.htdig.org/contrib/ page on the Web site lists more possibilities like acroconv.pl, conv_doc.pl, and parsepdf.pl. What are their pros and cons? The description of conv_doc.pl makes a distinction between parsing and converting with the statement, "External converters have two advantages over external parsers. They are easier to write, and the parsing is done in a more consistent way for all document types." I'm not sure I understand this. Does the external parser do more than the external converter by doing some of what htdig would do in searching for strings? Would there be some efficiency advantage to an external parser? If an external converter parses "in a more consistent way for all document types", then how is it different from an external parser and what kind of inconsistencies might arise? Wouldn't strings be unambiguously identified in a pdf file by any of these tools? TIA. Douglas ======== Douglas Kline kl...@he... |