Using:
downloaded on:
2013-01-12
and applied to:
http://www.irs.gov/pub/irs-pdf/f1040.pdf
which was downloaded on:
2013-03-11
produces what looks like a <TOKEN>...</TOKEN> element for each word.
For example, the attachment shows a portion of the xml output after
running thru xmlindent.
Could pdf2xml be modified so that words on same line are concatenated
in a single say,
The code here:
http://www.mobipocket.com/dev/pdf2xml/pdf2xml.zip
does that; hence, it must be possible.
Also, the f1040.pdf has many pdf form fields which don't appear in the
resulting .xml file produced by pdf2xml. Could pdf2xml be modified to
produce some type of xform fields, something like that shown here:
http://xformsinstitute.com/essentials/browse/ch02s02.php
Thanks for all the work on this.
I'm a pretty good c++ programmer and I'm trying to understand pdf;
hence, maybe I could provide some help on these features.
-regards,
Larry
Anonymous
Lines roughly correspond to TEXT tags. A simple concatenation of TOKEN content creates the line. TOKEN are generated since they carry typographical information for each token.
RE: forms, pdf2xml extracts information found in the PDF. Your PDF form is a set of text and graphical information. The form structure is not explicitly given. It has to be generated.