pdf2xml / Feature Requests / #7 pdf forms to xml forms

#7 pdf forms to xml forms

Milestone: Next_Release_(example)

Status: open

Owner: nobody

Labels: None

Priority: 1

Updated: 2013-12-16

Created: 2013-11-24

Creator: Anonymous

Private: No

Using:

http://hivelocity.dl.sourceforge.net/project/pdf2xml/binaries/Linux%2064%20v1.2.7/pdftoxml.linux64.exe.1.2_7.gz

downloaded on:

2013-01-12

and applied to:

http://www.irs.gov/pub/irs-pdf/f1040.pdf

which was downloaded on:

2013-03-11

produces what looks like a <token>...</token> element for each word.
For example, the attachment shows a portion of the xml output after
running thru xmlindent.

Could pdf2xml be modified so that words on same line are concatenated
in a single say, <text>...</text> element to make the xml easier to read?
The code here:

http://www.mobipocket.com/dev/pdf2xml/pdf2xml.zip

does that; hence, it must be possible.

Also, the f1040.pdf has many pdf form fields which don't appear in the
resulting .xml file produced by pdf2xml. Could pdf2xml be modified to
produce some type of xform fields, something like that shown here:

http://xformsinstitute.com/essentials/browse/ch02s02.php

Thanks for all the work on this.

I'm a pretty good c++ programmer and I'm trying to understand pdf;
hence, maybe I could provide some help on these features.

-regards,
Larry

1 Attachments

portion_xml.txt

Discussion

Herve Dejean - 2013-12-16

Lines roughly correspond to TEXT tags. A simple concatenation of TOKEN content creates the line. TOKEN are generated since they carry typographical information for each token.

RE: forms, pdf2xml extracts information found in the PDF. Your PDF form is a set of text and graphical information. The form structure is not explicitly given. It has to be generated.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous

pdf forms to xml forms

Group

Searches

Help

#7 pdf forms to xml forms

Discussion