Menu

#12 extract information from tagged PDF

closed
PDModel (41)
5
2010-04-07
2003-09-13
No

Add the ability to extract information from a tagged PDF
document. See taggedPDF.pdf for an example.

Discussion

  • Ben Litchfield

    Ben Litchfield - 2004-08-31
    • assigned_to: nobody --> benlitchfield
     
  • qumar

    qumar - 2006-03-13

    Logged In: YES
    user_id=1468838

    It would be nice if pdfbox can provide the ability to
    extract information from tagged PDF.As Adobre Acrobat Reader
    provides the tags for the pdf, pdfbox should also try to get
    the tagged pdfs.

    for example if iwe have a pdf file with a para1 under
    header1 and para2 under header 2 and a table with rows and
    columns.something like

    Header1
    This is a para 1 ,it describes about a disease.
    Header2
    This is a para2,describes remedies of disease.
    Table
    A B
    C D

    Now the tagged pdf looks like below in adobe acrobat reader

    <Heading 1>
    Header1
    <Normal>
    This is a para 1 ,it describes about a disease.
    <Heading 1>
    Header1
    <Normal>
    This is a para2,describes remedies of disease.
    <Heading 1>
    Table
    <Table>
    <TBody>
    <TR>
    <TD>
    <Normal>
    A
    <TD>
    <Normal>
    B
    <TR>
    <TD>
    <Normal>
    C
    <TD>
    <Normal>
    D

    how can we extract the Heading1 ,Heading 2 and tabular data
    using pdfbox.

    This is a good feature which should be added to the armory
    pdfbox.

    Please provide this feature.

     
  • qumar

    qumar - 2006-03-15

    Logged In: YES
    user_id=1468838

    Hi,

    i was seeing the specification of pdf and came to know the
    structure information of pdf will be in PDSEdit
    layer,PDSEdit Layer gives access to structure tree with in a
    pdf and methods methods and objects are prefixed by PDS.So
    how can we get access to PDSEdit layer of pdf.

     
  • Ben Litchfield

    Ben Litchfield - 2006-03-31

    Logged In: YES
    user_id=601708

    More comments from users

    Tagged PDF will be a big thing in government because
    federal government procurement of Acrobat publishing
    technology falls under Section 508. States will likely
    follow.

    see:
    www.section508.gov

    http://www.irs.gov/pub/irs-access/
    or
    ftp://ftp.irs.gov/pub/irs-access/

     
  • Ben Litchfield

    Ben Litchfield - 2006-04-20

    Logged In: YES
    user_id=601708

    http://www.irs.gov/pub/irs-access/f1040ez_accessible.pdf
    would be a good form to start with.

    If you notice they are putting labels on the form fields.
    these labels contain meta data critical to building tax
    software in rapid fashion. Without this meta data, the
    name of the form field is meaningless. It would be nice to
    extract this information so I can combine it with other
    data about the field (name, type, location, etc). I
    already know PDFBox can extract the other information about
    the fields. I haven't done it with PDFBox, but I did it
    with iText.

     
  • qumar

    qumar - 2006-04-26

    Logged In: YES
    user_id=1468838

    Hi,
    we have to parse the PDF object structure tree; all
    structural elements are inside the object tree (see e.g.
    PDFReference 1.4 chapter 9.6 "Logical Structure").
    - parse the PDF page streams to extract drawing and text
    operations;these contain the actual content of the
    structural elements. This content is surrounded by BMC/EMC
    tags which contain information to which element object the
    contained content belongs.This is what i got from pdf reference.

    Regards,
    Qumar.

     
  • Ben Litchfield

    Ben Litchfield - 2010-04-07

    PDFBox has moved to Apache. Please log issue there.

    http://pdfbox.apache.org

     
  • Ben Litchfield

    Ben Litchfield - 2010-04-07
    • status: open --> closed
     

Log in to post a comment.