Extract excelsheet from pdf

  • Bernhard

    Bernhard - 2009-05-05

    Hi all,

    I'm a newbie on the pdf format and would appreciate some help.

    I have a one page pdf document that is generated from an excelsheet and I need extract the text from the pdf in a structured way like the excelsheet.
    I ran the text extraction example from the distribution and managed to get the text from the document but there is no structure to it.

    Is there a fairly simple way of doing this? Where to start?

    Thanks all for help


    • mtraut

      mtraut - 2009-05-05

      Sorry - there is no way to do this in a SIMPLE manner...

      To extract such information you have to implement a lot of heuristical strategies - mostly based on alignments, space and other graphical elements available in the page.

      We currently do not provide such algorithms.

      The most basic approach may be to detect "virtual" Tab characters if the space between text is beyond a certain level. The tabbed text may reflect the Excel structure...

    • Bernhard

      Bernhard - 2009-05-05

      Thanks for reply,

      It may work with a basic approach because I "only" need to find in which "column"/tabbed distance a text is located.

      How can I find these "virtual" tab characters and the mapping to the text?


      • mtraut

        mtraut - 2009-05-05

        In class CSTextExtractor, method onCharacterFound you can see how the standard text extraction takes place. You may play around with the distances to trigger a column switch.

        In the content stream itself there is nothing like a "TAB" character.

    • Bernhard

      Bernhard - 2009-05-06

      Thanks 4 the tip,

      After investigating the input parameters PDGlyphs glyphs, Rectangle2D rect to method onCharacterFound() I found that each character is mapped to rectangle which in its turn is located in at a position in some kind of gridlayout in the file. With this information I can find the position of a character relative to another. Bingo, thats all I need.



Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.

No, thanks