Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

Extract excelsheet from pdf

Help
Bernhard
2009-05-05
2013-05-28
  • Bernhard
    Bernhard
    2009-05-05

    Hi all,

    I'm a newbie on the pdf format and would appreciate some help.

    I have a one page pdf document that is generated from an excelsheet and I need extract the text from the pdf in a structured way like the excelsheet.
    I ran the text extraction example from the distribution and managed to get the text from the document but there is no structure to it.

    Is there a fairly simple way of doing this? Where to start?

    Thanks all for help

    //Bernhard

     
    • mtraut
      mtraut
      2009-05-05

      Sorry - there is no way to do this in a SIMPLE manner...

      To extract such information you have to implement a lot of heuristical strategies - mostly based on alignments, space and other graphical elements available in the page.

      We currently do not provide such algorithms.

      The most basic approach may be to detect "virtual" Tab characters if the space between text is beyond a certain level. The tabbed text may reflect the Excel structure...

       
    • Bernhard
      Bernhard
      2009-05-05

      Thanks for reply,

      It may work with a basic approach because I "only" need to find in which "column"/tabbed distance a text is located.

      How can I find these "virtual" tab characters and the mapping to the text?

      //B

       
      • mtraut
        mtraut
        2009-05-05

        In class CSTextExtractor, method onCharacterFound you can see how the standard text extraction takes place. You may play around with the distances to trigger a column switch.

        In the content stream itself there is nothing like a "TAB" character.

         
    • Bernhard
      Bernhard
      2009-05-06

      Thanks 4 the tip,

      After investigating the input parameters PDGlyphs glyphs, Rectangle2D rect to method onCharacterFound() I found that each character is mapped to rectangle which in its turn is located in at a position in some kind of gridlayout in the file. With this information I can find the position of a character relative to another. Bingo, thats all I need.

      //Bernhard