I'm a newbie on the pdf format and would appreciate some help.
I have a one page pdf document that is generated from an excelsheet and I need extract the text from the pdf in a structured way like the excelsheet.
I ran the text extraction example from the distribution and managed to get the text from the document but there is no structure to it.
Is there a fairly simple way of doing this? Where to start?
Thanks all for help
Sorry - there is no way to do this in a SIMPLE manner...
To extract such information you have to implement a lot of heuristical strategies - mostly based on alignments, space and other graphical elements available in the page.
We currently do not provide such algorithms.
The most basic approach may be to detect "virtual" Tab characters if the space between text is beyond a certain level. The tabbed text may reflect the Excel structure...
Thanks for reply,
It may work with a basic approach because I "only" need to find in which "column"/tabbed distance a text is located.
How can I find these "virtual" tab characters and the mapping to the text?
In class CSTextExtractor, method onCharacterFound you can see how the standard text extraction takes place. You may play around with the distances to trigger a column switch.
In the content stream itself there is nothing like a "TAB" character.
Thanks 4 the tip,
After investigating the input parameters PDGlyphs glyphs, Rectangle2D rect to method onCharacterFound() I found that each character is mapped to rectangle which in its turn is located in at a position in some kind of gridlayout in the file. With this information I can find the position of a character relative to another. Bingo, thats all I need.
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.