Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.
Can anybody guide me on how to read superscripted characters from pdf as jpod lib does not differentiate it with normal characters. Also how to identify TABLE as jpod library does not differentiate between table rows and boxes
jPod acts on the PDF object model - which is not HTML or alike. There's a subset of PDF (Tagged PDF) that may better support your needs, but is not currently in the scope of jPod.
If we only have a look at the classic rendering, the PDF creator *MAY* use the operator "Ts" to indicate super/subscript, but most of the time doesn't. He will scale and move the text rendering itself. A "Table" concept simply does not exist.
Ergo: You can build additional semantics on top of the jPod PDF object model, taking into account the distances and relative sizes. But there's nothing prebuilt in jPod and nothing in the PDF rendering itself.
Thank you for this information.
I observed that in jPod, code was written for "Ts" operator but the extractor cannot recognize super-scripted fonts. Could you please guide me, from where should I start if I need this feature enabled?
If you want to enhance the CSTextExtractor (or write one of your own) you first need an appropriate representation. The CSTextExtractor only records glyphs in a String - so you have to adapt this to something you can work on later.
The current text rise is already handled and can be read from "textState.rise" whenever a new character event is happening.
Thanks for your guidance. I am able to read super-scripted font :)
Any idea on how to enhance/customize jPod to recognize HTML-alike things?
Sorry - we are not involved in dark arts….
To start with, you may build a CSTextExtractor acting on a model that remembers proeprties like size, font, color etc. Next step would take into account distance to build paragraphs. "Ts" or a slight change of the baseline goes in a "super/subscript" property and so on…
After this you serialize the result in HTML syntax….
Problems will start on all more sophisticated elements, especially tables. Just try to write down the rules that make you recognize tables…