I want to Read PDF for Indian Languages Like Tamil, malayalam, hindi etc.. I used jPod to read the content from PDF but I could not get proper text….
If Anybody tried this same can you please share the code
As i'm not fluent in Tamil this is really not a tested feature and *may* be a jPod deficiency. But in theory, Encodings, multibyte encodings and /ToUnicode is completely implemented.
So, be sure to check your PDF to contain a proper (multi)byte font with an invertible character map or a ToUnicode map.
Thanks for your reply… Tamil language takes 2 bytes. So, in this situation I am not getting proper output. It is displaying first byte only it ignores 2nd byte info of that specific character. I faced this problem with Hindi also. I think you are aware with Hindi language. Can you check once for me with your code snippet
I know that it takes two bytes. But i don't know who is the "it" that is displaying or ignoring. I assume "it" is your code. I don't know what you're doing. So i can't check. I don't know Hindi. I don't have a code snippet. I don't have Hindi documents.
Check the font structure as stated above.
What I mean is While reading for each character, I am getting first byte only not second byte.
One more thing…
check your PDF to contain a proper (multi)byte font with an invertible character map or a ToUnicode map
I couldn't understand the above suggestion.. can you elaborate regarding this.
To Read PDF I used ExtractText.java class from samples package from jPod Library.
If you have a two byte character set, a (correct) glyph extraction will always read two bytes.
After that either the associated encoding or a /ToUnicode entry in the font object is used to guess the correct Unicode character.
In the ExtractText example the process of decoding is already done - if successful or not. You can see the mechanics in PDGlyphs#getUnicode. The codepoint is the (two byte) codepoint in the graphicsstream. ToUnicode or character encoding is used to map to unicode.
So, if you can debug this and your document has a correct encoding or ToUnicode -> jPod may have an issue. Otherwise -> your document is simply not extractable (as millions of others careless maufactures ones).