Read PDF for Indian Languages Like Tamil etc.

RameshNIC
2012-02-21
2013-05-28
  • RameshNIC
    RameshNIC
    2012-02-21

    I want to Read PDF for Indian Languages Like Tamil, malayalam, hindi etc.. I used jPod to read the content from PDF but I could not get proper text….

    If Anybody tried this same can you please share the code

     

  • Anonymous
    2012-02-21

    As i'm not fluent in Tamil this is really not a tested feature and *may* be a jPod deficiency. But in theory, Encodings, multibyte encodings and /ToUnicode is completely implemented.

    So, be sure to check your PDF to contain a proper (multi)byte font with an invertible character map or a ToUnicode map.

     
  • RameshNIC
    RameshNIC
    2012-02-21

    Thanks for your reply… Tamil language takes 2 bytes. So, in this situation I am not getting proper output. It is displaying first byte only it ignores 2nd byte info of that specific character. I faced this problem with Hindi also. I think you are aware with Hindi language. Can you check once for me with your code snippet

     

  • Anonymous
    2012-02-21

    I know that it takes two bytes.  But i don't know who is the "it" that is displaying or ignoring. I assume "it" is your code. I don't know what you're doing. So i can't check. I don't know Hindi. I don't have a code snippet. I don't have Hindi documents.

    Check the font structure as stated above.

     
  • RameshNIC
    RameshNIC
    2012-02-21

    Hi,

    What I mean is While reading for each character, I am getting first byte only not second byte.

    One more thing…
    check your PDF to contain a proper (multi)byte font with an invertible character map or a ToUnicode map
    I couldn't understand the above suggestion.. can you elaborate regarding this.

    To Read PDF I used ExtractText.java class from samples package from jPod Library.

     

  • Anonymous
    2012-02-21

    If you have a two byte character set, a (correct) glyph extraction will always read two bytes.

    After that either the associated encoding or a /ToUnicode entry in the font object is used to guess the correct Unicode character.

    In the ExtractText example the process of decoding is already done - if successful or not. You can see the mechanics in PDGlyphs#getUnicode. The codepoint is the (two byte) codepoint in the graphicsstream. ToUnicode or character encoding is used to map to unicode.

    So, if you can debug this and your document has a correct encoding or ToUnicode -> jPod may have an issue. Otherwise -> your document is simply not extractable (as millions of others careless maufactures ones).