Ligatures in PDF

  • AwayFromTheSun

    AwayFromTheSun - 2009-07-02

    Hi everyone,

    I'm using jPOD to extract text from given PDFs. In general, this works very well, but once a text contains a ligature (using one character for ff or fl or fi) ir only contains the second character.

    I can provide an example PDF which demonstrates the problem.

    Any ideas?

    best regards
    Andreas Haufler

    • mtraut

      mtraut - 2009-07-09

      text extraction currently works on unicode base. So, if your text contains a correct "backmapping" (either via a correct /Encoding or a /ToUnicode map) you should receive for example a Unicode 64257 for "fi" (at least for my test documents).

      There is still the rare, unsupported case that the /ToUnicode map will map to a character sequence instead of a single unicode character. We wait for a real world document of this kind...

      Feel free to upload your testcase for further inspection.


Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks