Ligatures in PDF

  • AwayFromTheSun

    AwayFromTheSun - 2009-07-02

    Hi everyone,

    I'm using jPOD to extract text from given PDFs. In general, this works very well, but once a text contains a ligature (using one character for ff or fl or fi) ir only contains the second character.

    I can provide an example PDF which demonstrates the problem.

    Any ideas?

    best regards
    Andreas Haufler

    • mtraut

      mtraut - 2009-07-09

      text extraction currently works on unicode base. So, if your text contains a correct "backmapping" (either via a correct /Encoding or a /ToUnicode map) you should receive for example a Unicode 64257 for "fi" (at least for my test documents).

      There is still the rare, unsupported case that the /ToUnicode map will map to a character sequence instead of a single unicode character. We wait for a real world document of this kind...

      Feel free to upload your testcase for further inspection.


Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks