Ligatures in PDF

Help
2009-07-02
2013-05-28
  • AwayFromTheSun
    AwayFromTheSun
    2009-07-02

    Hi everyone,

    I'm using jPOD to extract text from given PDFs. In general, this works very well, but once a text contains a ligature (using one character for ff or fl or fi) ir only contains the second character.

    I can provide an example PDF which demonstrates the problem.

    Any ideas?

    best regards
    Andreas Haufler

     
    • mtraut
      mtraut
      2009-07-09

      text extraction currently works on unicode base. So, if your text contains a correct "backmapping" (either via a correct /Encoding or a /ToUnicode map) you should receive for example a Unicode 64257 for "fi" (at least for my test documents).

      There is still the rare, unsupported case that the /ToUnicode map will map to a character sequence instead of a single unicode character. We wait for a real world document of this kind...

      Feel free to upload your testcase for further inspection.