I'm using jPOD to extract text from given PDFs. In general, this works very well, but once a text contains a ligature (using one character for ff or fl or fi) ir only contains the second character.
I can provide an example PDF which demonstrates the problem.
text extraction currently works on unicode base. So, if your text contains a correct "backmapping" (either via a correct /Encoding or a /ToUnicode map) you should receive for example a Unicode 64257 for "fi" (at least for my test documents).
There is still the rare, unsupported case that the /ToUnicode map will map to a character sequence instead of a single unicode character. We wait for a real world document of this kind...
Feel free to upload your testcase for further inspection.
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.