Using the CSTextExtract example, the ligature in the given PDF is recognized as f and not fi.
Ligature in Schweißrauchfilter
Thanks for the test case. This is one of the "rare" cases and not supported currently. The /ToUnicode map maps to a two char string where we only expect a single character.
This has to be included in an upcoming release.
Is there a roadmap or a specified release date for the next version?
As jPod can also manipulate PDFs, would it also be possible to replace the text by two "real" characters?
best regards Andy
The next release 5.1 is already in the release process. This feature is not included.
The next major release can be expected at the end of the year, and we will schedule this feature for this version.
Maybe (and perhaps with some patch from someone) we could include this earlier with a small update.
With jPod you CAN manipulate this content stream, but i don't think you really want to. In practice you would have to re-render the content as the words run length will change and the outcome would be less visually pleasant. You could change the /ToUnicode mapping to map to the unicode char for the ligature and then post process the result string of the extraction.
I found the bug:
In PDGlyphs.class line 77 you do:
int unic16BE = toUnicode.getDecoded(codepoint);
This is basically returned and downcasted to a char. (In CSTextExtractor)
There are 2 problems here: This int is a UTF-16 encoded char-sequence, so you should split it into a byte and call new String(toByteArr(unic16BE, "UTF-16") to get the correct result (with all characters supplied).
BUT, the PDF-Spec states that this could by a sequence of any size, so an int is probably too short. Maybe the CMap should return a byte instead of an int?
I tried to find my way trough the code, but didn't quite understand how StreamBasedCMap works.
Example 2 on Page 294 of the PDF-Reference 1.7 shows a longer mapping:
< 0000 >< 005E >< 0020 >
< 005F >< 0061 >[ < 00660066 > < 00660069 > < 00660066006C > ]
The last one is mapped to f f l and longer than java's 32bit ints....
Thanks for your input.
As we stated below, this is a currently unsupported feature and we will have to implement the multichar mapping that is explicitly introduced with the unicode map.
I didn't get what exactly you mean with the conversion snippets - the (existing) unicode map returns a unicode character in the form of a int. there is no deficiency in casting this result. Unicode conversion etc. is done when the map is constructed.
the upcoming map will have an additional char or int result to handle the string mappings.
What I meant is what happens in CSTextExtractorExample.java:65
Here you take an integer and cast it into a char expecting it to contain a valid UTF-Char-sequence. Using this cast on the given mapping FI for the ligature fi only returns the i because the F which is still contained in the integer is thrown away when doing the type-conversion. If you take the integer, split it into a byte array and use new String(arr, "UTF-16") you get both chars... Therefore I'd recommend to return a byte and explicitely call the String-constructor with UTF-16 as charset, this seems very transparent and more logical than using ints for chars....
Anyways, I just wanted to point out, that it is already possible with the current release of jpod to extract ligatures, in case someone has the same problem. So I guess the fix can wait until the end of the year or whenever it makes its way on the fix-list.
BTW: Looking at the source code, I have to say that the library is made very well in an engineering sense. It's almost a joy to read an debug trough it (although you don't have debug-infos in the jars, which screws up the debugger when using break points). You might consider adding debug-infos to the jar since it doesn't add much to the total size afaik..
now i see what you meant - i had to go back to the code to find this artifact. indeed this "feature" is more of a bug, the intended API is really "return the unicode as an integer". That the most significant two bytes are still included is due to the fact that the special behavior of the unicode map is not supported - just imagine a codepoint that maps to a even longer character sequence, only the last two characters are forwarded...
The mapping API is designed following the general contarct of the maps in PDF, where ONLY single codepoints are mapped to other (unicode) codepoints. The int is due to the fact that we need a "undefined" result (as found in java streams for example). As maps/encodings are used VERY often throughout the spec i think it is critical from the performance aspect to have these (character) values at hand immediately in the form needed for further computation.
the to ToUnicode maps define an explicit exception to the rule as it can return a char instead of a single character and i'd like to treat it this way - as an extension that is only used in text extraction.
Alright, you're cleary deeper into the code than I am. Thanks for the fast and friendly feedback and keep up the good work!
better late then never. an extension for ToUnicode string mappings has found its way to the codebase and is contained in the next release. this will hopefully be available in mai.