Fonts with difference encoding can have ToUnicode, as well

A PDF parsing, modification and creation library.

Brought to you by: domseichter

This project can now be found here.

#107 Fonts with difference encoding can have ToUnicode, as well

Milestone: SVN TRUNK

Status: closed

Owner: nobody

Labels: None

Updated: 2021-08-19

Created: 2020-11-05

Creator: Christopher Creutzig

Private: No

PdfFifferenceEncoding needs to support explicit ToUnicode tables, too. PDF examples requiring this can be created, for example, at https://www.canva.com/design/play?category=tACFat6uXco

1 Attachments

podofo.patch

Discussion

zyx - 2021-08-18

I'm sorry, but I hate to open random sites whit their cookies consents and whatever. Would it be possible to attach such file and claim what precisely is your patch fixing, please? Like: without it, PoDoFo cannot.... , but with it PoDoFo can.....

PdfObject* pToUnicode = nullptr);

Use NULL instead, please, the same as the other code in the PoDoFo.

if( m_differences.Contains( static_cast<int>(pszInput[i]), name, value ) ) pszUtf16[i] = value;

if(m_bToUnicodeIsLoaded)

{

value = GetUnicodeValue(pszInput[i]);

The m_toUnicode can be empty (you may check m_toUnicode.empty()).
When there are both differences and to Unicode, then the later overwrites the value of the former. Can it happen? Might there be an else clause?

Otherwise the patch looks good, though just reading it, not testing it in action.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Christopher Creutzig - 2021-08-19

The attached file should extract text like “Wear proper.” Without the patch, I am getting random-looking character substitutions, text like “Wlaeba cr lpot.”

GetUnicodeValue should be able to handle requests for glyhs not in m_toUnicode, which includes it being empty.

When there are both differences and toUnicode, I expect toUnicode to take precedence, yes.

canva.pdf

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

zyx - 2021-08-19

Thanks for the file. I checked the PDF ISO and according to the "9.10.2 Mapping Character Codes to Unicode Values" the toUnicode has a precedence over the differences. Your patch does both, but I think in a good way.

I committed your patch (slightly modified) as [r2044].

Related

Commit: [r2044]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

zyx - 2021-08-19

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link: