gnuplot / Patches / #764 Unicode escape conversion using "own" code or iconv

Ethan Merritt - 2018-03-26

The in-line conversion to UTF-8 looks OK to me. I'd even be OK with substituting it for the library call unconditionally. The only limitation I see is that it explicitly restricts the accepted range of unicode code points to Plane 0 and Plane 1. If we are going to the trouble of allowing codepoints above BMP (plane 0) it might be worth allowing the whole range (planes 0-16) so that private use areas are also covered. Not that I have any easy way to test use of those higher planes in practice.

I agree that it would be cleaner to move this chunk of code into a separate conversion routine that "always works" so that the call site in term.c is a single line and the routine might be shared by other callers.

I'm not sold on use of iconv in this context. It was intended to handle internationalization, not fonts per se. As you point out, there is limited gain from allowing unicode entry points for legacy 8-bit encodings. What is the real-world scenario in which the user is working in UTF-8 or unicode but needs to produce output in the legacy encoding? All I can think of is PostScript output to a printer that supports only non-Western fonts. Is there such a thing?

If it is specifically PostScript output that benefits, then a more interesting case is conversion between Adobe Symbol font and UTF-8. This isn't handled by iconv but is handled, in one direction, by existing code in gp_cairo.c. Perhaps that is the code that should be moved to a separate module and shared? Although even here the gain is I think limited to legacy gnuplot scripts that were written for the PostScript terminal.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bastian Märkisch - 2018-03-28

Maybe I don't understand enough about Unicode, but the I thought the code covers all 17 planes. as the last range is code < 0x10ffff.

As per the use case for 8-bit encodings, I just could image that it might be easier to enter characters which are normally inaccessible directly via the keybord like \U00D7 for ×.

Which file should the code go to? We now might have enough fragments all over the place to warant an new file encoding.c. We could e.g. move the special char code (degree sign etc...) there, too. Also, I'd like to generalise the iconv code for the use in term.c, gd.trm and emf.trm.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ethan Merritt - 2018-03-28
  
  You are right about covering all 16 supllementary unicode planes. I mis-read the code. I saw 4 Fs in a row and thought "what about 5 Fs as in 0xFFFFF". I totally missed the leading 10. The confusion is compounded by the fact that there are actually 17 planes (BMP + 16 supplemental) so the highest page doesn't have a leading F.
  
  So let's revert the libfontconfig-dependent bits and replace it with the inline code.
  
  The legacy 8-bit encodings for languages provide hardly any of the interesting non-alphabetic symbols. They need those extra 127 slots to hold alphabetic characters with diacritical marks. From browsing the Wikipedia pages for the ones we support, I think the degree sign is the only symbol provided by more than couple of them. So I don't think the unicode escape->legacy 8-bit conversion via iconv gains anything useful in practice.
  
  As to where the code should go - yeah, a new file encoding.c is a good idea.
  I particularly like the consequence that if the locale code is moved out of variable.c the rest of that file can probably go away altogether ("set loadpath" is the only other thing in there that is still needed by the core code).
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bastian Märkisch - 2018-03-28

One more observation: currently the code emits the same string for \\U+221E and \U+221E if the encoding is not UTF-8. This makes it difficult to convert them in the driver. So I suppose something like this would be needed before the code handling Unicode escapes:

if ((p[1] == '\\') && !(term->flags & TERM_IS_POSTSCRIPT)) { /* Escaped escape character */ (term->enhanced_writec)('\\'); /* Pass through the escape for an Unicode escape */ if ((p[2] == 'U') && (p[3] == '+') && (encoding != S_ENC_UTF8) && (term->flags & TERM_UNICODE_ESCAPES)) term->enhanced_writec)('\\'); p++; break; }
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ethan Merritt - 2018-03-28
  
  Can you give me a reproducing test case for that? I am thinking it may be a side effect of double-quote processing vs single-quote processing. See also the small fragment of code in util.c (parse_esc) that does not collapse \U+ into U+ even though double-quote processing normally would do that. The intention there was to be compatible with the way single character octal escapes of the form \012 have always been handled. That is
  print "\101"
  produces A rather than 101 even though the normal rules for double-quote strings would collapse \1 into 1.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Bastian Märkisch - 2018-03-28
    
    Just set encoding to something elseelse than utf8. The printout you get is what is sent to the driver. Iff it is supposed to interpret escapes the double backslash should be kept.
    
    For unicode.dem would otherwise get both codes per line translated.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ethan Merritt - 2018-03-28

I may not understand or I may be testing the wrong thing. I see a difference in how single-quote and double-quote strings are stored that is independent of the encoding. The difference is already there before sending it to the driver and whether or not it goes through the enhanced text processing. Single quotes maintain a different between \U and \U; double quotes do not.
[damn this interface for suffering from the same problem I am tring to describe; the first of those is backslash-U the second one is backslash-backslash-U]

Is this what you mean, or are you seeing something else in addition to this?

input:

t1 = '\ID \U+03f5 single quote' t2 = '\ID \\U+03f5 single quote' t3 = "\ID \U+03f5 double quote" t4 = "\ID \\U+03f5 double quote" print t1 print t2 print t3 print t4

Output:

\ID \U+03f5 single quote \ID \\U+03f5 single quote ID \U+03f5 double quote ID \U+03f5 double quote

Last edit: Ethan Merritt 2018-03-28
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ethan Merritt - 2018-03-30

I coded up an svg-specific unicode handler so that I had an actual driver to work with. This led me to the same conclusion that you reached (I think). I misunderstood your description of the problem. The enhanced text code that recognizes an escape sequence was replacing the whole thing in the case of UTF-8 but consumed the opening backslash otherwise. That wasn't intentional - it was a bug on my part when I introduced the escape handling.

I have fixed that now and placed the ucs4->utf8 conversion code in a separate routine in util.c. I didn't do anything about the iconv code. The next step is to back out all the libfontconfig checks and calls but I haven't done that yet.

I would be happy to leave any further consolidation or rearrangement of code in your hands, including the iconv bits. I like your idea of gathering it all together in a new file.

The unicode chunk in svg.trm can be used to test the escape handling either with or without UTF-8 and with various combinations of quotes and backslashes. I attach my test script below.

test_escapes.bug

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bastian Märkisch - 2018-03-31

status: open --> closed-accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Unicode escape conversion using "own" code or iconv

A portable, multi-platform, command-line driven graphing utility

Group

Searches

Help

#764 Unicode escape conversion using "own" code or iconv

Discussion