Menu

#764 Unicode escape conversion using "own" code or iconv

Version 5
closed-accepted
nobody
enhanced (2)
5
2018-03-31
2018-03-26
No

This patch adds two more conversion mechanisms for Unicode escapes in enhanced text as alternative to fontconfig's FcUcs4ToUtf8() which may not be available.

The first variant does the UTF-32 to UTF-8 conversion in place, using code slightly adopted from https://stackoverflow.com/a/42013433 by "Nominal Animal".

The second variant makes use of libiconv. In addition to UTF-8 it handles all other gnuplot encodings. Of course using Unicode escapes with 8bit charsets is much less useful.

Arguably, both variants add too much code locally and should therefore be moved to their own functions. The iconv code in particular could be re-used by other terminals to handle encodings.

1 Attachments

Discussion

  • Ethan Merritt

    Ethan Merritt - 2018-03-26

    The in-line conversion to UTF-8 looks OK to me. I'd even be OK with substituting it for the library call unconditionally. The only limitation I see is that it explicitly restricts the accepted range of unicode code points to Plane 0 and Plane 1. If we are going to the trouble of allowing codepoints above BMP (plane 0) it might be worth allowing the whole range (planes 0-16) so that private use areas are also covered. Not that I have any easy way to test use of those higher planes in practice.

    I agree that it would be cleaner to move this chunk of code into a separate conversion routine that "always works" so that the call site in term.c is a single line and the routine might be shared by other callers.

    I'm not sold on use of iconv in this context. It was intended to handle internationalization, not fonts per se. As you point out, there is limited gain from allowing unicode entry points for legacy 8-bit encodings. What is the real-world scenario in which the user is working in UTF-8 or unicode but needs to produce output in the legacy encoding? All I can think of is PostScript output to a printer that supports only non-Western fonts. Is there such a thing?

    If it is specifically PostScript output that benefits, then a more interesting case is conversion between Adobe Symbol font and UTF-8. This isn't handled by iconv but is handled, in one direction, by existing code in gp_cairo.c. Perhaps that is the code that should be moved to a separate module and shared? Although even here the gain is I think limited to legacy gnuplot scripts that were written for the PostScript terminal.

     
  • Bastian Märkisch

    Maybe I don't understand enough about Unicode, but the I thought the code covers all 17 planes. as the last range is code < 0x10ffff.

    As per the use case for 8-bit encodings, I just could image that it might be easier to enter characters which are normally inaccessible directly via the keybord like \U00D7 for ×.

    Which file should the code go to? We now might have enough fragments all over the place to warant an new file encoding.c. We could e.g. move the special char code (degree sign etc...) there, too. Also, I'd like to generalise the iconv code for the use in term.c, gd.trm and emf.trm.

     
    • Ethan Merritt

      Ethan Merritt - 2018-03-28

      You are right about covering all 16 supllementary unicode planes. I mis-read the code. I saw 4 Fs in a row and thought "what about 5 Fs as in 0xFFFFF". I totally missed the leading 10. The confusion is compounded by the fact that there are actually 17 planes (BMP + 16 supplemental) so the highest page doesn't have a leading F.

      So let's revert the libfontconfig-dependent bits and replace it with the inline code.

      The legacy 8-bit encodings for languages provide hardly any of the interesting non-alphabetic symbols. They need those extra 127 slots to hold alphabetic characters with diacritical marks. From browsing the Wikipedia pages for the ones we support, I think the degree sign is the only symbol provided by more than couple of them. So I don't think the unicode escape->legacy 8-bit conversion via iconv gains anything useful in practice.

      As to where the code should go - yeah, a new file encoding.c is a good idea.
      I particularly like the consequence that if the locale code is moved out of variable.c the rest of that file can probably go away altogether ("set loadpath" is the only other thing in there that is still needed by the core code).

       
  • Bastian Märkisch

    One more observation: currently the code emits the same string for \\U+221E and \U+221E if the encoding is not UTF-8. This makes it difficult to convert them in the driver. So I suppose something like this would be needed before the code handling Unicode escapes:

    if ((p[1] == '\\') && !(term->flags & TERM_IS_POSTSCRIPT)) {
            /* Escaped escape character */
            (term->enhanced_writec)('\\');
            /* Pass through the escape for an Unicode escape */
            if ((p[2] == 'U') && (p[3] == '+') &&
                (encoding != S_ENC_UTF8) && (term->flags & TERM_UNICODE_ESCAPES))
                    term->enhanced_writec)('\\');
            p++;
            break;
    }
    
     
    • Ethan Merritt

      Ethan Merritt - 2018-03-28

      Can you give me a reproducing test case for that? I am thinking it may be a side effect of double-quote processing vs single-quote processing. See also the small fragment of code in util.c (parse_esc) that does not collapse \U+ into U+ even though double-quote processing normally would do that. The intention there was to be compatible with the way single character octal escapes of the form \012 have always been handled. That is
      print "\101"
      produces A rather than 101 even though the normal rules for double-quote strings would collapse \1 into 1.

       
      • Bastian Märkisch

        Just set encoding to something elseelse than utf8. The printout you get is what is sent to the driver. Iff it is supposed to interpret escapes the double backslash should be kept.

        For unicode.dem would otherwise get both codes per line translated.

         
  • Ethan Merritt

    Ethan Merritt - 2018-03-28

    I may not understand or I may be testing the wrong thing. I see a difference in how single-quote and double-quote strings are stored that is independent of the encoding. The difference is already there before sending it to the driver and whether or not it goes through the enhanced text processing. Single quotes maintain a different between \U and \U; double quotes do not.
    [damn this interface for suffering from the same problem I am tring to describe; the first of those is backslash-U the second one is backslash-backslash-U]

    Is this what you mean, or are you seeing something else in addition to this?

    input:

    t1 = '\ID \U+03f5 single quote'
    t2 = '\ID \\U+03f5 single quote'
    t3 = "\ID \U+03f5 double quote"
    t4 = "\ID \\U+03f5 double quote"
    
    print t1
    print t2
    print t3
    print t4
    

    Output:

    \ID \U+03f5 single quote
    \ID \\U+03f5 single quote
    ID \U+03f5 double quote
    ID \U+03f5 double quote
    
     

    Last edit: Ethan Merritt 2018-03-28
  • Ethan Merritt

    Ethan Merritt - 2018-03-30

    I coded up an svg-specific unicode handler so that I had an actual driver to work with. This led me to the same conclusion that you reached (I think). I misunderstood your description of the problem. The enhanced text code that recognizes an escape sequence was replacing the whole thing in the case of UTF-8 but consumed the opening backslash otherwise. That wasn't intentional - it was a bug on my part when I introduced the escape handling.

    I have fixed that now and placed the ucs4->utf8 conversion code in a separate routine in util.c. I didn't do anything about the iconv code. The next step is to back out all the libfontconfig checks and calls but I haven't done that yet.

    I would be happy to leave any further consolidation or rearrangement of code in your hands, including the iconv bits. I like your idea of gathering it all together in a new file.

    The unicode chunk in svg.trm can be used to test the escape handling either with or without UTF-8 and with various combinations of quotes and backslashes. I attach my test script below.

     
  • Bastian Märkisch

    • status: open --> closed-accepted
     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.