#1418 utf8 token coloring

Bug
closed-fixed
Neil Hodgson
Scintilla (791)
5
2013-07-21
2012-10-29
No

I have written a lexer for Visual Prolog, but it has some utf8 related problems, that I am not sure how to handle (and perhaps the lexer concept is a little too simplified in this respect).

The first problem is that a char literal like 'q' should only contain one char (unless there are eschape sequences), if there are too many characters the style is set to SCE_VISUALPROLOG_CHARACTER_TOO_MANY:

case SCE_VISUALPROLOG_CHARACTER:
if (sc.atLineEnd) {
sc.SetState(SCE_VISUALPROLOG_STRING_EOL_OPEN); // reuse STRING_EOL_OPEN for this
} else if (sc.ch == '\'') {
sc.SetState(SCE_VISUALPROLOG_CHARACTER_ESCAPE_ERROR);
sc.ForwardSetState(SCE_VISUALPROLOG_DEFAULT);
} else {
if (sc.ch == '\\') {
sc.SetState(SCE_VISUALPROLOG_CHARACTER_ESCAPE_ERROR);
forwardEscapeLiteral(sc, SCE_VISUALPROLOG_CHARACTER);
}
sc.ForwardSetState(SCE_VISUALPROLOG_CHARACTER);
if (sc.ch == '\'') {
sc.ForwardSetState(SCE_VISUALPROLOG_DEFAULT);
} else {
sc.SetState(SCE_VISUALPROLOG_CHARACTER_TOO_MANY);
}
}
break;

We always us the editor in utf8 mode and if the char is for example the danish letter 'ø' which is a two byte sequence in utf8, then it seems that ch is holding the first byte rather than the entire char. And therefore it gets the style SCE_VISUALPROLOG_CHARACTER_TOO_MANY (with the code above).

It seems that code like StyleContext::GetNextChar:

void GetNextChar\(unsigned int pos\) \{
    chNext = static\_cast<unsigned char>\(styler.SafeGetCharAt\(pos+1\)\);
    if \(styler.IsLeadByte\(static\_cast<char>\(chNext\)\)\) \{
        chNext = chNext << 8;
        chNext |= static\_cast<unsigned char>\(styler.SafeGetCharAt\(pos+2\)\);
    \}

tries to combine multi-char sequences into the single char it represents, but the code seems wrong for utf8.

I suppose that the best handling of utf8 would be that 'ch' (which is an int) contians the char in utf32.

Assuming that it (soon) does contain the utf32 char then we have the next problem: Visual Prolog have the unusual feature that identifiers may contain letters from any language in the world (or more corectly any Unicode character that is classified as a letter). Furthermore, identifiers starting with an uppercase letter are "Variables" (and should be green) whereas any identifiers starting with a lowercase letter are constants (and should be black).

The letters a-z and A-Z are easy, but what about æøå and ÆØÅ (three additional Danish letters in lower/upper case). On Microsoft Windows (where we only run) there are api functions IsCharLower, IsCharUpper, etc that will answer such questions, but how do I answer the question in Scintilla which is alsop run on other platforms.

Discussion

  • Neil Hodgson
    Neil Hodgson
    2012-11-06

    StyleContext does not support UTF-8. Lexers must add their own code when UTF-8 is syntactically significant. As part of implementing Unicode line ends, this may change in the future.

     
  • Neil Hodgson
    Neil Hodgson
    2012-11-06

    • assigned_to: nobody --> nyamatongwe
     
  • OK.

    But how can I make the code portable between Linux and Windows? Platform.h does not seems to offer any Unicode support. (Perhaps I am looking in the wrong place).

    Do you know if any lexer have special Unicode needs/support?

     
  • LexAccessor with IsUnicodeMode

     
    Attachments
  • StyleContext with Unicode support

     
    Attachments
  • StyleContext with necessary include directive

     
    Attachments
  • I have uploaded files that add unicode/utf8 support to StyleContext. I will be using it in our own context, but if you approve it might as well be put into the official version. I do not think it makes any difference to people that don't need it.

     
  • Visual Prolog with Unicode handling

     
    Attachments
  • I have also uploaded a new lexer for Visual Prolog, which treats identifiers correctly. I have used the functions iswlower, iswupper and iswalnum so I hope they are supported un Linux.

     
  • iswlower() & friends are part of the C99 standard, so they should work almost everywhere -- but they bring C99 dependency. However, as far as I know nothing tells you that the wide characters of the standard library are using an Unicode representation. Actually, it's even only required to store any character on the current locale, which could very well not be Unicode.
    Most modern GNU/Linux distributions are using an UTF-8 locale, thus probably use Unicode for wide characters; but I doubt it's the case for every OS. Say, does Windows XP with a CP-1252 locale supports Unicode wide characters? And the same question goes for a GNU/Linux distribution with non-Unicode locale.

    Additionally, the GNU/Linux manpages for isupper() and islower() warns about using them for Unicode:
    "[These functions are] not very appropriate for dealing with Unicode characters, because Unicode knows about three cases: upper, lower and title case".
    This may or may not be a concern for you, but the C standard definitely doesn't include an istitle() function, thus using it won't allow for full Unicode character classification. And they also warns you about those functions behaving differently depending on LC_CTYPE part of the current locale.

     
  • Thank you for your comments, it is quite interesting.

    In this context I think the mentioned problems are worse that alternative. I.e. without these changes my lexer will definitely color wrong whenever there are any national letters of any kind. With the change it seems correct on Windows (which appearently consider the 'w' versions to work with Unicode). Should it not work in the same way on other operting systems, then things are just back to the situation they would have been in anyway.

     
  • Sorry, I meant that this solution is BETTER than the alternative, because it is correct in more cases.

     
  • Neil Hodgson
    Neil Hodgson
    2012-11-08

    I won't be able to review this code until after I return from vacation at year's end.

     
  • Neil Hodgson
    Neil Hodgson
    2013-01-06

    Regarding iwlower, isupper and iswalnum, it may be worthwhile wrapping these so they can be re-implemented if there is a platform problem. The lexlib/CharacterSet module would be an appropriate place with the function definition in the .cxx file so any system headers and preprocessor hackery can be hidden.

    The current state of my Unicode line ends patch is available from http://www.scintilla.org/unidecode1.patch
    That is an excerpt that concentrates on changes to StyleContext but much of the patch is concerned with Unicode line ends which were difficult to handle on a byte level so instead a current line end position is maintained. The code for converting bytes to UTF32 should be similar to yours. The Unicode line ends change modifies some strongly versioned interfaces so won't be committed until some other changes to those interfaces are complete so is unlikely to be committed soon.

     
    • Neil Hodgson
      Neil Hodgson
      2013-06-02

      I looked into the behaviour of iswlower and iswupper on various platforms and they are very inconsistent so shouldn't be used. Something that reports the Unicode classification of a character would be worthwhile since that is well defined.

       
  • Neil Hodgson
    Neil Hodgson
    2013-02-26

    There is now some support for Unicode character handling in StyleContext.

     
  • Neil Hodgson
    Neil Hodgson
    2013-07-21

    A new module CharacterCategory reports the Unicode general category of a character.

     
  • Neil Hodgson
    Neil Hodgson
    2013-07-21

    • status: open --> closed-fixed
    • Group: --> Bug