#1418 utf8 token coloring

Bug
closed-fixed
Scintilla (792)
5
2013-07-21
2012-10-29
No

I have written a lexer for Visual Prolog, but it has some utf8 related problems, that I am not sure how to handle (and perhaps the lexer concept is a little too simplified in this respect).

The first problem is that a char literal like 'q' should only contain one char (unless there are eschape sequences), if there are too many characters the style is set to SCE_VISUALPROLOG_CHARACTER_TOO_MANY:

case SCE_VISUALPROLOG_CHARACTER:
if (sc.atLineEnd) {
sc.SetState(SCE_VISUALPROLOG_STRING_EOL_OPEN); // reuse STRING_EOL_OPEN for this
} else if (sc.ch == '\'') {
sc.SetState(SCE_VISUALPROLOG_CHARACTER_ESCAPE_ERROR);
sc.ForwardSetState(SCE_VISUALPROLOG_DEFAULT);
} else {
if (sc.ch == '\\') {
sc.SetState(SCE_VISUALPROLOG_CHARACTER_ESCAPE_ERROR);
forwardEscapeLiteral(sc, SCE_VISUALPROLOG_CHARACTER);
}
sc.ForwardSetState(SCE_VISUALPROLOG_CHARACTER);
if (sc.ch == '\'') {
sc.ForwardSetState(SCE_VISUALPROLOG_DEFAULT);
} else {
sc.SetState(SCE_VISUALPROLOG_CHARACTER_TOO_MANY);
}
}
break;

We always us the editor in utf8 mode and if the char is for example the danish letter 'ø' which is a two byte sequence in utf8, then it seems that ch is holding the first byte rather than the entire char. And therefore it gets the style SCE_VISUALPROLOG_CHARACTER_TOO_MANY (with the code above).

It seems that code like StyleContext::GetNextChar:

void GetNextChar\(unsigned int pos\) \{
    chNext = static\_cast<unsigned char>\(styler.SafeGetCharAt\(pos+1\)\);
    if \(styler.IsLeadByte\(static\_cast<char>\(chNext\)\)\) \{
        chNext = chNext << 8;
        chNext |= static\_cast<unsigned char>\(styler.SafeGetCharAt\(pos+2\)\);
    \}

tries to combine multi-char sequences into the single char it represents, but the code seems wrong for utf8.

I suppose that the best handling of utf8 would be that 'ch' (which is an int) contians the char in utf32.

Assuming that it (soon) does contain the utf32 char then we have the next problem: Visual Prolog have the unusual feature that identifiers may contain letters from any language in the world (or more corectly any Unicode character that is classified as a letter). Furthermore, identifiers starting with an uppercase letter are "Variables" (and should be green) whereas any identifiers starting with a lowercase letter are constants (and should be black).

The letters a-z and A-Z are easy, but what about æøå and ÆØÅ (three additional Danish letters in lower/upper case). On Microsoft Windows (where we only run) there are api functions IsCharLower, IsCharUpper, etc that will answer such questions, but how do I answer the question in Scintilla which is alsop run on other platforms.

Discussion

  • Neil Hodgson

    Neil Hodgson - 2012-11-06

    StyleContext does not support UTF-8. Lexers must add their own code when UTF-8 is syntactically significant. As part of implementing Unicode line ends, this may change in the future.

     
  • Neil Hodgson

    Neil Hodgson - 2012-11-06
    • assigned_to: nobody --> nyamatongwe
     
  • Thomas Linder Puls

    OK.

    But how can I make the code portable between Linux and Windows? Platform.h does not seems to offer any Unicode support. (Perhaps I am looking in the wrong place).

    Do you know if any lexer have special Unicode needs/support?

     
  • Thomas Linder Puls

    LexAccessor with IsUnicodeMode

     
  • Thomas Linder Puls

    StyleContext with Unicode support

     
  • Thomas Linder Puls

    StyleContext with necessary include directive

     
  • Thomas Linder Puls

    I have uploaded files that add unicode/utf8 support to StyleContext. I will be using it in our own context, but if you approve it might as well be put into the official version. I do not think it makes any difference to people that don't need it.

     
  • Thomas Linder Puls

    I have also uploaded a new lexer for Visual Prolog, which treats identifiers correctly. I have used the functions iswlower, iswupper and iswalnum so I hope they are supported un Linux.

     
  • Colomban Wendling

    iswlower() & friends are part of the C99 standard, so they should work almost everywhere -- but they bring C99 dependency. However, as far as I know nothing tells you that the wide characters of the standard library are using an Unicode representation. Actually, it's even only required to store any character on the current locale, which could very well not be Unicode.
    Most modern GNU/Linux distributions are using an UTF-8 locale, thus probably use Unicode for wide characters; but I doubt it's the case for every OS. Say, does Windows XP with a CP-1252 locale supports Unicode wide characters? And the same question goes for a GNU/Linux distribution with non-Unicode locale.

    Additionally, the GNU/Linux manpages for isupper() and islower() warns about using them for Unicode:
    "[These functions are] not very appropriate for dealing with Unicode characters, because Unicode knows about three cases: upper, lower and title case".
    This may or may not be a concern for you, but the C standard definitely doesn't include an istitle() function, thus using it won't allow for full Unicode character classification. And they also warns you about those functions behaving differently depending on LC_CTYPE part of the current locale.

     
  • Thomas Linder Puls

    Thank you for your comments, it is quite interesting.

    In this context I think the mentioned problems are worse that alternative. I.e. without these changes my lexer will definitely color wrong whenever there are any national letters of any kind. With the change it seems correct on Windows (which appearently consider the 'w' versions to work with Unicode). Should it not work in the same way on other operting systems, then things are just back to the situation they would have been in anyway.

     
  • Thomas Linder Puls

    Sorry, I meant that this solution is BETTER than the alternative, because it is correct in more cases.

     
  • Neil Hodgson

    Neil Hodgson - 2012-11-08

    I won't be able to review this code until after I return from vacation at year's end.

     
  • Neil Hodgson

    Neil Hodgson - 2013-01-06

    Regarding iwlower, isupper and iswalnum, it may be worthwhile wrapping these so they can be re-implemented if there is a platform problem. The lexlib/CharacterSet module would be an appropriate place with the function definition in the .cxx file so any system headers and preprocessor hackery can be hidden.

    The current state of my Unicode line ends patch is available from http://www.scintilla.org/unidecode1.patch
    That is an excerpt that concentrates on changes to StyleContext but much of the patch is concerned with Unicode line ends which were difficult to handle on a byte level so instead a current line end position is maintained. The code for converting bytes to UTF32 should be similar to yours. The Unicode line ends change modifies some strongly versioned interfaces so won't be committed until some other changes to those interfaces are complete so is unlikely to be committed soon.

     
    • Neil Hodgson

      Neil Hodgson - 2013-06-02

      I looked into the behaviour of iswlower and iswupper on various platforms and they are very inconsistent so shouldn't be used. Something that reports the Unicode classification of a character would be worthwhile since that is well defined.

       
  • Neil Hodgson

    Neil Hodgson - 2013-02-26

    There is now some support for Unicode character handling in StyleContext.

     
  • Neil Hodgson

    Neil Hodgson - 2013-07-21

    A new module CharacterCategory reports the Unicode general category of a character.

     
  • Neil Hodgson

    Neil Hodgson - 2013-07-21
    • status: open --> closed-fixed
    • Group: --> Bug
     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks