Menu

#1483 StyleContext::GetRelative() should use character offset, not use byte offset

Bug
closed-fixed
5
2015-02-12
2013-06-01
No

Now StyleContext reports characters rather than bytes, its GetRelative() method should follow:

  1. count characters, not bytes
  2. get the character relative to the position, not the byte

'1' breaks at least the Perl lexer which uses GetRelative() and the passes an offset found with it to Forward() -- Forward() uses characters, so it will forward too much. For example this code will not highlight as expected:

" ü " keyword

(see https://sourceforge.net/p/geany/bugs/962/)


Attached is an initial patch showing a possible fix. Not tested very well, and not so beautiful, but it seems to work fine.

On the implementation side though, there currently is a conflict between avoiding code duplication, avoiding possibly different invalid character handling and avoiding unnecessary data access. I mean, currently StyleContext has code to get the next character, but it requires to know the full value of the character, so to read the whole set of bytes (1-4). When fast-forwarding to find the character at pos+n, one could rely on some encoding properties to avoid fetching some data, e.g. the first byte of UTF-8 contains the information on how many sub-bytes it contains, so one could do:

if ((ch_ & 0xf8) == 0xf0) // 0b11110xxx
    pos += 4;
else if ((ch_ & 0xf0) == 0xe0) // 0b1110xxxx
    pos += 3;
else if ((ch_ & 0xe0) == 0xc0) // 0b110xxxxx
    pos += 2;
else
    pos++;

This only requires the first byte for UTF-8, and it seems like there already is some code to determine if a DBCS byte is a leading one. However, this may give different results on invalid data than what StyleContext::GetNextChar() does, so it might not be wanted -- and we need to get the last character anyway, so for GetRelative(1) it would not change anything.

Also, going backwards is a little tricky so it probably won't give the exact same result than the current GetNextChar() on invalid data anyway (since to go backward we pretty much have no other solution than seek backward until we find the previous lead byte). I mean:

int c = sc.ch;
sc.Forward();
assert(c == sc.GetRelative(-1));

might perhaps fail on invalid input.

2 Attachments

Related

Geany: Bugs: #995
Bugs: #1497
Bugs: #1528

Discussion

1 2 3 > >> (Page 1 of 3)
  • Kein-Hong Man

    Kein-Hong Man - 2013-06-02

    I'll wait for Neil to weigh in, then I'll dig in and see what bits of the forward and backward scanning need to be changed. Backward scanning that involve checking styles may still work.

     
  • Neil Hodgson

    Neil Hodgson - 2013-06-02

    Building a Unicode character stream abstraction over a UTF-8 representation can be expensive or complex. The modified GetRelative now loops, causing code such as "while (i < len && (sc.GetRelative(++i) != ']'" from LexTxt2tags to rise in cost to be quadratic. It may be better to allow lexers to opt-in to this expense by using a different name for the character-oriented GetRelative.

    Scintilla allows invalid UTF-8 and having different calls see the character stream in different ways could lead to more problems. Caching the UTF-32 representation of portions of the document may be possible but is also likely to require retaining byte position information (to allow styling the bytes up to a character discovered through relative access) and may be expensive so require opt-in.

    There are a few idioms that use GetRelative and it may be better to directly support these idioms. One is to seek for a string or set of strings and it would be possible to add a search method that took an encoded string (byte string in the document's encoding) and found the byte offset of this text. Another idiom is Match(byte string) and then examine the character after that byte string. Some byte-oriented calls like ForwardBytes may also be useful for lexers that want to continue working in bytes.

    The example file from the Geany issue is Latin-1, not UTF-8. Its possible that Geany has converted the encoding to UTF-8.

     
    • Colomban Wendling

      Building a Unicode character stream abstraction over a UTF-8 representation can be expensive or complex. The modified GetRelative now loops, causing code such as "while (i < len && (sc.GetRelative(++i) != ']'" from LexTxt2tags to rise in cost to be quadratic. It may be better to allow lexers to opt-in to this expense by using a different name for the character-oriented GetRelative.

      Yeah, seeking over non-constant-sized bytes has a cost indeed. However, my point here is that (almost) everything in StyleContext is now character-based, but GetRelative() is byte based, which is inconsistent -- and did lead to breakage.

      BTW, looking at LexTxt2tags makes me guess that this lexer could probably suffer of the same problem. BTW, this lexer seems to overuse GetRelative() a bit -- but OK, I don't know Txt2Tags and won't update the lexer.

      Scintilla allows invalid UTF-8 and having different calls see the character stream in different ways could lead to more problems.

      Indeed, that's the reason I pointed out it might not be trivial to fix.

      Caching the UTF-32 representation of portions of the document may be possible but is also likely to require retaining byte position information (to allow styling the bytes up to a character discovered through relative access) and may be expensive so require opt-in.

      Would it be that expensive? OK, it would require storing a tuple array for position/character (or maybe character is enough?), but Forward() would benefit from it and also fill it, so the added cost would roughly only be maintaining the array. I mean, if you call GetRelative(3) and then Forward(3), the GetRelative() call would fill the cache and the Forward() call would only be a cache fetch. Same for backwards, the cache would already be filled by previous Forward()s.

      It would probably be a bit of work to implement this, but I doubt it would have a important cost in performances. Maybe to avoid a huge cache, it could default to a small size and a caller could ask for more if it uses large backward/forward searches (like StyleContext::SetCacheSize() or something). Maybe a circular buffer could also avoid most memory movements.

      There are a few idioms that use GetRelative and it may be better to directly support these idioms. [...]

      Indeed. It might even render caller code simpler and would probably allow to avoid a lot of encoding issues, since generally what callers are searching for is plain ASCII. However again, some caller use the search to get a position later passed to Forward(), and then needs a character offset -- which comes back to the character issue.

      The example file from the Geany issue is Latin-1, not UTF-8. Its possible that Geany has converted the encoding to UTF-8.

      Ah sorry, maybe my SciTE is configured to load UTF-8 or something (and yes, Geany always loads UTF-8). But try any MB encoding and you'll see the string styles breaks.

       
      • Neil Hodgson

        Neil Hodgson - 2013-06-10

        A circular buffer sounds reasonable. It should be implemented in a way that performs well for a wide set of situations from holding a couple of lines in a normal file with decent handling of the inevitable Megabyte line cases.

        If no one else starts work, I'll probably look at it some time. The code could be divided into a generic circular buffer class and the adaptation of that into StyleContext or a subclass of StyleContext.

         
  • Neil Hodgson

    Neil Hodgson - 2013-06-02

    One way to avoid quadratic costs in GetRelative (Character) would be to cache the previous (relative character -> byte offset) access and start from there when the new request is same or later.

    Could also store the current position active for last GetRelative so that the cached position is invalidated correctly without adding to the cost of GetNextChar.

     
    • Colomban Wendling

      Indeed, this could be a solution that would avoid maintaining a complete cache. Its usefulness depends on how the callers use GetRelative(), but it looks kinda sensible (but for backward searches, where it wouldn't help).

      But instead of maintaining the old GetRelative() position I'd probably rather update the cached offset in GetNextChar() -- it is quite cheap and probably would make it possible not to invalidate the cache without adding more complex code. But that's a detail anyway.

       
  • Kein-Hong Man

    Kein-Hong Man - 2013-06-02

    Okay, I'll attack the thing during the next week.
    I have converted the sample to utf-8 and I can see the breakage. It also breaks some of my old utf-8 test cases.

     
  • Neil Hodgson

    Neil Hodgson - 2013-06-07

    Attached is a patch that fixes a small part of this. A ForwardBytes method is added to StyleContext and this is used inside the lexer where bytes have been counted. Another change in the lexer retrieves the number of bytes moved by Forward to keep things in sync.

    The BackwardChar method in the original patch does not handle DBCS. It is difficult to move backwards through DBCS since trail byte values can also be lead byte values. See Document::NextPosition.

    Edited to be explicit about which patch doesn't do DBCS.

     

    Last edit: Neil Hodgson 2013-06-07
  • Neil Hodgson

    Neil Hodgson - 2013-06-08
    • labels: --> scintilla, lexer
    • assigned_to: Neil Hodgson
     
  • Neil Hodgson

    Neil Hodgson - 2013-06-08

    Attached is an attempt to classify the uses of GetRelative, into direction (forwards or backwards) and whether the use is in a loop with a variable position or a simple constant.

    Since moving backwards is more difficult, particularly for DBCS, it would be better to avoid it. Most of the uses appear safe with GetRelative using byte offsets. When GetRelative is used to extract strings, its safer to leave these as byte strings, especially if they are to be compared, than it is to convert to 32bit-character strings.

    The constant offsets to GetRelative range from -4 to +3. Some negative cases look for line ends so can't be satisfied by remembering a small buffer of recently current characters.

     
1 2 3 > >> (Page 1 of 3)

Log in to post a comment.