StyleContext reports characters rather than bytes, its
GetRelative() method should follow:
'1' breaks at least the Perl lexer which uses
GetRelative() and the passes an offset found with it to
Forward() uses characters, so it will forward too much. For example this code will not highlight as expected:
" ü " keyword
Attached is an initial patch showing a possible fix. Not tested very well, and not so beautiful, but it seems to work fine.
On the implementation side though, there currently is a conflict between avoiding code duplication, avoiding possibly different invalid character handling and avoiding unnecessary data access. I mean, currently
StyleContext has code to get the next character, but it requires to know the full value of the character, so to read the whole set of bytes (1-4). When fast-forwarding to find the character at
pos+n, one could rely on some encoding properties to avoid fetching some data, e.g. the first byte of UTF-8 contains the information on how many sub-bytes it contains, so one could do:
if ((ch_ & 0xf8) == 0xf0) // 0b11110xxx pos += 4; else if ((ch_ & 0xf0) == 0xe0) // 0b1110xxxx pos += 3; else if ((ch_ & 0xe0) == 0xc0) // 0b110xxxxx pos += 2; else pos++;
This only requires the first byte for UTF-8, and it seems like there already is some code to determine if a DBCS byte is a leading one. However, this may give different results on invalid data than what StyleContext::GetNextChar() does, so it might not be wanted -- and we need to get the last character anyway, so for
GetRelative(1) it would not change anything.
Also, going backwards is a little tricky so it probably won't give the exact same result than the current
GetNextChar() on invalid data anyway (since to go backward we pretty much have no other solution than seek backward until we find the previous lead byte). I mean:
int c = sc.ch; sc.Forward(); assert(c == sc.GetRelative(-1));
might perhaps fail on invalid input.