#1483 StyleContext::GetRelative() should use character offset, not use byte offset

Bug
closed-fixed
Neil Hodgson
5
2015-02-12
2013-06-01
No

Now StyleContext reports characters rather than bytes, its GetRelative() method should follow:

  1. count characters, not bytes
  2. get the character relative to the position, not the byte

'1' breaks at least the Perl lexer which uses GetRelative() and the passes an offset found with it to Forward() -- Forward() uses characters, so it will forward too much. For example this code will not highlight as expected:

" ü " keyword

(see https://sourceforge.net/p/geany/bugs/962/)


Attached is an initial patch showing a possible fix. Not tested very well, and not so beautiful, but it seems to work fine.

On the implementation side though, there currently is a conflict between avoiding code duplication, avoiding possibly different invalid character handling and avoiding unnecessary data access. I mean, currently StyleContext has code to get the next character, but it requires to know the full value of the character, so to read the whole set of bytes (1-4). When fast-forwarding to find the character at pos+n, one could rely on some encoding properties to avoid fetching some data, e.g. the first byte of UTF-8 contains the information on how many sub-bytes it contains, so one could do:

if ((ch_ & 0xf8) == 0xf0) // 0b11110xxx
    pos += 4;
else if ((ch_ & 0xf0) == 0xe0) // 0b1110xxxx
    pos += 3;
else if ((ch_ & 0xe0) == 0xc0) // 0b110xxxxx
    pos += 2;
else
    pos++;

This only requires the first byte for UTF-8, and it seems like there already is some code to determine if a DBCS byte is a leading one. However, this may give different results on invalid data than what StyleContext::GetNextChar() does, so it might not be wanted -- and we need to get the last character anyway, so for GetRelative(1) it would not change anything.

Also, going backwards is a little tricky so it probably won't give the exact same result than the current GetNextChar() on invalid data anyway (since to go backward we pretty much have no other solution than seek backward until we find the previous lead byte). I mean:

int c = sc.ch;
sc.Forward();
assert(c == sc.GetRelative(-1));

might perhaps fail on invalid input.

2 Attachments

Related

Geany: Bugs: #995
Bugs: #1497
Bugs: #1528

Discussion

<< < 1 2 (Page 2 of 2)
  • Neil Hodgson
    Neil Hodgson
    2013-06-30

    • status: open --> open-fixed
     
  • Neil Hodgson
    Neil Hodgson
    2013-07-21

    • status: open-fixed --> closed-fixed
     
  • Matthew Brush
    Matthew Brush
    2013-09-16

    Hi Neil,

    I just happened by here from a link on a new Geany bug report that references this one and after reading the comments, it made me think of a library I started using lately in an unrelated project which might be of interest to Scintilla. It's called utf8-cpp and is a very small (4 header-only C++ files) and provides simple and C++ idiomatic ways to iterate over and operate on a UTF-8 encoded buffer by code unit. I've no idea about DBCS but it supports to/from 16 and 32 bit Unicode code units.

    Sorry for random drive-by comment, I didn't want to pester the whole mailing list :)

     
<< < 1 2 (Page 2 of 2)