#1483 StyleContext::GetRelative() should use character offset, not use byte offset

Bug
closed-fixed
5
2015-02-12
2013-06-01
No

Now StyleContext reports characters rather than bytes, its GetRelative() method should follow:

  1. count characters, not bytes
  2. get the character relative to the position, not the byte

'1' breaks at least the Perl lexer which uses GetRelative() and the passes an offset found with it to Forward() -- Forward() uses characters, so it will forward too much. For example this code will not highlight as expected:

" ü " keyword

(see https://sourceforge.net/p/geany/bugs/962/)


Attached is an initial patch showing a possible fix. Not tested very well, and not so beautiful, but it seems to work fine.

On the implementation side though, there currently is a conflict between avoiding code duplication, avoiding possibly different invalid character handling and avoiding unnecessary data access. I mean, currently StyleContext has code to get the next character, but it requires to know the full value of the character, so to read the whole set of bytes (1-4). When fast-forwarding to find the character at pos+n, one could rely on some encoding properties to avoid fetching some data, e.g. the first byte of UTF-8 contains the information on how many sub-bytes it contains, so one could do:

if ((ch_ & 0xf8) == 0xf0) // 0b11110xxx
    pos += 4;
else if ((ch_ & 0xf0) == 0xe0) // 0b1110xxxx
    pos += 3;
else if ((ch_ & 0xe0) == 0xc0) // 0b110xxxxx
    pos += 2;
else
    pos++;

This only requires the first byte for UTF-8, and it seems like there already is some code to determine if a DBCS byte is a leading one. However, this may give different results on invalid data than what StyleContext::GetNextChar() does, so it might not be wanted -- and we need to get the last character anyway, so for GetRelative(1) it would not change anything.

Also, going backwards is a little tricky so it probably won't give the exact same result than the current GetNextChar() on invalid data anyway (since to go backward we pretty much have no other solution than seek backward until we find the previous lead byte). I mean:

int c = sc.ch;
sc.Forward();
assert(c == sc.GetRelative(-1));

might perhaps fail on invalid input.

2 Attachments

Related

Geany: Bugs: #995
Bugs: #1497
Bugs: #1528

Discussion

<< < 1 2 3 (Page 3 of 3)
  • Kein-Hong Man

    Kein-Hong Man - 2013-06-27

    Make that: File handle (in angle brackets) scan now does DBCS/utf-8 too. (Gee, I wonder if this SF thing is handling angle brackets properly...)

     
  • Neil Hodgson

    Neil Hodgson - 2013-06-28

    That seems good. I like dropping the calls to styler.GetRelativePosition since that method may change signature to possibly return multiple (character,width) values in a single call as an optimization. Or be split into two calls. Have to do some benchmarks of new and old code to ensure there hasn't been too much of a performance cost.

    To placate some MSVC warnings, I had to change some loops from

    while ((c = sc.GetRelativeCharacter(++sLen))) {
    

    to

    while ((c = sc.GetRelativeCharacter(++sLen)) != 0) {
    

    Yes, the tracker uses a markup language that eats some characters but there are also some features like inline images that make up for it.

    BTW, would it be OK to omit the "this->" in the implementation of the QuoteCls constructor? cppcheck gets confused and complains about fields not being initialized.

     
  • Kein-Hong Man

    Kein-Hong Man - 2013-06-28

    Sure, whack away on the "this->". My C++ is always rusty so I tend to avoid C++ cleanups...

     
  • Neil Hodgson

    Neil Hodgson - 2013-06-29

    [a4ae33] Splits GetRelativePosition into 2 calls as the most common call in GetNextChar knows exactly which position to retrieve. Slightly faster. Most lexers are approximately equivalent speeds for UTF-8 and Latin-1 although there are sometimes differences of up to 20%.

    A test added in [d46f56] which checks that UTF-8 labels in Lua are lexed correctly. This test failed with the previous implementation as the label style continued past "::" into line end.

     

    Related

    Commit: [a4ae33]
    Commit: [d46f56]

  • Neil Hodgson

    Neil Hodgson - 2013-06-30
    • status: open --> open-fixed
     
  • Neil Hodgson

    Neil Hodgson - 2013-07-21
    • status: open-fixed --> closed-fixed
     
  • Anonymous - 2013-09-16

    Hi Neil,

    I just happened by here from a link on a new Geany bug report that references this one and after reading the comments, it made me think of a library I started using lately in an unrelated project which might be of interest to Scintilla. It's called utf8-cpp and is a very small (4 header-only C++ files) and provides simple and C++ idiomatic ways to iterate over and operate on a UTF-8 encoded buffer by code unit. I've no idea about DBCS but it supports to/from 16 and 32 bit Unicode code units.

    Sorry for random drive-by comment, I didn't want to pester the whole mailing list :)

     
<< < 1 2 3 (Page 3 of 3)

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks