#1483 StyleContext::GetRelative() should use character offset, not use byte offset

Bug
closed-fixed
Neil Hodgson
5
2015-01-13
2013-06-01
No

Now StyleContext reports characters rather than bytes, its GetRelative() method should follow:

  1. count characters, not bytes
  2. get the character relative to the position, not the byte

'1' breaks at least the Perl lexer which uses GetRelative() and the passes an offset found with it to Forward() -- Forward() uses characters, so it will forward too much. For example this code will not highlight as expected:

" ü " keyword

(see https://sourceforge.net/p/geany/bugs/962/)


Attached is an initial patch showing a possible fix. Not tested very well, and not so beautiful, but it seems to work fine.

On the implementation side though, there currently is a conflict between avoiding code duplication, avoiding possibly different invalid character handling and avoiding unnecessary data access. I mean, currently StyleContext has code to get the next character, but it requires to know the full value of the character, so to read the whole set of bytes (1-4). When fast-forwarding to find the character at pos+n, one could rely on some encoding properties to avoid fetching some data, e.g. the first byte of UTF-8 contains the information on how many sub-bytes it contains, so one could do:

if ((ch_ & 0xf8) == 0xf0) // 0b11110xxx
    pos += 4;
else if ((ch_ & 0xf0) == 0xe0) // 0b1110xxxx
    pos += 3;
else if ((ch_ & 0xe0) == 0xc0) // 0b110xxxxx
    pos += 2;
else
    pos++;

This only requires the first byte for UTF-8, and it seems like there already is some code to determine if a DBCS byte is a leading one. However, this may give different results on invalid data than what StyleContext::GetNextChar() does, so it might not be wanted -- and we need to get the last character anyway, so for GetRelative(1) it would not change anything.

Also, going backwards is a little tricky so it probably won't give the exact same result than the current GetNextChar() on invalid data anyway (since to go backward we pretty much have no other solution than seek backward until we find the previous lead byte). I mean:

int c = sc.ch;
sc.Forward();
assert(c == sc.GetRelative(-1));

might perhaps fail on invalid input.

2 Attachments

Related

Geany: Bugs: #995
Bugs: #1497
Bugs: #1528

Discussion

<< < 1 2 3 > >> (Page 2 of 3)
  • Neil Hodgson
    Neil Hodgson
    2013-06-18

    ForwardBytes method added to StyleContext.

     
  • Kein-Hong Man
    Kein-Hong Man
    2013-06-18

    Studying DBCS and the various Windows codepages now.
    The perldoc most relevant seems to be:
    http://perldoc.perl.org/encoding.html
    which mainly talks about encoding of various string types q// etc. Some EUC-JP examples there but that's not DBCS, but anyway I have the codepage tables to prepare some DBCS test cases to supplement existing utf-8 test cases.

     
  • Kein-Hong Man
    Kein-Hong Man
    2013-06-18

    Ignoring the rest of the byte-based scanning stuff for now, I have checked usage of GetRelative() for everything not related to interpolated strings, and I don't think anything need to be changed there for now. Some notes and tests attached. Now I will deal with the interpolated strings stuff.

     
  • Kein-Hong Man
    Kein-Hong Man
    2013-06-19

    Here is a first stab at DBCS and utf-8 recognition within all kinds of delimited strings. The added function is awkward, but works... unsure whether it can be written in a more elegant way (without doing a complete overhaul). Added more DBCS samples. Still need more tests to check the codepaths etc., will cook it for a few days more.

     
  • Neil Hodgson
    Neil Hodgson
    2013-06-25

    Attached is a patch that provides a GetRelativeCharacter method on StyleContext. It moves much of the code into Document as a call GetRelativePosition which combines moving an integral number of characters and retrieving the character at the resulting position.

    It might be simpler to provide separate methods to find a relative position and to retrieve the character/width at a position but that may not optimize as well and its hidden inside StyleContext. StyleContext remembers the last access position so should be efficient when looping over a range.

    The patch will need more testing. Tried using it for Lua, TCL, Txt2Tags and Markdown and it looks like most places are better served by byte-oriented access. Its common to want to retrieve byte-strings for comparison with a list of keywords provided as byte-strings, not character-strings.

     
  • Kein-Hong Man
    Kein-Hong Man
    2013-06-27

    Tried using the patch, but it currently breaks existing stuff and breaks refresh when editing. I have tried converting the GetRelative items in LexPerl to be character-oriented, LexPerl draft attached. So it will go back and forth a bit more but that's fine by me.

     
  • Neil Hodgson
    Neil Hodgson
    2013-06-27

    The GetRelativePosition code has been working OK for me and I'm reasonably sure its an OK direction even if not yet bug-free so its committed with some small changes as [b7f382].

    A quick check showed the draft Perl lexer worked OK on the UTF-8 and DBCS test files although I'm not very knowledgeable about Perl syntax corner cases. If there's a particular refresh problem, I can look into it.

     

    Related

    Commit: [b7f382]

  • Kein-Hong Man
    Kein-Hong Man
    2013-06-27

    I'll pull from hg and compile again, I might have missed something blindingly obvious. Hope it's not due to compiler differences. Will report later...

     
    • Kein-Hong Man
      Kein-Hong Man
      2013-06-27

      False alarm :-) hg HEAD + draft LexPerl builds and works fine. Will test a bit more. Must've messed up the patch somewhere.

       
  • Kein-Hong Man
    Kein-Hong Man
    2013-06-27

    A cleaned-up update. No more usage ForwardBytes() or GetRelativePosition(). <filehandle> scan now does DBCS/utf-8 too. Only HERE doc delimiters and -barewords may still break for DBCS/utf-8.

    As for maxSeg usage in InterpolateSegment(), looks like it cannot be cleaned up trivially because of custom delimiters checks, but if factorized out, I guess the repetition can be avoided.

     
<< < 1 2 3 > >> (Page 2 of 3)