Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#1483 StyleContext::GetRelative() should use character offset, not use byte offset

Bug
closed-fixed
Neil Hodgson
5
2014-06-16
2013-06-01
No

Now StyleContext reports characters rather than bytes, its GetRelative() method should follow:

  1. count characters, not bytes
  2. get the character relative to the position, not the byte

'1' breaks at least the Perl lexer which uses GetRelative() and the passes an offset found with it to Forward() -- Forward() uses characters, so it will forward too much. For example this code will not highlight as expected:

" ü " keyword

(see https://sourceforge.net/p/geany/bugs/962/)


Attached is an initial patch showing a possible fix. Not tested very well, and not so beautiful, but it seems to work fine.

On the implementation side though, there currently is a conflict between avoiding code duplication, avoiding possibly different invalid character handling and avoiding unnecessary data access. I mean, currently StyleContext has code to get the next character, but it requires to know the full value of the character, so to read the whole set of bytes (1-4). When fast-forwarding to find the character at pos+n, one could rely on some encoding properties to avoid fetching some data, e.g. the first byte of UTF-8 contains the information on how many sub-bytes it contains, so one could do:

if ((ch_ & 0xf8) == 0xf0) // 0b11110xxx
    pos += 4;
else if ((ch_ & 0xf0) == 0xe0) // 0b1110xxxx
    pos += 3;
else if ((ch_ & 0xe0) == 0xc0) // 0b110xxxxx
    pos += 2;
else
    pos++;

This only requires the first byte for UTF-8, and it seems like there already is some code to determine if a DBCS byte is a leading one. However, this may give different results on invalid data than what StyleContext::GetNextChar() does, so it might not be wanted -- and we need to get the last character anyway, so for GetRelative(1) it would not change anything.

Also, going backwards is a little tricky so it probably won't give the exact same result than the current GetNextChar() on invalid data anyway (since to go backward we pretty much have no other solution than seek backward until we find the previous lead byte). I mean:

int c = sc.ch;
sc.Forward();
assert(c == sc.GetRelative(-1));

might perhaps fail on invalid input.

2 Attachments

Related

Geany: Bugs: #995
Bugs: #1497
Bugs: #1528

Discussion

  • Kein-Hong Man
    Kein-Hong Man
    2013-06-02

    I'll wait for Neil to weigh in, then I'll dig in and see what bits of the forward and backward scanning need to be changed. Backward scanning that involve checking styles may still work.

     
  • Neil Hodgson
    Neil Hodgson
    2013-06-02

    Building a Unicode character stream abstraction over a UTF-8 representation can be expensive or complex. The modified GetRelative now loops, causing code such as "while (i < len && (sc.GetRelative(++i) != ']'" from LexTxt2tags to rise in cost to be quadratic. It may be better to allow lexers to opt-in to this expense by using a different name for the character-oriented GetRelative.

    Scintilla allows invalid UTF-8 and having different calls see the character stream in different ways could lead to more problems. Caching the UTF-32 representation of portions of the document may be possible but is also likely to require retaining byte position information (to allow styling the bytes up to a character discovered through relative access) and may be expensive so require opt-in.

    There are a few idioms that use GetRelative and it may be better to directly support these idioms. One is to seek for a string or set of strings and it would be possible to add a search method that took an encoded string (byte string in the document's encoding) and found the byte offset of this text. Another idiom is Match(byte string) and then examine the character after that byte string. Some byte-oriented calls like ForwardBytes may also be useful for lexers that want to continue working in bytes.

    The example file from the Geany issue is Latin-1, not UTF-8. Its possible that Geany has converted the encoding to UTF-8.

     
    • Building a Unicode character stream abstraction over a UTF-8 representation can be expensive or complex. The modified GetRelative now loops, causing code such as "while (i < len && (sc.GetRelative(++i) != ']'" from LexTxt2tags to rise in cost to be quadratic. It may be better to allow lexers to opt-in to this expense by using a different name for the character-oriented GetRelative.

      Yeah, seeking over non-constant-sized bytes has a cost indeed. However, my point here is that (almost) everything in StyleContext is now character-based, but GetRelative() is byte based, which is inconsistent -- and did lead to breakage.

      BTW, looking at LexTxt2tags makes me guess that this lexer could probably suffer of the same problem. BTW, this lexer seems to overuse GetRelative() a bit -- but OK, I don't know Txt2Tags and won't update the lexer.

      Scintilla allows invalid UTF-8 and having different calls see the character stream in different ways could lead to more problems.

      Indeed, that's the reason I pointed out it might not be trivial to fix.

      Caching the UTF-32 representation of portions of the document may be possible but is also likely to require retaining byte position information (to allow styling the bytes up to a character discovered through relative access) and may be expensive so require opt-in.

      Would it be that expensive? OK, it would require storing a tuple array for position/character (or maybe character is enough?), but Forward() would benefit from it and also fill it, so the added cost would roughly only be maintaining the array. I mean, if you call GetRelative(3) and then Forward(3), the GetRelative() call would fill the cache and the Forward() call would only be a cache fetch. Same for backwards, the cache would already be filled by previous Forward()s.

      It would probably be a bit of work to implement this, but I doubt it would have a important cost in performances. Maybe to avoid a huge cache, it could default to a small size and a caller could ask for more if it uses large backward/forward searches (like StyleContext::SetCacheSize() or something). Maybe a circular buffer could also avoid most memory movements.

      There are a few idioms that use GetRelative and it may be better to directly support these idioms. [...]

      Indeed. It might even render caller code simpler and would probably allow to avoid a lot of encoding issues, since generally what callers are searching for is plain ASCII. However again, some caller use the search to get a position later passed to Forward(), and then needs a character offset -- which comes back to the character issue.

      The example file from the Geany issue is Latin-1, not UTF-8. Its possible that Geany has converted the encoding to UTF-8.

      Ah sorry, maybe my SciTE is configured to load UTF-8 or something (and yes, Geany always loads UTF-8). But try any MB encoding and you'll see the string styles breaks.

       
      • Neil Hodgson
        Neil Hodgson
        2013-06-10

        A circular buffer sounds reasonable. It should be implemented in a way that performs well for a wide set of situations from holding a couple of lines in a normal file with decent handling of the inevitable Megabyte line cases.

        If no one else starts work, I'll probably look at it some time. The code could be divided into a generic circular buffer class and the adaptation of that into StyleContext or a subclass of StyleContext.

         
  • Neil Hodgson
    Neil Hodgson
    2013-06-02

    One way to avoid quadratic costs in GetRelative (Character) would be to cache the previous (relative character -> byte offset) access and start from there when the new request is same or later.

    Could also store the current position active for last GetRelative so that the cached position is invalidated correctly without adding to the cost of GetNextChar.

     
    • Indeed, this could be a solution that would avoid maintaining a complete cache. Its usefulness depends on how the callers use GetRelative(), but it looks kinda sensible (but for backward searches, where it wouldn't help).

      But instead of maintaining the old GetRelative() position I'd probably rather update the cached offset in GetNextChar() -- it is quite cheap and probably would make it possible not to invalidate the cache without adding more complex code. But that's a detail anyway.

       
  • Kein-Hong Man
    Kein-Hong Man
    2013-06-02

    Okay, I'll attack the thing during the next week.
    I have converted the sample to utf-8 and I can see the breakage. It also breaks some of my old utf-8 test cases.

     
  • Neil Hodgson
    Neil Hodgson
    2013-06-07

    Attached is a patch that fixes a small part of this. A ForwardBytes method is added to StyleContext and this is used inside the lexer where bytes have been counted. Another change in the lexer retrieves the number of bytes moved by Forward to keep things in sync.

    The BackwardChar method in the original patch does not handle DBCS. It is difficult to move backwards through DBCS since trail byte values can also be lead byte values. See Document::NextPosition.

    Edited to be explicit about which patch doesn't do DBCS.

     
    Last edit: Neil Hodgson 2013-06-07
    Attachments
  • Neil Hodgson
    Neil Hodgson
    2013-06-08

    • labels: --> scintilla, lexer
    • assigned_to: Neil Hodgson
     
  • Neil Hodgson
    Neil Hodgson
    2013-06-08

    Attached is an attempt to classify the uses of GetRelative, into direction (forwards or backwards) and whether the use is in a loop with a variable position or a simple constant.

    Since moving backwards is more difficult, particularly for DBCS, it would be better to avoid it. Most of the uses appear safe with GetRelative using byte offsets. When GetRelative is used to extract strings, its safer to leave these as byte strings, especially if they are to be compared, than it is to convert to 32bit-character strings.

    The constant offsets to GetRelative range from -4 to +3. Some negative cases look for line ends so can't be satisfied by remembering a small buffer of recently current characters.

     
  • Neil Hodgson
    Neil Hodgson
    2013-06-18

    ForwardBytes method added to StyleContext.

     
  • Kein-Hong Man
    Kein-Hong Man
    2013-06-18

    Studying DBCS and the various Windows codepages now.
    The perldoc most relevant seems to be:
    http://perldoc.perl.org/encoding.html
    which mainly talks about encoding of various string types q// etc. Some EUC-JP examples there but that's not DBCS, but anyway I have the codepage tables to prepare some DBCS test cases to supplement existing utf-8 test cases.

     
  • Kein-Hong Man
    Kein-Hong Man
    2013-06-18

    Ignoring the rest of the byte-based scanning stuff for now, I have checked usage of GetRelative() for everything not related to interpolated strings, and I don't think anything need to be changed there for now. Some notes and tests attached. Now I will deal with the interpolated strings stuff.

     
  • Kein-Hong Man
    Kein-Hong Man
    2013-06-19

    Here is a first stab at DBCS and utf-8 recognition within all kinds of delimited strings. The added function is awkward, but works... unsure whether it can be written in a more elegant way (without doing a complete overhaul). Added more DBCS samples. Still need more tests to check the codepaths etc., will cook it for a few days more.

     
  • Neil Hodgson
    Neil Hodgson
    2013-06-25

    Attached is a patch that provides a GetRelativeCharacter method on StyleContext. It moves much of the code into Document as a call GetRelativePosition which combines moving an integral number of characters and retrieving the character at the resulting position.

    It might be simpler to provide separate methods to find a relative position and to retrieve the character/width at a position but that may not optimize as well and its hidden inside StyleContext. StyleContext remembers the last access position so should be efficient when looping over a range.

    The patch will need more testing. Tried using it for Lua, TCL, Txt2Tags and Markdown and it looks like most places are better served by byte-oriented access. Its common to want to retrieve byte-strings for comparison with a list of keywords provided as byte-strings, not character-strings.

     
  • Kein-Hong Man
    Kein-Hong Man
    2013-06-27

    Tried using the patch, but it currently breaks existing stuff and breaks refresh when editing. I have tried converting the GetRelative items in LexPerl to be character-oriented, LexPerl draft attached. So it will go back and forth a bit more but that's fine by me.

     
  • Neil Hodgson
    Neil Hodgson
    2013-06-27

    The GetRelativePosition code has been working OK for me and I'm reasonably sure its an OK direction even if not yet bug-free so its committed with some small changes as [b7f382].

    A quick check showed the draft Perl lexer worked OK on the UTF-8 and DBCS test files although I'm not very knowledgeable about Perl syntax corner cases. If there's a particular refresh problem, I can look into it.

     

    Related

    Commit: [b7f382]

  • Kein-Hong Man
    Kein-Hong Man
    2013-06-27

    I'll pull from hg and compile again, I might have missed something blindingly obvious. Hope it's not due to compiler differences. Will report later...

     
    • Kein-Hong Man
      Kein-Hong Man
      2013-06-27

      False alarm :-) hg HEAD + draft LexPerl builds and works fine. Will test a bit more. Must've messed up the patch somewhere.

       
  • Kein-Hong Man
    Kein-Hong Man
    2013-06-27

    A cleaned-up update. No more usage ForwardBytes() or GetRelativePosition(). <filehandle> scan now does DBCS/utf-8 too. Only HERE doc delimiters and -barewords may still break for DBCS/utf-8.

    As for maxSeg usage in InterpolateSegment(), looks like it cannot be cleaned up trivially because of custom delimiters checks, but if factorized out, I guess the repetition can be avoided.

     
    • Kein-Hong Man
      Kein-Hong Man
      2013-06-27

      Make that: File handle (in angle brackets) scan now does DBCS/utf-8 too. (Gee, I wonder if this SF thing is handling angle brackets properly...)

       
  • Neil Hodgson
    Neil Hodgson
    2013-06-28

    That seems good. I like dropping the calls to styler.GetRelativePosition since that method may change signature to possibly return multiple (character,width) values in a single call as an optimization. Or be split into two calls. Have to do some benchmarks of new and old code to ensure there hasn't been too much of a performance cost.

    To placate some MSVC warnings, I had to change some loops from

    while ((c = sc.GetRelativeCharacter(++sLen))) {
    

    to

    while ((c = sc.GetRelativeCharacter(++sLen)) != 0) {
    

    Yes, the tracker uses a markup language that eats some characters but there are also some features like inline images that make up for it.

    BTW, would it be OK to omit the "this->" in the implementation of the QuoteCls constructor? cppcheck gets confused and complains about fields not being initialized.

     
  • Kein-Hong Man
    Kein-Hong Man
    2013-06-28

    Sure, whack away on the "this->". My C++ is always rusty so I tend to avoid C++ cleanups...

     
  • Neil Hodgson
    Neil Hodgson
    2013-06-29

    [a4ae33] Splits GetRelativePosition into 2 calls as the most common call in GetNextChar knows exactly which position to retrieve. Slightly faster. Most lexers are approximately equivalent speeds for UTF-8 and Latin-1 although there are sometimes differences of up to 20%.

    A test added in [d46f56] which checks that UTF-8 labels in Lua are lexed correctly. This test failed with the previous implementation as the label style continued past "::" into line end.

     

    Related

    Commit: [a4ae33]
    Commit: [d46f56]

  • Neil Hodgson
    Neil Hodgson
    2013-06-30

    • status: open --> open-fixed
     
  • Neil Hodgson
    Neil Hodgson
    2013-07-21

    • status: open-fixed --> closed-fixed
     
  • Matthew Brush
    Matthew Brush
    2013-09-16

    Hi Neil,

    I just happened by here from a link on a new Geany bug report that references this one and after reading the comments, it made me think of a library I started using lately in an unrelated project which might be of interest to Scintilla. It's called utf8-cpp and is a very small (4 header-only C++ files) and provides simple and C++ idiomatic ways to iterate over and operate on a UTF-8 encoded buffer by code unit. I've no idea about DBCS but it supports to/from 16 and 32 bit Unicode code units.

    Sorry for random drive-by comment, I didn't want to pester the whole mailing list :)