Scintilla / Bugs / #1483 StyleContext::GetRelative() should use character offset, not use byte offset

Kein-Hong Man - 2013-06-02

I'll wait for Neil to weigh in, then I'll dig in and see what bits of the forward and backward scanning need to be changed. Backward scanning that involve checking styles may still work.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2013-06-02

Building a Unicode character stream abstraction over a UTF-8 representation can be expensive or complex. The modified GetRelative now loops, causing code such as "while (i < len && (sc.GetRelative(++i) != ']'" from LexTxt2tags to rise in cost to be quadratic. It may be better to allow lexers to opt-in to this expense by using a different name for the character-oriented GetRelative.

Scintilla allows invalid UTF-8 and having different calls see the character stream in different ways could lead to more problems. Caching the UTF-32 representation of portions of the document may be possible but is also likely to require retaining byte position information (to allow styling the bytes up to a character discovered through relative access) and may be expensive so require opt-in.

There are a few idioms that use GetRelative and it may be better to directly support these idioms. One is to seek for a string or set of strings and it would be possible to add a search method that took an encoded string (byte string in the document's encoding) and found the byte offset of this text. Another idiom is Match(byte string) and then examine the character after that byte string. Some byte-oriented calls like ForwardBytes may also be useful for lexers that want to continue working in bytes.

The example file from the Geany issue is Latin-1, not UTF-8. Its possible that Geany has converted the encoding to UTF-8.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Colomban Wendling - 2013-06-02
  
  Building a Unicode character stream abstraction over a UTF-8 representation can be expensive or complex. The modified GetRelative now loops, causing code such as "while (i < len && (sc.GetRelative(++i) != ']'" from LexTxt2tags to rise in cost to be quadratic. It may be better to allow lexers to opt-in to this expense by using a different name for the character-oriented GetRelative.
  
  Yeah, seeking over non-constant-sized bytes has a cost indeed. However, my point here is that (almost) everything in StyleContext is now character-based, but GetRelative() is byte based, which is inconsistent -- and did lead to breakage.
  
  BTW, looking at LexTxt2tags makes me guess that this lexer could probably suffer of the same problem. BTW, this lexer seems to overuse GetRelative() a bit -- but OK, I don't know Txt2Tags and won't update the lexer.
  
  Scintilla allows invalid UTF-8 and having different calls see the character stream in different ways could lead to more problems.
  
  Indeed, that's the reason I pointed out it might not be trivial to fix.
  
  Caching the UTF-32 representation of portions of the document may be possible but is also likely to require retaining byte position information (to allow styling the bytes up to a character discovered through relative access) and may be expensive so require opt-in.
  
  Would it be that expensive? OK, it would require storing a tuple array for position/character (or maybe character is enough?), but Forward() would benefit from it and also fill it, so the added cost would roughly only be maintaining the array. I mean, if you call GetRelative(3) and then Forward(3), the GetRelative() call would fill the cache and the Forward() call would only be a cache fetch. Same for backwards, the cache would already be filled by previous Forward()s.
  
  It would probably be a bit of work to implement this, but I doubt it would have a important cost in performances. Maybe to avoid a huge cache, it could default to a small size and a caller could ask for more if it uses large backward/forward searches (like StyleContext::SetCacheSize() or something). Maybe a circular buffer could also avoid most memory movements.
  
  There are a few idioms that use GetRelative and it may be better to directly support these idioms. [...]
  
  Indeed. It might even render caller code simpler and would probably allow to avoid a lot of encoding issues, since generally what callers are searching for is plain ASCII. However again, some caller use the search to get a position later passed to Forward(), and then needs a character offset -- which comes back to the character issue.
  
  The example file from the Geany issue is Latin-1, not UTF-8. Its possible that Geany has converted the encoding to UTF-8.
  
  Ah sorry, maybe my SciTE is configured to load UTF-8 or something (and yes, Geany always loads UTF-8). But try any MB encoding and you'll see the string styles breaks.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Neil Hodgson - 2013-06-10
    
    A circular buffer sounds reasonable. It should be implemented in a way that performs well for a wide set of situations from holding a couple of lines in a normal file with decent handling of the inevitable Megabyte line cases.
    
    If no one else starts work, I'll probably look at it some time. The code could be divided into a generic circular buffer class and the adaptation of that into StyleContext or a subclass of StyleContext.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2013-06-02

One way to avoid quadratic costs in GetRelative (Character) would be to cache the previous (relative character -> byte offset) access and start from there when the new request is same or later.

Could also store the current position active for last GetRelative so that the cached position is invalidated correctly without adding to the cost of GetNextChar.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Colomban Wendling - 2013-06-02
  
  Indeed, this could be a solution that would avoid maintaining a complete cache. Its usefulness depends on how the callers use GetRelative(), but it looks kinda sensible (but for backward searches, where it wouldn't help).
  
  But instead of maintaining the old GetRelative() position I'd probably rather update the cached offset in GetNextChar() -- it is quite cheap and probably would make it possible not to invalidate the cache without adding more complex code. But that's a detail anyway.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kein-Hong Man - 2013-06-02

Okay, I'll attack the thing during the next week.
I have converted the sample to utf-8 and I can see the breakage. It also breaks some of my old utf-8 test cases.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2013-06-07

Attached is a patch that fixes a small part of this. A ForwardBytes method is added to StyleContext and this is used inside the lexer where bytes have been counted. Another change in the lexer retrieves the number of bytes moved by Forward to keep things in sync.

The BackwardChar method in the original patch does not handle DBCS. It is difficult to move backwards through DBCS since trail byte values can also be lead byte values. See Document::NextPosition.

Edited to be explicit about which patch doesn't do DBCS.

Last edit: Neil Hodgson 2013-06-07

PerlPartFix.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2013-06-08

labels: --> scintilla, lexer

assigned_to: Neil Hodgson
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2013-06-08

Attached is an attempt to classify the uses of GetRelative, into direction (forwards or backwards) and whether the use is in a loop with a variable position or a simple constant.

Since moving backwards is more difficult, particularly for DBCS, it would be better to avoid it. Most of the uses appear safe with GetRelative using byte offsets. When GetRelative is used to extract strings, its safer to leave these as byte strings, especially if they are to be compared, than it is to convert to 32bit-character strings.

The constant offsets to GetRelative range from -4 to +3. Some negative cases look for line ends so can't be satisfied by remembering a small buffer of recently current characters.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

StyleContext::GetRelative() should use character offset, not use byte offset

Group

Searches

Help

#1483 StyleContext::GetRelative() should use character offset, not use byte offset

Related

Discussion