Now StyleContext
reports characters rather than bytes, its GetRelative()
method should follow:
'1' breaks at least the Perl lexer which uses GetRelative()
and the passes an offset found with it to Forward()
-- Forward()
uses characters, so it will forward too much. For example this code will not highlight as expected:
" ü " keyword
(see https://sourceforge.net/p/geany/bugs/962/)
Attached is an initial patch showing a possible fix. Not tested very well, and not so beautiful, but it seems to work fine.
On the implementation side though, there currently is a conflict between avoiding code duplication, avoiding possibly different invalid character handling and avoiding unnecessary data access. I mean, currently StyleContext
has code to get the next character, but it requires to know the full value of the character, so to read the whole set of bytes (1-4). When fast-forwarding to find the character at pos+n
, one could rely on some encoding properties to avoid fetching some data, e.g. the first byte of UTF-8 contains the information on how many sub-bytes it contains, so one could do:
if ((ch_ & 0xf8) == 0xf0) // 0b11110xxx pos += 4; else if ((ch_ & 0xf0) == 0xe0) // 0b1110xxxx pos += 3; else if ((ch_ & 0xe0) == 0xc0) // 0b110xxxxx pos += 2; else pos++;
This only requires the first byte for UTF-8, and it seems like there already is some code to determine if a DBCS byte is a leading one. However, this may give different results on invalid data than what StyleContext::GetNextChar() does, so it might not be wanted -- and we need to get the last character anyway, so for GetRelative(1)
it would not change anything.
Also, going backwards is a little tricky so it probably won't give the exact same result than the current GetNextChar()
on invalid data anyway (since to go backward we pretty much have no other solution than seek backward until we find the previous lead byte). I mean:
int c = sc.ch; sc.Forward(); assert(c == sc.GetRelative(-1));
might perhaps fail on invalid input.
I'll wait for Neil to weigh in, then I'll dig in and see what bits of the forward and backward scanning need to be changed. Backward scanning that involve checking styles may still work.
Building a Unicode character stream abstraction over a UTF-8 representation can be expensive or complex. The modified GetRelative now loops, causing code such as "while (i < len && (sc.GetRelative(++i) != ']'" from LexTxt2tags to rise in cost to be quadratic. It may be better to allow lexers to opt-in to this expense by using a different name for the character-oriented GetRelative.
Scintilla allows invalid UTF-8 and having different calls see the character stream in different ways could lead to more problems. Caching the UTF-32 representation of portions of the document may be possible but is also likely to require retaining byte position information (to allow styling the bytes up to a character discovered through relative access) and may be expensive so require opt-in.
There are a few idioms that use GetRelative and it may be better to directly support these idioms. One is to seek for a string or set of strings and it would be possible to add a search method that took an encoded string (byte string in the document's encoding) and found the byte offset of this text. Another idiom is Match(byte string) and then examine the character after that byte string. Some byte-oriented calls like ForwardBytes may also be useful for lexers that want to continue working in bytes.
The example file from the Geany issue is Latin-1, not UTF-8. Its possible that Geany has converted the encoding to UTF-8.
Yeah, seeking over non-constant-sized bytes has a cost indeed. However, my point here is that (almost) everything in
StyleContext
is now character-based, butGetRelative()
is byte based, which is inconsistent -- and did lead to breakage.BTW, looking at LexTxt2tags makes me guess that this lexer could probably suffer of the same problem. BTW, this lexer seems to overuse
GetRelative()
a bit -- but OK, I don't know Txt2Tags and won't update the lexer.Indeed, that's the reason I pointed out it might not be trivial to fix.
Would it be that expensive? OK, it would require storing a tuple array for position/character (or maybe character is enough?), but
Forward()
would benefit from it and also fill it, so the added cost would roughly only be maintaining the array. I mean, if you callGetRelative(3)
and thenForward(3)
, theGetRelative()
call would fill the cache and theForward()
call would only be a cache fetch. Same for backwards, the cache would already be filled by previousForward()
s.It would probably be a bit of work to implement this, but I doubt it would have a important cost in performances. Maybe to avoid a huge cache, it could default to a small size and a caller could ask for more if it uses large backward/forward searches (like StyleContext::SetCacheSize() or something). Maybe a circular buffer could also avoid most memory movements.
Indeed. It might even render caller code simpler and would probably allow to avoid a lot of encoding issues, since generally what callers are searching for is plain ASCII. However again, some caller use the search to get a position later passed to
Forward()
, and then needs a character offset -- which comes back to the character issue.Ah sorry, maybe my SciTE is configured to load UTF-8 or something (and yes, Geany always loads UTF-8). But try any MB encoding and you'll see the string styles breaks.
A circular buffer sounds reasonable. It should be implemented in a way that performs well for a wide set of situations from holding a couple of lines in a normal file with decent handling of the inevitable Megabyte line cases.
If no one else starts work, I'll probably look at it some time. The code could be divided into a generic circular buffer class and the adaptation of that into StyleContext or a subclass of StyleContext.
One way to avoid quadratic costs in GetRelative (Character) would be to cache the previous (relative character -> byte offset) access and start from there when the new request is same or later.
Could also store the current position active for last GetRelative so that the cached position is invalidated correctly without adding to the cost of GetNextChar.
Indeed, this could be a solution that would avoid maintaining a complete cache. Its usefulness depends on how the callers use
GetRelative()
, but it looks kinda sensible (but for backward searches, where it wouldn't help).But instead of maintaining the old
GetRelative()
position I'd probably rather update the cached offset inGetNextChar()
-- it is quite cheap and probably would make it possible not to invalidate the cache without adding more complex code. But that's a detail anyway.Okay, I'll attack the thing during the next week.
I have converted the sample to utf-8 and I can see the breakage. It also breaks some of my old utf-8 test cases.
Attached is a patch that fixes a small part of this. A ForwardBytes method is added to StyleContext and this is used inside the lexer where bytes have been counted. Another change in the lexer retrieves the number of bytes moved by Forward to keep things in sync.
The BackwardChar method in the original patch does not handle DBCS. It is difficult to move backwards through DBCS since trail byte values can also be lead byte values. See Document::NextPosition.
Edited to be explicit about which patch doesn't do DBCS.
Last edit: Neil Hodgson 2013-06-07
Attached is an attempt to classify the uses of GetRelative, into direction (forwards or backwards) and whether the use is in a loop with a variable position or a simple constant.
Since moving backwards is more difficult, particularly for DBCS, it would be better to avoid it. Most of the uses appear safe with GetRelative using byte offsets. When GetRelative is used to extract strings, its safer to leave these as byte strings, especially if they are to be compared, than it is to convert to 32bit-character strings.
The constant offsets to GetRelative range from -4 to +3. Some negative cases look for line ends so can't be satisfied by remembering a small buffer of recently current characters.