Re: [re2c-devel] UTF-8 support

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On 12/13/2013 06:21 PM, Ulya Fokanova wrote:
> Hi,
>
> here's a patch that adds UTF-8 support to re2c.
> Please consider pulling this patch in HEAD. :)
>
> My approach relies on algorithm used by Ken Thompson, inventor of UTF-8,
> in his 'grep'. Below is a detailed explanation of the algorithm and 
> description
> of changes in re2c code. If you feel like this patch is not good 
> enough, or
> you need some tests to prove that this works, or anything else, please,
> contact me. As a regular user of re2c, I really lack UTF-8 support in it.
>
> As I understand, re2c already supports ASCII, EBCDIC, UTF-16 and UTF-32.
> All of these are constant-byte-length encodings, so re2c assumes that 
> code point
> is a scalar value (1 byte for ASCII and EBCDIC, 2 bytes for UTF-16 and 
> 4 bytes for UTF-32).
>
> UTF-8 is a different case: code points have variable length (from 1 to 
> 4 bytes).
> So straightforward way to add UTF-8 support is inconvenient: it would 
> cause
> complications in code generation (depending on the length of current 
> code point,
> re2c would have to operate on values of different size).
> Less obvious, but much better way is to compile Unicode code points to
> sequences of bytes, so that re2c can operate on bytes. So, for 
> example, Unicode
> code point 0x430 (Cyrillic letter 'а', which maps to UTF-8 string 
> "\xD0\xB0"),
> would compile to byte sequence 0xD0, 0xB0 (just like if it was 
> manually encoded
> as a 2-byte string).
>
> It's straightforward to compile UTF-8 strings to byte sequences. The 
> main complication
> lies in Unicode ranges. The only simple case is Unicode range that 
> represents
> ASCII symbols:
>
> [0x00 - 0x7F]
>
> it is the same as normal ASCII range: any byte from 0x00 to 0x7F matches,
> so it compiles to a normal byte range. For a wider range:
>
> [0x00 - 0x7FF]
>
> the things are quite different. Unicode code point 0x7FF maps to 
> 2-byte UTF-8 string
> "\xDF\xBF". In order to compile this range to a DFA that operates on 
> bytes, we need
> to split it into two sub-ranges. The first sub-range contains all 
> values from the initial range,
> which are represented with 1 byte in UTF-8. The second sub-range 
> contains all values that
> map to 2-byte UTF-8 strings:
>
> [0x00 - 0x7f, 0x80 - 0x7FF]
>
> The first range is simple byte range. The second range contains all 
> UTF-8 strings
> (taken in lexicographical order) from "\xC2\x80" (UTF-8 string for 
> Unicode code point
> 0x80) to "\xDF\xBF" (UTF-8 string for Unicode code point 0x7FF). It 
> can be represented
> with concatenation of two byte ranges: [0xC2 - 0xDF], [0x80 - 0xBF].
>
> So the idea is to split Unicode range into sub-ranges, each of which 
> maps to UTF-8 strings
> of the same length. Each sub-range is further represented with 
> concatenation of byte ranges.
> This approach, however, is still a bit incorrect. The problem is that 
> in some of the sub-ranges
> there are byte sequences, which are not valid UTF-8 strings. This is 
> because UTF-8 encoding
> is not continuous: it consists of a number of continuous ranges.
>
> So in fact we need to split the initial range into continuous sub-ranges.
> These sub-ranges are:
>
> [0x00 - 0x7F]
> [0x80 - 0x7FF]
> [0x800 - 0xFFF]
> [0x1000 - 0xFFFF]
> [0x10000 - 0x3FFFF]
> [0x40000 - 0xFFFFF]
> [0x100000 - 0x10FFFF]
>
> Then each sub-range is represented with a concatenation of byte ranges.
> Finally, the initial range is an alternation of these concatenations.
>
> For example, the whole Unicode range, [0x00 - 0x10FFFF] (the meaning
> of [^] in UTF-8 mode), compiles to:
>
> Alt
> (
> Cat ([0x00 - 0x7F])
> Cat ([0xC2 - 0xDF], [0x80 - 0xBF])
> Cat ([0xE0], [0xA0 - 0xBF], [0x80 - 0xBF])
> Cat ([0xE1 - 0xEF], [0x80 - 0xBF], [0x80 - 0xBF])
> Cat ([0xF0], [0x90 - 0xBF], [0x80 - 0xBF], [0x80 - 0xBF])
> Cat ([0xF1-0xF3], [0x80 - 0xBF], [0x80 - 0xBF], [0x80 - 0xBF])
> Cat ([0xF4], [0x80 - 0x8F], [0x80 - 0xBF], [0x80 - 0xBF])
> )
>
> It looks rather scary, but notice common suffix [0x80 - 0xBF].
> If we factor it out, we'll get 2x smaller range:
>
> Alt
> (
> Cat ([0x00 - 0x7F])
> Cat
> (
> Alt
> (
> [0xC2 - 0xDF]
> Cat ([0xE0], [0xA0 - 0xBF])
> Cat
> (
> Alt
> (
> [0xE1 - 0xEF]
> Cat ([0xF0], [0x90 - 0xBF])
> Cat ([0xF1-0xF3], [0x80 - 0xBF])
> Cat ([0xF4], [0x80 - 0x8F])
> )
> [0x80 - 0xBF]
> )
> )
> [0x80 - 0xBF]
> )
> )
>
> So this is the final algorithm:
> 1) Split initial range into continuous sub-ranges.
> 2) Factor out common suffixes.
>
> If we don't factor out common suffixes (which is an optional space 
> optimisation),
> then implementation in re2c in very simple. We only need to:
> 1) add new cmd-option (I called it "-z" or "--utf-8")
> 2) change functions that compile byte strings and byte ranges to 
> re2c::RegExp's.
> and it works.
>
> If we want to factor out common suffixes (which reduces size of 
> generates scanners
> greatly), then we have to interfere the way re2c compiles RegExp graph 
> to instruction
> sequence. This is necessary because current re2c traverses RegExp 
> graph like a tree:
> it doesn't check whether current RegExp node was visited before. Such 
> an approach
> is unacceptable if we don't want to duplicate code for common 
> suffixes: if nodes 'r1'
> and 'r2' have references to the same node 'r3', instructions for 'r3' 
> must be generated
> only once, though 'r3' will be traversed (at least) twice.
> So I had to change this aspect a bit: I added pointer to instruction 
> to every RegExp,
> which is NULL, if the RegExp hasn't been compiled yet, otherwise it 
> points to the
> compiled instructions. Also I have to track the length of compiled 
> sequence,
> because it may differ from the size of the RegExp.
>
> There are some other minor changes in this patch, which are cosmetic.
>
> Ulya.
>