From: Ulya F. <skv...@gm...> - 2013-12-13 15:21:58
|
On 12/13/2013 06:21 PM, Ulya Fokanova wrote: > Hi, > > here's a patch that adds UTF-8 support to re2c. > Please consider pulling this patch in HEAD. :) > > My approach relies on algorithm used by Ken Thompson, inventor of UTF-8, > in his 'grep'. Below is a detailed explanation of the algorithm and > description > of changes in re2c code. If you feel like this patch is not good > enough, or > you need some tests to prove that this works, or anything else, please, > contact me. As a regular user of re2c, I really lack UTF-8 support in it. > > As I understand, re2c already supports ASCII, EBCDIC, UTF-16 and UTF-32. > All of these are constant-byte-length encodings, so re2c assumes that > code point > is a scalar value (1 byte for ASCII and EBCDIC, 2 bytes for UTF-16 and > 4 bytes for UTF-32). > > UTF-8 is a different case: code points have variable length (from 1 to > 4 bytes). > So straightforward way to add UTF-8 support is inconvenient: it would > cause > complications in code generation (depending on the length of current > code point, > re2c would have to operate on values of different size). > Less obvious, but much better way is to compile Unicode code points to > sequences of bytes, so that re2c can operate on bytes. So, for > example, Unicode > code point 0x430 (Cyrillic letter 'а', which maps to UTF-8 string > "\xD0\xB0"), > would compile to byte sequence 0xD0, 0xB0 (just like if it was > manually encoded > as a 2-byte string). > > It's straightforward to compile UTF-8 strings to byte sequences. The > main complication > lies in Unicode ranges. The only simple case is Unicode range that > represents > ASCII symbols: > > [0x00 - 0x7F] > > it is the same as normal ASCII range: any byte from 0x00 to 0x7F matches, > so it compiles to a normal byte range. For a wider range: > > [0x00 - 0x7FF] > > the things are quite different. Unicode code point 0x7FF maps to > 2-byte UTF-8 string > "\xDF\xBF". In order to compile this range to a DFA that operates on > bytes, we need > to split it into two sub-ranges. The first sub-range contains all > values from the initial range, > which are represented with 1 byte in UTF-8. The second sub-range > contains all values that > map to 2-byte UTF-8 strings: > > [0x00 - 0x7f, 0x80 - 0x7FF] > > The first range is simple byte range. The second range contains all > UTF-8 strings > (taken in lexicographical order) from "\xC2\x80" (UTF-8 string for > Unicode code point > 0x80) to "\xDF\xBF" (UTF-8 string for Unicode code point 0x7FF). It > can be represented > with concatenation of two byte ranges: [0xC2 - 0xDF], [0x80 - 0xBF]. > > So the idea is to split Unicode range into sub-ranges, each of which > maps to UTF-8 strings > of the same length. Each sub-range is further represented with > concatenation of byte ranges. > This approach, however, is still a bit incorrect. The problem is that > in some of the sub-ranges > there are byte sequences, which are not valid UTF-8 strings. This is > because UTF-8 encoding > is not continuous: it consists of a number of continuous ranges. > > So in fact we need to split the initial range into continuous sub-ranges. > These sub-ranges are: > > [0x00 - 0x7F] > [0x80 - 0x7FF] > [0x800 - 0xFFF] > [0x1000 - 0xFFFF] > [0x10000 - 0x3FFFF] > [0x40000 - 0xFFFFF] > [0x100000 - 0x10FFFF] > > Then each sub-range is represented with a concatenation of byte ranges. > Finally, the initial range is an alternation of these concatenations. > > For example, the whole Unicode range, [0x00 - 0x10FFFF] (the meaning > of [^] in UTF-8 mode), compiles to: > > Alt > ( > Cat ([0x00 - 0x7F]) > Cat ([0xC2 - 0xDF], [0x80 - 0xBF]) > Cat ([0xE0], [0xA0 - 0xBF], [0x80 - 0xBF]) > Cat ([0xE1 - 0xEF], [0x80 - 0xBF], [0x80 - 0xBF]) > Cat ([0xF0], [0x90 - 0xBF], [0x80 - 0xBF], [0x80 - 0xBF]) > Cat ([0xF1-0xF3], [0x80 - 0xBF], [0x80 - 0xBF], [0x80 - 0xBF]) > Cat ([0xF4], [0x80 - 0x8F], [0x80 - 0xBF], [0x80 - 0xBF]) > ) > > It looks rather scary, but notice common suffix [0x80 - 0xBF]. > If we factor it out, we'll get 2x smaller range: > > Alt > ( > Cat ([0x00 - 0x7F]) > Cat > ( > Alt > ( > [0xC2 - 0xDF] > Cat ([0xE0], [0xA0 - 0xBF]) > Cat > ( > Alt > ( > [0xE1 - 0xEF] > Cat ([0xF0], [0x90 - 0xBF]) > Cat ([0xF1-0xF3], [0x80 - 0xBF]) > Cat ([0xF4], [0x80 - 0x8F]) > ) > [0x80 - 0xBF] > ) > ) > [0x80 - 0xBF] > ) > ) > > So this is the final algorithm: > 1) Split initial range into continuous sub-ranges. > 2) Factor out common suffixes. > > If we don't factor out common suffixes (which is an optional space > optimisation), > then implementation in re2c in very simple. We only need to: > 1) add new cmd-option (I called it "-z" or "--utf-8") > 2) change functions that compile byte strings and byte ranges to > re2c::RegExp's. > and it works. > > If we want to factor out common suffixes (which reduces size of > generates scanners > greatly), then we have to interfere the way re2c compiles RegExp graph > to instruction > sequence. This is necessary because current re2c traverses RegExp > graph like a tree: > it doesn't check whether current RegExp node was visited before. Such > an approach > is unacceptable if we don't want to duplicate code for common > suffixes: if nodes 'r1' > and 'r2' have references to the same node 'r3', instructions for 'r3' > must be generated > only once, though 'r3' will be traversed (at least) twice. > So I had to change this aspect a bit: I added pointer to instruction > to every RegExp, > which is NULL, if the RegExp hasn't been compiled yet, otherwise it > points to the > compiled instructions. Also I have to track the length of compiled > sequence, > because it may differ from the size of the RegExp. > > There are some other minor changes in this patch, which are cosmetic. > > Ulya. > |