[Flex-devel] Unicode food for thought
flex is a tool for generating scanners
Brought to you by:
wlestes
From: Peter M. <pet...@gm...> - 2012-07-20 04:32:55
|
Unicode has just released its newest version of UTS-18, dealing with regular expression in Unicode. http://www.unicode.org/reports/tr18/tr18-15.html I think as far as our implementation of Unicode handling is concerned, we should see which sections of Level 1 we're most concerned about and aim for that. Some of the Level 2 features would be nice down the road, but I don't think we'd even want to go for full support for that or anything further. The elements of Level 1 that stick out the most to me are: "Some caseless matches may match one character against two: for example, U+00DF "ß" matches the two characters "SS". And case matching may vary by locale. However, because many implementations are not set up to handle this, at Level 1 only simple case matches are necessary. To correctly implement a caseless match, see* Chapter 3, Conformance* of [Unicode<http://www.unicode.org/reports/tr18/tr18-15.html#Unicode>]. The data file supporting caseless matching is [CaseData<http://www.unicode.org/reports/tr18/tr18-15.html#CaseData> ]." Definition of a newline character is : "\u{A} | \u{B} | \u{C} | \u{D} | \u{85} | \u{2028} | \u{2029} | \u{D A}" "It is strongly recommended that there be a regular expression meta-character, such as "\R", for matching all line ending characters and sequences listed above (for example, in #1). This would correspond to something equivalent to the following expression. That expression is slightly complicated by the need to avoid backup. (?:\u{D A}|(?!\u{D A})[\u{A}-\u{D}\u{85}\u{2028}\u{2029})" " RL1.7 <http://www.unicode.org/reports/tr18/tr18-15.html#RL1.7>Supplementary Code PointsTo meet this requirement, an implementation shall handle the full range of Unicode code points, including values from U+FFFF to U+10FFFF. In particular, where UTF-16 is used, a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching. *Note: *It is permissible, but not required, to match an isolated surrogate code point (such as \u{D800}), which may occur in Unicode Strings. SeeUnicode String <http://www.unicode.org/glossary/#unicode_string> in the Unicode glossary." |