[Flex-devel] Unicode food for thought

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Unicode has just released its newest version of UTS-18, dealing with
regular expression in Unicode.

http://www.unicode.org/reports/tr18/tr18-15.html

I think as far as our implementation of Unicode handling is concerned, we
should see which sections of Level 1 we're most concerned about and aim for
that. Some of the Level 2 features would be nice down the road, but I don't
think we'd even want to go for full support for that or anything further.

The elements of Level 1 that stick out the most to me are:

"Some caseless matches may match one character against two: for example,
U+00DF "ß" matches the two characters "SS". And case matching may vary by
locale. However, because many implementations are not set up to handle
this, at Level 1 only simple case matches are necessary. To correctly
implement a caseless match, see* Chapter 3, Conformance* of
[Unicode<http://www.unicode.org/reports/tr18/tr18-15.html#Unicode>].
The data file supporting caseless matching is
[CaseData<http://www.unicode.org/reports/tr18/tr18-15.html#CaseData>
]."

Definition of a newline character is :
"\u{A} | \u{B} | \u{C} | \u{D} | \u{85} | \u{2028} | \u{2029} | \u{D A}"

"It is strongly recommended that there be a regular expression
meta-character, such as "\R", for matching all line ending characters and
sequences listed above (for example, in #1). This would correspond to
something equivalent to the following expression. That expression is
slightly complicated by the need to avoid backup.

(?:\u{D A}|(?!\u{D A})[\u{A}-\u{D}\u{85}\u{2028}\u{2029})"

"
RL1.7 <http://www.unicode.org/reports/tr18/tr18-15.html#RL1.7>Supplementary
Code PointsTo meet this requirement, an implementation shall handle the
full range of Unicode code points, including values from U+FFFF to
U+10FFFF. In particular, where UTF-16 is used, a sequence consisting of a
leading surrogate followed by a trailing surrogate shall be handled as a
single code point in matching.
*Note: *It is permissible, but not required, to match an isolated surrogate
code point (such as \u{D800}), which may occur in Unicode Strings. SeeUnicode
String <http://www.unicode.org/glossary/#unicode_string> in the Unicode
glossary."

[Flex-devel] Unicode food for thought

flex is a tool for generating scanners

[Flex-devel] Unicode food for thought