From: Aaron S. <aa...@se...> - 2012-03-13 00:16:23
|
On Mon, Mar 12, 2012 at 4:46 PM, Peter Martini <pet...@gm...> wrote: > A few things to keep in mind for flex to support Unicode. > > Broadly, Unicode support can be split into three parts - parsing bytes > into code points, assigning one (or more!) of those code points to > characters, and and assigning properties to those characters. > > The first part is much simpler: Unicode is generally encoded in either > UTF-8 (a variable width encoding scheme, optimized for backwards > compatibility with ASCII, and to a lesser extent, Latin-1) or > UTF-16/UTF-16LE/UTF-16BE (most notably Windows and Mac). UTF-16 > includes a BOM (0xFEFF) at the start of the text to allow the parser > to infer whether the text was written with little-endian or big-endian > tools; the UTF-16BE and UTF-16LE variants, as their names imply, are > specifically not supposed to have the BOM since the name of the > variant identifies which encoding to use. SPARC and PowerPC are very > common big-endian server architectures. Its worth noting that Mac OS > X made the transition from a big-endian to a little-endian platform, > and does quite a lot to hide those details from the programmer, but > flex would be operating at a level where that could be significant. I > don't recall their file encoding, and don't have my Mac OS X / PowerPC > machine handy to test. > > Supporting any one encoding isn't too difficult; we've just seen the > work to change from a single byte to a double byte encoding on this > list, and I've done some work separately > (https://github.com/PeterMartini/flex) to support UTF-8. Even > supporting a compiler flag is pretty straightforward. Supporting an > option in the lexer though could get a little hairy; do we want to > support transitioning from one encoding to another? > > There's also the issue of what to do about the BOM, which I was able > to side-step in my UTF-8 work, since as far as UTF-8 is concerned, its > a noncharacter. I'd like to get this part handled for sure. At this point, I think the state of Unicode has settled down a bit, with the major winners being UTF-8 and UTF-16. The UCS encoding are mercifully dead. (If anybody on the list knows of other encodings with an important constituency, please speak up). The wchar_t patch posted earlier is probably not an ideal approach. Rather than hoping for system provided 16-bit wchar_t, I think flex should define its own 16-bit type. That way, you know you're working with 16 bits. http://icu-project.org/docs/papers/unicode_wchar_t.html > So, that's part 1, parsing text into codepoints (with the additional > complication that in UTF-16, a single codepoint must be encoded as a > pair of surrogates). What I'm calling part 2 is combining character > sequences into graphemes. A grapheme is multiple codepoints visually > represented as one unit on your screen / page. The canonical example > of this is, and one that shows where it can get complex, is á - it can > be stored as either U+00C1 (a-acute) or the two codepoints > U+0041,U+0301 (a followed by combining acute). It's up to the > application to determine whether the two are considered equivalent; > something which flex could legitimately leave to the application > developer, but would be a useful thing to have. IIRC, the rule tables are fairly sizable, and subject to change. I'd prefer to punt on this. Recommending ICU seems to be the way to go: http://icu-project.org/ > Finally, part 3, applying Unicode properties. This is the moving > target that makes which version of the Unicode standard an application > supports relevant. The simplest properties are character names - you > could reference WHITE SMILING FACE instead of U+263A and mean the same > thing. Case sensitivity is actually a fairly complicated property; > one of the canonical examples here is the German Eszett, ß, which is > equivalent to ss when matched case insensitively. While flex could > get away with not supporting many properties, handling case > insensitivity in some manner should be addressed. Almost certainly has to be external libraries here, however flex does have the "-i" flag for case-insensitive scanning. I think it'd be reasonable to say that applies only to ASCII, and recommend that applications perform their own case matching with a Unicode library in the future. > Anyway, this is just a brain dump, please feel free to pick at the > details or ask questions; I'm hardly an expert. > > Regards, > Peter Martini > > ------------------------------------------------------------------------------ > Try before you buy = See our experts in action! > The most comprehensive online learning library for Microsoft developers > is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, > Metro Style Apps, more. Free future releases when you subscribe now! > http://p.sf.net/sfu/learndevnow-dev2 > _______________________________________________ > Flex-devel mailing list > Fle...@li... > https://lists.sourceforge.net/lists/listinfo/flex-devel |