Re: [Flex-devel] Unicode support

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Mon, Mar 12, 2012 at 4:46 PM, Peter Martini <pet...@gm...> wrote:
> A few things to keep in mind for flex to support Unicode.
>
> Broadly, Unicode support can be split into three parts - parsing bytes
> into code points, assigning one (or more!) of those code points to
> characters, and and assigning properties to those characters.
>
> The first part is much simpler: Unicode is generally encoded in either
> UTF-8 (a variable width encoding scheme, optimized for backwards
> compatibility with ASCII, and to a lesser extent, Latin-1) or
> UTF-16/UTF-16LE/UTF-16BE (most notably Windows and Mac).  UTF-16
> includes a BOM (0xFEFF) at the start of the text to allow the parser
> to infer whether the text was written with little-endian or big-endian
> tools; the UTF-16BE and UTF-16LE variants, as their names imply, are
> specifically not supposed to have the BOM since the name of the
> variant identifies which encoding to use.  SPARC and PowerPC are very
> common big-endian server architectures.  Its worth noting that Mac OS
> X made the transition from a big-endian to a little-endian platform,
> and does quite a lot to hide those details from the programmer, but
> flex would be operating at a level where that could be significant.  I
> don't recall their file encoding, and don't have my Mac OS X / PowerPC
> machine handy to test.
>
> Supporting any one encoding isn't too difficult; we've just seen the
> work to change from a single byte to a double byte encoding on this
> list, and I've done some work separately
> (https://github.com/PeterMartini/flex) to support UTF-8.  Even
> supporting a compiler flag is pretty straightforward.  Supporting an
> option in the lexer though could get a little hairy; do we want to
> support transitioning from one encoding to another?
>
> There's also the issue of what to do about the BOM, which I was able
> to side-step in my UTF-8 work, since as far as UTF-8 is concerned, its
> a noncharacter.

I'd like to get this part handled for sure. At this point, I think the
state of Unicode has settled down a bit, with the major winners being
UTF-8 and UTF-16. The UCS encoding are mercifully dead. (If anybody on
the list knows of other encodings with an important constituency,
please speak up).

The wchar_t patch posted earlier is probably not an ideal approach.
Rather than hoping for system provided 16-bit wchar_t, I think flex
should define its own 16-bit type. That way, you know you're working
with 16 bits.

http://icu-project.org/docs/papers/unicode_wchar_t.html

> So, that's part 1, parsing text into codepoints (with the additional
> complication that in UTF-16, a single codepoint must be encoded as a
> pair of surrogates). What I'm calling part 2 is combining character
> sequences into graphemes.  A grapheme is multiple codepoints visually
> represented as one unit on your screen / page.  The canonical example
> of this is, and one that shows where it can get complex, is á - it can
> be stored as either U+00C1 (a-acute) or the two codepoints
> U+0041,U+0301 (a followed by combining acute).  It's up to the
> application to determine whether the two are considered equivalent;
> something which flex could legitimately leave to the application
> developer, but would be a useful thing to have.

IIRC, the rule tables are fairly sizable, and subject to change. I'd
prefer to punt on this.

Recommending ICU seems to be the way to go: http://icu-project.org/

> Finally, part 3, applying Unicode properties.  This is the moving
> target that makes which version of the Unicode standard an application
> supports relevant.  The simplest properties are character names - you
> could reference WHITE SMILING FACE instead of U+263A and mean the same
> thing.  Case sensitivity is actually a fairly complicated property;
> one of the canonical examples here is the German Eszett, ß, which is
> equivalent to ss when matched case insensitively.  While flex could
> get away with not supporting many properties, handling case
> insensitivity in some manner should be addressed.

Almost certainly has to be external libraries here, however flex does
have the "-i" flag for case-insensitive scanning. I think it'd be
reasonable to say that applies only to ASCII, and recommend that
applications perform their own case matching with a Unicode library in
the future.

> Anyway, this is just a brain dump, please feel free to pick at the
> details or ask questions; I'm hardly an expert.
>
> Regards,
> Peter Martini
>
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> Flex-devel mailing list
> Fle...@li...
> https://lists.sourceforge.net/lists/listinfo/flex-devel

Re: [Flex-devel] Unicode support

flex is a tool for generating scanners

Re: [Flex-devel] Unicode support