Re: [Flex-devel] %option utf16 utf16le & utf16be
flex is a tool for generating scanners
Brought to you by:
wlestes
From: Peter M. <pet...@gm...> - 2012-07-10 20:00:00
|
Do you have a git repository (or smething) where I can follow along with the changes? By the way - thanks for setting the options to --utf16le and --utf16be; is there anything we can do about -U though? I do a lot of web and UNIX work, so my default frame of mind for Unicode is UTF-8. Going to the the official Unicode FAQ, at http://unicode.org/faq/utf_bom.html, I think for --utf16 it should only every stay in utf16 mode for a single character. If that character is a BOM, switch to the appropriate BE or LE, and otherwise switch to the native encoding (although technically plain utf-16 is specified as BE, I kind of think most UTF-16 processing will be done on Wintel and native encoding makes more sense for us). Is the byte swap check done at program start up time or compile time? It may be a bit uglier to generate, but for performance if we do the check at compile time and use macros for the byte swapping functions, there will be no penalty at all when running the code if its using native endianness, and we can muck around with using htons etc on platforms where its available (if that makes it faster, which I don't know). I have a branch on github with my UTF-8 work, which is a little stale (from 2011), and I'd like to try to merge our efforts. The pattern classes are compiled as a list of ranges in the first phase, and then translated to a byte sequence as the second phase, so the things I was working on with pattern matching (particulary having '.' exclude the surrogates but still match characters above 0xFFFF) would automatically work with double byte. An important test is to try to match "0xD800 0xDC00" - "." should match, and ".." should not. When I get some time, I want to add an optional dependency on libicu, which we could use to turn named characters, properties, and classes into character ranges for use in character classes. After all, just ask any Perl guy who's worked with Unicode, '\w' and '\d' have a severely different meaning when Unicode is in effect ... On Tue, Jul 10, 2012 at 3:36 PM, Paul <pa...@pr...> wrote: > The flex unicode version has now been changed so that: > %option utf16 generates a scanner that accepts the native utf of the > machine. > %option utf16le generates a scanner that accepts UTF-16LE regardless of > the machine byte order. > %option utf16be generates a scanner that accepts UTF-16BE regardless of > the machine byte order. > > This means that when utf16le or utf16be is an option the scanner tests > the byte order of the machine. This code is not generated otherwise or > for utf16. The test is done at startup, but checked at each read to > decide to swap bytes. The byte swap is also only generated with utf16le > or utf16be. Thus a Flex scanner.c may be moved between machines with > hopefully least astonishment. > > There are separate tests for C, C++, reentrant and non-reentrant > scanners, with options utf16le & utf16be. > There are now 105 tests all of which pass. > > The flag -U or --utf is the same as %option utf16. > The flag --utf16le is the same as %option utf16le. > The flag --utf16be is the same as %option utf16be. > > Suggestions? > > Paul Neelands > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Flex-devel mailing list > Fle...@li... > https://lists.sourceforge.net/lists/listinfo/flex-devel > |