From: Marcus B. <ma...@ma...> - 2008-07-25 16:45:10
|
Hello Gregg, Friday, July 25, 2008, 5:29:20 PM, you wrote: > On Fri, Jul 25, 2008 at 5:23 AM, Marcus Boerger <ma...@ma...> wrote: >> Hello Gregg, >> The testing would be the harder work. What is required is to have re2c read >> chrs from the input stream into ints rather than into chars (bytes as you >> called it). However you can already di so if you provide the layer doing so >> and just pass along the int array. Anyway, at this point re2c development > Does anybody have some sample specs to work with Unicode input? I'm > having trouble getting -u to work > Summary: we have two encodings, one for the re2c spec and one for the > input to the generated scanner. I'm stuck using cygwin for the time > being, so I use emacs to write my files in utf-8, then us iconv to > convert to UTF-16 or UTF-32. > My YYCTYPE is unsigned int. > Only utf-8 specs work. With utf-16 or utf-32 specs re2c runs to > completion but generates humongous c files that provoke zillions of > "warning: null character(s) ignored". I'm not sure if compilation > would work, it was taking so long I killed it. C/C++ compilers only accept ascii, maybe UTF-8 if you're lucky. > With a utf-8 encoded spec, I can either use utf-8 encoded chars in my > regexes (e.g. كتاب, lègére, etc.) or \u notation (e.g. \u0628 = ب). > With utf-8 encoded chars in a utf-8 spec, the re2c command works, but re2c > -u results in a re2c segfault. (I think I can make a usable scanner > with this method, but I want to understand how "proper" unicode > support works.) > With \u encoded chars in a utf-8 spec, the re2c command produces > "Illegal unicode character, out of range" for e.g. "\u0628", but re2c -u works. > A utf-8 spec, without -u, with utf-8 encoded chars produces a > scanner that recognizes utf-8 encoded input, but only by recognizing > byte codes, not characters (in the unicode sense). It reads utf-16 > and utf-32 input but doesn't recognize the (non-ascii) chars. > A utf-8 spec, with -u, with \u encoded chars produces a scanner that > does not recognize non-ascii input regardless of input encoding. > However, it does seem to recognize ascii regexes. > So I'm not understanding something. I'm also confused about the > difference between -w and -u. The difference is the character space. While -u supports 16 bit only as in UCS2, -w supports full unicode as UTF-32. > Any help would be greatly appreciated. Also, I'd be happy to write up > some documentation if somebody can get me started with an example or > two. > -Gregg > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > Build the coolest Linux based applications with Moblin SDK & win great prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > Re2c-general mailing list > Re2...@li... > https://lists.sourceforge.net/lists/listinfo/re2c-general -- Best regards, Marcus mailto:ma...@ma... |