Re: [re2c-general] unicode; javascript

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello Gregg,

Friday, July 25, 2008, 5:29:20 PM, you wrote:

> On Fri, Jul 25, 2008 at 5:23 AM, Marcus Boerger <ma...@ma...> wrote:
>> Hello Gregg,

>> The testing would be the harder work. What is required is to have re2c read
>> chrs from the input stream into ints rather than into chars (bytes as you
>> called it). However you can already di so if you provide the layer doing so
>> and just pass along the int array. Anyway, at this point re2c development

> Does anybody have some sample specs to work with Unicode input?  I'm
> having trouble getting -u to work

> Summary: we have two encodings, one for the re2c spec and one for the
> input to the generated scanner.  I'm stuck using cygwin for the time
> being, so I use emacs to write my files in utf-8, then us iconv to
> convert to UTF-16 or UTF-32.

> My YYCTYPE is unsigned int.

> Only utf-8 specs work.  With utf-16 or utf-32 specs re2c runs to
> completion but generates humongous c files that provoke zillions of
> "warning: null character(s) ignored".  I'm not sure if compilation
> would work, it was taking so long I killed it.

C/C++ compilers only accept ascii, maybe UTF-8 if you're lucky.

> With a utf-8 encoded spec, I can either use utf-8 encoded chars in my
> regexes (e.g. كتاب, lègére, etc.) or \u notation (e.g. \u0628 = ب).

> With utf-8 encoded chars in a utf-8 spec, the re2c command works, but re2c
> -u results in a re2c segfault.  (I think I can make a usable scanner
> with this method, but I want to understand how "proper" unicode
> support works.)

> With \u encoded chars in a utf-8 spec, the re2c command produces
> "Illegal unicode character, out of range" for e.g. "\u0628", but  re2c -u works.

> A utf-8 spec, without -u, with utf-8 encoded chars produces a
> scanner that recognizes utf-8 encoded input, but only by recognizing
> byte codes, not characters (in the unicode sense).  It reads utf-16
> and utf-32 input but doesn't recognize the (non-ascii) chars.

> A utf-8 spec, with -u, with \u encoded chars produces a scanner that
> does not recognize non-ascii input regardless of input encoding.
> However, it does seem to recognize ascii regexes.

> So I'm not understanding something.  I'm also confused about the
> difference between -w and -u.

The difference is the character space. While -u supports 16 bit only as in
UCS2, -w supports full unicode as UTF-32.

> Any help would be greatly appreciated.  Also, I'd be happy to write up
> some documentation if somebody can get me started with an example or
> two.

> -Gregg
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Re2c-general mailing list
> Re2...@li...
> https://lists.sourceforge.net/lists/listinfo/re2c-general

-- 
Best regards,
 Marcus                            mailto:ma...@ma...