[Flex-devel] Unicode support for Flex

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

After investing time into a re-write of the skeleton file processing for 
flex, I was looking into how to add support for Unicode. I was comparing 
flex to Sun's open-source lex, available at 
http://heirloom.sourceforge.net/. It is based on the original AT&T lex, 
but has added support for wide character sets using yytext as wchar_t, 
or by leaving yytext as char and using multi-byte encoding.

There is also an existing "unicode" patch for flex, but it is really 
somewhat of a hack. I think it must have been written by an MS Windows 
user, because it really is only 16-bit, which is the size of wchar_t on 
Windows, and therefore not really Unicode. It also uses raw 2-byte file 
I/O instead of the proper C library wcs functions. It is also very 
inefficient, because it just sets CSIZE to 65536, creating rather large 
tables.

The right way to do large character set is to make a sparse table of 
characters that are actually referenced in the grammar. It turns out 
that this is essentially what flex does with the ECS option, so it will 
be fairly easy to implement large character sets correctly by making ECS 
mode a requirement.

OTOH, the lower 16-bit part of the Unicode character sets covers all of 
the common written languages in modern use, so a 16-bit limited version 
as implemented in the current "unicode" patch would be a good start. It 
will also work for other 16-bit encodings that are still widely used. 
So, I have implemented a variation of the 16-bit patch to provide a 
general-purpose --16bit mode. The main difference is that I converted 
all flex CSIZE tables to dynamic allocation, so that the large tables 
required for this approach will not make flex a memory hog for all of 
the 8-bit users.

I also added a "--with-wchar" option to configure.in so it can be an 
experimental feature. By 'wchar' I mean general wide-character support, 
and not 'wchar_t'. Maybe another name would avoid possible confusion?

Using hints from Sun's lex code, it was actually fairly easy to get a 
quick initial implementation. The problem is that I use American 
English, and am not a Unicode or NLS user, so I am somewhat clueless 
about actually using wide-character support, other than inserting 
special characters into some of my print messages. I mainly did it 
because I know a lot of other users have expressed a strong interest in 
getting Unicode support. I suspect there must be some test examples that 
use wide-char support in Sun's lex, or maybe examples that work with the 
existing unicode patch.

Joe Krahn

[Flex-devel] Unicode support for Flex

flex is a tool for generating scanners

[Flex-devel] Unicode support for Flex