[Flex-devel] Unicode support for Flex
flex is a tool for generating scanners
Brought to you by:
wlestes
From: Joe K. <kr...@ni...> - 2008-11-11 17:12:35
|
After investing time into a re-write of the skeleton file processing for flex, I was looking into how to add support for Unicode. I was comparing flex to Sun's open-source lex, available at http://heirloom.sourceforge.net/. It is based on the original AT&T lex, but has added support for wide character sets using yytext as wchar_t, or by leaving yytext as char and using multi-byte encoding. There is also an existing "unicode" patch for flex, but it is really somewhat of a hack. I think it must have been written by an MS Windows user, because it really is only 16-bit, which is the size of wchar_t on Windows, and therefore not really Unicode. It also uses raw 2-byte file I/O instead of the proper C library wcs functions. It is also very inefficient, because it just sets CSIZE to 65536, creating rather large tables. The right way to do large character set is to make a sparse table of characters that are actually referenced in the grammar. It turns out that this is essentially what flex does with the ECS option, so it will be fairly easy to implement large character sets correctly by making ECS mode a requirement. OTOH, the lower 16-bit part of the Unicode character sets covers all of the common written languages in modern use, so a 16-bit limited version as implemented in the current "unicode" patch would be a good start. It will also work for other 16-bit encodings that are still widely used. So, I have implemented a variation of the 16-bit patch to provide a general-purpose --16bit mode. The main difference is that I converted all flex CSIZE tables to dynamic allocation, so that the large tables required for this approach will not make flex a memory hog for all of the 8-bit users. I also added a "--with-wchar" option to configure.in so it can be an experimental feature. By 'wchar' I mean general wide-character support, and not 'wchar_t'. Maybe another name would avoid possible confusion? Using hints from Sun's lex code, it was actually fairly easy to get a quick initial implementation. The problem is that I use American English, and am not a Unicode or NLS user, so I am somewhat clueless about actually using wide-character support, other than inserting special characters into some of my print messages. I mainly did it because I know a lot of other users have expressed a strong interest in getting Unicode support. I suspect there must be some test examples that use wide-char support in Sun's lex, or maybe examples that work with the existing unicode patch. Joe Krahn |