[Flex-devel] Generating UTF-8 regular expressions

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I've written a program that might be useful for people writing
scanners with UTF-8 input.

(See attachment)

/*
 * This program produces lex(1)-compatible regular expressions that will match
 * any of specified Unicode code points in UTF-8.  Code points can be either
 * enumerated or specified by their General_Category.
 *
 * lex(1) doesn't know anything about Unicode, but it is 8-bit clean.  So we
 * construct regular expressions that match UTF-8 byte sequences (that encode
 * needed Unicode code points).  In this way we can get a lexer that works with
 * UTF-8 input, but doesn't know anything about Unicode.
 *
 * To make lexer aware of any Unicode Transformation Format we need to write a
 * wrapper that converts any UTF to UTF-8 and feeds the result to the UTF-8
 * lexer.
 *
 * This program is supposed to be conforming to Unicode 5.2.0 (but I don't have
 * a proof of that).
 */

/*
 * To compile: install libicu-dev.  Then run:
 * g++ -W -Wall -std=c++0x gen_utf-8_regexp.cc $(icu-config --ldflags)
 */

For example:
A specified category:
$ ./gen_utf-8_regexp Space_Separator
([\x20]|(\xc2\xa0)|(\xe1\x9a\x80|\xe1\xa0\x8e|\xe2\x80[\x80-\x8a]|\xe2\x80\xaf|\xe2\x81\x9f|\xe3\x80\x80))

A specified code point:
$ ./gen_utf-8_regexp -c 2192
((\xe2\x86\x92))

Multiple categories or code points can be specified.

It could be useful for flex users if this program would be included in
flex distribution.

I'm free for suggestions.

Best regards,
Dmitri Gribenko

-- 
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gri...@gm...>*/

[Flex-devel] Generating UTF-8 regular expressions

flex is a tool for generating scanners

[Flex-devel] Generating UTF-8 regular expressions