[Flex-devel] Generating UTF-8 regular expressions
flex is a tool for generating scanners
Brought to you by:
wlestes
From: Dmitri G. <gri...@gm...> - 2009-10-19 04:47:23
|
Hi, I've written a program that might be useful for people writing scanners with UTF-8 input. (See attachment) /* * This program produces lex(1)-compatible regular expressions that will match * any of specified Unicode code points in UTF-8. Code points can be either * enumerated or specified by their General_Category. * * lex(1) doesn't know anything about Unicode, but it is 8-bit clean. So we * construct regular expressions that match UTF-8 byte sequences (that encode * needed Unicode code points). In this way we can get a lexer that works with * UTF-8 input, but doesn't know anything about Unicode. * * To make lexer aware of any Unicode Transformation Format we need to write a * wrapper that converts any UTF to UTF-8 and feeds the result to the UTF-8 * lexer. * * This program is supposed to be conforming to Unicode 5.2.0 (but I don't have * a proof of that). */ /* * To compile: install libicu-dev. Then run: * g++ -W -Wall -std=c++0x gen_utf-8_regexp.cc $(icu-config --ldflags) */ For example: A specified category: $ ./gen_utf-8_regexp Space_Separator ([\x20]|(\xc2\xa0)|(\xe1\x9a\x80|\xe1\xa0\x8e|\xe2\x80[\x80-\x8a]|\xe2\x80\xaf|\xe2\x81\x9f|\xe3\x80\x80)) A specified code point: $ ./gen_utf-8_regexp -c 2192 ((\xe2\x86\x92)) Multiple categories or code points can be specified. It could be useful for flex users if this program would be included in flex distribution. I'm free for suggestions. Best regards, Dmitri Gribenko -- main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if (j){printf("%d\n",i);}}} /*Dmitri Gribenko <gri...@gm...>*/ |