[Flex-help] Using serialized tables and a single template for a general recognizer: can this work?
flex is a tool for generating scanners
Brought to you by:
wlestes
From: Allyn D. <all...@gm...> - 2010-08-04 19:50:45
|
I am interested in solving the following problem: I have a file full of regular expressions. The file contents can change, both the regular expressions themselves and the number of regular expressions. I want a program that given a string can determine quickly which regular expression from the file provides the longest / first match. A file of regexps always ends with a default rule so there is never a question of an expression not matching. By "quickly" I mean that checking the string against one regexp at a time is too slow: I need a combined DFA. Is there a standard solution to this problem? If so please let me know. My idea is to run the file through flex generating a serialized table. I would like my program to be linked with a single template or small predetermined number of templates. Given that the number of rules changes, I cannot use the standard lex.yy.c file since, even for serialized tables, the value YY_END_OF_BUFFER is hard coded but in my case the number of rules can change. So I need to hack the template lex.yy.c file to make this a dynamic parameter that I can set in yylex_init_extra. It would be nice to be able to determine the value for YY_END_OF_BUFFER from the tables file, but I can also use a pair of <tables file, metadata>. I also need to be able to determine which regexp matches in the combined DFA. I propose to do this by in a single action { yylval = yy_act; }. I believe that this will give me the (1-based) number of the matched regexp in my input file. If this is wrong please let me know. Again I need to hack into a lex.yy.c file to put this into the default case. In beginning to test out this idea with a few files of regexps I notice that I get struct yy_trans_info { flex_int16_t yy_verify; flex_int16_t yy_nxt; }; in some cases and struct yy_trans_info { flex_int32_t yy_verify; flex_int32_t yy_nxt; }; in others. This would seem to require two (at least) different scanners linked into my program. Are there combinations of flex_int16_t with flex_int32_t, or is it possible to get flex_int8_t which would require more scanners. I notice that the lex.yy.c files for some regexp sequences include a back-up action on yy_act == 0 and others do not. I assume that there is no harm to always having the back-up action on yy_act == 0 and not having it is just an optimization in the scanner generator. Does the approach outlined above appear doable? Would I do better to steal from the flex source to make a specialized tool rather than trying to generalize the lex.yy.c files put out by vanilla flex? Right now, I am looking at hacking a lex.yy.c file output from flex rather than using --skel, but please let me know if using --skel would make my coding job easier or more reliable. If all this doesn't work, I can fall back on not using serialized tables but instead compiling a scanner and linking it as dll. I would prefer to avoid the dll route as it is architecture dependent and it may require some significant amount of time to change over between dlls. Thanks in advance for your advice, -- Allyn P.S. The lex.yy.c files that I generated while looking at the problem were generated with flex 2.5.34 as flex -CF --tables-file --reentrant --noline --batch --never-interactive --7bit --noyywrap --bison-bridge I am using an unmodified flex: I tend to be using < 2000 regexps in a file and do not seem to be getting too close to the flex limit of 32000 NFA states. |