From: Hans-Bernhard B. <br...@ph...> - 2000-09-27 12:11:29
|
Hello, all, let me show you what a flex-optimized scanner.l might look like. It's the one I mentioned in yesterday's mail, already. Today, I remembered to bring its source in, so I can make diffs and show them to you, to give you an idea what we would be talking about, and a quick benchmark on another platform than my Linux box at home. So here's a diff, heavily edited to isolate the scanner rewrite (The full 'diff' is attached, for reference): Index: scanner.l =================================================================== RCS file: /cvsroot/cscope/cscope/src/scanner.l,v retrieving revision 1.4 diff -u -w -B -r1.4 scanner.l --- scanner.l 2000/05/05 17:35:00 1.4 +++ scanner.l 2000/09/27 10:54:59 @@ -85,11 +86,13 @@ static int templateparens; /* function template outer parentheses count */ static int typedefbraces = -1; /* initial typedef brace count */ static int token; /* token found */ +static int ident_start; /* begin of preceding identifier */ That's a new status variable I had to introduce. See below. [...] @@ -113,6 +128,16 @@ %start SDL %a 4000 %o 7000 +/* flex only options, to avoid incompatibilities with AT&T lex: */ +%array +/* 'flex -B' (batch mode) doesn't work, for this scanner. Probably + * a usage bug, but let's stay conservative: */ +%option always-interactive + +/* Exclusive initial states (not available in 'lex'!) */ +%x WAS_ENDIF WAS_IDENTIFIER + + %% Flex-specific flags and options. 'lex' can just ignore %array and %option. %x could be emulated by %s, and prefixing all rules with a non-default initial state, to eliminate the difference between exclusive and inclusive initial states. [...] -\#[ \t]*endif/.*\n[ \t\n]*#[ \t]*if { +<WAS_ENDIF>.*\n[ \t\n]*#[ \t]*ifn?(def)? { This is one of the original 'variable trailing context rules'. To get rid of this performance killer (according to flex'es messages and docs), I delayed handling of '#endif' lines, and use the initial state <WAS_ENDIF> to flag this. @@ -203,10 +228,11 @@ /* NOTREACHED */ } pseudoelif = YES; + BEGIN(INITIAL); goto more; /* NOTREACHED */ } This goes back to normal operation after <WAS_ENDIF> has been taken care of. -\#[ \t]*ifn?(def)? { /* #if, #ifdef or #ifndef */ +<INITIAL>\#[ \t]*ifn?(def)? { /* #if, #ifdef or #ifndef */ Don't use this rule if in <WAS_ENDIF> status. -\#[ \t]*endif { /* #endif */ +<WAS_ENDIF>.*\n { /* an #endif with no #if right after it */ And endif with no #if right after it now needs special treatment (an #endif with nothing between it and the EOF causes a little glitch because of this rule --> will be echoed to stdout) @@ -249,6 +275,16 @@ braces = maxifbraces[iflevel]; } } + BEGIN(INITIAL); + yyless(0); /* rescan */ + goto more; + /* NOTREACHED */ + } + +\#[ \t]*endif { /* #endif */ + /* delay treatment of #endif depending on whether an + * #if comes right after it, or not */ + BEGIN(WAS_ENDIF); goto more; /* NOTREACHED */ } And this is the rule that detect #endif, and does nothing but move the scanner into <WAS_ENDIF> mode. The original treatment of #endif is now in the <WAS_ENDIF>.|\n rule, above. @@ -359,16 +395,39 @@ class[ \t]+{identifier}[ \t\na-zA-Z0-9_():]*\{ { /* class definition */ classdef = YES; tagdef = 'c'; - REJECT; + /* RE!JECT; */ + yyless(5); /* eat up 'class', and re-scan */ + goto more; /* NOTREACHED */ } Uses yyless() to avoid use of REJECT. I'm not 100% certain that this does what it's supposed to, but it seemed to behave the same as the original scanner, if used on a C++ source. -(enum|struct|union)/([ \t\n]+{identifier})?[ \t\n]*\{ { /* enum/struct/union definition */ - tagdef = *(yytext + first); + + /* Notice: three copies of essentially same rule may seem + * strange, but they help flex to avoid slowing down the + * scan (--> 'variable trailing context rules') */ +enum/([ \t\n]+{identifier})?[ \t\n]*\{ { /* enum definition */ + tagdef = 'e'; goto ident; /* NOTREACHED */ } +struct/([ \t\n]+{identifier})?[ \t\n]*\{ { /* struct definition */ + tagdef = 's'; + goto ident; + /* NOTREACHED */ + } +union/([ \t\n]+{identifier})?[ \t\n]*\{ { /* union definition */ + tagdef = 'u'; + goto ident; + /* NOTREACHED */ + } This resolves another variable trailing context, by splitting up the rule into three. That was simple. -{identifier}/[ \t]*\([ \t\na-zA-Z0-9_*&[\]=,.]*\)[ \t\n()]*[:a-zA-Z_#{] { +{identifier} { /* identifier found: do nothing, yet. (!) */ + BEGIN(WAS_IDENTIFIER); + ident_start = first; + goto more; + /* NOTREACHED */ + } +<WAS_IDENTIFIER>[ \t]*\([ \t\na-zA-Z0-9_*&[\]=,.]*\)[ \t\n()]*[:a-zA-Z_#{] { + Again, a slew of variable trailing context rules. Treated by first reading the identifier, and transforming the trailing contexts into (non-accepting) actual rules. 'ident_start' is needed as a new status variable to store where the {identifier} is, within the current yytext[]. -{identifier}/[ \t]*\( { /* if a function call */ +<WAS_IDENTIFIER>[ \t]*\( { /* if a function call */ -{identifier}/[* \t\n]+[a-zA-Z0-9_] { /* typedef name use */ +<WAS_IDENTIFIER>[* \t\n]+[a-zA-Z0-9_] { /* typedef name use */ -{identifier} { +<WAS_IDENTIFIER>.|\n { /* the old {identifier} rule */ If the none of the trailing contexts gave a match, treat it like the original {identifier} rule did. @@ -420,6 +480,14 @@ ident: token = IDENT; } fcn: + if (YYSTATE == WAS_IDENTIFIER) { + /* Position back to the actual identifier: */ + yyleng = first; + first = ident_start; + BEGIN(INITIAL); + yyless(0); + } This if(YYSTATE == WAS_IDENTIFIER) block is needed because other rules 'goto' right into the middle of the {identifier} rule, and not all of them belong to the set of variable context matchers that have <WAS_IDENTIFIER> as their initial state. Those that do need yyleng and first modified to simulate the original behaviour, where {identifier} would have been the match, not the trailing context, as we come here. We could duplicate the handling code, instead, to avoid all that 'goto'ing. @@ -622,6 +690,7 @@ template = NO; /* function template */ templateparens = -1; /* function template outer parentheses count */ typedefbraces = -1; /* initial typedef braces count */ + ident_start = 0; BEGIN 0; Phew. That was that. Now, let me show a quick'n'dirty benchmark: cscope current-cvs with scanner.l modified as described, and compiled in different 'flex' modes. Benchmark is 'cscope -ulR' in its own src dir (build directory is elsewhere, so no .o and no scanner.c or egrep.c are in this directry) plus my local gnuplot sources tree, to make it a bit bigger. Two consecutive runs each time, to allow for caching effects to settle. 'flex -l', original scanner: 11.337u 0.430s 0:16.87 69.7% 0+2k 623+382io 0pf+0w 11.336u 0.422s 0:16.43 71.5% 0+2k 179+346io 0pf+0w 'flex -l' mode, but with the modified scanner: #endif 10.831u 0.428s 0:15.53 72.4% 0+2k 644+367io 0pf+0w #endif 10.851u 0.278s 0:12.73 87.3% 0+2k 0+293io 0pf+0w (That '#endif' in the output is from the scanning glitch I mentioned earlier.) 'flex' without any flags (except those silently activated by '%array' and %option I mentioned), with the modified scanner. #endif 9.552u 0.403s 0:14.47 68.7% 0+2k 603+374io 0pf+0w #endif 9.471u 0.318s 0:12.06 81.0% 0+2k 57+342io 0pf+0w 'flex -f' mode ('-F' and the other -C{F|f}... modes are similar): #endif 6.254u 0.416s 0:11.81 56.3% 0+2k 699+351io 0pf+0w #endif 6.282u 0.265s 0:07.50 87.2% 0+3k 0+281io 0pf+0w I.e. roughly 45% reduction of execution time from first to last example. This would bring the factor of 5 to 6 reported by Darryl down to about 3, then. But currently, it comes at the cost of errors: the #endif glitch is easily visible above. The other one may be a heisenbug: -f mode silently forces -B mode, which caused errors on my Linux box at home, before. But on the Alpha, it seems to function properly. Hans-Bernhard Broeker (br...@ph...) Even if all the snow were burnt, ashes would remain. |