[Cscope-devel] A flex-able scanner.l

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello, all,

let me show you what a flex-optimized scanner.l might look like. It's the
one I mentioned in yesterday's mail, already. Today, I remembered to bring
its source in, so I can make diffs and show them to you, to give you an
idea what we would be talking about, and a quick benchmark on another
platform than my Linux box at home.

So here's a diff, heavily edited to isolate the scanner rewrite (The
full 'diff' is attached, for reference):

Index: scanner.l
===================================================================
RCS file: /cvsroot/cscope/cscope/src/scanner.l,v
retrieving revision 1.4
diff -u -w -B -r1.4 scanner.l

--- scanner.l	2000/05/05 17:35:00	1.4
+++ scanner.l	2000/09/27 10:54:59
@@ -85,11 +86,13 @@
 static	int	templateparens;		/* function template outer parentheses count */
 static	int	typedefbraces = -1;	/* initial typedef brace count */
 static	int	token;			/* token found */
+static	int 	ident_start;		/* begin of preceding identifier */

That's a new status variable I had to introduce. See below.

[...]
@@ -113,6 +128,16 @@
 %start SDL
 %a 4000
 %o 7000
+/* flex only options, to avoid incompatibilities with AT&T lex: */
+%array
+/* 'flex -B' (batch mode) doesn't work, for this scanner. Probably 
+ * a usage bug, but let's stay conservative: */
+%option always-interactive
+
+/* Exclusive initial states (not available in 'lex'!) */
+%x WAS_ENDIF WAS_IDENTIFIER
+
+
 %%

Flex-specific flags and options. 'lex' can just ignore %array and %option.
%x could be emulated by %s, and prefixing all rules with a non-default
initial state, to eliminate the difference between exclusive and inclusive
initial states.

[...]
-\#[ \t]*endif/.*\n[ \t\n]*#[ \t]*if	{
+<WAS_ENDIF>.*\n[ \t\n]*#[ \t]*ifn?(def)?	{

This is one of the original 'variable trailing context rules'. To get
rid of this performance killer (according to flex'es messages and docs),
I delayed handling of '#endif' lines, and use the initial state
<WAS_ENDIF> to flag this.

@@ -203,10 +228,11 @@
 				/* NOTREACHED */
 			}
 			pseudoelif = YES;
+			BEGIN(INITIAL);
 			goto more;
 			/* NOTREACHED */
 		}
This goes back to normal operation after <WAS_ENDIF> has been taken care
of.

-\#[ \t]*ifn?(def)?	{ /* #if, #ifdef or #ifndef */
+<INITIAL>\#[ \t]*ifn?(def)?	{ /* #if, #ifdef or #ifndef */

Don't use this rule if in <WAS_ENDIF> status.

-\#[ \t]*endif	{	/* #endif */
+<WAS_ENDIF>.*\n		{ 	/* an #endif with no #if right after it */

And endif with no #if right after it now needs special treatment (an
#endif with nothing between it and the EOF causes a little glitch because
of this rule --> will be echoed to stdout)
 				
@@ -249,6 +275,16 @@
 					braces = maxifbraces[iflevel];
 				}
 			}
+			BEGIN(INITIAL);
+			yyless(0); /* rescan */
+			goto more;
+			/* NOTREACHED */
+		}
+			
+\#[ \t]*endif	{	/* #endif */
+			/* delay treatment of #endif depending on whether an
+			 * #if comes right after it, or not */
+			BEGIN(WAS_ENDIF);
 			goto more;
 			/* NOTREACHED */
 		}

And this is the rule that detect #endif, and does nothing but move the
scanner into <WAS_ENDIF> mode. The original treatment of #endif is now in
the <WAS_ENDIF>.|\n rule, above.

@@ -359,16 +395,39 @@
 class[ \t]+{identifier}[ \t\na-zA-Z0-9_():]*\{	{	/* class definition */
 			classdef = YES;
 			tagdef =  'c';
-			REJECT;
+			/* RE!JECT; */
+			yyless(5); /* eat up 'class', and re-scan */
+			goto more;
 			/* NOTREACHED */
 		}

Uses yyless() to avoid use of REJECT. I'm not 100% certain that this does
what it's supposed to, but it seemed to behave the same as the original
scanner, if used on a C++ source.

-(enum|struct|union)/([ \t\n]+{identifier})?[ \t\n]*\{	{ /* enum/struct/union definition */
-			tagdef = *(yytext + first);
+
+		/* Notice: three copies of essentially same rule may seem
+		 * strange, but they help flex to avoid slowing down the
+	         * scan (--> 'variable trailing context rules') */
+enum/([ \t\n]+{identifier})?[ \t\n]*\{	{ /* enum definition */
+			tagdef = 'e';
			goto ident;
 			/* NOTREACHED */
		}
+struct/([ \t\n]+{identifier})?[ \t\n]*\{	{ /* struct definition */
+			tagdef = 's';
+			goto ident;
+			/* NOTREACHED */
+		}
+union/([ \t\n]+{identifier})?[ \t\n]*\{	{ /* union definition */
+			tagdef = 'u';
+			goto ident;
+			/* NOTREACHED */
+		}

This resolves another variable trailing context, by splitting up the rule
into three. That was simple.

-{identifier}/[ \t]*\([ \t\na-zA-Z0-9_*&[\]=,.]*\)[ \t\n()]*[:a-zA-Z_#{]	{
+{identifier}	{	/* identifier found: do nothing, yet. (!) */
+			BEGIN(WAS_IDENTIFIER);
+			ident_start = first;
+			goto more;
+			/* NOTREACHED */
+		}
+<WAS_IDENTIFIER>[ \t]*\([ \t\na-zA-Z0-9_*&[\]=,.]*\)[ \t\n()]*[:a-zA-Z_#{]	{
+

Again, a slew of variable trailing context rules. Treated by first reading
the identifier, and transforming the trailing contexts into
(non-accepting) actual rules. 'ident_start' is needed as a new status
variable to store where the {identifier} is, within the current yytext[].

-{identifier}/[ \t]*\(	{ /* if a function call */
+<WAS_IDENTIFIER>[ \t]*\(	{ /* if a function call */

-{identifier}/[* \t\n]+[a-zA-Z0-9_]	{	/* typedef name use */
+<WAS_IDENTIFIER>[* \t\n]+[a-zA-Z0-9_]	{	/* typedef name use */

-{identifier}	{
+<WAS_IDENTIFIER>.|\n	{		/* the old {identifier} rule */

If the none of the trailing contexts gave a match, treat it like the
original {identifier} rule did.

@@ -420,6 +480,14 @@
 		ident:		token = IDENT;
 			}
 		fcn:
+			if (YYSTATE == WAS_IDENTIFIER) {
+				/* Position back to the actual identifier: */
+				yyleng = first; 
+				first = ident_start;
+				BEGIN(INITIAL);
+				yyless(0);
+			}

This if(YYSTATE == WAS_IDENTIFIER) block is needed because other rules
'goto' right into the middle of the {identifier} rule, and not all of them
belong to the set of variable context matchers that have <WAS_IDENTIFIER>
as their initial state. Those that do need yyleng and first modified
to simulate the original behaviour, where {identifier} would have been the
match, not the trailing context, as we come here.

We could duplicate the handling code, instead, to avoid all that
'goto'ing.

@@ -622,6 +690,7 @@
 	template = NO;		/* function template */
 	templateparens = -1;	/* function template outer parentheses count */
 	typedefbraces = -1;	/* initial typedef braces count */
+	ident_start = 0;
 
 	BEGIN 0;
 

Phew. That was that.

Now, let me show a quick'n'dirty benchmark: cscope current-cvs with
scanner.l modified as described, and compiled in different 'flex' modes.
Benchmark is 'cscope -ulR' in its own src dir (build directory is
elsewhere, so no .o and no scanner.c or egrep.c are in this directry) plus
my local gnuplot sources tree, to make it a bit bigger. Two consecutive
runs each time, to allow for caching effects to settle.

'flex -l', original scanner:
11.337u 0.430s 0:16.87 69.7%    0+2k 623+382io 0pf+0w
11.336u 0.422s 0:16.43 71.5%    0+2k 179+346io 0pf+0w

'flex -l' mode, but with the modified scanner:
#endif 10.831u 0.428s 0:15.53 72.4%     0+2k 644+367io 0pf+0w
#endif 10.851u 0.278s 0:12.73 87.3%     0+2k 0+293io 0pf+0w

(That '#endif' in the output is from the scanning glitch I mentioned
earlier.)

'flex' without any flags (except those silently activated by '%array'
and %option I mentioned), with the modified scanner.
#endif 9.552u 0.403s 0:14.47 68.7%      0+2k 603+374io 0pf+0w
#endif 9.471u 0.318s 0:12.06 81.0%      0+2k 57+342io 0pf+0w

'flex -f' mode ('-F' and the other -C{F|f}... modes are similar):
#endif 6.254u 0.416s 0:11.81 56.3%      0+2k 699+351io 0pf+0w
#endif 6.282u 0.265s 0:07.50 87.2%      0+3k 0+281io 0pf+0w

I.e. roughly 45% reduction of execution time from first to last example.
This would bring the factor of 5 to 6 reported by Darryl down to about 3,
then. But currently, it comes at the cost of errors: the #endif glitch is
easily visible above. The other one may be a heisenbug: -f mode silently
forces -B mode, which caused errors on my Linux box at home, before. But
on the Alpha, it seems to function properly.

Hans-Bernhard Broeker (br...@ph...)
Even if all the snow were burnt, ashes would remain.


[Cscope-devel] A flex-able scanner.l

Efficient, text-only browser for C sources

[Cscope-devel] A flex-able scanner.l