Thread: [Flex-help] Python Lexical Analysis

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hey guys

I am writing a flex and bison implementation of the python parser, and
coming up on a problem, thinking i may need to revert to a hand
written lexer, but i would like the insight of more experienced flex
guys.

The problem being is the suite grammar which is the grammar for a block example:

def foo ( .. ) :
 <code_block>

I am not going to talk about grammar but what i want to illustrate is
the problem in figuring out indentation the grammar for something like
this is as follows:

http://docs.python.org/release/2.5.2/ref/grammar.txt

funcdef ::=
             [decorators] "def" funcname "(" [parameter_list] ")"
              ":" suite

suite ::=
             stmt_list NEWLINE
              | NEWLINE INDENT statement+ DEDENT

The problem being is on the lexical side of things figuring out what
is INDENT and DEDENT, so for this parser i am requiring 4 spaces for
an indent because that's what emacs python-mode is doing for me for
now anyways. So reading up on:

http://docs.python.org/reference/lexical_analysis.html

The indentation part they use a stack to figure out the indentation
levels, so first off 0 is pushed onto a stack as a kind of initializer
or baseline for the system, then if we find an indentation on a new
logical line we push 1 onto the stack if we find multiple we need to
check that level of indentation exists on the stack and so on you get
the idea you need to read that little paragraph, this is all to figure
out when to generate a DEDENT token which is the real crux of the
problem.

The problem i am having implementing this is really everything
revolves arount these to flex rules:

"\n"                    { return NEWLINE; }
"    "                  { return INDENT; }

The problm being there is no lexical token that we actually read in
the file for DEDENT, so my idea is so far either create a handwritten
lexer or do somthing like:

"\n"                    { vec_push( 0 ); return NEWLINE; }
"    "                  { vec_head->indent++; return INDENT; }

Then with newline i can do some if checks to figure out if there was a
dedent, but the problem is i will need things like return DEDENT then
immediately after return INDENT or NEWLINE, and C wont allow multiple
returns in one code block ;)

So then i could then make a general token stack for what to actually
return in flex but this all sounds very complicated with lots of
vector work which i think i could do but its not the most pleasant of
solutions maybe you guys would have some insight.

--Phil

Thread: [Flex-help] Python Lexical Analysis

flex is a tool for generating scanners

flex-help