[Flex-help] Python Lexical Analysis
flex is a tool for generating scanners
Brought to you by:
wlestes
From: Philip H. <her...@go...> - 2010-08-03 05:20:07
|
Hey guys I am writing a flex and bison implementation of the python parser, and coming up on a problem, thinking i may need to revert to a hand written lexer, but i would like the insight of more experienced flex guys. The problem being is the suite grammar which is the grammar for a block example: def foo ( .. ) : <code_block> I am not going to talk about grammar but what i want to illustrate is the problem in figuring out indentation the grammar for something like this is as follows: http://docs.python.org/release/2.5.2/ref/grammar.txt funcdef ::= [decorators] "def" funcname "(" [parameter_list] ")" ":" suite suite ::= stmt_list NEWLINE | NEWLINE INDENT statement+ DEDENT The problem being is on the lexical side of things figuring out what is INDENT and DEDENT, so for this parser i am requiring 4 spaces for an indent because that's what emacs python-mode is doing for me for now anyways. So reading up on: http://docs.python.org/reference/lexical_analysis.html The indentation part they use a stack to figure out the indentation levels, so first off 0 is pushed onto a stack as a kind of initializer or baseline for the system, then if we find an indentation on a new logical line we push 1 onto the stack if we find multiple we need to check that level of indentation exists on the stack and so on you get the idea you need to read that little paragraph, this is all to figure out when to generate a DEDENT token which is the real crux of the problem. The problem i am having implementing this is really everything revolves arount these to flex rules: "\n" { return NEWLINE; } " " { return INDENT; } The problm being there is no lexical token that we actually read in the file for DEDENT, so my idea is so far either create a handwritten lexer or do somthing like: "\n" { vec_push( 0 ); return NEWLINE; } " " { vec_head->indent++; return INDENT; } Then with newline i can do some if checks to figure out if there was a dedent, but the problem is i will need things like return DEDENT then immediately after return INDENT or NEWLINE, and C wont allow multiple returns in one code block ;) So then i could then make a general token stack for what to actually return in flex but this all sounds very complicated with lots of vector work which i think i could do but its not the most pleasant of solutions maybe you guys would have some insight. --Phil |