Thread: [Flex-help] Python Lexical Analysis
flex is a tool for generating scanners
Brought to you by:
wlestes
From: Philip H. <her...@go...> - 2010-08-03 05:20:07
|
Hey guys I am writing a flex and bison implementation of the python parser, and coming up on a problem, thinking i may need to revert to a hand written lexer, but i would like the insight of more experienced flex guys. The problem being is the suite grammar which is the grammar for a block example: def foo ( .. ) : <code_block> I am not going to talk about grammar but what i want to illustrate is the problem in figuring out indentation the grammar for something like this is as follows: http://docs.python.org/release/2.5.2/ref/grammar.txt funcdef ::= [decorators] "def" funcname "(" [parameter_list] ")" ":" suite suite ::= stmt_list NEWLINE | NEWLINE INDENT statement+ DEDENT The problem being is on the lexical side of things figuring out what is INDENT and DEDENT, so for this parser i am requiring 4 spaces for an indent because that's what emacs python-mode is doing for me for now anyways. So reading up on: http://docs.python.org/reference/lexical_analysis.html The indentation part they use a stack to figure out the indentation levels, so first off 0 is pushed onto a stack as a kind of initializer or baseline for the system, then if we find an indentation on a new logical line we push 1 onto the stack if we find multiple we need to check that level of indentation exists on the stack and so on you get the idea you need to read that little paragraph, this is all to figure out when to generate a DEDENT token which is the real crux of the problem. The problem i am having implementing this is really everything revolves arount these to flex rules: "\n" { return NEWLINE; } " " { return INDENT; } The problm being there is no lexical token that we actually read in the file for DEDENT, so my idea is so far either create a handwritten lexer or do somthing like: "\n" { vec_push( 0 ); return NEWLINE; } " " { vec_head->indent++; return INDENT; } Then with newline i can do some if checks to figure out if there was a dedent, but the problem is i will need things like return DEDENT then immediately after return INDENT or NEWLINE, and C wont allow multiple returns in one code block ;) So then i could then make a general token stack for what to actually return in flex but this all sounds very complicated with lots of vector work which i think i could do but its not the most pleasant of solutions maybe you guys would have some insight. --Phil |
From: Marcel L. <ma...@la...> - 2010-08-03 06:36:24
|
Can't you have a global stack in your scanner to keep track of the indentation levels? Something like: [ \t]+ { size_t top_len = yyextra->indent_stack.top(); if (top_len < yyleng) { yyextra->indent_stack.push(yyleng); return INDENT; } else if (top_len > yyleng) { yyextra->indent_stack.pop(); return DEDENT; } } You will need to add proper start conditions to only let this fire after you return a NEWLINE. Additionally you need to handle cases like this: foo bar etc var I think after scanning "etc" it should return DEDENT, DEDENT, INDENT; but the sample lex example above won't handle that correctly. You should be able to pull this off with clever start conditions, or YY_USER_ACTION however. On Tue, 3 Aug 2010 06:19:59 +0100, Philip Herron <her...@go...> wrote: > Hey guys > > I am writing a flex and bison implementation of the python parser, and > coming up on a problem, thinking i may need to revert to a hand > written lexer, but i would like the insight of more experienced flex > guys. > > The problem being is the suite grammar which is the grammar for a block > example: > > def foo ( .. ) : > <code_block> > > I am not going to talk about grammar but what i want to illustrate is > the problem in figuring out indentation the grammar for something like > this is as follows: > > http://docs.python.org/release/2.5.2/ref/grammar.txt > > funcdef ::= > [decorators] "def" funcname "(" [parameter_list] ")" > ":" suite > > suite ::= > stmt_list NEWLINE > | NEWLINE INDENT statement+ DEDENT > > The problem being is on the lexical side of things figuring out what > is INDENT and DEDENT, so for this parser i am requiring 4 spaces for > an indent because that's what emacs python-mode is doing for me for > now anyways. So reading up on: > > http://docs.python.org/reference/lexical_analysis.html > > The indentation part they use a stack to figure out the indentation > levels, so first off 0 is pushed onto a stack as a kind of initializer > or baseline for the system, then if we find an indentation on a new > logical line we push 1 onto the stack if we find multiple we need to > check that level of indentation exists on the stack and so on you get > the idea you need to read that little paragraph, this is all to figure > out when to generate a DEDENT token which is the real crux of the > problem. > > The problem i am having implementing this is really everything > revolves arount these to flex rules: > > "\n" { return NEWLINE; } > " " { return INDENT; } > > The problm being there is no lexical token that we actually read in > the file for DEDENT, so my idea is so far either create a handwritten > lexer or do somthing like: > > "\n" { vec_push( 0 ); return NEWLINE; } > " " { vec_head->indent++; return INDENT; } > > Then with newline i can do some if checks to figure out if there was a > dedent, but the problem is i will need things like return DEDENT then > immediately after return INDENT or NEWLINE, and C wont allow multiple > returns in one code block ;) > > So then i could then make a general token stack for what to actually > return in flex but this all sounds very complicated with lots of > vector work which i think i could do but its not the most pleasant of > solutions maybe you guys would have some insight. > > --Phil > > ------------------------------------------------------------------------------ > The Palm PDK Hot Apps Program offers developers who use the > Plug-In Development Kit to bring their C/C++ apps to Palm for a share > of $1 Million in cash or HP Products. Visit us here for more details: > http://p.sf.net/sfu/dev2dev-palm > _______________________________________________ > Flex-help mailing list > Fle...@li... > https://lists.sourceforge.net/lists/listinfo/flex-help |
From: Daniel Varga-H. <var...@ov...> - 2010-08-03 12:09:49
|
I am not, by all means, an expert on Flex and without giving much thought to it, I would possibly check every line how many time it's been indented. Whenever it is indented I would enter a start condition and when that happens increment a global, like YY_EXTRA. Something like: <ST_INDENT> " " { YY_EXTRA++; BEGIN(ST_INDENT); } This way YY_EXTRA would contain the number of indentions and whenever your LEXER returns a token to BISON, you can examine the YY_EXTRA's value in the grammar file to see the level of indention. This is quite barbaric I suppose and might not even work. It's been like a year since I last used Flex, but I hope I could give you another perspective even if the practical things aren't accurate as they are now. As well I have found some cookies for you :) http://docs.python.org/release/2.5.2/ref/grammar.txt Regards, Daniel On Tue, Aug 3, 2010 at 8:16 AM, Marcel Laverdet <ma...@la...> wrote: > > > Can't you have a global stack in your scanner to keep track of the > indentation levels? Something like: > > [ \t]+ { > size_t top_len = yyextra->indent_stack.top(); > if (top_len < yyleng) { > yyextra->indent_stack.push(yyleng); > return INDENT; > } else if (top_len > yyleng) { > yyextra->indent_stack.pop(); > return DEDENT; > } > } > > You will need to add proper start conditions to only let this fire after > you return a NEWLINE. Additionally you need to handle cases like this: > > foo > bar > etc > var > > I think after scanning "etc" it should return DEDENT, DEDENT, INDENT; but > the sample lex example above won't handle that correctly. You should be > able to pull this off with clever start conditions, or YY_USER_ACTION > however. > > On Tue, 3 Aug 2010 06:19:59 +0100, Philip Herron > <her...@go...> wrote: >> Hey guys >> >> I am writing a flex and bison implementation of the python parser, and >> coming up on a problem, thinking i may need to revert to a hand >> written lexer, but i would like the insight of more experienced flex >> guys. >> >> The problem being is the suite grammar which is the grammar for a block >> example: >> >> def foo ( .. ) : >> <code_block> >> >> I am not going to talk about grammar but what i want to illustrate is >> the problem in figuring out indentation the grammar for something like >> this is as follows: >> >> http://docs.python.org/release/2.5.2/ref/grammar.txt >> >> funcdef ::= >> [decorators] "def" funcname "(" [parameter_list] ")" >> ":" suite >> >> suite ::= >> stmt_list NEWLINE >> | NEWLINE INDENT statement+ DEDENT >> >> The problem being is on the lexical side of things figuring out what >> is INDENT and DEDENT, so for this parser i am requiring 4 spaces for >> an indent because that's what emacs python-mode is doing for me for >> now anyways. So reading up on: >> >> http://docs.python.org/reference/lexical_analysis.html >> >> The indentation part they use a stack to figure out the indentation >> levels, so first off 0 is pushed onto a stack as a kind of initializer >> or baseline for the system, then if we find an indentation on a new >> logical line we push 1 onto the stack if we find multiple we need to >> check that level of indentation exists on the stack and so on you get >> the idea you need to read that little paragraph, this is all to figure >> out when to generate a DEDENT token which is the real crux of the >> problem. >> >> The problem i am having implementing this is really everything >> revolves arount these to flex rules: >> >> "\n" { return NEWLINE; } >> " " { return INDENT; } >> >> The problm being there is no lexical token that we actually read in >> the file for DEDENT, so my idea is so far either create a handwritten >> lexer or do somthing like: >> >> "\n" { vec_push( 0 ); return NEWLINE; } >> " " { vec_head->indent++; return INDENT; } >> >> Then with newline i can do some if checks to figure out if there was a >> dedent, but the problem is i will need things like return DEDENT then >> immediately after return INDENT or NEWLINE, and C wont allow multiple >> returns in one code block ;) >> >> So then i could then make a general token stack for what to actually >> return in flex but this all sounds very complicated with lots of >> vector work which i think i could do but its not the most pleasant of >> solutions maybe you guys would have some insight. >> >> --Phil >> >> > ------------------------------------------------------------------------------ >> The Palm PDK Hot Apps Program offers developers who use the >> Plug-In Development Kit to bring their C/C++ apps to Palm for a share >> of $1 Million in cash or HP Products. Visit us here for more details: >> http://p.sf.net/sfu/dev2dev-palm >> _______________________________________________ >> Flex-help mailing list >> Fle...@li... >> https://lists.sourceforge.net/lists/listinfo/flex-help > > ------------------------------------------------------------------------------ > The Palm PDK Hot Apps Program offers developers who use the > Plug-In Development Kit to bring their C/C++ apps to Palm for a share > of $1 Million in cash or HP Products. Visit us here for more details: > http://p.sf.net/sfu/dev2dev-palm > _______________________________________________ > Flex-help mailing list > Fle...@li... > https://lists.sourceforge.net/lists/listinfo/flex-help > |
From: Philip H. <her...@go...> - 2010-08-03 13:40:49
|
On 3 August 2010 07:16, Marcel Laverdet <ma...@la...> wrote: > > > Can't you have a global stack in your scanner to keep track of the > indentation levels? Something like: > > [ \t]+ { > size_t top_len = yyextra->indent_stack.top(); > if (top_len < yyleng) { > yyextra->indent_stack.push(yyleng); > return INDENT; > } else if (top_len > yyleng) { > yyextra->indent_stack.pop(); > return DEDENT; > } > } > Ah that makes more sense i just was having problems seeing how it could look within flex and i thought it could be too messy even for a first pass basic setup. > You will need to add proper start conditions to only let this fire after > you return a NEWLINE. Additionally you need to handle cases like this: > > foo > bar > etc > var Yeah handling the case after etc could be tricky but I'll play with this now and let you guys know. > > I think after scanning "etc" it should return DEDENT, DEDENT, INDENT; but > the sample lex example above won't handle that correctly. You should be > able to pull this off with clever start conditions, or YY_USER_ACTION > however. I have never needed to use any big flex trickery like YY_USER_ACTION before but i will read up on the manual for it now. Some people have expressed interest in the lexer and parser code already and its available over at http://code.redbrain.co.uk/cgit.cgi/gcc-dev/ in the Python branch. I will however extract out the parser code and post it to comp.compilers and host it on my blog for people to do whatever they want with it when its ready. --Phil |
From: Philip H. <red...@gc...> - 2010-08-10 19:54:56
|
Hey guys I've actually got a lot of stuff going at the moment now, but one problem is, say i am using a rule like: "\n" { return NEWLINE; } " "+ { int top_len = VEC_index( gpy_int, gpy_indent_stack, VEC_length( gpy_int, gpy_indent_stack )-1 ); if( top_len <= yyleng ) { if( top_len != yyleng ) VEC_safe_push( gpy_int, gc, gpy_indent_stack, yyleng ); return INDENT; } else if( top_len > yyleng ) { VEC_pop( gpy_int, gpy_indent_stack ); return DEDENT; } } <<EOF>> { int top_len = VEC_index( gpy_int, gpy_indent_stack, VEC_length( gpy_int, gpy_indent_stack )-1 ); if( top_len != 0 ) { VEC_pop( gpy_int, gpy_indent_stack ); return DEDENT; } else return 0; } So far thats what i do but its not right since, i need some mechanism for counting the number of " " rules are matched since yytext will return the text of the single rule. There is probably some simple way of doing that this works fine for simple things like: x = 2 y = x def foo( .. ): ... ... .. But fails on more indentation, any suggestions will be great when the parsing code is more concrete i will put up a separate a flex and bison parser code to simply parse python code so you can do what ever you want with it :) --Phil |
From: Philip H. <red...@gc...> - 2010-08-10 21:34:10
|
Hey Ignore my previous post now using: "\n" { return NEWLINE; } " "+ { int top_len = VEC_index( gpy_int, gpy_indent_stack, VEC_length( gpy_int, gpy_indent_stack )-1 ); int indent = yyleng/4; if( top_len <= indent ) { if( top_len != indent ) { VEC_safe_push( gpy_int, gc, gpy_indent_stack, indent ); } return INDENT; } else if( top_len > indent ) { VEC_pop( gpy_int, gpy_indent_stack ); return DEDENT; } } <<EOF>> { int top_len = VEC_index( gpy_int, gpy_indent_stack, VEC_length( gpy_int, gpy_indent_stack )-1 ); if( top_len != 0 ) { VEC_pop( gpy_int, gpy_indent_stack ); return DEDENT; } else return 0; } Didnt realise yyleng records the entire length of matched rules when using <something>+ it recorded the entire length of it so all i needed to do was divide. --Phil |