From: Duncan C. <dun...@wo...> - 2005-05-30 18:17:05
|
On Thu, 2005-05-26 at 16:46 +1000, Manuel M T Chakravarty wrote: > Duncan Coutts: > > On Wed, 2005-05-25 at 11:44 +1000, Manuel M T Chakravarty wrote: > > > > > An Alex/Happy parser would be an option if it improves matters > > > significantly. If you or anybody else has a go at it, please follow the > > > C language definition closely, as does the existing lexer and parser > > > (with comments stating to which sections in K&R the individual > > > productions relate). > > > > Intrim status update: Another status update: I started with a yacc C grammar and added the semantic actions by translating from the existing parser. I reordered the clauses to match the order in the original parser and kept the comments (but removed the bits that are no longer true). I think I've added back in all the syntactic extensions supported by c2hs. There is one shift/reduce conflict to do with the "if then else" syntax which I believe is benign. It's not quite finished yet (in particular I've not done allocation of attributes), but it does already parse the Gtk+ headers. It is quite quick and importantly uses very little heap space. On my old 500MHz sparc, parsing the Gtk+ headers takes 8 seconds and requires 28Mb of heap. There appears to be essentially no allocation apart from that used by the AST, since the parser does not get slower when the heap size is very slightly larger than the minimum required. Depending on the heap limit it either runs at full speed or runs out of heap. Going back to the lexer, it now produces exactly the same output as the original lexer (including positions and unique names). Sadly it seems to have got quite a bit slower for reasons I don't quite understand. In particular making it monadic (which we need to do because of) seems to make it rather slower. It is now taking 6 seconds rather than 2 and so is now only a little faster that the original lexer. Though on the positive side it means that if the lexer is taking 6 out of the 8 second total then the parser is only taking 2 seconds which is quite good. One important speedup I got was to change the identifier vs. reserved word lookup so that it does not use a linear search over the list of reserved word strings. That cut overall parsing time down from 10 seconds to 8. I've slightly generalised the grammar for GNU C __attribute__s after looking at the GCC manual. So the remaining things to be done before posting the stuff for review is to get the attribute allocation going and to check that the AST produced by this parser is exactly the same as for the existing parser. Then integrate this parser into c2hs and see what the overall memory requirements turn out to be taking into account the name analysis phase. > > > Moreover, the module c2hs/base/syntax/ParserMonad.hs already provides > > > a monad suitable for Happy. > > > > I'll take a look. Unfortunately, since the monad used by the parser has to be the same as the one used by the lexer, the existing parser monad is not suitable. I've based another one on the happy monad example and the ghc lexer/parser monad. > > As for the lexer/parser interaction required for C, I guess the way to > > do this is to make the lexer monad keep the set of 'typedefed' > > identifiers and return the appropriate token type to the parser > > depending on membership of that set. The lexer monad should also expose > > an operation to the parser to add identifiers to the 'typedefed' set > > which the parser would use after parseing the appropriate kind of > > declaration. > > Sounds right to me. I got this scheme working after encountering one interesting problem along the way... I had to refactor the C grammar slightly to move the reduction rule for typedefs (which adds the typedef'ed names to the lexer/parser monad state) below the ';' terminal in declarations so that constructs like the following work: typedef int foo; foo bar (int baz); Otherwise the 'foo' token on the second line is not recognised as a type name since the reduction rule for the previous declaration is not done until the 'foo' token has already been seen. With the slightly refactored grammar/semantic actions the reduction rule is invoked as soon as the ';' token is encountered and so it all works. (With thanks to Heffalump and skew on #haskell for help on this issue) Duncan |