From: Duncan C. <dun...@wo...> - 2005-07-23 00:45:54
|
On Tue, 2005-07-19 at 12:39 +1000, Manuel M T Chakravarty wrote: > > > I know that the gtk2hs guys built a new parser/lexer, and it appears > > > indeed to be a parse error. C > > > > Yes, neither the old nor the new C lexers understand the __asm symbol. > > > > This is something we obviously need to fix. I'll look into the gcc __asm > > feature as soon as I have a free moment. Any fix will appear in Gtk2Hs > > 0.9.8.1 and patches will be sent to the mainline c2hs. > > Thanks, Duncan. So I've got something working. I've added the asm/__asm/__asm__ keyword to the lexer. Simple. I've referred to the gcc GNU C grammar. Apparetly there can be asm top level (translation unit level) things. I've extended the grammar to accept and ignore these: translation_unit :: { [CExtDecl] } translation_unit : {- empty -} { [] } | translation_unit external_declaration { $2 : $1 } + | translation_unit asm '(' expression ')' ';' { $1 } I'm not sure what these sort of things actually do, it might be a good idea to at least record that the thing is there, even if we don't record the whole thing. (The expression will typically be a constant string but it can also be a slightly more complex expression that evaluates to a constant string. gcc validates the thing after parsing.) Then there are the asm statements: -- parse C statement (K&R A9) -- statement :: { CStat } statement : labeled_statement { $1 } | compound_statement { $1 } | expression_statement { $1 } | selection_statement { $1 } | iteration_statement { $1 } | jump_statement { $1 } + | asm_statement { $1 } these come in a few different forms (including gcc extended asm syntax). For these asm statements I do record a placeholder rather than just ignoring the thing entirely: | CReturn (Maybe CExpr) Attrs + | CAsm Attrs -- a chunk of assembly code (which is + -- not itself recorded) I think it's less confusing this way (and it'll be easy to extend to record the full details of the asm statement if required). Is there any code that needs changing due to adding another constructor to the CStmt type? When I was testing this change it exposed a infedelity in the current (and previous) lexer/parser. Apparently in C (I'd not seen this feature before) a sequence of string literals are concatenated to become a single string literal. I bumped into this because asm chunks are often written: asm ( "first line\n\t" "second line\n\t" ) gcc does this string literal concatenation in between the lexer and parser. It's tricky to do in the lexer because you're allowed white space and comments between the strings to be concatenated. I decided to do it in the parser rather than an intermediate layer because doing it in the parser does not require token lookahead/pushback in the lexer (which the current lexer does not have). So I've done it like this: literal_expression : cint {% withAttrs $1 $ case $1 of CTokILit _ i -> CIntConst i } | cchar {% withAttrs $1 $ case $1 of CTokCLit _ c -> CCharConst c } | cfloat {% withAttrs $1 $ case $1 of CTokFLit _ f -> CFloatConst f } - | cstr {% withAttrs $1 $ case $1 of CTokSLit _ s -> CStrConst s } + | string {% withAttrs $1 $ CStrConst (unL $1) } + +-- deal with C string liternal concatination +-- +string :: { Located String } +string + : cstr { case $1 of CTokSLit _ s -> L s (posOf $1) } + | cstr string_ { case $1 of CTokSLit _ s -> + let s' = concat (s : reverse $2) + in L s' (posOf $1) } + +string_ :: { [String] } +string_ + : cstr { case $1 of CTokSLit _ s -> [s] } + | string_ cstr { case $2 of CTokSLit _ s -> s : $1 } So it's (hopefully) optimised for the case of a single string. Finally we have the case of asm decorators on function/var declaration/initilisation. -- parse C init declarator (K&R A8) -- init_declarator :: { (CDeclr, Maybe CInit) } init_declarator - : declarator { ($1, Nothing) } - | declarator '=' initializer { ($1, Just $3) } + : declarator maybe_asm { ($1, Nothing) } + | declarator maybe_asm '=' initializer { ($1, Just $4) } + +maybe_asm :: { () } +maybe_asm + : {- empty -} { () } + | asm '(' string ')' { () } So altogether these changes allow the following sort of thing to be parsed (from the MacOS X 10.4 libc headers http://www.mattcox.ca/misc/gtk.i ): int fprintf(FILE * , const char * , ...) __asm("_" "fprintf" "$LDBLStub"); Manuel, if you think this is an ok approach I'll send the darcs patch, otherwise I await your suggestions. Duncan |