Menu

User Extension of the C/Cpp Grammar

2020-11-21
2020-12-03
  • Eckard Klotz

    Eckard Klotz - 2020-11-21

    Dear Users.

    This topic should continue the discussion of the bug #10 preprocessor directive issue reported by Wasilios Goutas.

    In the his code some preprocessor definitions are used to insert compiler specific attributes in data declarations and data definitions. The preprocessor defintions are used instead of the real attribute-syntax to clean up the code. But since Moritz has no preprocessor that is replacing the preprocessor defintions by the real attribute-syntax the parser is struggeling by the unexpected tokens.

    It is not possible to extend the parser grammar with a standard solution working for all thinkable cases. Thus I try now to provide a posibilty of an extension by the user.

    For this I attach here 9 files.

    5 files have to be inserted into the folder "*[MoritzRoot]\LangPack\ansi_c\a2x*"

    • abc2xml_Process_ansi_c_only.xml
    • abc2xml_Process_ansi_cpp.xml
    • ANSI_C_Source_C_only_grm.a2x
    • ANSI_C_Source_Cpp_grm.a2x
    • UserGrammar.a2x

    2 files have to be inserted into the folder "*[MoritzRoot]\LangPack\ansi_c\x2a_nsd*"

    • nsd_ansi_c_cfg.x2a
    • nsd_ansi_c_mrtz_cmd.x2a

    2 files have to be inserted into the folder "*[MoritzRoot]\LangPack\ansi_c\x2a_uad*"

    • uad_ansi_c_cfg.x2a
    • uad_ansi_c_mrtz_cmd.x2a

    To solve the issue from Wasilios Goutas I have extended the syntax-rule declaration_specifier by an optional detection of the new syntax-rule userAttribute. The new grammar-file UserGrammar.a2x contains this rule to be easier accessable for the user.

    Currently it looks like this;

    userAttribute =
    ("" >> "VARIABLE_PREPROCESSOR")
    | ("
    " >> "FUNC_RET_PREPROCESSOR")
    | ("__" >> "FUNC_PARAM_PREPROCESSOR");

    The syntax rule specifies an collection of alternatives marked by the bit-wise or oparator (|).
    The alternative preprocessor identifiers are split in a sequnce of 2 strings marked by the right-shif oparator (>>). This is necessary to prevent problems while detecting the associated preprocessor definitions like this.

    #define __VARIABLE_PREPROCESSOR
    #define __FUNC_RET_PREPROCESSOR
    #define __FUNC_PARAM_PREPROCESSOR

    If we would define the altenative with monolithic strings instead of string sequkences this define lines would not be detectable.
    Currently the syntax-rule userAttribute contains just dummys the user has to insert now the token he has defined as preprocessor declarations.
    The number of alternatives may be more than 3 or less. Don't forget the semikolon at the end of the syntax-rule.

    Please try it out and report your results and or doubts.

    Stay well and healthy,
    Eckard Klotz.

     

    Last edit: Eckard Klotz 2020-12-30
  • Wasilios Goutas

    Wasilios Goutas - 2020-11-24

    Hello Eckard,

    thank you for providing the possibility for user grammar, and sorry for answering tht late.
    Meanwhile I was able to run first tests.
    I merged the changes of the above files to the changes you provided based on the issues
    #9 unnamed fields in struct/union not handled
    I realized, that the user grammar been added in the file UserGrammar.a2x is also part of the file ANSI_C_Source_C_only_grm.a2x. I expect that this had been added while doing the proof of concept and forgotten to be removed.
    I removed it and the reproducer sources where handled correctly :)

    When using it on the project I try to use Moritz on I got still issues, and I identified that is was related to 'const' statements used in combination with the user grammar.
    Code like this worked
    static __FUNC_RET_PREPROCESSOR int test0( __FUNC_PARAM_PREPROCESSOR int param1,
    __FUNC_PARAM_PREPROCESSOR int param2,
    __FUNC_PARAM_PREPROCESSOR int param3,
    __FUNC_PARAM_PREPROCESSOR int param4);

    where code like this failed
    static __FUNC_RET_PREPROCESSOR int test2( __FUNC_PARAM_PREPROCESSOR const int param1,
    __FUNC_PARAM_PREPROCESSOR const int param2,
    __FUNC_PARAM_PREPROCESSOR const int param3,
    __FUNC_PARAM_PREPROCESSOR const int param4);

    I tried to adapt the UserGrammar to handle the const expression as well and failed while trying these rules:
    / * not working
    | ("__VARIABLE_PREPROCESSOR" >> "const")
    | ("__FUNC_PARAM_PREPROCESSOR" >> "const")
    | ("__FUNC_PARAM_PREPROCESSOR" >> "const");
    /
    but I suceeded whith these rules:

    | ("" >> "VARIABLE_PREPROCESSOR const")
    | ("
    " >> "FUNC_PARAM_PREPROCESSOR const")
    | ("__" >> "FUNC_PARAM_PREPROCESSOR const")
    *

    Unfortunately this will not handle expressions where the white-space between the special user grammar and the const statement differs from one space.
    Maybe you can give me a hint on how to handle multiple spaces, tabs, newlines, or if there is another way of handling const user grammar.
    BR
    Wasili

     
  • Wasilios Goutas

    Wasilios Goutas - 2020-11-24

    the strange thing is, that meanwhile the 'const' expression is not working an more. I think I need to start reviewing my changes to identify why it worked earlier today.

     
  • Wasilios Goutas

    Wasilios Goutas - 2020-11-24

    Unfortunately I'm not able to get positive results when having a user grammar followed by a const. I expect it never worked so far and I was just wrong in my statement earlier.
    My grammar meanwhile looks as follows:
    userAttribute =
    ("" >> "VARIABLE_PREPROCESSOR")
    | ("
    " >> "FUNC_RET_PREPROCESSOR")
    | ("" >> "FUNC_PARAM_PREPROCESSOR")
    | ("
    " >> "FUNC_PARAM_PREPROCESSOR" >> " const")
    | ("__" >> "FUNC_PARAM_PREPROCESSOR" >> *" " >> "const")
    ;

    and the code I'm testing it with is this:
    static __FUNC_RET_PREPROCESSOR int test0( __FUNC_PARAM_PREPROCESSOR int param1 );
    const static __FUNC_RET_PREPROCESSOR myUnnamedBitFields_t test1( __FUNC_PARAM_PREPROCESSOR int param1 );
    static __FUNC_RET_PREPROCESSOR myUnnamedBitFields_t test2( __FUNC_PARAM_PREPROCESSOR myUnnamedBitFields_t param1 );
    static __FUNC_RET_PREPROCESSOR myUnnamedBitFields_t test3( __FUNC_PARAM_PREPROCESSOR const myUnnamedBitFields_t param1 );

    The resulting NSD html file ends at function 'test2'
    static __FUNC_RET_PREPROCESSOR int test0 ( __FUNC_PARAM_PREPROCESSOR int param1 )
    const static __FUNC_RET_PREPROCESSOR myUnnamedBitFields_ttest1 ( __FUNC_PARAM_PREPROCESSOR int param1 )
    static __FUNC_RET_PREPROCESSOR myUnnamedBitFields_ttest2 ( __FUNC_PARAM_PREPROCESSOR myUnnamedBitFields_t param1 )

     
  • Eckard Klotz

    Eckard Klotz - 2020-11-25

    Hello Wasili.

    Thanks for your intensive testing.

    I realized, that the user grammar been added in the file UserGrammar.a2x is also part of the file ANSI_C_Source_C_only_grm.a2x. I expect that this had been added while doing the proof of concept and forgotten to be removed.
    I removed it and the reproducer sources where handled correctly :)

    You are right, the syntax-rule was mentioned 2 times. But this is not a problem since while scanning the Grammar the second definition is overwriting the first one.
    The xml file, you use to configure the sources to analyse, is defining in it bottom an other xml file to configure the details of the parsing process.
    As long as you have not changed this, it is "abc2xml_Process_ansi_cpp.xm". But using "abc2xml_Process_ansi_c_only.xml" is also possible.
    * In this file you will find in the section fr the Source parsing a configuration like this:

    <grammar>
    <file root="LangPackPath" value="./LangPack/ansi_c/a2x/ANSI_C_Source_C_only_grm.a2x">
    <file root="LangPackPath" value="./LangPack/ansi_c/a2x/ANSI_C_Source_Cpp_grm.a2x">
    <file root="LangPackPath" value="./LangPack/ansi_c/a2x/ANSI_C_Source_C_Pragma_grm.a2x">
    <file root="LangPackPath" value="./LangPack/ansi_c/a2x/UserGrammar.a2x"></file></file></file></file></grammar>

    skip = space_p;
    
    pass = +(  extern_compiled
             | namespace_def
             | using_namespace
             | preprocedure
             | class_definition
             | function_prototype
             | declaration
             | function_definition
             | statement
            );
    

    • The parser is loading all a2x files in the order of the configuration lines and concatenates them together with the direct definition of the skip and pass rules.
    • All together is the Grammar text the parsing process is based on.
    • As already mentioned, if a syntax-rule is defined more than once, the last definition is used.

    Thus it is not really necessary to remove the first definition. But by keeping it as a kind of default definition it will be possible to comment out the user-grammar configuration-line.

    I have attached here now 3 files

    You will see that the attached user-grammar is the same than attached last time.
    I have not added strings like const or static.
    Please do the same in your user-grammar.

    What I have really changed is how the syntax-rule is used in the syntax-rule of the declaration_specifier.

    • This is the older example for the C-version

    declaration_specifier = ( * (storage_class_specifier | type_qualifier)
    >> ! userAttribute
    >> (type_specifier | USER_TYPE)
    >> * type_qualifier
    >> * (('&'-("&&"|"&=")) | '*')
    >> * type_qualifier
    );

    • The new syntax rule looks is a little bit extended

    declaration_specifier = ( ! userAttribute
    >> * (storage_class_specifier | type_qualifier)
    >> ! userAttribute
    >> (type_specifier | USER_TYPE)
    >> * type_qualifier
    >> * (('&'-("&&"|"&=")) | '*')
    >> * type_qualifier
    );

    • Tokens like const or static are already part of the syntax-rules storage_class_specifier and type_qualifier.
    • Using them again as you did will confuse the parsing process.

    With this change I was able to parse your examples except the last 3.
    But from my view the last 3 function declarations will be rejected by the compiler also.
    The declaration lines contain no semicolon at the end.
    The space between the return type and the function name is missing with the result that both together are one single identifier.
    Thus either the return type or the function name is missing.
    * Please check your last 3 examples and provide me the corrected once, since I want to ensure that the additional grammar I implement for you is for a valid code-syntax.

    In your result report you mention where the digram generation is terminating.
    Please take a look into the xml folder where you will find for every source an xml file generated by the parser.
    By evaluating the xml file instead of the diagrams we could really see where the parser is struggling.
    * If you recognize problems in the diagrams this may be an additional issue in the xml2abc processes.

    Please provide us a feedback with your results.

    Stay healthy and best regards,
    Eckard.

     
  • Wasilios Goutas

    Wasilios Goutas - 2020-11-25

    Thanks you Eckard, I will test it and give feedback

     

    Last edit: Wasilios Goutas 2020-11-25
  • Wasilios Goutas

    Wasilios Goutas - 2020-11-25

    Hello Eckard,

    the results I have seen so far looks good.
    Thanks for the fix :)
    BR
    Wasili

     
  • Eckard Klotz

    Eckard Klotz - 2020-12-02

    Hello Wasili.

    In your bug-report #12 different user grammars in function parameter list you described an issue you faced with an extension of your user-grammar.

    One point at the beginning:

    • I know now that the empty strings in your example contained underscores.
    • But they are not shown correctly. I face now the same issue with my post.
    • Please note that the empty strings you see in the syntax-rule should actually contain 2 underscores.

    If my understanding is correct, you extended the syntax-rule of the user-attributes like this:

    userAttribute =
    ( "" >> "VARIABLE_PREPROCESSOR")
    | ( "
    " >> "FUNC_RET_PREPROCESSOR")
    | ( "" >> "FUNC_PARAM_PREPROCESSOR")
    | ( "
    " >> "FUNC_PARAM_PREPROCESSOR_NEW");

    The result is that the parser is failing to detect the following declaration correctly:

    static __FUNC_RET_PREPROCESSOR myUnnamedBitFields_t test4( __FUNC_PARAM_PREPROCESSOR const myUnnamedBitFields_t param1,
    __FUNC_PARAM_PREPROCESSOR_NEW myUnnamedBitFields_t param2 );

    By reordering the alternatives in the rule like this, the parsing should work.

    userAttribute =
    ("" >> "VARIABLE_PREPROCESSOR")
    | ("
    " >> "FUNC_RET_PREPROCESSOR")
    | ("" >> "FUNC_PARAM_PREPROCESSOR_NEW")
    | ("
    " >> "FUNC_PARAM_PREPROCESSOR");

    I think now you like to know why.

    • The rule provides alternative possibilities which should be accepted by the parser but only one is really the correct choice.
    • The token __FUNC_PARAM_PREPROCESSOR_NEW contains the sequence ("__" >> "FUNC_PARAM_PREPROCESSOR") as well as the sequence ("__" >> "FUNC_PARAM_PREPROCESSOR_NEW").
    • The parser used by Moritz is using the first alternative, that is creating a positive hit and with your order this would be ("__" >> "FUNC_PARAM_PREPROCESSOR").
    • The longer alternative ("__" >> "FUNC_PARAM_PREPROCESSOR_NEW") would not be tested any more.
    • But than the remaining part of the token _NEW in the source-file will become a problem, since the syntax-rule declaration_specifier, that was calling the rule userAttribute , is not expecting a token like this, what results in the final failing of the parsing.

    The solution I suggest (turning the order) ensures that the longer alternative will be tested first and if it succeeds, the shorter alternative will not be tested any more.
    * Please do so when ever 2 alternatives have the same beginning and the only difference between the alternatives is, that the longer one contains additional parts while the shorter one is just the beginning sub-set of the longer one.

    An other possibility to solve your issue without exchanging the order would be the definition of an exception (by using the minus operator) like this :

    userAttribute =
    ( "" >> "VARIABLE_PREPROCESSOR")
    | ( "
    " >> "FUNC_RET_PREPROCESSOR")
    | ( ( " " >> "FUNC_PARAM_PREPROCESSOR")
    - ( "
    " >> "FUNC_PARAM_PREPROCESSOR_NEW")
    )
    | ( " __" >> "FUNC_PARAM_PREPROCESSOR_NEW");

    • In this case the shorter alternative will just result in a detection, if the longer version results not in a detection.
    • But to be honest, we should come back to this solution only in more complex cases, don't you think so.

    Please try it out and give us a feedback.

    Stay healthy and best regards,
    Eckard.

     
  • Eckard Klotz

    Eckard Klotz - 2020-12-02

    PS.:
    I forgot to add the extended user-grammar example.
    Here it is.

     
  • Wasilios Goutas

    Wasilios Goutas - 2020-12-03

    Hello Eckard,

    the solution of defining the longer name first works.
    The way you mentioned by the '-' (minus operator) I haven't tested as I do not understand it, and it seems not needed for what I'm trying.
    Thanks
    Wasili

     
  • Eckard Klotz

    Eckard Klotz - 2020-12-03

    Hello Wasili.

    The way you mentioned by the '-' (minus operator) I haven't tested as I do not understand it, and it seems not needed for what I'm trying.

    That's why I wrote :

    But to be honest, we should come back to this solution only in more complex cases, don't you think so.

    Thus please keep the simple approach.

    Nevertheless I try to explain it more detailed:

    You could understand the alternatives as independent syntax-rules:

    a_long  = "__" >> "FUNC_PARAM_PREPROCESSOR_NEW";
    a_short = "__" >> "FUNC_PARAM_PREPROCESSOR";
    

    The alternative rule could than be reduced to

    a = a_long | a_short;
    

    For parsing the string __FUNC_PARAM_PREPROCESSOR_NEW the alternative rule is trying first a_long what is already succeeding and thus a_short will not be tested any more.

    In the case we would define the alternative rule in the following way we would have the problem you have reported:

    a = a_short | a_long;
    

    For parsing the string __FUNC_PARAM_PREPROCESSOR_NEW the alternative rule is trying a_short first, what is succeeding also and thus a_long will not be tested any more.

    • But the parsing ends before the string-end at the begin of _NEW and this results in the problem you faced.

    Now we could introduce this rule:

    a_short_but_not_long = a_short - a_long;
    

    This rule defines an exclusion and it detects only strings which follow the rule a_short but are not following the rule a_long.

    With this exclusion we could modify our alternative:

    a = a_short_but_not_long | a_long;
    

    Since the string __FUNC_PARAM_PREPROCESSOR_NEW will be detected by a_long the rule a_short_but_not_long will not detect it and the alternative will try a_long .

    • The rule a_long will detect the complete string as wanted and the parsing will not fail.

    Now you like to tell me, that the introduction of the inclusion results in a much more complex design, what is actually always a bad idea, if you have no good reasons for it.

    I totally agree in this but the exclusion makes the following definition of the alternative rule possible also.

    a = a_long | a_short_but_not_long;
    

    Now the use of a_long will already succeed at the beginning while the complete consuming of the string.

    The important point is, that this way the order of the single alternatives doesn't matter.

    • This may be very essential, if your alternative rule contains a lot of single parts and it becomes more complicate to define a always fitting order.
    • In this situation the exclusion gives you the possibility to define exceptions which should not be detected from the single sub-parser.

    I hope that this gives you a rough idea about my comment above.

    However currently the simple reordering will be the petter choice.

    Stay healthy,
    Eckard.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.