Menu

problems using AST in template simplifier

2020-11-25
2020-11-29
  • Robert Reif

    Robert Reif - 2020-11-25

    I'm trying to use the AST in the template simplifier but I'm running into problems. Just creating the AST causes testrunner to almost immediately fail:

    ~/cppcheck$ ./testrunner
    Test64BitPortability::novardecl
    Test64BitPortability::functionpar
    Test64BitPortability::structmember
    Test64BitPortability::ptrcompare
    Test64BitPortability::ptrarithmetic
    Test64BitPortability::returnIssues
    terminate called after throwing an instance of 'InternalError'
    Aborted (core dumped)
    
    diff --git a/lib/templatesimplifier.cpp b/lib/templatesimplifier.cpp
    index 36514601b..b6102badb 100644
    --- a/lib/templatesimplifier.cpp
    +++ b/lib/templatesimplifier.cpp
    @@ -3657,6 +3657,9 @@ void TemplateSimplifier::simplifyTemplates(
    
         mTokenizer->calculateScopes();
    
    +    mTokenList.createAst();
    +    mTokenList.validateAst();
    +
         unsigned int passCount = 0;
         const unsigned int passCountMax = 10;
         for (; passCount < passCountMax; ++passCount) {
    

    This helps a little:

    diff --git a/lib/tokenlist.cpp b/lib/tokenlist.cpp
    index df754288c..8ee5c7611 100644
    --- a/lib/tokenlist.cpp
    +++ b/lib/tokenlist.cpp
    @@ -1655,9 +1655,17 @@ void TokenList::validateAst() const
             }
    
             // Don't check templates
    -        if (tok->str() == "<" && tok->link()) {
    -            tok = tok->link();
    -            continue;
    +        if (tok->str() == "<") {
    +            if (tok->link()) {
    +                tok = tok->link();
    +                continue;
    +            } else {
    +                const Token *end = tok->findClosingBracket();
    +                if (end) {
    +                    tok = end;
    +                    continue;
    +                }
    +            }
             }
    
             // Check binary operators
                                                                                                                              82,1          Bot
    

    but I'm also getting endless recursion from return statements.

    The AST is a mystery to me so some help would be welcomed.

     
  • Daniel Marjamäki

    I am not sure what problem exactly you have .. but I do not think it will be easy to combine ast, symboldatabase, templatesimplifier, valuetype.

    If there is a specific expression that you want to create ast for I believe it would be possible by executing createAstAtToken for that specific expression. The proper way would be; locate the first token in the expression and pass that to createAstAtToken.

    I do think we should continue to work on the builtin parser. However in parallell I also try to implement clang import. In theory I hope we'll be able to someday handle sophisticated c++ code much better with the clang import (as long as it's fully clang compliant). And I envision the builtin parser will be preferable when the code is not clang compliant or for performance reasons.
    The clang import is still far from ready though.

     

    Last edit: Daniel Marjamäki 2020-11-25
  • Robert Reif

    Robert Reif - 2020-11-25

    I tried createAstAtToken on the argument and createAstAtTokenInner for the argument list and neither set the value type. It looks like the symbol database sets the value type.

    I don't think using the AST will work for getting function call argument types in the template simplifier. Running setVarID() and using the varid looks like a better short term approach.

     
  • Daniel Marjamäki

    I tried createAstAtToken on the argument and createAstAtTokenInner for the argument list and neither set the value type. It looks like the symbol database sets the value type.

    yes. it only creates the tree (parents,operands). it might help determining argument types if we have parents and operands but feel free to skip that for now.

     
  • Paul Fultz

    Paul Fultz - 2020-11-27

    Is it not possible to run createAst and symbol database? We really need both for any robust type deduction.

     
  • Daniel Marjamäki

    It seems to me that in theory we could run createAst/setValueType/createSymbolDatabase before templatesimplifier. Then the information would not be complete. After each instantiation we could rerun these so we would have as much type info as possible.

    I do not know how gcc/clang works but I would assume that ast/valuetype/templatesimplifier/symboldatabase work together rather than separately.

     
  • Robert Reif

    Robert Reif - 2020-11-27

    The preprocessor doesn't tokenize the code properly and expects the tokenizer to fix it up because it has more context to fix ambiguous code. The AST can't handle the unsimplified or uncorrected code present before the template simplifier and I expect the symbol database will have the same problem. There will be a significant performance hit because they will have to be run after every template simplifier pass and possibly after every instantiation. This is a lot of work just for argument type deduction. I think this will be low priority for the short term now because there are many more serious bugs that need fixing but I will give it a try again in the future.

    You need a symbol table available while parsing the code because you need context to parse ambiguous code. This has to be very efficient or mutable because instantiating templates is essentially self modifying code.

     
  • Paul Fultz

    Paul Fultz - 2020-11-28

    The AST can't handle the unsimplified or uncorrected code present before the template simplifier

    I assume it needs things like createLinks2(), which relies on the template simplifier.

    There will be a significant performance hit because they will have to be run after every template simplifier pass and possibly after every instantiation.

    Ideally, we need a way run the AST and symboldatabase incrementally over a subset of tokens. This seems somewhat easy to do for the AST, but the symboldatabase would need lots of changes.

    Alternatively, I think we can just create the AST and symbol database at most three times(perhaps only twice). First, we would update the symbol database to add the ValueType for template parameters(this can be useful for valueflow analysis as well). Then run a pass(perhaps it could be in the template simplifer), that transforms function calls to use explicit types even for unexpanded templates. So foo(x) would be transformed into foo<T>(x). After this we would create the AST and symbol database again but there could be a way to transform the function while preserving the symbol database so this might be unnecessary. Then next run templatesimplifer. The template simplifier doesn't need to do type deduction as all the types should be explicit.

    So this basically would run AST/SymbolDatabase -> explictTypeDeduction -> AST/SymbolDatabase(optional) -> simplifyTemplates -> AST/SymbolDatabase. If its not feasible to move simplifyTemplates to this late, we could run simplifyTemplates twice.

    This may not be as efficient as incrementally updating the symbol database, but it seems easier to implement as we dont need to do major refactoring to the SymbolDatabase. These are the tasks I see that need to be done:

    • Update SymbolDatabase to add ValueType for template parameters.
    • Add the explictTypeDeduction to transform template function calls from foo(x) to foo<T>(x).
    • Update the pipeline to run AST/SymbolDatabase multiple times.
     
  • Paul Fultz

    Paul Fultz - 2020-11-28

    I do not know how gcc/clang works but I would assume that ast/valuetype/templatesimplifier/symboldatabase work together rather than separately.

    Clang uses a real AST. It lexes and then parses the source code to an AST, and then runs semantic analysis which adds the type information. The semantic analysis also instantiates the templates. This is usually done by inserting into the AST directly, so it doesn't create new tokens.

     
  • Daniel Marjamäki

    I do not think it will be enough with 2-3 runs of symboldatabase if we want to have a full type deduction..

    template<class T> T foo() {}
    void bar() {
         auto a = foo<int>();
         auto b = foo<decltype(a)>();
         auto c = foo<decltype(b)>();
         auto d = foo<decltype(c)>();
    }
    

    before templatesimplifier is executed, symboldatabase+setValueType can only determine the type for the first template argument. if we then run templatesimplifier+symboldatabase+setValueType.. we will only know the type of a not b..

    I can not say what the performance impact would be to just run these naively until we are finished..

    Ideally, we need a way run the AST and symboldatabase incrementally over a subset of tokens. This seems somewhat easy to do for the AST, but the symboldatabase would need lots of changes.

    Yes.

     
  • Robert Reif

    Robert Reif - 2020-11-28

    C++Insight https://cppinsights.io/ uses clang to generate this from the above example:

    template<class T> T foo() {}
    
    /* First instantiated from: insights.cpp:3 */
    #ifdef INSIGHTS_USE_TEMPLATE
    template<>
    int foo<int>()
    {
    }
    #endif
    
    void bar()
    {
      int a = foo<int>();
      int b = foo<int>();
      int c = foo<int>();
      int d = foo<int>();
    }
    

    Unfortunately it requires code that compiles.

     
  • Paul Fultz

    Paul Fultz - 2020-11-28

    I do not think it will be enough with 2-3 runs of symboldatabase if we want to have a full type deduction..

    We could have a pass to replace decltype with the actual type in the template parameters. That would get us a similar output to C++ insights, and it wouldn't require more than 3 runs of symboldatabase.

    Of course, there are probably some cases where this will miss the type deduction, and we might consider doing this in a loop in the future.

    However, I dont see cppcheck's built-in parser doing full type deduction. Overload resolution and type deduction are very complicated(and even more so with requires), if we want full type deduction we should use clang as our parser. In fact, in clang 11, there is Recovery AST:

    https://releases.llvm.org/11.0.0/tools/clang/docs/ReleaseNotes.html#recovery-ast

    This improves its handling of incomplete code.

    I know our clang import parses the ast dump, but this seems like a very fragile approach as this is not stable between versions. It would be better to use libclang to parse the AST. This is more stable but may include nodes which are 'Unexposed', so we will need to fill in these unexposed nodes. It also can evaluate constexpr values.

    To use libclang we will need to either link against it at build time or use dlopen to load it dynamically. I also have C++ bindings to make it easier to use libclang's API:

    https://github.com/pfultz2/libclangpp/blob/master/include/clangpp.hpp

     
  • Daniel Marjamäki

    I know our clang import parses the ast dump, but this seems like a very fragile approach as this is not stable between versions. It would be better to use libclang to parse the AST. This is more stable but may include nodes which are 'Unexposed', so we will need to fill in these unexposed nodes. It also can evaluate constexpr values.

    I wanted to use libclang to start with. That is the recommended approach. However I thought it was really unusable:
    * First the API is very difficult and the documentation is very limited. I believe writing a clang import using libclang will require a lot more work in cppcheck.
    * Then you have the problem with "Unexposed". Even basic code examples have "Unexposed" nodes. So lots of work is needed in Clang also.

    They used to have a xml output of the ast in Clang but they dropped it because it was unmaintained. I don't know how it was implemented but probably in some separated output code.

    I have thought about tweaking the clang ast output a bit. I want a import-friendly format based on minor tweaks in the existing output. If we only need to make minor tweaks in the output I hope that the Clang developers can accept that. So if the same code is used to dump normal debug output and "import-friendly" then it will not require extra maintenance.
    Well.. I intend to make a proof of concept and suggest it.. and let's see what they think about that.

     
  • Daniel Marjamäki

    here is some test code to use libclang. The name "exprengine.cpp" is very confusing but please ignore that.

    The intention is to parse a file v1.cpp and write some kind of AST on the screen..

    For this code:

    int foo(int x) {
        return -10000 / x;
    }
    
    int bar(int x) {
        return 10000 / x;
    }
    

    The output is:

    function foo
    Cursor ParmDecl:x
    Cursor CompoundStmt:
    Cursor ReturnStmt:
    ReturnStmt
        BinaryOperator
            UnaryOperator
                IntegerLiteral
            UnexposedExpr:x
                DeclRefExpr:x
    function bar
    Cursor ParmDecl:x
    Cursor CompoundStmt:
    Cursor ReturnStmt:
    ReturnStmt
        BinaryOperator
            IntegerLiteral
            UnexposedExpr:x
                DeclRefExpr:x
    

    Trying to add detailed info such as what type of binaryoperator it is , is very difficult the documentation is very cryptic. As far as I remember I made honest attempts to write such info.

    And even for this trivial code there is UnexposedExpr. So we will not even be able to import the AST for this code without updated clang.

    It took quite some effort to write exprengine.cpp even though it does not do much.

    Still.. if you think that libclang would be an better alternative then I am not in theory against it. I envision that it would be possible to more or less replace the ast-dump reader (that is not much code) with a libclang interface.

     
  • Daniel Marjamäki

    If we rewrite the builtin parser.. I suggest: the setValueType() is called first and that will delegate tasks to SymbolDatabase(set function pointer)/TemplateSimplifier when it needs.

    Example:

    int x,y,z;
    x = foo(y, 1.0 + bar(z));
    

    The setValueType() sets the types from the bottom and up (as today):
    1. First the type for z is set.
    2. Function call "bar": setValueType() should call the "SymbolDatabase" and "TemplateSimplifier" and ask them to lookup "bar(int)". If TemplateSimplifier finds a match it should instantiate "bar" and change the name to something like "bar<int>". Then the SymbolDatabase add "bar<int>" function if necessary and can set the function pointer for bar.
    3. It sets the ValueType for 1.0
    4. The ValueType for + is set.
    5. The ValueType for y is set.
    6. Function call "foo": setValueType() will call the "SymbolDatabase" and "TemplateSimplifier" and ask them to lookup "foo(int, double)".. if there is a match in symboldatabase then the function pointer is just set.. otherwise the templatesimplifier will instantiate and then the symboldatabase must add the function and set the function pointer...</int></int>

    I don't think this sounds 100% complex but it will require some rewriting..

     

    Last edit: Daniel Marjamäki 2020-11-29

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.