Menu

AST matchers

2018-11-23
2018-11-25
  • Daniel Marjamäki

    Do you have some ideas how we can create a flexible AST matcher?

    One idea is:

    astMatch(token, pattern)
    

    where the pattern can be a string literal that describes what to look for. I don't know how to write this pattern in a elegant way. Any ideas? If we somehow reuse %str% and those stuff from Token::Match I think that would be good.

    A + that adds a string and number in RPN notation:

    "%str% %num% +"
    

    Another idea is to use some clang-tidy inspired pattern matching. we might be able to specify more sophisticated matchers then that checks valuetype and stuff.

     astMatch(tok, binaryOperator("+", stringLiteral(), number()));
    

    What do you think?

     
    • Markus Elfring

      Markus Elfring - 2018-11-24

      I don't know how to write this pattern in a elegant way.

      • Why are your imaginations limited in this area so far?
      • How would the mentioned matching approach be different from techniques which are supported by the current software?

      What do you think?

      Will your software development interest increase for the technology “computation tree logic”?

       
  • Daniel Marjamäki

    after some thinking I am thinking that a clang-tidy inspired pattern matching would be nice. we will not be able to directly copy code to/from clang-tidy. But if anybody knows clang-query or something then looking at cppcheck code will be familiar. and vice versa.

     

    Last edit: Daniel Marjamäki 2018-11-23
  • Paul Fultz

    Paul Fultz - 2018-11-23

    The nice thing about clang's matchers is they provide a general-purpose way to compose a bunch of predicates. These matchers can match more than just tokens, it could also match ValueFlow values or symbols. I see matching something like this:

    astMatch(tok, pattern("+")(binary(pattern("%str%"), pattern("%num%"))));
    

    More importantly, we can support binding tokens as well:

    MatchResult result = astMatch(tok, pattern("+")(binary(pattern("%str%").bind("string"), pattern("%num%").bind("number"))));
    const Token * strTok = result.tokens["string"];
    const Token * numTok = result.tokens["number"];
    

    And the above could be implemented with something like this:

    struct MatcherContext
    {
        std::unordered_map<std::string, const Token*> tokens;
        template<class T>
        void bind(std::string name, const T* x)
        {
            throw std::runtime_exception("Type is not bindable");
        }
        void bind(std::string name, const Token* tok)
        {
            tokens[name] = tok;
        }
    };
    
    // General puporse matcher object, this can change the type its traversing to
    // support traversing more than tokens
    template<class T, class R=T>
    struct Matcher
    {
        std::function<const R*(MatcherContext&, const T*)> match;
    
        // Setup binding tokens
        Matcher bind(std::string name) const 
        {
            auto m = match;
            auto mf = [m, name](MatcherContext& ctx, const T* x) -> const R* {
                const R* result = m(ctx, x);
                if(result)
                    ctx.bind(name, x);
                return result;
            };
            return Matcher{mf};
        }
    
        // Compose a submatch
        template<class U>
        Matcher<T, R> operator()(Matcher<R, U> subMatch) const
        {
            auto m = match;
            auto mf = [m, subMatch](MatcherContext& ctx, const T* x) -> const R* {
                const R* result = m(ctx, x);
                if(result && subMatch(ctx, result))
                    return result;
                else
                    return nullptr;
            };
        }
    };
    
    struct MatchResult
    {
        const Token* tok;
        std::unordered_map<std::string, const Token*> tokens;
    };
    
    MatchResult astMatch(const Token* tok, Matcher<Token> m)
    {
        MatchResult result;
        MatcherContext ctx;
        const Token * tok = m.match(ctx, tok);
        result.tok = tok;
        result.tokens = ctx.tokens;
        return result;
    }
    
    // The predicates:
    Matcher<Token> pattern(std::string p)
    {
        return Matcher<Token>{[=](MatcherContext& ctx, const Token* tok) {
            if(Token::Match(tok, p.c_str()))
                return tok;
            else
                return nullptr;
        }};
    }
    
    Matcher<Token> binary(Matcher<Token> op1, Matcher<Token> op2)
    {
        return Matcher<Token>{[=](MatcherContext& ctx, const Token* tok) {
            if(!tok->astOperand1() && !tok->astOperand2())
                return nullptr;
            if(op1.match(ctx, tok->astOperand1()) && op2.match(ctx, tok->astOperand2()))
                return tok;
        }};
    }
    

    More can be built on top of this to improve it more. We can add allOf matcher so we can match several predicates for the same token. We could also register the matchers and build a parser so users can run them with custom rules or straight cppcheck CLI.

     
  • Daniel Marjamäki

    The nice thing about clang's matchers is they provide a general-purpose way to compose a bunch of predicates. These matchers can match more than just tokens, it could also match ValueFlow values or symbols.

    I agree. That is a huge advantage.

    We could also register the matchers and build a parser so users can run them with custom rules or straight cppcheck CLI.

    Yes imho that should be our goal. The rule files will be a lot more interesting with this functionality.

     
  • Daniel Marjamäki

    I think your example is a very good start.

    I envision that the ast matcher will be a hotspot so we will probably need to optimise it as much as possible, especially in release builds.

    I am guessing that with your code there will be overhead for such code:

    for (tok..) {
        if (astMatch(tok, pattern("str"))) {}
    }
    

    I assume a new Matcher<token> object will be created every loop?</token>

    I believe we can optimise it with MatchCompiler so the object is created before the loop starts.

    Feel free to create a PR with your code and we can start using it!

     
  • Paul Fultz

    Paul Fultz - 2018-11-24

    I envision that the ast matcher will be a hotspot so we will probably need to optimise it as much as possible, especially in release builds.

    There is a bottleneck with std::function. Instead we could make it a template parameter so the compiler can more easily inline it.

    I assume a new Matcher<token> object will be created every loop?</token>

    Possibly. Perhaps with a template the compiler might optimize it out. Alternatively, we could support finders so we can search for multiple matchers at the same time, and CSE can optimize the duplicate conditions across several matchers.

     
  • Daniel Marjamäki

    I am not sure about using templates to force inlines, it increases the translation unit size. We can look at it also later and see what the difference will be.

     
  • Daniel Marjamäki

    I suggest that you create a working solution first. And we worry about optimisations later.

    I do not see why Matcher needs to be a template. It seems to me that a Scope matcher and Token matcher will be quite different. But we can look into that later.

     
  • Daniel Marjamäki

    btw,, I also made a POC. In my approach the binaryOperator() etc are creating a AstMatcher struct. But it does not perform any matching.

    class AstMatcher {
    public:
        typedef AstMatcher * AstMatcherPtr; // TODO: use std::unique_ptr?
        AstMatcher(const char *match, AstMatcherPtr op1, AstMatcherPtr op2)
            : match(match), op1(op1), op2(op2) {}
        ~AstMatcher() { delete op1; delete op2; }
    
        bool match(const Token *tok) {
            if (!Token::Match(tok, match))
                return false;
            if (op1 && !op1->match(tok->astOperand1()))
                return false;
            if (op2 && !op2->match(tok->astOperand2()))
                return false;
            return true;
        }
    
    private:
        const char *match;
        AstMatcherPtr op1;
        AstMatcherPtr op2;
    };
    
    namespace Matcher {
        AstMatcher::AstMatcherPtr createAstMatcher(const char *m, AstMatcher::AstMatcherPtr op1, AstMatcher::AstMatcherPtr op2) {
            return new AstMatcher(m,op1,op2);
        }
        AstMatcher::AstMatcherPtr createNullMatcher() {
            return nullptr;
        }
    
        AstMatcher *binaryOperator(const char *str, AstMatcher *op1, AstMatcher *op2) {
            return createAstMatcher(str, op1, op2);
        }
        AstMatcher *stringLiteral() {
            return createAstMatcher("%str%", createNullMatcher(), createNullMatcher());
        }
        AstMatcher *integerLiteral() {
            return createAstMatcher("%num%", createNullMatcher(), createNullMatcher());
        }
    }
    

    The AstMatcher structs should be created once and reused.

     
  • Paul Fultz

    Paul Fultz - 2018-11-25

    I am not sure about using templates to force inlines

    It doesn't force inline, it just allows inline.

    it increases the translation unit size.

    The goal is to improve runtime performance not compile-time performance. As shown by Vittorio Romeo, a template parameter(over std::function and a function pointer) will produce the smallest assembly when compiled with -O2 or greater:

    https://vittorioromeo.info/index/blog/passing_functions_to_functions.html

    (here is the website from google cache: https://webcache.googleusercontent.com/search?q=cache:-leiPhnPQTYJ:https://vittorioromeo.info/index/blog/passing_functions_to_functions.html+&cd=1&hl=en&ct=clnk&gl=us&client=firefox-b-1)

    But more importantly than the zero overhead function call of a template parameter, is that if we combine multiple matchers together CSE can remove the duplicate conditions. This is not possible with std::function or a function pointer.

    I do not see why Matcher needs to be a template. It seems to me that a Scope matcher and Token matcher will be quite different.

    They should be the same except the types. The uniformity is necessary to be able to use the matchers together. For example, if we want to check for variables declared as a class or struct we could write something like this:

    pattern("%var%")(variable(typeScope(classDef(pattern("class|struct")))))
    

    This starts by matching a Token then a Variable then a Scope and then a Token again. This is just one example of course.

     
  • Daniel Marjamäki

    ok!

    I am hoping you will open a PR.

    Personally I would like to have a proof of concept soonish that we can try out.

    I do think it is more important that it's a clean, simple and intuitive framework than a fast framework. Sorry if I "side tracked" these discussions.

    I spontaneously think your suggestion seems to be clean and intuitive. The checker code should become much simpler.

     

    Last edit: Daniel Marjamäki 2018-11-25

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.