I'm trying to use the AST in the template simplifier but I'm running into problems. Just creating the AST causes testrunner to almost immediately fail:
I am not sure what problem exactly you have .. but I do not think it will be easy to combine ast, symboldatabase, templatesimplifier, valuetype.
If there is a specific expression that you want to create ast for I believe it would be possible by executing createAstAtToken for that specific expression. The proper way would be; locate the first token in the expression and pass that to createAstAtToken.
I do think we should continue to work on the builtin parser. However in parallell I also try to implement clang import. In theory I hope we'll be able to someday handle sophisticated c++ code much better with the clang import (as long as it's fully clang compliant). And I envision the builtin parser will be preferable when the code is not clang compliant or for performance reasons.
The clang import is still far from ready though.
Last edit: Daniel Marjamäki 2020-11-25
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I tried createAstAtToken on the argument and createAstAtTokenInner for the argument list and neither set the value type. It looks like the symbol database sets the value type.
I don't think using the AST will work for getting function call argument types in the template simplifier. Running setVarID() and using the varid looks like a better short term approach.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I tried createAstAtToken on the argument and createAstAtTokenInner for the argument list and neither set the value type. It looks like the symbol database sets the value type.
yes. it only creates the tree (parents,operands). it might help determining argument types if we have parents and operands but feel free to skip that for now.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It seems to me that in theory we could run createAst/setValueType/createSymbolDatabase before templatesimplifier. Then the information would not be complete. After each instantiation we could rerun these so we would have as much type info as possible.
I do not know how gcc/clang works but I would assume that ast/valuetype/templatesimplifier/symboldatabase work together rather than separately.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The preprocessor doesn't tokenize the code properly and expects the tokenizer to fix it up because it has more context to fix ambiguous code. The AST can't handle the unsimplified or uncorrected code present before the template simplifier and I expect the symbol database will have the same problem. There will be a significant performance hit because they will have to be run after every template simplifier pass and possibly after every instantiation. This is a lot of work just for argument type deduction. I think this will be low priority for the short term now because there are many more serious bugs that need fixing but I will give it a try again in the future.
You need a symbol table available while parsing the code because you need context to parse ambiguous code. This has to be very efficient or mutable because instantiating templates is essentially self modifying code.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The AST can't handle the unsimplified or uncorrected code present before the template simplifier
I assume it needs things like createLinks2(), which relies on the template simplifier.
There will be a significant performance hit because they will have to be run after every template simplifier pass and possibly after every instantiation.
Ideally, we need a way run the AST and symboldatabase incrementally over a subset of tokens. This seems somewhat easy to do for the AST, but the symboldatabase would need lots of changes.
Alternatively, I think we can just create the AST and symbol database at most three times(perhaps only twice). First, we would update the symbol database to add the ValueType for template parameters(this can be useful for valueflow analysis as well). Then run a pass(perhaps it could be in the template simplifer), that transforms function calls to use explicit types even for unexpanded templates. So foo(x) would be transformed into foo<T>(x). After this we would create the AST and symbol database again but there could be a way to transform the function while preserving the symbol database so this might be unnecessary. Then next run templatesimplifer. The template simplifier doesn't need to do type deduction as all the types should be explicit.
So this basically would run AST/SymbolDatabase -> explictTypeDeduction -> AST/SymbolDatabase(optional) -> simplifyTemplates -> AST/SymbolDatabase. If its not feasible to move simplifyTemplates to this late, we could run simplifyTemplates twice.
This may not be as efficient as incrementally updating the symbol database, but it seems easier to implement as we dont need to do major refactoring to the SymbolDatabase. These are the tasks I see that need to be done:
Update SymbolDatabase to add ValueType for template parameters.
Add the explictTypeDeduction to transform template function calls from foo(x) to foo<T>(x).
Update the pipeline to run AST/SymbolDatabase multiple times.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I do not know how gcc/clang works but I would assume that ast/valuetype/templatesimplifier/symboldatabase work together rather than separately.
Clang uses a real AST. It lexes and then parses the source code to an AST, and then runs semantic analysis which adds the type information. The semantic analysis also instantiates the templates. This is usually done by inserting into the AST directly, so it doesn't create new tokens.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
before templatesimplifier is executed, symboldatabase+setValueType can only determine the type for the first template argument. if we then run templatesimplifier+symboldatabase+setValueType.. we will only know the type of a not b..
I can not say what the performance impact would be to just run these naively until we are finished..
Ideally, we need a way run the AST and symboldatabase incrementally over a subset of tokens. This seems somewhat easy to do for the AST, but the symboldatabase would need lots of changes.
Yes.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I do not think it will be enough with 2-3 runs of symboldatabase if we want to have a full type deduction..
We could have a pass to replace decltype with the actual type in the template parameters. That would get us a similar output to C++ insights, and it wouldn't require more than 3 runs of symboldatabase.
Of course, there are probably some cases where this will miss the type deduction, and we might consider doing this in a loop in the future.
However, I dont see cppcheck's built-in parser doing full type deduction. Overload resolution and type deduction are very complicated(and even more so with requires), if we want full type deduction we should use clang as our parser. In fact, in clang 11, there is Recovery AST:
I know our clang import parses the ast dump, but this seems like a very fragile approach as this is not stable between versions. It would be better to use libclang to parse the AST. This is more stable but may include nodes which are 'Unexposed', so we will need to fill in these unexposed nodes. It also can evaluate constexpr values.
To use libclang we will need to either link against it at build time or use dlopen to load it dynamically. I also have C++ bindings to make it easier to use libclang's API:
I know our clang import parses the ast dump, but this seems like a very fragile approach as this is not stable between versions. It would be better to use libclang to parse the AST. This is more stable but may include nodes which are 'Unexposed', so we will need to fill in these unexposed nodes. It also can evaluate constexpr values.
I wanted to use libclang to start with. That is the recommended approach. However I thought it was really unusable:
* First the API is very difficult and the documentation is very limited. I believe writing a clang import using libclang will require a lot more work in cppcheck.
* Then you have the problem with "Unexposed". Even basic code examples have "Unexposed" nodes. So lots of work is needed in Clang also.
They used to have a xml output of the ast in Clang but they dropped it because it was unmaintained. I don't know how it was implemented but probably in some separated output code.
I have thought about tweaking the clang ast output a bit. I want a import-friendly format based on minor tweaks in the existing output. If we only need to make minor tweaks in the output I hope that the Clang developers can accept that. So if the same code is used to dump normal debug output and "import-friendly" then it will not require extra maintenance.
Well.. I intend to make a proof of concept and suggest it.. and let's see what they think about that.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Trying to add detailed info such as what type of binaryoperator it is , is very difficult the documentation is very cryptic. As far as I remember I made honest attempts to write such info.
And even for this trivial code there is UnexposedExpr. So we will not even be able to import the AST for this code without updated clang.
It took quite some effort to write exprengine.cpp even though it does not do much.
Still.. if you think that libclang would be an better alternative then I am not in theory against it. I envision that it would be possible to more or less replace the ast-dump reader (that is not much code) with a libclang interface.
If we rewrite the builtin parser.. I suggest: the setValueType() is called first and that will delegate tasks to SymbolDatabase(set function pointer)/TemplateSimplifier when it needs.
Example:
intx,y,z;x=foo(y,1.0+bar(z));
The setValueType() sets the types from the bottom and up (as today):
1. First the type for z is set.
2. Function call "bar": setValueType() should call the "SymbolDatabase" and "TemplateSimplifier" and ask them to lookup "bar(int)". If TemplateSimplifier finds a match it should instantiate "bar" and change the name to something like "bar<int>". Then the SymbolDatabase add "bar<int>" function if necessary and can set the function pointer for bar.
3. It sets the ValueType for 1.0
4. The ValueType for + is set.
5. The ValueType for y is set.
6. Function call "foo": setValueType() will call the "SymbolDatabase" and "TemplateSimplifier" and ask them to lookup "foo(int, double)".. if there is a match in symboldatabase then the function pointer is just set.. otherwise the templatesimplifier will instantiate and then the symboldatabase must add the function and set the function pointer...</int></int>
I don't think this sounds 100% complex but it will require some rewriting..
Last edit: Daniel Marjamäki 2020-11-29
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm trying to use the AST in the template simplifier but I'm running into problems. Just creating the AST causes testrunner to almost immediately fail:
This helps a little:
but I'm also getting endless recursion from return statements.
The AST is a mystery to me so some help would be welcomed.
I am not sure what problem exactly you have .. but I do not think it will be easy to combine ast, symboldatabase, templatesimplifier, valuetype.
If there is a specific expression that you want to create ast for I believe it would be possible by executing createAstAtToken for that specific expression. The proper way would be; locate the first token in the expression and pass that to
createAstAtToken
.I do think we should continue to work on the builtin parser. However in parallell I also try to implement clang import. In theory I hope we'll be able to someday handle sophisticated c++ code much better with the clang import (as long as it's fully clang compliant). And I envision the builtin parser will be preferable when the code is not clang compliant or for performance reasons.
The clang import is still far from ready though.
Last edit: Daniel Marjamäki 2020-11-25
I tried createAstAtToken on the argument and createAstAtTokenInner for the argument list and neither set the value type. It looks like the symbol database sets the value type.
I don't think using the AST will work for getting function call argument types in the template simplifier. Running setVarID() and using the varid looks like a better short term approach.
yes. it only creates the tree (parents,operands). it might help determining argument types if we have parents and operands but feel free to skip that for now.
Is it not possible to run createAst and symbol database? We really need both for any robust type deduction.
It seems to me that in theory we could run createAst/setValueType/createSymbolDatabase before templatesimplifier. Then the information would not be complete. After each instantiation we could rerun these so we would have as much type info as possible.
I do not know how gcc/clang works but I would assume that ast/valuetype/templatesimplifier/symboldatabase work together rather than separately.
The preprocessor doesn't tokenize the code properly and expects the tokenizer to fix it up because it has more context to fix ambiguous code. The AST can't handle the unsimplified or uncorrected code present before the template simplifier and I expect the symbol database will have the same problem. There will be a significant performance hit because they will have to be run after every template simplifier pass and possibly after every instantiation. This is a lot of work just for argument type deduction. I think this will be low priority for the short term now because there are many more serious bugs that need fixing but I will give it a try again in the future.
You need a symbol table available while parsing the code because you need context to parse ambiguous code. This has to be very efficient or mutable because instantiating templates is essentially self modifying code.
I assume it needs things like
createLinks2()
, which relies on the template simplifier.Ideally, we need a way run the AST and symboldatabase incrementally over a subset of tokens. This seems somewhat easy to do for the AST, but the symboldatabase would need lots of changes.
Alternatively, I think we can just create the AST and symbol database at most three times(perhaps only twice). First, we would update the symbol database to add the
ValueType
for template parameters(this can be useful for valueflow analysis as well). Then run a pass(perhaps it could be in the template simplifer), that transforms function calls to use explicit types even for unexpanded templates. Sofoo(x)
would be transformed intofoo<T>(x)
. After this we would create the AST and symbol database again but there could be a way to transform the function while preserving the symbol database so this might be unnecessary. Then next run templatesimplifer. The template simplifier doesn't need to do type deduction as all the types should be explicit.So this basically would run AST/SymbolDatabase -> explictTypeDeduction -> AST/SymbolDatabase(optional) -> simplifyTemplates -> AST/SymbolDatabase. If its not feasible to move
simplifyTemplates
to this late, we could runsimplifyTemplates
twice.This may not be as efficient as incrementally updating the symbol database, but it seems easier to implement as we dont need to do major refactoring to the SymbolDatabase. These are the tasks I see that need to be done:
SymbolDatabase
to addValueType
for template parameters.explictTypeDeduction
to transform template function calls fromfoo(x)
tofoo<T>(x)
.Clang uses a real AST. It lexes and then parses the source code to an AST, and then runs semantic analysis which adds the type information. The semantic analysis also instantiates the templates. This is usually done by inserting into the AST directly, so it doesn't create new tokens.
I do not think it will be enough with 2-3 runs of symboldatabase if we want to have a full type deduction..
before templatesimplifier is executed, symboldatabase+setValueType can only determine the type for the first template argument. if we then run templatesimplifier+symboldatabase+setValueType.. we will only know the type of a not b..
I can not say what the performance impact would be to just run these naively until we are finished..
Yes.
C++Insight https://cppinsights.io/ uses clang to generate this from the above example:
Unfortunately it requires code that compiles.
We could have a pass to replace
decltype
with the actual type in the template parameters. That would get us a similar output to C++ insights, and it wouldn't require more than 3 runs of symboldatabase.Of course, there are probably some cases where this will miss the type deduction, and we might consider doing this in a loop in the future.
However, I dont see cppcheck's built-in parser doing full type deduction. Overload resolution and type deduction are very complicated(and even more so with
requires
), if we want full type deduction we should use clang as our parser. In fact, in clang 11, there is Recovery AST:https://releases.llvm.org/11.0.0/tools/clang/docs/ReleaseNotes.html#recovery-ast
This improves its handling of incomplete code.
I know our clang import parses the ast dump, but this seems like a very fragile approach as this is not stable between versions. It would be better to use libclang to parse the AST. This is more stable but may include nodes which are 'Unexposed', so we will need to fill in these unexposed nodes. It also can evaluate constexpr values.
To use libclang we will need to either link against it at build time or use
dlopen
to load it dynamically. I also have C++ bindings to make it easier to use libclang's API:https://github.com/pfultz2/libclangpp/blob/master/include/clangpp.hpp
I wanted to use libclang to start with. That is the recommended approach. However I thought it was really unusable:
* First the API is very difficult and the documentation is very limited. I believe writing a clang import using libclang will require a lot more work in cppcheck.
* Then you have the problem with "Unexposed". Even basic code examples have "Unexposed" nodes. So lots of work is needed in Clang also.
They used to have a xml output of the ast in Clang but they dropped it because it was unmaintained. I don't know how it was implemented but probably in some separated output code.
I have thought about tweaking the clang ast output a bit. I want a import-friendly format based on minor tweaks in the existing output. If we only need to make minor tweaks in the output I hope that the Clang developers can accept that. So if the same code is used to dump normal debug output and "import-friendly" then it will not require extra maintenance.
Well.. I intend to make a proof of concept and suggest it.. and let's see what they think about that.
here is some test code to use libclang. The name "exprengine.cpp" is very confusing but please ignore that.
The intention is to parse a file
v1.cpp
and write some kind of AST on the screen..For this code:
The output is:
Trying to add detailed info such as what type of binaryoperator it is , is very difficult the documentation is very cryptic. As far as I remember I made honest attempts to write such info.
And even for this trivial code there is UnexposedExpr. So we will not even be able to import the AST for this code without updated clang.
It took quite some effort to write
exprengine.cpp
even though it does not do much.Still.. if you think that libclang would be an better alternative then I am not in theory against it. I envision that it would be possible to more or less replace the ast-dump reader (that is not much code) with a libclang interface.
If we rewrite the builtin parser.. I suggest: the
setValueType()
is called first and that will delegate tasks to SymbolDatabase(set function pointer)/TemplateSimplifier when it needs.Example:
The
setValueType()
sets the types from the bottom and up (as today):1. First the type for
z
is set.2. Function call "bar":
setValueType()
should call the "SymbolDatabase" and "TemplateSimplifier" and ask them to lookup "bar(int)". If TemplateSimplifier finds a match it should instantiate "bar" and change the name to something like "bar<int>". Then the SymbolDatabase add "bar<int>" function if necessary and can set the function pointer for bar.3. It sets the ValueType for 1.0
4. The ValueType for + is set.
5. The ValueType for y is set.
6. Function call "foo":
setValueType()
will call the "SymbolDatabase" and "TemplateSimplifier" and ask them to lookup "foo(int, double)".. if there is a match in symboldatabase then the function pointer is just set.. otherwise the templatesimplifier will instantiate and then the symboldatabase must add the function and set the function pointer...</int></int>I don't think this sounds 100% complex but it will require some rewriting..
Last edit: Daniel Marjamäki 2020-11-29