Re: [Opencxx-users] ROSE AST

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Andreas Sæbjørnsen wrote:

>> Interesting. As you may have gathered from the online docs, Synopsis was
>> originally
>> designed as a more flexible and powerful alternative to doxygen.
>> However, over the years I realized that what I came up with has the
>> potential to do
>> quite a bit more. The 'AST' I originally designed (and which, I believe
>> should be
>> renamed as it doesn't have anything to do with syntax) is just one
>> particular
>> representation of the code. Another one is the parse tree (now available
>> as the C++
>> Synopsis::PTree namespace), which provides very fine-grained control over
>> the
>> code, inclusively formatting.
>> I hope that I will be able to reuse the 'processor pipeline' design
>> (http://synopsis.fresco.org/docs/Tutorial/pipeline.html) to not only
>> process the
>> AST, but in fact any representation synopsis is or will be able to
>> construct and
>> expose (call graphs come to mind).
> 
> 
> That is great! So if what you call an AST is not an Abstract Syntax Tree,
> does it have
> anything in common with an Abstract Syntax tree?

That depends on what 'Abstract Syntax Tree' stands for in a particular context.
I think the way it is presently used in synopsis is relatively common (though
we only represent declarations, not arbitrary statements, and no expressions either).

However, there really isn't anything related to syntax in this representation,
i.e. specific syntactic constructs are abstracted away (or else I wouldn't be able
to use the same representation for multiple languages syntactically as different
as python and C++).

So, may be 'Abstract Semantic Graph' would be a better description of what this
is about.

> Also I have a question
> regarding
> why you have chosen to work with a C++ parse tree as an intermediate
> representation
> instead of an Abstract Syntax Tree?

I simply follow a processing model where the first thing obtained from the
parser is a parse tree. That, together with a symbol table and some type
repository, can be processed into higher level representations such as
the AST that is presently at the heart of synopsis.

The parse tree is simply a different representation exposing much more
detail at a much lower level. There are tasks that are best done on
that low level, and others, which work better on a higher level.

The PTree is currently only available as a C++ API / representation,
but I plan to expose that to python (via boost.python) as soon as it
stabilizes sufficiently (and is complemented by the two other representations:
symbol table and type repository). In fact, some first draft of such a
python module is already in the snapshots, within the sandbox/ directory.

[...]

> The
> feature which we think separates us the most from other source-to-source
> translators
> is that we can do whole program analysis, e.g we can merge the ASTs of all
> the
> separate compilations in a program into one big AST.

Right, 'linking' different ASTs is done in synopsis, too, so it is straight
forward to generate a source code navigation tool that is able to link the
correct symbols across different source files.

> One notable feature of the PTree is that it is non-lossy, i.e. the leaf
>> nodes ('atoms')
>> point into a buffer, and so I'm able to serialize it again, using the
>> exact same
>> formatting as the original input file (I can even 'un-preprocess' the
>> code, since I
>> retain all macro expansion information from the preprocessor, which is
>> just another
>> processor generating part of a representation of the code).
> 
> 
> That is very interesting. Is this done by copying the lines from the
> original source file when
> you have not done any changes to the corresponding nodes in the AST? We are
> interested
> in the same kind of problems as we can not just expect to translate
> homebody's source code
> and them to just replace their source code with the output from ROSE.  In
> order to achieve
> this we are working on a copy unparser, which will only unparse the
> portions
> of the AST that
> we have changed and copy the rest from the original source-file. This
> enables us to diff the
> original source file and the translated output from ROSE in order to create
> a patch file. It is
> much more likely that people will be comfortable with a minimal patch-file
> rather than a whole
> source-file. We preserve preprocessor directives and comments by annotating
> the ROSE AST,
> but we still have some work to do in order to rewrap macro calls (replace
> expanded macros in
> the unparsed AST with the macro call). Have you addressed the problem with
> rewrapping macros,
> and if so how?

Right. Synopsis uses a 'Cpp' parser to preprocess source files. This processor
will not only preprocess the source file into some '.i' temporary file, but
it will record all macro definitions, macro calls, file inclusions, etc., etc.
(I currently use ucpp (http://pornin.nerim.net/ucpp/), and boost.wave
(http://www.boost.org/libs/wave/index.html) as possible Cpp backends.)

This information is available through the internal representation so later
processors in the pipeline can use it to 'un-preprocess' the file.
The actual C/C++ parser will read the (potentially preprocessed) file into
a memory buffer, and then generate parse tree nodes such that leaf nodes
point into that buffer (lexical tokens are actually tuples (type, pointer, length)).

Thus, modifying the source is best done by replacing a specific node in the parse
tree by some new code. These 'Replacements' are registered with the 'Buffer'
instance, so when writing it out into a source file the buffer iterator can
take care to generate the correct stream.

I have used that technique for example to generate a cross-referenced html-
formatted source view of the original source code.

> And finally, the original C++ parser was adapted from OpenC++, a project to
>> provide
>> some metaprogramming facilities (by embedding some extension language
>> into
>> the input,
>> which is detected by the parser). I think that with Synopsis I'm able to
>> provide alternate
>> techniques to generate custom source-to-source translators, for
>> example by
>> providing external
>> mapper scripts (to generate language bindings, say), or even embedd
>> processing instructions
>> into comments, which are then recognized by the parser and associated
>> with
>> the respective
>> nodes.
> 
> 
>> From what I remember the OpenC++ parser, atleast when I looked at it last
> time, was not able
> to parse all of C++ (I think there was some problems with e.g templates). I
> must admit though that
> my knowledge of this is spotty. Correct me if I am wrong, but from what I
> understand from our
> conversation this is something you have worked on. Do you have a list of
> improvements that you
> have done to OpenC++?

Unfortunately, no. Synopsis has started with a fork of OpenC++ about seven
years ago, and I have been applying various patches both to the lexer as well
as the parser over the years to fix or work around multiple issues.

However, there are a number of issues that can't be worked around with the
OpenC++ parser, as it still uses heuristics to guess whether some token
sequence is a 'name' or an expression (this becomes apparent with more
recent versions of g++, i.e. I think >= 3.4).
This is why I started last summer to rewrite the C++ parser from scratch.
The new implementation uses a technique very similar to gcc's new C++ frontend,
where (non-dependent) symbol lookup is done during the parsing to resolve all
ambiguities upfront. This required a new symbol table implementation, and
a new type repository (which is still very incomplete right now).

Once this rewrite is complete synopsis will be able to analyze things as complex as
partial template specializations, and correctly implement overload resolution.
There is still quite some way to go, though. :-)

>> I really think that the python language provides an ideal ground for this
>> kind
>> of introspection.
> 
> 
> That is something I would be very interested in discussing. First I am
> curious about
> your definition of the term introspection. If I interpret the term
> "introspection" as a way to
> specify patterns in the AST, we are  researching if this can be done
> with  a
> GLR parser generator like Elkhound. Basically we want to look
> for patterns in source code, security patterns, bug patterns for
> multi-threaded codes, poor use of MPI,
> or even poor programming practice patterns. Our research into how we can
> use
> Elkhound to achieve
> this goal is still in it's infancy though. How are you imagining the use of
> python to specify search
> patterns in the AST? We think that having a tool to specify search patterns
> in the AST would help
> us to:
>   * easier identify places in the AST where a translation should be
> performed
>   * enable all sorts of source code analysis which is now difficult to
> specify
>   * help us with the macro rewrap work

I'm not sure I understand where the actual parser framework (e.g. GLR) comes
into play here. Are you suggesting that some patterns you are looking for
should have a direct representation in the GLR grammar ?

The way I use 'introspection' is very generic: A means to analyze source
code, i.e. a set / layers of representations that can be used to search
for patterns, perform some actions (such as translation), etc.

I'm arguing in favor of python here because introspection typically involves
a number of steps to be performed in a specific workflow (a 'processor pipeline',
as I call it in synopsis). While the individual processors may be written in C++
or python, the wiring into a specific pipeline is easily done in little python
scripts, by means of a domain-specific language (just a subset of python) that
makes it easy to fine-tune the behavior of the tool.

There are other tools written for a more specific task in the same domain,
such as pyste or pyplusplus, which use python scripting to define language
bindings for C++ APIs to python APIs (using boost.python).
While these are currently still using gccxml as C++ frontend, I hope that
once the new C++ frontend is sufficiently complete, synopsis can offer similar
facilities.

Another domain this is useful in is source-to-source translation as originally
done by OpenC++ (see http://citeseer.ist.psu.edu/chiba95metaobject.html), or
nowadays AspectC++ (see http://www.aspectc.org/). These tools operate on an
augmented C++ grammar, i.e. the processing instructions are embedded right
into the source code language.

I believe it is easier to keep the target language and the language processing
instructions are expressed in separate, both, for the developer of the code,
as well as for the developer of the tool that is to translate it.

I have hopes that here, too, Synopsis could become useful.

Regards,
		Stefan

-- 

      ...ich hab' noch einen Koffer in Berlin...