Menu

Some ideas on Parsing

Developers
2002-01-25
2004-02-02
  • John Edward Judd

    Over the Christmas break I did some checking out of C++ parsers and parser generators (see ANTLR www.antlr.org).

    One of the things that I determined during my investigations [apart from the fact that C++ is very hard to parse] was that we need something a little out of the ordinary. A C++ compiler builds a single compilation unit that incorporate all #include files into a single object file that is then linked into a single executable [or lib]. Somehow we need to change [or build] a parser that works on a file by file basis, building a description of an entire project rather than a single compilation unit.

    A normal C++ compiler also breaks down the act of parsing a C++ file into preprocessing and then processing. In our case I reckon we need a preprocessor that builds a structure of the project as well as logically "linking" the C++ files together. The parser then parses the individual files into a set of data that can be used when refactoring or reporting on metrics.

    Not having done much in the way of parsing before I'm not sure if this is the best way of doing this. Any ideas?

     
    • David Barker

      David Barker - 2002-01-25

      I'm not sure what you mean by a preprocessor that builds structure...
      I agree that the parsing should be done an a 'per file' basis, however, this is not a serious problem becuase (At least with gcc and msvc++) the preprocessor leaves # directives after preprocessing that tell the parser / compiler what file and line the source is from, (so that errors can be reported accurately).  we will need to use this information.  The parser simply ignores this information - using it only to keep it's 'current_file' and 'line_number' variables up to date - (as I understand it).

      I think we need to build a program database of sorts, however it must have the actual statements recorded in some form.  I think that pdb files have only type info - I could be wrong, yet more reading to do therefore http://www.cs.uoregon.edu/research/paracomp/pdtoolkit/Docs/converter.txt

      (I'm not sure how this will deal with templates, which are buggers for a compiler anyway)

      I'm worried about submerging myself in eBooks and HTML docs.  the ANTLR link was intersting, I skimmed - and will read the docs a bit more throughly.  I _think_ that it will give us more control over parsing, and perhaps (I'm guessing) a more expressive parse tree (than OpenC++) at the end.  It's all good.

      What of templates?  Certainly the code must accept and 'understand' them, however, do you want the template instatiation to be fully figured out.  this must be done if we're to have the complete picture about the code.

      Do you know how ANTLR would deal with bitchin code like this:

      template <unsigned x>
      struct CompileTimeFactorial
      {
      enum Result (x == 0 ? 0 : CompileTimeFactorial<x-1>::Result);
      };

      (Like, i've not compiled this; so the enum declaration might be off - I don't do this in real code).

      Do you want us to deal with this.  Partial template specialization will be harder.  dagnabit.

       
      • John Edward Judd

        Sorry, I wasnt clear on that. I meant that the preprocessor #include statement provides a link to the program database entry [symbol table] for the include file rather than actually including the file and parsing it. Since a header (or other) include file can be used many times by various compilation units, it has to be recompiled each time (unless a precompiled header is used). This wastes processing time. For a compiler it might be ok to recompile headers, but the refactorer needs to be somewhat more responsive.

        True. The PDB does need additional info. Some projects use an Abstract Syntax Tree (AST) to store the code structure and a PDB or symbol table to store identifiers and the like. This would almost certainly be needed for more complex refactorings like Extract Method, or for metrics that report on the amount of duplication in the code.

        One thing we dont necessarily want to do is to actually validate the code the way a compiler does. I'm not sure if we care if the code compiles or not. I say this because the developer may want to do a refactoring on broken code for one reason or another.

        One of the C++ grammars used by ANTLR has template support. You are right though, they seem like a real buggers to parse.

        Eventually, I want to deal with as much as possible. However, for the moment, lets start small. I mentioned in another post that I'd like to develop this highly iteratively. So things like ASTs and full template processing can wait until later when we understand the problem a bit better, and when we need to implement the refactorings that depend on them. There's also no point in doing something now that we might not need or that might have to be done entirely differently in the future.

         
    • David Barker

      David Barker - 2002-01-26

      I agree with keeping it simple, doing only what is requisite for each refactoring we persue.  I'm not so sure about dealing with broken code, but i'm willing to give it a bash, and see how that pans out.  I like the prcompiled headers idea, and you're right about the tool needing to be responsive.  I think we must decide which parser to work with.  They both (presumably) have published API's, let's decide on one, and find out what we can do with it - I.e. start making a symbol lookup table.  (Again, I'm worried about how the parser we choose will cope with broken code...)

      It's good to talk about all this stuff, and feel out the problem space a bit, however this is (I think for us both) not an area we're familiar with, so the best thing to do (IMHO) is to start some prototype-style coding, (that we'll definitely throw away) and (for example) attempt to implement one single refactoring, with a nasty command line driven interface, not concered with speed (Unless it takes like two minutes)  then, we'll have a more useful understanding of what we're dealing with.

      Do you agree with this in principle?
      p.s. I do throw my prototypes away

       
      • John Edward Judd

        I certainly do. Prototypes can be useful for understanding the problem space. Generally, they shouldnt be used for production code, but more and more these days they are.

        I wouldnt throw away the prototype code, I'd keep it for reference. There will probably be parts that could be extracted and used in the main project.

         
        • John Edward Judd

          I forgot to mention. I also tend to do spikes. Little side projects that experiment with different ways of doing things.

          These tend not to be prototypes since they dont provide a functional 'product' the way a prototype does, they just experiment with one aspect of it. And often there are a bunch of them for a single aspect [like the preprocessor.] These are throwaway projects in the sense that you can even delete them [if you want] when you have the results of the experiment and have decided what you are going to do.

           
    • David Barker

      David Barker - 2002-01-26

      If we have the #include statement essentially point to a pre-compiled data structure for that file, then we must be careful that the macro definitions currently in effect at the point of the #include statement are known because this code (for example) would produce problems otherwise:

      --------------------------------------------------
      #ifndef MY_HEADER
      #define MY_HEADER

      #ifndef MINCING_MACRO
      #error MINCING_MACRO must be defined in MyHeader.
      #endif

      #if (MINCING_MACRO == 1)

      class ClassA
      {
      };

      #else

      class ClassB
      {
      };

      #endif // (if on MINCING_MACRO_)
      #endif // #ifndef MY_HEADER
      --------------------------------------------------

      So, I reckon the precompiled data structure should include the macro defintions in effect that are of significance (mentioned) in the file that is precompiled.  this will mean there is >= 1 precompiled data structure for each precompiled header.

      Do you want to do it like this, or mandate that such macro chinanegans are not going to be handled and so these headers won't be precompiled?

      On code that is portable across platforms this is going to be widespread.  E.g.

      #ifdef WIN32
      define my own thread class this way
      #endif

      #ifdef UNIX
      define my own thread class that way..
      #endif

       
      • John Edward Judd

        The other option is to ignore macros completely and cater for code that is multiply defined.

        #ifdef WINDOWS

        class A { the windows variant of this class };

        #else // LINUX

        class A { the linux variant of this class };

        #endif

        Thats nasty code AFAIAC, but it illustrates the point. If we rename class A for windows and it renames every reference, we should also rename the class for the linux version. This applies equally to method definitions, and bodies.

        Unfortunately C++s preprocessor, while it can be handy for the programmer, complicates things by quite a lot.

        Hmmm. Maybe we can treat #if blocks like mini includes. Create a block for each one that exists under the main file definition... Hmmm bears some thought.

        For the time being though, in the interests of simplicity, lets just ignore all preprocessor statements other than the #include.

         
    • David Barker

      David Barker - 2002-01-27

      Right, first off i'm mailing this after being drinking, and it's 03:26 for me.  Right now, I reckon that the #if thing is only of theoretical significance.  (that's good words for me in this condition) becuase most of the time, refactorings are going to be done on classes that are at a higher level of abstraction than the base 'system' classes that have to worry about system level details and hence whose implemntation (& therefore definition) are controlled by the preprocessor.  So I agree that this is a bridge we'll cross when we come to it.  Today, I managed to download antlr, compile the grammer that maintains whitespace characters - which is significant for us - and read about this example.  all very good.  will look more deeply into AST's and the antlr C++ grammer tomorrow - well today really.  au reviour.....

       
    • Anonymous

      Anonymous - 2003-05-24

      Hello, I wanted to bring some input to this project.
      I would be interested in joining it although I'm not sure I do have the time for it.

      In particular, here is a list of available tools you may find useful (if you don't know them already) :

      - GEN++, a tool for creating C++ code parsers and tools
         developed at http://www.cs.ucdavis.edu/~devanbu/genp/
         Two refactoring tools have been developed with it here http://www.cs.wayne.edu/~vip/RefactoringTools/

      - Artistic Style, a configurable C++/java code beautifier that I use daily at work. The code is automatically beautified with astyle when doing a Clearcase checkin (or a CVS commit), so that it is always clean. Beautified code is also probably easier to parse.
         It is available here on Sourceforge.

      - Doxygen, a powerful hyperlinked doc generator for C++, which can handle very large projects, used daily at work too.
         Generates HTML, Windows help format, LaTex, hyperlinked PDF, man pages : http://www.stack.nl/~dimitri/doxygen/

       
    • Anonymous

      Anonymous - 2003-08-12

      I found the following thesis about parsing C++. I haven't seen that it was mentioned before, so I thought it might be of interest.
      http://www.computing.surrey.ac.uk/research/dsrg/fog/FogThesis.pdf

       
    • Andre Baresel

      Andre Baresel - 2004-01-02

      There's no documentation yet - but the C++ Parsing problem is solvable as shown by the currently implemented parser.

      However, the parser gets difficult to read because of all the symbol scanning and various decision to be done within a handwritten parser. Baptiste introduced the idea of using python as parsing description language and has played arround with it - seems to be a better readable solution.

      Preprocessing has been introduced - currently I am (Andre) working on #include. I decided to implement it like a real preprocessor so that 'ifdef' can hide code. This is needed because of cycles in header includes.

      Happy coding.

      Andre

       
    • Jay Kint

      Jay Kint - 2004-02-02

      I haven't seen anyone mention Keystone.  It's an ISO compatible C++ parser, probably the most feature complete this side of gcc and EDG.  It does no code generation, but does allow for output of an AST.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.