Re: [cclib-devel] further parser refactoring

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

If you are going to think about efficiency, I'd like to see some
timings. That means something like timing the parsing of a particular
very large file (several times, and take the minimum). I don't like to
speculate too much on whether certain changes will make everything
more efficient.

For instance, what was the effect of the recent change where you
avoided calling extract() when the line was empty? It seems reasonable
that this would speed things up, but did it in fact? What's the
fastest way of testing whether a line is empty (must be
cross-platform)? And so on.

How we test each line has a large effect on efficiency. I point out
again that using line[x:y]=="jklj" is much faster than using "word in
line", or line.find(), and so these should be some of the first
targets for improving efficiency.

On 03/05/07, Karol Langner <kar...@kn...> wrote:
> Some thoughts about more refactoring to the parser...
>
> If you take a look at the parsers after the recent refactoring, it is now more
> evident that they are quite inefficient. That isn't a problem, since cclib
> isn't about efficiency, but it would be nice. For example, even something as
> simple as putting a 'return' statement at the end of each parsing block would
> speed things up (the following conditions are not evaluated). Anyway, this
> already suggests that it would be useful to break up the extract() method
> into pieces, one for each block of parsed output.

I think that there is one case where the block shouldn't return, but
in general it would be fine. However, it wouldn't speed things up that
much, so I feel it is not worth doing. If you think about it, most
lines don't match any of the 'if' statements. If each block is
executed once, and there are 10 blocks, then the number of wasteful
'if' statements will be 9 + 8 + 7 + ...  + 1 = 45.

> I've been hovering around this subject for some time, and turning it around in
> my mind. A dictionary of functions seems appropriate (with regexps or
> something as keys), and more easy to manage that the current "long" function.
> I don't think we can do away with the functions, since sometimes pretty
> complicated operations are done with the parsed output. The problem I see is
> where to define all these functions (30-40 separate parsed blocks)?

I don't think defining the functions is the problem - just define them
in the gaussian parser for example. We could do this already without
affecting anything, and leave the dictionary of functions idea till a
later date.

> How about this: the functions would be defined in a different class, not
> LogFile. What I'm suggesting, is to separate from the class that represents a
> parsed log file a class that represents the parser. Currently, they are one.
> An instance of the parser class would be an attribute of the log file class,
> say "_parser". This object would hold all the parsing functions and a dict
> used by the parse() method of LogFile, and any other stuff needed for
> parsing. An additional advantage is that the parser becomes less visible to
> the casual user, leaving only parsed attributes in the log file object.
>
> Summarizing, I propose two layers of classes:
> LogFile - subclasses Gaussian, GAMES, ...
> LogFileParser - subclasses GaussianParser, GAMESSParser, ...
> The first remains as is (at least for the user), except that everything
> related to parsing is put in the second. Of course, instances of the latter
> class should be attributes of the instances of the former.

I think you'll have to explain this some more...I'm not sure what the
advantage is in doing this. I guess I don't have enough time right now
to think this through fully...

> Waiting to hear what you think about this idea,
I think I need some more time....

Noel