Re: [cclib-devel] further parser refactoring

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 03/05/07, Karol Langner <kar...@kn...> wrote:
> On Thursday 03 May 2007 10:23, Noel O'Boyle wrote:
> > For instance, what was the effect of the recent change where you
> > avoided calling extract() when the line was empty? It seems reasonable
> > that this would speed things up, but did it in fact? What's the
> > fastest way of testing whether a line is empty (must be
> > cross-platform)? And so on.
>
> Below, "parse_slower" is the same method as "parse" from the trunk without the
> condition that checks if the line is empty.
>
> langner@slim:~/tmp/python/cclib/trunk/data/Gaussian/Gaussian03$ python
> Python 2.5 (r25:51908, Apr 30 2007, 15:03:13)
> [GCC 3.4.6 (Debian 3.4.6-5)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import cclib
> >>> a = cclib.parser.ccopen("chn1.log.gz")
> >>> import timeit
> >>> t = timeit.Timer("a.clean(); a.parse()", "from __main__ import a")
> >>> min(t.repeat(repeat=10,number=5))
> .... logger output ....
> 0.92677688598632812
> >>> t_slower = timeit.Timer("a.clean();a.parse_slower()", "from __main__
> import a")
> >>> min(t_slower.repeat(repeat=10,number=5))
> ... logger output ...
> 0.92177586353772345
>
> I tried a bigger file and it also had no visible effect. So... what seemed
> reasonable to me was wrong. I guess that revision can be reverted :)

Maybe there's a quicker way of testing for an empty line. Or even
better, all lines less than 4 characters or something (although
clearly we need to be careful not to skip important lines).

> > How we test each line has a large effect on efficiency. I point out
> > again that using line[x:y]=="jklj" is much faster than using "word in
> > line", or line.find(), and so these should be some of the first
> > targets for improving efficiency.
>
> Good point, confirmed by a profiling run:
>
> langner@slim:~/tmp/python/cclib/trunk/data/GAMESS/basicGAMESS-US$ python
> Python 2.5 (r25:51908, Apr 30 2007, 15:03:13)
> [GCC 3.4.6 (Debian 3.4.6-5)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import cclib
> >>> a = cclib.parser.ccopen("C_bigbasis.out")
> >>> import profile
> >>> profile.run("a.parse()", "parse.prof")
> >>> import pstats
> >>> s = pstats.Stats("parse.prof")
> >>> s.sort_stats("time")
> >>> s.print_stats(.12)
> Thu May  3 14:43:04 2007    parse.prof
>
>          199815 function calls in 9.069 CPU seconds
>
>    Ordered by: internal time
>    List reduced from 96 to 12 due to restriction <0.12>
>
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>      8581    4.548    0.001    8.625
> 0.001 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/gamessparser.py:90
> (extract)
>    137355    3.080    0.000    3.080    0.000 :0(find)
>     20310    0.480    0.000    0.480    0.000 :0(len)
>         1    0.316    0.316    9.069
> 9.069 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfileparser.py:165
> (parse)
>      8600    0.184    0.000    0.184    0.000 :0(rstrip)
>      2143    0.140    0.000    0.140    0.000 :0(split)
>      2055    0.124    0.000    0.124    0.000 :0(range)
>      9145    0.076    0.000    0.076    0.000 :0(strip)
>      8868    0.060    0.000    0.060
> 0.000 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfileparser.py:375
> (updateprogress)
>       370    0.016    0.000    0.016    0.000 :0(append)
>       218    0.004    0.000    0.004    0.000 :0(replace)
>        31    0.004    0.000    0.032
> 0.001 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfileparser.py:153
> (__setattr__)

I've never used the profiler. Can you interpret this for me in simple language?

> > On 03/05/07, Karol Langner <kar...@kn...> wrote:
> > > Some thoughts about more refactoring to the parser...
> > >
> > > If you take a look at the parsers after the recent refactoring, it is now
> > > more evident that they are quite inefficient. That isn't a problem, since
> > > cclib isn't about efficiency, but it would be nice. For example, even
> > > something as simple as putting a 'return' statement at the end of each
> > > parsing block would speed things up (the following conditions are not
> > > evaluated). Anyway, this already suggests that it would be useful to
> > > break up the extract() method into pieces, one for each block of parsed
> > > output.
> >
> > I think that there is one case where the block shouldn't return, but
> > in general it would be fine. However, it wouldn't speed things up that
> > much, so I feel it is not worth doing. If you think about it, most
> > lines don't match any of the 'if' statements. If each block is
> > executed once, and there are 10 blocks, then the number of wasteful
> > 'if' statements will be 9 + 8 + 7 + ...  + 1 = 45.
>
> There are also the lines between blocks that don't match any condition, but in
> principle you're right, it's not worth it.

>  > I've been hovering around this subject for some time, and turning it
> > > around in my mind. A dictionary of functions seems appropriate (with
> > > regexps or something as keys), and more easy to manage that the current
> > > "long" function. I don't think we can do away with the functions, since
> > > sometimes pretty complicated operations are done with the parsed output.
> > > The problem I see is where to define all these functions (30-40 separate
> > > parsed blocks)?
> >
> > I don't think defining the functions is the problem - just define them
> > in the gaussian parser for example. We could do this already without
> > affecting anything, and leave the dictionary of functions idea till a
> > later date.
>
> What do you mean by 'gaussian parser' - the file gaussianparser.py or the
> class? I think I didn't make myself clear - my worry is that if we define all
> these functions in the parser class, then when you go "a =
> ccopen("....").parse(); print dir(a)" you will get flooded by function names.

Ah, ok. I see. I think I would prefer users to use help(a), rather
than dir(a). I think that if you use function names starting with _ it
shouldn't appear in this list, although I haven't tested this.

> > > How about this: the functions would be defined in a different class, not
> > > LogFile. What I'm suggesting, is to separate from the class that
> > > represents a parsed log file a class that represents the parser.
> > > Currently, they are one. An instance of the parser class would be an
> > > attribute of the log file class, say "_parser". This object would hold
> > > all the parsing functions and a dict used by the parse() method of
> > > LogFile, and any other stuff needed for parsing. An additional advantage
> > > is that the parser becomes less visible to the casual user, leaving only
> > > parsed attributes in the log file object.
> > >
> > > Summarizing, I propose two layers of classes:
> > > LogFile - subclasses Gaussian, GAMES, ...
> > > LogFileParser - subclasses GaussianParser, GAMESSParser, ...
> > > The first remains as is (at least for the user), except that everything
> > > related to parsing is put in the second. Of course, instances of the
> > > latter class should be attributes of the instances of the former.
> >
> > I think you'll have to explain this some more...I'm not sure what the
> > advantage is in doing this. I guess I don't have enough time right now
> > to think this through fully...
>
> Let me sketch out the idea. Snippets of the parse class (second layer):
>
> def GaussianParser(LogFileParser):
>         (...)
>         def parse_charge(self, inputfile, line):
>                 super(GaussianParser, self).charge = (...)
>         def parse_scfenergy(self, intputfile, line):
>                 super(GaussianParser, self).scfenergies.append(...)
>         (...)
>         self.parse_dict = {
>                 <regexp_charge>:                self.parse_charge,
>                 <regexp_scfenergy>:     self.parse_scfenergy,
>                 (...)
>         }
>
> Now the first layer, the log file class:
>
> def Gaussian(LogFile):
>         self._parser = GaussianParser(...)
>         (...)
>         def parse(self, ...):
>                 (...)
>                 for line in inputfile:
>                         for regexp in self._parser.parse_dict:
>                                 if re.match(regexp, line):
>                                         self._parser.parse_dict[regexp](line, inputfile)
>
> I hope that's clearer.

OK, thanks for that. As I said, it needs more time...

> No problem, I'm just brainstorming on a holiday.

If you want to do something neat, you can see if you can integrate
cclib with PyVib2, or whether you can get cclib to run under Jython
(there has been some interest in this, more later...)...I would hate
for you not to have something to do on your holiday :-)

Noel