Re: [cclib-devel] further parser refactoring

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Thursday 03 May 2007 10:23, Noel O'Boyle wrote:
> For instance, what was the effect of the recent change where you
> avoided calling extract() when the line was empty? It seems reasonable
> that this would speed things up, but did it in fact? What's the
> fastest way of testing whether a line is empty (must be
> cross-platform)? And so on.

Below, "parse_slower" is the same method as "parse" from the trunk without the 
condition that checks if the line is empty.

langner@slim:~/tmp/python/cclib/trunk/data/Gaussian/Gaussian03$ python
Python 2.5 (r25:51908, Apr 30 2007, 15:03:13) 
[GCC 3.4.6 (Debian 3.4.6-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import cclib
>>> a = cclib.parser.ccopen("chn1.log.gz")
>>> import timeit
>>> t = timeit.Timer("a.clean(); a.parse()", "from __main__ import a")
>>> min(t.repeat(repeat=10,number=5))
.... logger output ....
0.92677688598632812
>>> t_slower = timeit.Timer("a.clean();a.parse_slower()", "from __main__ 
import a")
>>> min(t_slower.repeat(repeat=10,number=5))
... logger output ...
0.92177586353772345

I tried a bigger file and it also had no visible effect. So... what seemed 
reasonable to me was wrong. I guess that revision can be reverted :)

> How we test each line has a large effect on efficiency. I point out
> again that using line[x:y]=="jklj" is much faster than using "word in
> line", or line.find(), and so these should be some of the first
> targets for improving efficiency.

Good point, confirmed by a profiling run:

langner@slim:~/tmp/python/cclib/trunk/data/GAMESS/basicGAMESS-US$ python
Python 2.5 (r25:51908, Apr 30 2007, 15:03:13) 
[GCC 3.4.6 (Debian 3.4.6-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import cclib
>>> a = cclib.parser.ccopen("C_bigbasis.out")
>>> import profile
>>> profile.run("a.parse()", "parse.prof")
>>> import pstats
>>> s = pstats.Stats("parse.prof")
>>> s.sort_stats("time")
>>> s.print_stats(.12)
Thu May  3 14:43:04 2007    parse.prof

         199815 function calls in 9.069 CPU seconds

   Ordered by: internal time
   List reduced from 96 to 12 due to restriction <0.12>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     8581    4.548    0.001    8.625    
0.001 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/gamessparser.py:90
(extract)
   137355    3.080    0.000    3.080    0.000 :0(find)
    20310    0.480    0.000    0.480    0.000 :0(len)
        1    0.316    0.316    9.069    
9.069 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfileparser.py:165
(parse)
     8600    0.184    0.000    0.184    0.000 :0(rstrip)
     2143    0.140    0.000    0.140    0.000 :0(split)
     2055    0.124    0.000    0.124    0.000 :0(range)
     9145    0.076    0.000    0.076    0.000 :0(strip)
     8868    0.060    0.000    0.060    
0.000 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfileparser.py:375
(updateprogress)
      370    0.016    0.000    0.016    0.000 :0(append)
      218    0.004    0.000    0.004    0.000 :0(replace)
       31    0.004    0.000    0.032    
0.001 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfileparser.py:153
(__setattr__)

> On 03/05/07, Karol Langner <kar...@kn...> wrote:
> > Some thoughts about more refactoring to the parser...
> >
> > If you take a look at the parsers after the recent refactoring, it is now
> > more evident that they are quite inefficient. That isn't a problem, since
> > cclib isn't about efficiency, but it would be nice. For example, even
> > something as simple as putting a 'return' statement at the end of each
> > parsing block would speed things up (the following conditions are not
> > evaluated). Anyway, this already suggests that it would be useful to
> > break up the extract() method into pieces, one for each block of parsed
> > output.
>
> I think that there is one case where the block shouldn't return, but
> in general it would be fine. However, it wouldn't speed things up that
> much, so I feel it is not worth doing. If you think about it, most
> lines don't match any of the 'if' statements. If each block is
> executed once, and there are 10 blocks, then the number of wasteful
> 'if' statements will be 9 + 8 + 7 + ...  + 1 = 45.

There are also the lines between blocks that don't match any condition, but in 
principle you're right, it's not worth it.

 > I've been hovering around this subject for some time, and turning it
> > around in my mind. A dictionary of functions seems appropriate (with
> > regexps or something as keys), and more easy to manage that the current
> > "long" function. I don't think we can do away with the functions, since
> > sometimes pretty complicated operations are done with the parsed output.
> > The problem I see is where to define all these functions (30-40 separate
> > parsed blocks)?
>
> I don't think defining the functions is the problem - just define them
> in the gaussian parser for example. We could do this already without
> affecting anything, and leave the dictionary of functions idea till a
> later date.

What do you mean by 'gaussian parser' - the file gaussianparser.py or the 
class? I think I didn't make myself clear - my worry is that if we define all 
these functions in the parser class, then when you go "a = 
ccopen("....").parse(); print dir(a)" you will get flooded by function names.

> > How about this: the functions would be defined in a different class, not
> > LogFile. What I'm suggesting, is to separate from the class that
> > represents a parsed log file a class that represents the parser.
> > Currently, they are one. An instance of the parser class would be an
> > attribute of the log file class, say "_parser". This object would hold
> > all the parsing functions and a dict used by the parse() method of
> > LogFile, and any other stuff needed for parsing. An additional advantage
> > is that the parser becomes less visible to the casual user, leaving only
> > parsed attributes in the log file object.
> >
> > Summarizing, I propose two layers of classes:
> > LogFile - subclasses Gaussian, GAMES, ...
> > LogFileParser - subclasses GaussianParser, GAMESSParser, ...
> > The first remains as is (at least for the user), except that everything
> > related to parsing is put in the second. Of course, instances of the
> > latter class should be attributes of the instances of the former.
>
> I think you'll have to explain this some more...I'm not sure what the
> advantage is in doing this. I guess I don't have enough time right now
> to think this through fully...

Let me sketch out the idea. Snippets of the parse class (second layer):

def GaussianParser(LogFileParser):
	(...)
	def parse_charge(self, inputfile, line):
		super(GaussianParser, self).charge = (...)
	def parse_scfenergy(self, intputfile, line):
		super(GaussianParser, self).scfenergies.append(...)
	(...)
	self.parse_dict = {
		<regexp_charge>:		self.parse_charge,
		<regexp_scfenergy>:	self.parse_scfenergy,
		(...)
	}

Now the first layer, the log file class:

def Gaussian(LogFile):
	self._parser = GaussianParser(...)
	(...)
	def parse(self, ...):
		(...)
		for line in inputfile:
			for regexp in self._parser.parse_dict:
				if re.match(regexp, line):
					self._parser.parse_dict[regexp](line, inputfile)

I hope that's clearer.

> > Waiting to hear what you think about this idea,
>
> I think I need some more time....

No problem, I'm just brainstorming on a holiday.

- Karol

-- 
written by Karol Langner
Thu May  3 14:15:39 CEST 2007