|
From: Karol L. <kar...@kn...> - 2007-05-03 11:11:28
|
On Thursday 03 May 2007 10:23, Noel O'Boyle wrote:
> For instance, what was the effect of the recent change where you
> avoided calling extract() when the line was empty? It seems reasonable
> that this would speed things up, but did it in fact? What's the
> fastest way of testing whether a line is empty (must be
> cross-platform)? And so on.
Below, "parse_slower" is the same method as "parse" from the trunk without the
condition that checks if the line is empty.
langner@slim:~/tmp/python/cclib/trunk/data/Gaussian/Gaussian03$ python
Python 2.5 (r25:51908, Apr 30 2007, 15:03:13)
[GCC 3.4.6 (Debian 3.4.6-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import cclib
>>> a = cclib.parser.ccopen("chn1.log.gz")
>>> import timeit
>>> t = timeit.Timer("a.clean(); a.parse()", "from __main__ import a")
>>> min(t.repeat(repeat=10,number=5))
.... logger output ....
0.92677688598632812
>>> t_slower = timeit.Timer("a.clean();a.parse_slower()", "from __main__
import a")
>>> min(t_slower.repeat(repeat=10,number=5))
... logger output ...
0.92177586353772345
I tried a bigger file and it also had no visible effect. So... what seemed
reasonable to me was wrong. I guess that revision can be reverted :)
> How we test each line has a large effect on efficiency. I point out
> again that using line[x:y]=="jklj" is much faster than using "word in
> line", or line.find(), and so these should be some of the first
> targets for improving efficiency.
Good point, confirmed by a profiling run:
langner@slim:~/tmp/python/cclib/trunk/data/GAMESS/basicGAMESS-US$ python
Python 2.5 (r25:51908, Apr 30 2007, 15:03:13)
[GCC 3.4.6 (Debian 3.4.6-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import cclib
>>> a = cclib.parser.ccopen("C_bigbasis.out")
>>> import profile
>>> profile.run("a.parse()", "parse.prof")
>>> import pstats
>>> s = pstats.Stats("parse.prof")
>>> s.sort_stats("time")
>>> s.print_stats(.12)
Thu May 3 14:43:04 2007 parse.prof
199815 function calls in 9.069 CPU seconds
Ordered by: internal time
List reduced from 96 to 12 due to restriction <0.12>
ncalls tottime percall cumtime percall filename:lineno(function)
8581 4.548 0.001 8.625
0.001 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/gamessparser.py:90
(extract)
137355 3.080 0.000 3.080 0.000 :0(find)
20310 0.480 0.000 0.480 0.000 :0(len)
1 0.316 0.316 9.069
9.069 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfileparser.py:165
(parse)
8600 0.184 0.000 0.184 0.000 :0(rstrip)
2143 0.140 0.000 0.140 0.000 :0(split)
2055 0.124 0.000 0.124 0.000 :0(range)
9145 0.076 0.000 0.076 0.000 :0(strip)
8868 0.060 0.000 0.060
0.000 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfileparser.py:375
(updateprogress)
370 0.016 0.000 0.016 0.000 :0(append)
218 0.004 0.000 0.004 0.000 :0(replace)
31 0.004 0.000 0.032
0.001 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfileparser.py:153
(__setattr__)
> On 03/05/07, Karol Langner <kar...@kn...> wrote:
> > Some thoughts about more refactoring to the parser...
> >
> > If you take a look at the parsers after the recent refactoring, it is now
> > more evident that they are quite inefficient. That isn't a problem, since
> > cclib isn't about efficiency, but it would be nice. For example, even
> > something as simple as putting a 'return' statement at the end of each
> > parsing block would speed things up (the following conditions are not
> > evaluated). Anyway, this already suggests that it would be useful to
> > break up the extract() method into pieces, one for each block of parsed
> > output.
>
> I think that there is one case where the block shouldn't return, but
> in general it would be fine. However, it wouldn't speed things up that
> much, so I feel it is not worth doing. If you think about it, most
> lines don't match any of the 'if' statements. If each block is
> executed once, and there are 10 blocks, then the number of wasteful
> 'if' statements will be 9 + 8 + 7 + ... + 1 = 45.
There are also the lines between blocks that don't match any condition, but in
principle you're right, it's not worth it.
> I've been hovering around this subject for some time, and turning it
> > around in my mind. A dictionary of functions seems appropriate (with
> > regexps or something as keys), and more easy to manage that the current
> > "long" function. I don't think we can do away with the functions, since
> > sometimes pretty complicated operations are done with the parsed output.
> > The problem I see is where to define all these functions (30-40 separate
> > parsed blocks)?
>
> I don't think defining the functions is the problem - just define them
> in the gaussian parser for example. We could do this already without
> affecting anything, and leave the dictionary of functions idea till a
> later date.
What do you mean by 'gaussian parser' - the file gaussianparser.py or the
class? I think I didn't make myself clear - my worry is that if we define all
these functions in the parser class, then when you go "a =
ccopen("....").parse(); print dir(a)" you will get flooded by function names.
> > How about this: the functions would be defined in a different class, not
> > LogFile. What I'm suggesting, is to separate from the class that
> > represents a parsed log file a class that represents the parser.
> > Currently, they are one. An instance of the parser class would be an
> > attribute of the log file class, say "_parser". This object would hold
> > all the parsing functions and a dict used by the parse() method of
> > LogFile, and any other stuff needed for parsing. An additional advantage
> > is that the parser becomes less visible to the casual user, leaving only
> > parsed attributes in the log file object.
> >
> > Summarizing, I propose two layers of classes:
> > LogFile - subclasses Gaussian, GAMES, ...
> > LogFileParser - subclasses GaussianParser, GAMESSParser, ...
> > The first remains as is (at least for the user), except that everything
> > related to parsing is put in the second. Of course, instances of the
> > latter class should be attributes of the instances of the former.
>
> I think you'll have to explain this some more...I'm not sure what the
> advantage is in doing this. I guess I don't have enough time right now
> to think this through fully...
Let me sketch out the idea. Snippets of the parse class (second layer):
def GaussianParser(LogFileParser):
(...)
def parse_charge(self, inputfile, line):
super(GaussianParser, self).charge = (...)
def parse_scfenergy(self, intputfile, line):
super(GaussianParser, self).scfenergies.append(...)
(...)
self.parse_dict = {
<regexp_charge>: self.parse_charge,
<regexp_scfenergy>: self.parse_scfenergy,
(...)
}
Now the first layer, the log file class:
def Gaussian(LogFile):
self._parser = GaussianParser(...)
(...)
def parse(self, ...):
(...)
for line in inputfile:
for regexp in self._parser.parse_dict:
if re.match(regexp, line):
self._parser.parse_dict[regexp](line, inputfile)
I hope that's clearer.
> > Waiting to hear what you think about this idea,
>
> I think I need some more time....
No problem, I'm just brainstorming on a holiday.
- Karol
--
written by Karol Langner
Thu May 3 14:15:39 CEST 2007
|