From: Karol L. <kar...@kn...> - 2007-05-03 11:11:28
|
On Thursday 03 May 2007 10:23, Noel O'Boyle wrote: > For instance, what was the effect of the recent change where you > avoided calling extract() when the line was empty? It seems reasonable > that this would speed things up, but did it in fact? What's the > fastest way of testing whether a line is empty (must be > cross-platform)? And so on. Below, "parse_slower" is the same method as "parse" from the trunk without the condition that checks if the line is empty. langner@slim:~/tmp/python/cclib/trunk/data/Gaussian/Gaussian03$ python Python 2.5 (r25:51908, Apr 30 2007, 15:03:13) [GCC 3.4.6 (Debian 3.4.6-5)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import cclib >>> a = cclib.parser.ccopen("chn1.log.gz") >>> import timeit >>> t = timeit.Timer("a.clean(); a.parse()", "from __main__ import a") >>> min(t.repeat(repeat=10,number=5)) .... logger output .... 0.92677688598632812 >>> t_slower = timeit.Timer("a.clean();a.parse_slower()", "from __main__ import a") >>> min(t_slower.repeat(repeat=10,number=5)) ... logger output ... 0.92177586353772345 I tried a bigger file and it also had no visible effect. So... what seemed reasonable to me was wrong. I guess that revision can be reverted :) > How we test each line has a large effect on efficiency. I point out > again that using line[x:y]=="jklj" is much faster than using "word in > line", or line.find(), and so these should be some of the first > targets for improving efficiency. Good point, confirmed by a profiling run: langner@slim:~/tmp/python/cclib/trunk/data/GAMESS/basicGAMESS-US$ python Python 2.5 (r25:51908, Apr 30 2007, 15:03:13) [GCC 3.4.6 (Debian 3.4.6-5)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import cclib >>> a = cclib.parser.ccopen("C_bigbasis.out") >>> import profile >>> profile.run("a.parse()", "parse.prof") >>> import pstats >>> s = pstats.Stats("parse.prof") >>> s.sort_stats("time") >>> s.print_stats(.12) Thu May 3 14:43:04 2007 parse.prof 199815 function calls in 9.069 CPU seconds Ordered by: internal time List reduced from 96 to 12 due to restriction <0.12> ncalls tottime percall cumtime percall filename:lineno(function) 8581 4.548 0.001 8.625 0.001 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/gamessparser.py:90 (extract) 137355 3.080 0.000 3.080 0.000 :0(find) 20310 0.480 0.000 0.480 0.000 :0(len) 1 0.316 0.316 9.069 9.069 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfileparser.py:165 (parse) 8600 0.184 0.000 0.184 0.000 :0(rstrip) 2143 0.140 0.000 0.140 0.000 :0(split) 2055 0.124 0.000 0.124 0.000 :0(range) 9145 0.076 0.000 0.076 0.000 :0(strip) 8868 0.060 0.000 0.060 0.000 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfileparser.py:375 (updateprogress) 370 0.016 0.000 0.016 0.000 :0(append) 218 0.004 0.000 0.004 0.000 :0(replace) 31 0.004 0.000 0.032 0.001 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfileparser.py:153 (__setattr__) > On 03/05/07, Karol Langner <kar...@kn...> wrote: > > Some thoughts about more refactoring to the parser... > > > > If you take a look at the parsers after the recent refactoring, it is now > > more evident that they are quite inefficient. That isn't a problem, since > > cclib isn't about efficiency, but it would be nice. For example, even > > something as simple as putting a 'return' statement at the end of each > > parsing block would speed things up (the following conditions are not > > evaluated). Anyway, this already suggests that it would be useful to > > break up the extract() method into pieces, one for each block of parsed > > output. > > I think that there is one case where the block shouldn't return, but > in general it would be fine. However, it wouldn't speed things up that > much, so I feel it is not worth doing. If you think about it, most > lines don't match any of the 'if' statements. If each block is > executed once, and there are 10 blocks, then the number of wasteful > 'if' statements will be 9 + 8 + 7 + ... + 1 = 45. There are also the lines between blocks that don't match any condition, but in principle you're right, it's not worth it. > I've been hovering around this subject for some time, and turning it > > around in my mind. A dictionary of functions seems appropriate (with > > regexps or something as keys), and more easy to manage that the current > > "long" function. I don't think we can do away with the functions, since > > sometimes pretty complicated operations are done with the parsed output. > > The problem I see is where to define all these functions (30-40 separate > > parsed blocks)? > > I don't think defining the functions is the problem - just define them > in the gaussian parser for example. We could do this already without > affecting anything, and leave the dictionary of functions idea till a > later date. What do you mean by 'gaussian parser' - the file gaussianparser.py or the class? I think I didn't make myself clear - my worry is that if we define all these functions in the parser class, then when you go "a = ccopen("....").parse(); print dir(a)" you will get flooded by function names. > > How about this: the functions would be defined in a different class, not > > LogFile. What I'm suggesting, is to separate from the class that > > represents a parsed log file a class that represents the parser. > > Currently, they are one. An instance of the parser class would be an > > attribute of the log file class, say "_parser". This object would hold > > all the parsing functions and a dict used by the parse() method of > > LogFile, and any other stuff needed for parsing. An additional advantage > > is that the parser becomes less visible to the casual user, leaving only > > parsed attributes in the log file object. > > > > Summarizing, I propose two layers of classes: > > LogFile - subclasses Gaussian, GAMES, ... > > LogFileParser - subclasses GaussianParser, GAMESSParser, ... > > The first remains as is (at least for the user), except that everything > > related to parsing is put in the second. Of course, instances of the > > latter class should be attributes of the instances of the former. > > I think you'll have to explain this some more...I'm not sure what the > advantage is in doing this. I guess I don't have enough time right now > to think this through fully... Let me sketch out the idea. Snippets of the parse class (second layer): def GaussianParser(LogFileParser): (...) def parse_charge(self, inputfile, line): super(GaussianParser, self).charge = (...) def parse_scfenergy(self, intputfile, line): super(GaussianParser, self).scfenergies.append(...) (...) self.parse_dict = { <regexp_charge>: self.parse_charge, <regexp_scfenergy>: self.parse_scfenergy, (...) } Now the first layer, the log file class: def Gaussian(LogFile): self._parser = GaussianParser(...) (...) def parse(self, ...): (...) for line in inputfile: for regexp in self._parser.parse_dict: if re.match(regexp, line): self._parser.parse_dict[regexp](line, inputfile) I hope that's clearer. > > Waiting to hear what you think about this idea, > > I think I need some more time.... No problem, I'm just brainstorming on a holiday. - Karol -- written by Karol Langner Thu May 3 14:15:39 CEST 2007 |