From: Karol L. <kar...@kn...> - 2007-05-02 21:50:03
|
Some thoughts about more refactoring to the parser... If you take a look at the parsers after the recent refactoring, it is now more evident that they are quite inefficient. That isn't a problem, since cclib isn't about efficiency, but it would be nice. For example, even something as simple as putting a 'return' statement at the end of each parsing block would speed things up (the following conditions are not evaluated). Anyway, this already suggests that it would be useful to break up the extract() method into pieces, one for each block of parsed output. I've been hovering around this subject for some time, and turning it around in my mind. A dictionary of functions seems appropriate (with regexps or something as keys), and more easy to manage that the current "long" function. I don't think we can do away with the functions, since sometimes pretty complicated operations are done with the parsed output. The problem I see is where to define all these functions (30-40 separate parsed blocks)? How about this: the functions would be defined in a different class, not LogFile. What I'm suggesting, is to separate from the class that represents a parsed log file a class that represents the parser. Currently, they are one. An instance of the parser class would be an attribute of the log file class, say "_parser". This object would hold all the parsing functions and a dict used by the parse() method of LogFile, and any other stuff needed for parsing. An additional advantage is that the parser becomes less visible to the casual user, leaving only parsed attributes in the log file object. Summarizing, I propose two layers of classes: LogFile - subclasses Gaussian, GAMES, ... LogFileParser - subclasses GaussianParser, GAMESSParser, ... The first remains as is (at least for the user), except that everything related to parsing is put in the second. Of course, instances of the latter class should be attributes of the instances of the former. Waiting to hear what you think about this idea, Karol -- written by Karol Langner Thu May 3 01:20:44 CEST 2007 |
From: Noel O'B. <bao...@gm...> - 2007-05-03 08:23:33
|
If you are going to think about efficiency, I'd like to see some timings. That means something like timing the parsing of a particular very large file (several times, and take the minimum). I don't like to speculate too much on whether certain changes will make everything more efficient. For instance, what was the effect of the recent change where you avoided calling extract() when the line was empty? It seems reasonable that this would speed things up, but did it in fact? What's the fastest way of testing whether a line is empty (must be cross-platform)? And so on. How we test each line has a large effect on efficiency. I point out again that using line[x:y]=="jklj" is much faster than using "word in line", or line.find(), and so these should be some of the first targets for improving efficiency. On 03/05/07, Karol Langner <kar...@kn...> wrote: > Some thoughts about more refactoring to the parser... > > If you take a look at the parsers after the recent refactoring, it is now more > evident that they are quite inefficient. That isn't a problem, since cclib > isn't about efficiency, but it would be nice. For example, even something as > simple as putting a 'return' statement at the end of each parsing block would > speed things up (the following conditions are not evaluated). Anyway, this > already suggests that it would be useful to break up the extract() method > into pieces, one for each block of parsed output. I think that there is one case where the block shouldn't return, but in general it would be fine. However, it wouldn't speed things up that much, so I feel it is not worth doing. If you think about it, most lines don't match any of the 'if' statements. If each block is executed once, and there are 10 blocks, then the number of wasteful 'if' statements will be 9 + 8 + 7 + ... + 1 = 45. > I've been hovering around this subject for some time, and turning it around in > my mind. A dictionary of functions seems appropriate (with regexps or > something as keys), and more easy to manage that the current "long" function. > I don't think we can do away with the functions, since sometimes pretty > complicated operations are done with the parsed output. The problem I see is > where to define all these functions (30-40 separate parsed blocks)? I don't think defining the functions is the problem - just define them in the gaussian parser for example. We could do this already without affecting anything, and leave the dictionary of functions idea till a later date. > How about this: the functions would be defined in a different class, not > LogFile. What I'm suggesting, is to separate from the class that represents a > parsed log file a class that represents the parser. Currently, they are one. > An instance of the parser class would be an attribute of the log file class, > say "_parser". This object would hold all the parsing functions and a dict > used by the parse() method of LogFile, and any other stuff needed for > parsing. An additional advantage is that the parser becomes less visible to > the casual user, leaving only parsed attributes in the log file object. > > Summarizing, I propose two layers of classes: > LogFile - subclasses Gaussian, GAMES, ... > LogFileParser - subclasses GaussianParser, GAMESSParser, ... > The first remains as is (at least for the user), except that everything > related to parsing is put in the second. Of course, instances of the latter > class should be attributes of the instances of the former. I think you'll have to explain this some more...I'm not sure what the advantage is in doing this. I guess I don't have enough time right now to think this through fully... > Waiting to hear what you think about this idea, I think I need some more time.... Noel |
From: Karol L. <kar...@kn...> - 2007-05-03 11:11:28
|
On Thursday 03 May 2007 10:23, Noel O'Boyle wrote: > For instance, what was the effect of the recent change where you > avoided calling extract() when the line was empty? It seems reasonable > that this would speed things up, but did it in fact? What's the > fastest way of testing whether a line is empty (must be > cross-platform)? And so on. Below, "parse_slower" is the same method as "parse" from the trunk without the condition that checks if the line is empty. langner@slim:~/tmp/python/cclib/trunk/data/Gaussian/Gaussian03$ python Python 2.5 (r25:51908, Apr 30 2007, 15:03:13) [GCC 3.4.6 (Debian 3.4.6-5)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import cclib >>> a = cclib.parser.ccopen("chn1.log.gz") >>> import timeit >>> t = timeit.Timer("a.clean(); a.parse()", "from __main__ import a") >>> min(t.repeat(repeat=10,number=5)) .... logger output .... 0.92677688598632812 >>> t_slower = timeit.Timer("a.clean();a.parse_slower()", "from __main__ import a") >>> min(t_slower.repeat(repeat=10,number=5)) ... logger output ... 0.92177586353772345 I tried a bigger file and it also had no visible effect. So... what seemed reasonable to me was wrong. I guess that revision can be reverted :) > How we test each line has a large effect on efficiency. I point out > again that using line[x:y]=="jklj" is much faster than using "word in > line", or line.find(), and so these should be some of the first > targets for improving efficiency. Good point, confirmed by a profiling run: langner@slim:~/tmp/python/cclib/trunk/data/GAMESS/basicGAMESS-US$ python Python 2.5 (r25:51908, Apr 30 2007, 15:03:13) [GCC 3.4.6 (Debian 3.4.6-5)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import cclib >>> a = cclib.parser.ccopen("C_bigbasis.out") >>> import profile >>> profile.run("a.parse()", "parse.prof") >>> import pstats >>> s = pstats.Stats("parse.prof") >>> s.sort_stats("time") >>> s.print_stats(.12) Thu May 3 14:43:04 2007 parse.prof 199815 function calls in 9.069 CPU seconds Ordered by: internal time List reduced from 96 to 12 due to restriction <0.12> ncalls tottime percall cumtime percall filename:lineno(function) 8581 4.548 0.001 8.625 0.001 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/gamessparser.py:90 (extract) 137355 3.080 0.000 3.080 0.000 :0(find) 20310 0.480 0.000 0.480 0.000 :0(len) 1 0.316 0.316 9.069 9.069 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfileparser.py:165 (parse) 8600 0.184 0.000 0.184 0.000 :0(rstrip) 2143 0.140 0.000 0.140 0.000 :0(split) 2055 0.124 0.000 0.124 0.000 :0(range) 9145 0.076 0.000 0.076 0.000 :0(strip) 8868 0.060 0.000 0.060 0.000 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfileparser.py:375 (updateprogress) 370 0.016 0.000 0.016 0.000 :0(append) 218 0.004 0.000 0.004 0.000 :0(replace) 31 0.004 0.000 0.032 0.001 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfileparser.py:153 (__setattr__) > On 03/05/07, Karol Langner <kar...@kn...> wrote: > > Some thoughts about more refactoring to the parser... > > > > If you take a look at the parsers after the recent refactoring, it is now > > more evident that they are quite inefficient. That isn't a problem, since > > cclib isn't about efficiency, but it would be nice. For example, even > > something as simple as putting a 'return' statement at the end of each > > parsing block would speed things up (the following conditions are not > > evaluated). Anyway, this already suggests that it would be useful to > > break up the extract() method into pieces, one for each block of parsed > > output. > > I think that there is one case where the block shouldn't return, but > in general it would be fine. However, it wouldn't speed things up that > much, so I feel it is not worth doing. If you think about it, most > lines don't match any of the 'if' statements. If each block is > executed once, and there are 10 blocks, then the number of wasteful > 'if' statements will be 9 + 8 + 7 + ... + 1 = 45. There are also the lines between blocks that don't match any condition, but in principle you're right, it's not worth it. > I've been hovering around this subject for some time, and turning it > > around in my mind. A dictionary of functions seems appropriate (with > > regexps or something as keys), and more easy to manage that the current > > "long" function. I don't think we can do away with the functions, since > > sometimes pretty complicated operations are done with the parsed output. > > The problem I see is where to define all these functions (30-40 separate > > parsed blocks)? > > I don't think defining the functions is the problem - just define them > in the gaussian parser for example. We could do this already without > affecting anything, and leave the dictionary of functions idea till a > later date. What do you mean by 'gaussian parser' - the file gaussianparser.py or the class? I think I didn't make myself clear - my worry is that if we define all these functions in the parser class, then when you go "a = ccopen("....").parse(); print dir(a)" you will get flooded by function names. > > How about this: the functions would be defined in a different class, not > > LogFile. What I'm suggesting, is to separate from the class that > > represents a parsed log file a class that represents the parser. > > Currently, they are one. An instance of the parser class would be an > > attribute of the log file class, say "_parser". This object would hold > > all the parsing functions and a dict used by the parse() method of > > LogFile, and any other stuff needed for parsing. An additional advantage > > is that the parser becomes less visible to the casual user, leaving only > > parsed attributes in the log file object. > > > > Summarizing, I propose two layers of classes: > > LogFile - subclasses Gaussian, GAMES, ... > > LogFileParser - subclasses GaussianParser, GAMESSParser, ... > > The first remains as is (at least for the user), except that everything > > related to parsing is put in the second. Of course, instances of the > > latter class should be attributes of the instances of the former. > > I think you'll have to explain this some more...I'm not sure what the > advantage is in doing this. I guess I don't have enough time right now > to think this through fully... Let me sketch out the idea. Snippets of the parse class (second layer): def GaussianParser(LogFileParser): (...) def parse_charge(self, inputfile, line): super(GaussianParser, self).charge = (...) def parse_scfenergy(self, intputfile, line): super(GaussianParser, self).scfenergies.append(...) (...) self.parse_dict = { <regexp_charge>: self.parse_charge, <regexp_scfenergy>: self.parse_scfenergy, (...) } Now the first layer, the log file class: def Gaussian(LogFile): self._parser = GaussianParser(...) (...) def parse(self, ...): (...) for line in inputfile: for regexp in self._parser.parse_dict: if re.match(regexp, line): self._parser.parse_dict[regexp](line, inputfile) I hope that's clearer. > > Waiting to hear what you think about this idea, > > I think I need some more time.... No problem, I'm just brainstorming on a holiday. - Karol -- written by Karol Langner Thu May 3 14:15:39 CEST 2007 |
From: Noel O'B. <bao...@gm...> - 2007-05-03 11:21:30
|
On 03/05/07, Karol Langner <kar...@kn...> wrote: > On Thursday 03 May 2007 10:23, Noel O'Boyle wrote: > > For instance, what was the effect of the recent change where you > > avoided calling extract() when the line was empty? It seems reasonable > > that this would speed things up, but did it in fact? What's the > > fastest way of testing whether a line is empty (must be > > cross-platform)? And so on. > > Below, "parse_slower" is the same method as "parse" from the trunk without the > condition that checks if the line is empty. > > langner@slim:~/tmp/python/cclib/trunk/data/Gaussian/Gaussian03$ python > Python 2.5 (r25:51908, Apr 30 2007, 15:03:13) > [GCC 3.4.6 (Debian 3.4.6-5)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import cclib > >>> a = cclib.parser.ccopen("chn1.log.gz") > >>> import timeit > >>> t = timeit.Timer("a.clean(); a.parse()", "from __main__ import a") > >>> min(t.repeat(repeat=10,number=5)) > .... logger output .... > 0.92677688598632812 > >>> t_slower = timeit.Timer("a.clean();a.parse_slower()", "from __main__ > import a") > >>> min(t_slower.repeat(repeat=10,number=5)) > ... logger output ... > 0.92177586353772345 > > I tried a bigger file and it also had no visible effect. So... what seemed > reasonable to me was wrong. I guess that revision can be reverted :) Maybe there's a quicker way of testing for an empty line. Or even better, all lines less than 4 characters or something (although clearly we need to be careful not to skip important lines). > > How we test each line has a large effect on efficiency. I point out > > again that using line[x:y]=="jklj" is much faster than using "word in > > line", or line.find(), and so these should be some of the first > > targets for improving efficiency. > > Good point, confirmed by a profiling run: > > langner@slim:~/tmp/python/cclib/trunk/data/GAMESS/basicGAMESS-US$ python > Python 2.5 (r25:51908, Apr 30 2007, 15:03:13) > [GCC 3.4.6 (Debian 3.4.6-5)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import cclib > >>> a = cclib.parser.ccopen("C_bigbasis.out") > >>> import profile > >>> profile.run("a.parse()", "parse.prof") > >>> import pstats > >>> s = pstats.Stats("parse.prof") > >>> s.sort_stats("time") > >>> s.print_stats(.12) > Thu May 3 14:43:04 2007 parse.prof > > 199815 function calls in 9.069 CPU seconds > > Ordered by: internal time > List reduced from 96 to 12 due to restriction <0.12> > > ncalls tottime percall cumtime percall filename:lineno(function) > 8581 4.548 0.001 8.625 > 0.001 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/gamessparser.py:90 > (extract) > 137355 3.080 0.000 3.080 0.000 :0(find) > 20310 0.480 0.000 0.480 0.000 :0(len) > 1 0.316 0.316 9.069 > 9.069 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfileparser.py:165 > (parse) > 8600 0.184 0.000 0.184 0.000 :0(rstrip) > 2143 0.140 0.000 0.140 0.000 :0(split) > 2055 0.124 0.000 0.124 0.000 :0(range) > 9145 0.076 0.000 0.076 0.000 :0(strip) > 8868 0.060 0.000 0.060 > 0.000 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfileparser.py:375 > (updateprogress) > 370 0.016 0.000 0.016 0.000 :0(append) > 218 0.004 0.000 0.004 0.000 :0(replace) > 31 0.004 0.000 0.032 > 0.001 /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfileparser.py:153 > (__setattr__) I've never used the profiler. Can you interpret this for me in simple language? > > On 03/05/07, Karol Langner <kar...@kn...> wrote: > > > Some thoughts about more refactoring to the parser... > > > > > > If you take a look at the parsers after the recent refactoring, it is now > > > more evident that they are quite inefficient. That isn't a problem, since > > > cclib isn't about efficiency, but it would be nice. For example, even > > > something as simple as putting a 'return' statement at the end of each > > > parsing block would speed things up (the following conditions are not > > > evaluated). Anyway, this already suggests that it would be useful to > > > break up the extract() method into pieces, one for each block of parsed > > > output. > > > > I think that there is one case where the block shouldn't return, but > > in general it would be fine. However, it wouldn't speed things up that > > much, so I feel it is not worth doing. If you think about it, most > > lines don't match any of the 'if' statements. If each block is > > executed once, and there are 10 blocks, then the number of wasteful > > 'if' statements will be 9 + 8 + 7 + ... + 1 = 45. > > There are also the lines between blocks that don't match any condition, but in > principle you're right, it's not worth it. > > I've been hovering around this subject for some time, and turning it > > > around in my mind. A dictionary of functions seems appropriate (with > > > regexps or something as keys), and more easy to manage that the current > > > "long" function. I don't think we can do away with the functions, since > > > sometimes pretty complicated operations are done with the parsed output. > > > The problem I see is where to define all these functions (30-40 separate > > > parsed blocks)? > > > > I don't think defining the functions is the problem - just define them > > in the gaussian parser for example. We could do this already without > > affecting anything, and leave the dictionary of functions idea till a > > later date. > > What do you mean by 'gaussian parser' - the file gaussianparser.py or the > class? I think I didn't make myself clear - my worry is that if we define all > these functions in the parser class, then when you go "a = > ccopen("....").parse(); print dir(a)" you will get flooded by function names. Ah, ok. I see. I think I would prefer users to use help(a), rather than dir(a). I think that if you use function names starting with _ it shouldn't appear in this list, although I haven't tested this. > > > How about this: the functions would be defined in a different class, not > > > LogFile. What I'm suggesting, is to separate from the class that > > > represents a parsed log file a class that represents the parser. > > > Currently, they are one. An instance of the parser class would be an > > > attribute of the log file class, say "_parser". This object would hold > > > all the parsing functions and a dict used by the parse() method of > > > LogFile, and any other stuff needed for parsing. An additional advantage > > > is that the parser becomes less visible to the casual user, leaving only > > > parsed attributes in the log file object. > > > > > > Summarizing, I propose two layers of classes: > > > LogFile - subclasses Gaussian, GAMES, ... > > > LogFileParser - subclasses GaussianParser, GAMESSParser, ... > > > The first remains as is (at least for the user), except that everything > > > related to parsing is put in the second. Of course, instances of the > > > latter class should be attributes of the instances of the former. > > > > I think you'll have to explain this some more...I'm not sure what the > > advantage is in doing this. I guess I don't have enough time right now > > to think this through fully... > > Let me sketch out the idea. Snippets of the parse class (second layer): > > def GaussianParser(LogFileParser): > (...) > def parse_charge(self, inputfile, line): > super(GaussianParser, self).charge = (...) > def parse_scfenergy(self, intputfile, line): > super(GaussianParser, self).scfenergies.append(...) > (...) > self.parse_dict = { > <regexp_charge>: self.parse_charge, > <regexp_scfenergy>: self.parse_scfenergy, > (...) > } > > Now the first layer, the log file class: > > def Gaussian(LogFile): > self._parser = GaussianParser(...) > (...) > def parse(self, ...): > (...) > for line in inputfile: > for regexp in self._parser.parse_dict: > if re.match(regexp, line): > self._parser.parse_dict[regexp](line, inputfile) > > I hope that's clearer. OK, thanks for that. As I said, it needs more time... > No problem, I'm just brainstorming on a holiday. If you want to do something neat, you can see if you can integrate cclib with PyVib2, or whether you can get cclib to run under Jython (there has been some interest in this, more later...)...I would hate for you not to have something to do on your holiday :-) Noel |
From: Karol L. <kar...@kn...> - 2007-05-07 08:28:09
|
On Thursday 03 May 2007 13:21, Noel O'Boyle wrote: > > > How we test each line has a large effect on efficiency. I point out > > > again that using line[x:y]=="jklj" is much faster than using "word in > > > line", or line.find(), and so these should be some of the first > > > targets for improving efficiency. >> > > langner@slim:~/tmp/python/cclib/trunk/data/GAMESS/basicGAMESS-US$ python > > Python 2.5 (r25:51908, Apr 30 2007, 15:03:13) > > [GCC 3.4.6 (Debian 3.4.6-5)] on linux2 > > Type "help", "copyright", "credits" or "license" for more information. > > > > >>> import cclib > > >>> a = cclib.parser.ccopen("C_bigbasis.out") > > >>> import profile > > >>> profile.run("a.parse()", "parse.prof") > > >>> import pstats > > >>> s = pstats.Stats("parse.prof") > > >>> s.sort_stats("time") > > >>> s.print_stats(.12) > > > > Thu May 3 14:43:04 2007 parse.prof > > > > 199815 function calls in 9.069 CPU seconds > > > > Ordered by: internal time > > List reduced from 96 to 12 due to restriction <0.12> > > > > ncalls tottime percall cumtime percall filename:lineno(function) > > 8581 4.548 0.001 8.625 > > 0.001 > > /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/gamess > >parser.py:90 (extract) > > 137355 3.080 0.000 3.080 0.000 :0(find) > > 20310 0.480 0.000 0.480 0.000 :0(len) > > 1 0.316 0.316 9.069 > > 9.069 > > /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfil > >eparser.py:165 (parse) > > 8600 0.184 0.000 0.184 0.000 :0(rstrip) > > 2143 0.140 0.000 0.140 0.000 :0(split) > > 2055 0.124 0.000 0.124 0.000 :0(range) > > 9145 0.076 0.000 0.076 0.000 :0(strip) > > 8868 0.060 0.000 0.060 > > 0.000 > > /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfil > >eparser.py:375 (updateprogress) > > 370 0.016 0.000 0.016 0.000 :0(append) > > 218 0.004 0.000 0.004 0.000 :0(replace) > > 31 0.004 0.000 0.032 > > 0.001 > > /home/langner/apps/python/lib/python2.5/site-packages/cclib/parser/logfil > >eparser.py:153 (__setattr__) > > I've never used the profiler. Can you interpret this for me in simple > language? The profiler measures the time used for function calls when executing a command. In the columns you have: ncalls - number of times a function ws called tottime - time spent in the given function (excluding time in sub-functions) percall - tottime/ncalls cumtime - time in function including subfunctions (from invocation to exit) percall (2nd) - cumtime/ncalls Now that I think about all this, though, statements such as "word in line" and "line[i:j] = word" are not measured here, since they are not function calls (the time is cumulated into the time of extract). A simple little test shows that find() is in fact the worst, but "word in line" is at least comparable to "line[i:j] == word": >>> import timeit >>> t1 = timeit.Timer("'a' in 'abcdefg'") >>> t2 = timeit.Timer("'abcdefg'[:1] == 'a'") >>> t3 = timeit.Timer("'abcdefg'.find('a')") >>> min(t1.repeat(repeat=100, number=1000000)) 0.18727612495422363 >>> min(t2.repeat(repeat=100, number=1000000)) 0.3044281005859375 >>> min(t3.repeat(repeat=100, number=1000000)) 0.7338860034942627 - Karol -- written by Karol Langner Mon May 7 11:47:55 CEST 2007 |
From: Noel O'B. <bao...@gm...> - 2007-05-07 08:45:27
|
> A simple little test shows that find() is in fact the worst, but "word in > line" is at least comparable to "line[i:j] == word": > >>> import timeit > >>> t1 = timeit.Timer("'a' in 'abcdefg'") > >>> t2 = timeit.Timer("'abcdefg'[:1] == 'a'") > >>> t3 = timeit.Timer("'abcdefg'.find('a')") > >>> min(t1.repeat(repeat=100, number=1000000)) > 0.18727612495422363 > >>> min(t2.repeat(repeat=100, number=1000000)) > 0.3044281005859375 > >>> min(t3.repeat(repeat=100, number=1000000)) > 0.7338860034942627 I was surprised by this, but it's the same for me. However, my earlier experiments showed 'in' to be the worst, and that is because most lines don't match the expression. I will show some timings for a large log file when I get a chance. Noel |