From: Noel O'B. <bao...@gm...> - 2007-05-03 08:23:33
|
If you are going to think about efficiency, I'd like to see some timings. That means something like timing the parsing of a particular very large file (several times, and take the minimum). I don't like to speculate too much on whether certain changes will make everything more efficient. For instance, what was the effect of the recent change where you avoided calling extract() when the line was empty? It seems reasonable that this would speed things up, but did it in fact? What's the fastest way of testing whether a line is empty (must be cross-platform)? And so on. How we test each line has a large effect on efficiency. I point out again that using line[x:y]=="jklj" is much faster than using "word in line", or line.find(), and so these should be some of the first targets for improving efficiency. On 03/05/07, Karol Langner <kar...@kn...> wrote: > Some thoughts about more refactoring to the parser... > > If you take a look at the parsers after the recent refactoring, it is now more > evident that they are quite inefficient. That isn't a problem, since cclib > isn't about efficiency, but it would be nice. For example, even something as > simple as putting a 'return' statement at the end of each parsing block would > speed things up (the following conditions are not evaluated). Anyway, this > already suggests that it would be useful to break up the extract() method > into pieces, one for each block of parsed output. I think that there is one case where the block shouldn't return, but in general it would be fine. However, it wouldn't speed things up that much, so I feel it is not worth doing. If you think about it, most lines don't match any of the 'if' statements. If each block is executed once, and there are 10 blocks, then the number of wasteful 'if' statements will be 9 + 8 + 7 + ... + 1 = 45. > I've been hovering around this subject for some time, and turning it around in > my mind. A dictionary of functions seems appropriate (with regexps or > something as keys), and more easy to manage that the current "long" function. > I don't think we can do away with the functions, since sometimes pretty > complicated operations are done with the parsed output. The problem I see is > where to define all these functions (30-40 separate parsed blocks)? I don't think defining the functions is the problem - just define them in the gaussian parser for example. We could do this already without affecting anything, and leave the dictionary of functions idea till a later date. > How about this: the functions would be defined in a different class, not > LogFile. What I'm suggesting, is to separate from the class that represents a > parsed log file a class that represents the parser. Currently, they are one. > An instance of the parser class would be an attribute of the log file class, > say "_parser". This object would hold all the parsing functions and a dict > used by the parse() method of LogFile, and any other stuff needed for > parsing. An additional advantage is that the parser becomes less visible to > the casual user, leaving only parsed attributes in the log file object. > > Summarizing, I propose two layers of classes: > LogFile - subclasses Gaussian, GAMES, ... > LogFileParser - subclasses GaussianParser, GAMESSParser, ... > The first remains as is (at least for the user), except that everything > related to parsing is put in the second. Of course, instances of the latter > class should be attributes of the instances of the former. I think you'll have to explain this some more...I'm not sure what the advantage is in doing this. I guess I don't have enough time right now to think this through fully... > Waiting to hear what you think about this idea, I think I need some more time.... Noel |