[cclib-devel] Parsing multiple files (Re: Turbomole parser)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Saturday 11 August 2007 13:45, Noel O'Boyle wrote:
> Chris,
>
> The Turbomole parser seems to be coming along well. I've added a colm
> to the wiki for keeping track of progress:
> http://cclib.sourceforge.net/wiki/index.php/Development_parsed_data#Details
>_of_current_implementation

Yes, I agree. I'd like to expand here on parsing multiple files, since 
Turbomole output is the prime example for this. After updating the files with 
the code for parsing multiple output files, I'd like to demonstrate how that 
works. This is from my working Turbomole branch, with the dvb_sp data files 
copied in manually...

langner@slim:~/cclib/branches/turbomoleparser/src/cclib/parser$ python
Python 2.5 (r25:51908, Apr 30 2007, 15:03:13)
[GCC 3.4.6 (Debian 3.4.6-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from utils import ccopen

All the files were concatenated into turbo.out, I presume...
>>> ccopen("turbo.out").parse()
[Turbomole turbo.out INFO] Creating attribute natom: 20
[Turbomole turbo.out INFO] Creating attribute aonames[]
[Turbomole turbo.out INFO] Creating attribute nbasis: 60
[Turbomole turbo.out INFO] Creating attribute nmo: 60
[Turbomole turbo.out INFO] Creating attribute homos[]
[Turbomole turbo.out INFO] Creating attribute aooverlaps[]
[Turbomole turbo.out INFO] Creating attribute atomcoords[]
[Turbomole turbo.out INFO] Creating attribute atomnos[]
[Turbomole turbo.out INFO] Creating attribute moenergies[]
[Turbomole turbo.out INFO] Creating attribute mocoeffs[]
[Turbomole turbo.out INFO] Creating attribute coreelectrons[]
<cclib.data.ccData object at 0xb63a1dec>

Passing just the basis stuff doesn't get anything parsed:
>>> ccopen("basis").parse()
<cclib.data.ccData object at 0xb63a1e6c>

To parse two files sequentially, pass them as a list:
>>> ccopen(["basis","control"]).parse()
[Turbomole ['basis', 'control'] INFO] Creating attribute natom: 20
[Turbomole ['basis', 'control'] INFO] Creating attribute aonames[]
[Turbomole ['basis', 'control'] INFO] Creating attribute nbasis: 60
[Turbomole ['basis', 'control'] INFO] Creating attribute nmo: 60
[Turbomole ['basis', 'control'] INFO] Creating attribute homos[]
[Turbomole ['basis', 'control'] INFO] Creating attribute coreelectrons[]
<cclib.data.ccData object at 0xb63a484c>

This will be equivalent to parsing turbo.out in terms of parsed attributes:
>>> ccopen(["basis","control","coord","energy","mos"]).parse()
	...
<cclib.data.ccData object at 0xb63a42cc>

By the way, no need to add extra lines for Turbomole to be recognized - ccopen 
can do it with the condition (line[0] == "$" and line[1].islower()), which is 
unique for Turbomole at least presently for all the cclib parsers.

In my opinion, we should do without concatenated files such as turbo.out, and 
parse the tests using multiple files where required. What do you think?

There are caveats that need to be dealt with - passing the files in the wrong 
order crashes the parser. And that is equivalent to cancatenating in the 
wrong order, so it still requires the user to do something right :)
>>> ccopen(["control","basis"]).parse()
[Turbomole ['control', 'basis'] INFO] Creating attribute natom: 20
[Turbomole ['control', 'basis'] INFO] Creating attribute aonames[]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "logfileparser.py", line 140, in parse
    self.extract(inputfile, line)
  File "turbomoleparser.py", line 214, in extract
    for i in range(0, len(self.basis_lib), 1):
AttributeError: 'Turbomole' object has no attribute 'basis_lib'

One more issue is how to get ccget to parse multiple files into one data 
object. What I mean is something that will be equivalent to passing a 
concatenated (turbo.out, for example) file to ccget. I propose to add an 
option to ccget that will do that. What do you all think?

Cheers,
Karol

-- 
written by Karol Langner
Mon Aug 13 18:49:58 EDT 2007