Re: [cclib-devel] Q-Chem and ORCA parsers

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

This would certainly be interesting, but I have a few comments that are
probably Q-Chem specific:

1. As Martin already mentioned, unless you modify the `qchem` driver script
or you pass the flag `-save` when running, the folder with all of these
binary files gets deleted. This means they usually aren't present, except
when you want them, in contrast to the regular plain-text output file which
is always produced. This is in contrast to something like the HESS file,
which is only present when performing an analytic Hessian calculation. ORCA
is even worse about this, since it deletes scratch/temp files early and
often.

2. Since almost all of these files are named with a number, anyone who is
not a Q-Chem developer probably doesn't know what's in these files. For
example, 53 corresponds to the MO coefficients. Even knowing the
number/content mapping doesn't easily tell you the dimensionality of the
file; you have to know what all of the reads and writes to that file look
like.

2b. Once you know the number/content mapping and the dimensionality, there
is still no guarantee about what's in the file, due to nasty bits in the
source code with people using files for things other than what they are
specified as containing.

If one wanted to make a Q-Chem-specific parser for a bunch of binary files,
it would need to be maintained by someone who is a developer and knows
where all of the reads/writes to that file are, but even then it's quite
risky. You just can't perform a visual spot-check on them to see if there
are problems. It's definitely functionality for expert users.

It seems as though this is something that should either be in a separate
package that depends on cclib, or at least is separate from the parsers/
directory.

Martin, do you have scripts for parsing any of these that are publicly
viewable? I'd be interested to see how you do it. I'd really love to know
the structure of ORCA's *.gbw files.

Eric

On Tue, Aug 19, 2014 at 8:21 AM, Karol M. Langner <kar...@gm...>
wrote:

> Here is the issue I created for this:
> https://github.com/cclib/cclib/issues/120
>
> On Aug 18 2014, Karol M. Langner wrote:
> > Hi Martin,
> >
> > That's true, technically it is not hard. I was thinking more about
> integration
> > with the rest of cclib. Binary files will not be "parsed" in the sense
> logfiles
> > are, that is the task is not one of reading a text file line by line and
> > extracting useful information. Rather, it is one of reading data that is
> stored
> > in a predefined format. This is obvious, but it implies some design
> choices
> > that have not been discussed here. Above all, the line-by-line paradigm
> > used in the class LogFile (inherited by all parsers) is not useful and
> > we cannot just pass one extra file.
> >
> > Anyway, this can definitely be done, I'm just saying it needs a little
> > bit of forethought.
> >
> > Cheers,
> > Karol
> >
> > On Aug 18 2014, Martin Blood-Forsythe wrote:
> > > I'll have to look at the way that the multiple file parse is
> implemented
> > > for Molpro, but at least for ORCA and Q-Chem adding a binary file
> parser
> > > wouldn't require much different than the way a text based multiple file
> > > parsing works. I've had a lot of success parsing these files using
> > > numpy.fromfile and numpy.frombuffer.  Typically the only required
> knowledge
> > > is the number of orbitals (to determine the array shape to dump into)
> and
> > > the file name.
> > >
> > > -Martin
>
> --
> written by Karol M. Langner
> Tue Aug 19 08:20:53 EDT 2014
>