Re: [cclib-devel] Q-Chem and ORCA parsers

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

1. Yes it is true that most codes delete their binary files unless you tell
them not to.  Which is why I was mentioning the need to be careful about
checking for the existence of a file. ORCA is actually a bit unusual in the
regard that it always leaves the *.gbw binary file around.  I don't know
the *entire* contents of the file, but I know how to parse the MO
coefficients/energies, occupation numbers, job name, and perhaps orbital
labels (1pz 2s etc, the offsets to these are a little tricky since it is a
mixed binary/ascii format). As far as I know, the *.hess file is a text
file, but the point about job specific binaries, like the *.cis file, is
certainly correct.

2 . With regard to Q-Chem's custom binary files:
There are two different categories of binary files.  The ones that get
saved if you do 'qchem infile outfile savename' and the more complete set
that are stored if you do the '-save' option.
 For the most part I think that the standard ones saved without the '-save'
command are pretty safe to trust. These include the 53.0 [MO coefficients +
eigenvalues] and 54.0 [density matrix].  There are a few other binary files
that I trust: 320.0 [overlap matrix], 601.0 [NBO coefficients].
The general point though that binary files are a pain and can require some
developer level knowledge is well taken. One reason that I thought adding
binary parsing support might be neat is exactly because non-developers have
no idea what is in these files.

2b. The point about developers misusing the binary files in Q-chem is
certainly a good one, and is a strong reason to not support parsing any of
the non-vanilla binary files.  I'd argue that the 53.0 and 54.0 files are
quite safe to parse since they have a very well defined structure that
hasn't changed across versions 3 - 4.2 as far as I can tell.  Their content
mapping is very easy and only requires knowing the total number of orbitals.

3. I agree that any package that attempts to unpack a whole bunch of other
binary files should probably be separate from cclib.  Mainly I was
wondering if the MO coefficients & eigenvalues would be of interest for
visualization purposes.  The tricky part there is that a lot of codes use
non-standard basis set definitions that differ from the EMSL specifications
of the same name.

4. My scripts that unpack *.gbw and Q-Chem binaries are currently wrapped
up in with some other code.  I strip them out and add them to
https://github.com/mbloodforsythe/qc-binary-parsers

---------------------------------------------------
Martin Blood-Forsythe
Graduate Student in Physics
Harvard University
Aspuru-Guzik Lab

On Tue, Aug 19, 2014 at 12:57 PM, Eric Berquist <er...@pi...> wrote:

> This would certainly be interesting, but I have a few comments that are
> probably Q-Chem specific:
>
> 1. As Martin already mentioned, unless you modify the `qchem` driver
> script or you pass the flag `-save` when running, the folder with all of
> these binary files gets deleted. This means they usually aren't present,
> except when you want them, in contrast to the regular plain-text output
> file which is always produced. This is in contrast to something like the
> HESS file, which is only present when performing an analytic Hessian
> calculation. ORCA is even worse about this, since it deletes scratch/temp
> files early and often.
>
> 2. Since almost all of these files are named with a number, anyone who is
> not a Q-Chem developer probably doesn't know what's in these files. For
> example, 53 corresponds to the MO coefficients. Even knowing the
> number/content mapping doesn't easily tell you the dimensionality of the
> file; you have to know what all of the reads and writes to that file look
> like.
>
> 2b. Once you know the number/content mapping and the dimensionality, there
> is still no guarantee about what's in the file, due to nasty bits in the
> source code with people using files for things other than what they are
> specified as containing.
>
> If one wanted to make a Q-Chem-specific parser for a bunch of binary
> files, it would need to be maintained by someone who is a developer and
> knows where all of the reads/writes to that file are, but even then it's
> quite risky. You just can't perform a visual spot-check on them to see if
> there are problems. It's definitely functionality for expert users.
>
> It seems as though this is something that should either be in a separate
> package that depends on cclib, or at least is separate from the parsers/
> directory.
>
> Martin, do you have scripts for parsing any of these that are publicly
> viewable? I'd be interested to see how you do it. I'd really love to know
> the structure of ORCA's *.gbw files.
>
> Eric
>
>
> On Tue, Aug 19, 2014 at 8:21 AM, Karol M. Langner <kar...@gm...
> > wrote:
>
>> Here is the issue I created for this:
>> https://github.com/cclib/cclib/issues/120
>>
>> On Aug 18 2014, Karol M. Langner wrote:
>> > Hi Martin,
>> >
>> > That's true, technically it is not hard. I was thinking more about
>> integration
>> > with the rest of cclib. Binary files will not be "parsed" in the sense
>> logfiles
>> > are, that is the task is not one of reading a text file line by line and
>> > extracting useful information. Rather, it is one of reading data that
>> is stored
>> > in a predefined format. This is obvious, but it implies some design
>> choices
>> > that have not been discussed here. Above all, the line-by-line paradigm
>> > used in the class LogFile (inherited by all parsers) is not useful and
>> > we cannot just pass one extra file.
>> >
>> > Anyway, this can definitely be done, I'm just saying it needs a little
>> > bit of forethought.
>> >
>> > Cheers,
>> > Karol
>> >
>> > On Aug 18 2014, Martin Blood-Forsythe wrote:
>> > > I'll have to look at the way that the multiple file parse is
>> implemented
>> > > for Molpro, but at least for ORCA and Q-Chem adding a binary file
>> parser
>> > > wouldn't require much different than the way a text based multiple
>> file
>> > > parsing works. I've had a lot of success parsing these files using
>> > > numpy.fromfile and numpy.frombuffer.  Typically the only required
>> knowledge
>> > > is the number of orbitals (to determine the array shape to dump into)
>> and
>> > > the file name.
>> > >
>> > > -Martin
>>
>> --
>> written by Karol M. Langner
>> Tue Aug 19 08:20:53 EDT 2014
>>
>
>