Re: [cclib-devel] Q-Chem and ORCA parsers

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

All good counterpoints, the usual show stoppers in this context. And I like
the idea of gathering the tools for reading binary files in a separate code
base, perhaps using/requiring data from the logfile, because cclib is now
somewhat mature and has a well-defined scope. Still, we could keep it
within the cclib project on github, and call it cclib-binary or whatever.

@Noel @Adam -- any thoughts on this topic?

On Aug 19 2014, Martin Blood-Forsythe wrote:
> 1. Yes it is true that most codes delete their binary files unless you tell
> them not to.  Which is why I was mentioning the need to be careful about
> checking for the existence of a file. ORCA is actually a bit unusual in the
> regard that it always leaves the *.gbw binary file around.  I don't know
> the *entire* contents of the file, but I know how to parse the MO
> coefficients/energies, occupation numbers, job name, and perhaps orbital
> labels (1pz 2s etc, the offsets to these are a little tricky since it is a
> mixed binary/ascii format). As far as I know, the *.hess file is a text
> file, but the point about job specific binaries, like the *.cis file, is
> certainly correct.
> 
> 2 . With regard to Q-Chem's custom binary files:
> There are two different categories of binary files.  The ones that get
> saved if you do 'qchem infile outfile savename' and the more complete set
> that are stored if you do the '-save' option.
>  For the most part I think that the standard ones saved without the '-save'
> command are pretty safe to trust. These include the 53.0 [MO coefficients +
> eigenvalues] and 54.0 [density matrix].  There are a few other binary files
> that I trust: 320.0 [overlap matrix], 601.0 [NBO coefficients].
> The general point though that binary files are a pain and can require some
> developer level knowledge is well taken. One reason that I thought adding
> binary parsing support might be neat is exactly because non-developers have
> no idea what is in these files.
> 
> 2b. The point about developers misusing the binary files in Q-chem is
> certainly a good one, and is a strong reason to not support parsing any of
> the non-vanilla binary files.  I'd argue that the 53.0 and 54.0 files are
> quite safe to parse since they have a very well defined structure that
> hasn't changed across versions 3 - 4.2 as far as I can tell.  Their content
> mapping is very easy and only requires knowing the total number of orbitals.
> 
> 3. I agree that any package that attempts to unpack a whole bunch of other
> binary files should probably be separate from cclib.  Mainly I was
> wondering if the MO coefficients & eigenvalues would be of interest for
> visualization purposes.  The tricky part there is that a lot of codes use
> non-standard basis set definitions that differ from the EMSL specifications
> of the same name.
> 
> 4. My scripts that unpack *.gbw and Q-Chem binaries are currently wrapped
> up in with some other code.  I strip them out and add them to
> https://github.com/mbloodforsythe/qc-binary-parsers
> 
> 
> ---------------------------------------------------
> Martin Blood-Forsythe
> Graduate Student in Physics
> Harvard University
> Aspuru-Guzik Lab
> 
> 
> On Tue, Aug 19, 2014 at 12:57 PM, Eric Berquist <er...@pi...> wrote:
> 
> > This would certainly be interesting, but I have a few comments that are
> > probably Q-Chem specific:
> >
> > 1. As Martin already mentioned, unless you modify the `qchem` driver
> > script or you pass the flag `-save` when running, the folder with all of
> > these binary files gets deleted. This means they usually aren't present,
> > except when you want them, in contrast to the regular plain-text output
> > file which is always produced. This is in contrast to something like the
> > HESS file, which is only present when performing an analytic Hessian
> > calculation. ORCA is even worse about this, since it deletes scratch/temp
> > files early and often.
> >
> > 2. Since almost all of these files are named with a number, anyone who is
> > not a Q-Chem developer probably doesn't know what's in these files. For
> > example, 53 corresponds to the MO coefficients. Even knowing the
> > number/content mapping doesn't easily tell you the dimensionality of the
> > file; you have to know what all of the reads and writes to that file look
> > like.
> >
> > 2b. Once you know the number/content mapping and the dimensionality, there
> > is still no guarantee about what's in the file, due to nasty bits in the
> > source code with people using files for things other than what they are
> > specified as containing.
> >
> > If one wanted to make a Q-Chem-specific parser for a bunch of binary
> > files, it would need to be maintained by someone who is a developer and
> > knows where all of the reads/writes to that file are, but even then it's
> > quite risky. You just can't perform a visual spot-check on them to see if
> > there are problems. It's definitely functionality for expert users.
> >
> > It seems as though this is something that should either be in a separate
> > package that depends on cclib, or at least is separate from the parsers/
> > directory.
> >
> > Martin, do you have scripts for parsing any of these that are publicly
> > viewable? I'd be interested to see how you do it. I'd really love to know
> > the structure of ORCA's *.gbw files.
> >
> > Eric
> >
> >
> > On Tue, Aug 19, 2014 at 8:21 AM, Karol M. Langner <kar...@gm...
> > > wrote:
> >
> >> Here is the issue I created for this:
> >> https://github.com/cclib/cclib/issues/120
> >>
> >> On Aug 18 2014, Karol M. Langner wrote:
> >> > Hi Martin,
> >> >
> >> > That's true, technically it is not hard. I was thinking more about
> >> integration
> >> > with the rest of cclib. Binary files will not be "parsed" in the sense
> >> logfiles
> >> > are, that is the task is not one of reading a text file line by line and
> >> > extracting useful information. Rather, it is one of reading data that
> >> is stored
> >> > in a predefined format. This is obvious, but it implies some design
> >> choices
> >> > that have not been discussed here. Above all, the line-by-line paradigm
> >> > used in the class LogFile (inherited by all parsers) is not useful and
> >> > we cannot just pass one extra file.
> >> >
> >> > Anyway, this can definitely be done, I'm just saying it needs a little
> >> > bit of forethought.
> >> >
> >> > Cheers,
> >> > Karol
> >> >
> >> > On Aug 18 2014, Martin Blood-Forsythe wrote:
> >> > > I'll have to look at the way that the multiple file parse is
> >> implemented
> >> > > for Molpro, but at least for ORCA and Q-Chem adding a binary file
> >> parser
> >> > > wouldn't require much different than the way a text based multiple
> >> file
> >> > > parsing works. I've had a lot of success parsing these files using
> >> > > numpy.fromfile and numpy.frombuffer.  Typically the only required
> >> knowledge
> >> > > is the number of orbitals (to determine the array shape to dump into)
> >> and
> >> > > the file name.
> >> > >
> >> > > -Martin
> >>
> >> --
> >> written by Karol M. Langner
> >> Tue Aug 19 08:20:53 EDT 2014
> >>
> >
> >

-- 
written by Karol M. Langner
Tue Aug 19 17:00:06 EDT 2014