Re: [cclib-devel] Q-Chem and ORCA parsers

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I created an issue for this in cclib for future work:
https://github.com/cclib/cclib/issues/120


On Tue, Aug 19, 2014 at 5:06 PM, Karol M. Langner <kar...@gm...>
wrote:

> All good counterpoints, the usual show stoppers in this context. And I like
> the idea of gathering the tools for reading binary files in a separate code
> base, perhaps using/requiring data from the logfile, because cclib is now
> somewhat mature and has a well-defined scope. Still, we could keep it
> within the cclib project on github, and call it cclib-binary or whatever.
>
> @Noel @Adam -- any thoughts on this topic?
>
> On Aug 19 2014, Martin Blood-Forsythe wrote:
> > 1. Yes it is true that most codes delete their binary files unless you
> tell
> > them not to.  Which is why I was mentioning the need to be careful about
> > checking for the existence of a file. ORCA is actually a bit unusual in
> the
> > regard that it always leaves the *.gbw binary file around.  I don't know
> > the *entire* contents of the file, but I know how to parse the MO
> > coefficients/energies, occupation numbers, job name, and perhaps orbital
> > labels (1pz 2s etc, the offsets to these are a little tricky since it is
> a
> > mixed binary/ascii format). As far as I know, the *.hess file is a text
> > file, but the point about job specific binaries, like the *.cis file, is
> > certainly correct.
> >
> > 2 . With regard to Q-Chem's custom binary files:
> > There are two different categories of binary files.  The ones that get
> > saved if you do 'qchem infile outfile savename' and the more complete set
> > that are stored if you do the '-save' option.
> >  For the most part I think that the standard ones saved without the
> '-save'
> > command are pretty safe to trust. These include the 53.0 [MO
> coefficients +
> > eigenvalues] and 54.0 [density matrix].  There are a few other binary
> files
> > that I trust: 320.0 [overlap matrix], 601.0 [NBO coefficients].
> > The general point though that binary files are a pain and can require
> some
> > developer level knowledge is well taken. One reason that I thought adding
> > binary parsing support might be neat is exactly because non-developers
> have
> > no idea what is in these files.
> >
> > 2b. The point about developers misusing the binary files in Q-chem is
> > certainly a good one, and is a strong reason to not support parsing any
> of
> > the non-vanilla binary files.  I'd argue that the 53.0 and 54.0 files are
> > quite safe to parse since they have a very well defined structure that
> > hasn't changed across versions 3 - 4.2 as far as I can tell.  Their
> content
> > mapping is very easy and only requires knowing the total number of
> orbitals.
> >
> > 3. I agree that any package that attempts to unpack a whole bunch of
> other
> > binary files should probably be separate from cclib.  Mainly I was
> > wondering if the MO coefficients & eigenvalues would be of interest for
> > visualization purposes.  The tricky part there is that a lot of codes use
> > non-standard basis set definitions that differ from the EMSL
> specifications
> > of the same name.
> >
> > 4. My scripts that unpack *.gbw and Q-Chem binaries are currently wrapped
> > up in with some other code.  I strip them out and add them to
> > https://github.com/mbloodforsythe/qc-binary-parsers
> >
> >
> > ---------------------------------------------------
> > Martin Blood-Forsythe
> > Graduate Student in Physics
> > Harvard University
> > Aspuru-Guzik Lab
> >
> >
> > On Tue, Aug 19, 2014 at 12:57 PM, Eric Berquist <er...@pi...> wrote:
> >
> > > This would certainly be interesting, but I have a few comments that are
> > > probably Q-Chem specific:
> > >
> > > 1. As Martin already mentioned, unless you modify the `qchem` driver
> > > script or you pass the flag `-save` when running, the folder with all
> of
> > > these binary files gets deleted. This means they usually aren't
> present,
> > > except when you want them, in contrast to the regular plain-text output
> > > file which is always produced. This is in contrast to something like
> the
> > > HESS file, which is only present when performing an analytic Hessian
> > > calculation. ORCA is even worse about this, since it deletes
> scratch/temp
> > > files early and often.
> > >
> > > 2. Since almost all of these files are named with a number, anyone who
> is
> > > not a Q-Chem developer probably doesn't know what's in these files. For
> > > example, 53 corresponds to the MO coefficients. Even knowing the
> > > number/content mapping doesn't easily tell you the dimensionality of
> the
> > > file; you have to know what all of the reads and writes to that file
> look
> > > like.
> > >
> > > 2b. Once you know the number/content mapping and the dimensionality,
> there
> > > is still no guarantee about what's in the file, due to nasty bits in
> the
> > > source code with people using files for things other than what they are
> > > specified as containing.
> > >
> > > If one wanted to make a Q-Chem-specific parser for a bunch of binary
> > > files, it would need to be maintained by someone who is a developer and
> > > knows where all of the reads/writes to that file are, but even then
> it's
> > > quite risky. You just can't perform a visual spot-check on them to see
> if
> > > there are problems. It's definitely functionality for expert users.
> > >
> > > It seems as though this is something that should either be in a
> separate
> > > package that depends on cclib, or at least is separate from the
> parsers/
> > > directory.
> > >
> > > Martin, do you have scripts for parsing any of these that are publicly
> > > viewable? I'd be interested to see how you do it. I'd really love to
> know
> > > the structure of ORCA's *.gbw files.
> > >
> > > Eric
> > >
> > >
> > > On Tue, Aug 19, 2014 at 8:21 AM, Karol M. Langner <
> kar...@gm...
> > > > wrote:
> > >
> > >> Here is the issue I created for this:
> > >> https://github.com/cclib/cclib/issues/120
> > >>
> > >> On Aug 18 2014, Karol M. Langner wrote:
> > >> > Hi Martin,
> > >> >
> > >> > That's true, technically it is not hard. I was thinking more about
> > >> integration
> > >> > with the rest of cclib. Binary files will not be "parsed" in the
> sense
> > >> logfiles
> > >> > are, that is the task is not one of reading a text file line by
> line and
> > >> > extracting useful information. Rather, it is one of reading data
> that
> > >> is stored
> > >> > in a predefined format. This is obvious, but it implies some design
> > >> choices
> > >> > that have not been discussed here. Above all, the line-by-line
> paradigm
> > >> > used in the class LogFile (inherited by all parsers) is not useful
> and
> > >> > we cannot just pass one extra file.
> > >> >
> > >> > Anyway, this can definitely be done, I'm just saying it needs a
> little
> > >> > bit of forethought.
> > >> >
> > >> > Cheers,
> > >> > Karol
> > >> >
> > >> > On Aug 18 2014, Martin Blood-Forsythe wrote:
> > >> > > I'll have to look at the way that the multiple file parse is
> > >> implemented
> > >> > > for Molpro, but at least for ORCA and Q-Chem adding a binary file
> > >> parser
> > >> > > wouldn't require much different than the way a text based multiple
> > >> file
> > >> > > parsing works. I've had a lot of success parsing these files using
> > >> > > numpy.fromfile and numpy.frombuffer.  Typically the only required
> > >> knowledge
> > >> > > is the number of orbitals (to determine the array shape to dump
> into)
> > >> and
> > >> > > the file name.
> > >> > >
> > >> > > -Martin
> > >>
> > >> --
> > >> written by Karol M. Langner
> > >> Tue Aug 19 08:20:53 EDT 2014
> > >>
> > >
> > >
>
> --
> written by Karol M. Langner
> Tue Aug 19 17:00:06 EDT 2014
>