From: Karol L. <kar...@gm...> - 2014-08-25 01:07:43
|
I created an issue for this in cclib for future work: https://github.com/cclib/cclib/issues/120 On Tue, Aug 19, 2014 at 5:06 PM, Karol M. Langner <kar...@gm...> wrote: > All good counterpoints, the usual show stoppers in this context. And I like > the idea of gathering the tools for reading binary files in a separate code > base, perhaps using/requiring data from the logfile, because cclib is now > somewhat mature and has a well-defined scope. Still, we could keep it > within the cclib project on github, and call it cclib-binary or whatever. > > @Noel @Adam -- any thoughts on this topic? > > On Aug 19 2014, Martin Blood-Forsythe wrote: > > 1. Yes it is true that most codes delete their binary files unless you > tell > > them not to. Which is why I was mentioning the need to be careful about > > checking for the existence of a file. ORCA is actually a bit unusual in > the > > regard that it always leaves the *.gbw binary file around. I don't know > > the *entire* contents of the file, but I know how to parse the MO > > coefficients/energies, occupation numbers, job name, and perhaps orbital > > labels (1pz 2s etc, the offsets to these are a little tricky since it is > a > > mixed binary/ascii format). As far as I know, the *.hess file is a text > > file, but the point about job specific binaries, like the *.cis file, is > > certainly correct. > > > > 2 . With regard to Q-Chem's custom binary files: > > There are two different categories of binary files. The ones that get > > saved if you do 'qchem infile outfile savename' and the more complete set > > that are stored if you do the '-save' option. > > For the most part I think that the standard ones saved without the > '-save' > > command are pretty safe to trust. These include the 53.0 [MO > coefficients + > > eigenvalues] and 54.0 [density matrix]. There are a few other binary > files > > that I trust: 320.0 [overlap matrix], 601.0 [NBO coefficients]. > > The general point though that binary files are a pain and can require > some > > developer level knowledge is well taken. One reason that I thought adding > > binary parsing support might be neat is exactly because non-developers > have > > no idea what is in these files. > > > > 2b. The point about developers misusing the binary files in Q-chem is > > certainly a good one, and is a strong reason to not support parsing any > of > > the non-vanilla binary files. I'd argue that the 53.0 and 54.0 files are > > quite safe to parse since they have a very well defined structure that > > hasn't changed across versions 3 - 4.2 as far as I can tell. Their > content > > mapping is very easy and only requires knowing the total number of > orbitals. > > > > 3. I agree that any package that attempts to unpack a whole bunch of > other > > binary files should probably be separate from cclib. Mainly I was > > wondering if the MO coefficients & eigenvalues would be of interest for > > visualization purposes. The tricky part there is that a lot of codes use > > non-standard basis set definitions that differ from the EMSL > specifications > > of the same name. > > > > 4. My scripts that unpack *.gbw and Q-Chem binaries are currently wrapped > > up in with some other code. I strip them out and add them to > > https://github.com/mbloodforsythe/qc-binary-parsers > > > > > > --------------------------------------------------- > > Martin Blood-Forsythe > > Graduate Student in Physics > > Harvard University > > Aspuru-Guzik Lab > > > > > > On Tue, Aug 19, 2014 at 12:57 PM, Eric Berquist <er...@pi...> wrote: > > > > > This would certainly be interesting, but I have a few comments that are > > > probably Q-Chem specific: > > > > > > 1. As Martin already mentioned, unless you modify the `qchem` driver > > > script or you pass the flag `-save` when running, the folder with all > of > > > these binary files gets deleted. This means they usually aren't > present, > > > except when you want them, in contrast to the regular plain-text output > > > file which is always produced. This is in contrast to something like > the > > > HESS file, which is only present when performing an analytic Hessian > > > calculation. ORCA is even worse about this, since it deletes > scratch/temp > > > files early and often. > > > > > > 2. Since almost all of these files are named with a number, anyone who > is > > > not a Q-Chem developer probably doesn't know what's in these files. For > > > example, 53 corresponds to the MO coefficients. Even knowing the > > > number/content mapping doesn't easily tell you the dimensionality of > the > > > file; you have to know what all of the reads and writes to that file > look > > > like. > > > > > > 2b. Once you know the number/content mapping and the dimensionality, > there > > > is still no guarantee about what's in the file, due to nasty bits in > the > > > source code with people using files for things other than what they are > > > specified as containing. > > > > > > If one wanted to make a Q-Chem-specific parser for a bunch of binary > > > files, it would need to be maintained by someone who is a developer and > > > knows where all of the reads/writes to that file are, but even then > it's > > > quite risky. You just can't perform a visual spot-check on them to see > if > > > there are problems. It's definitely functionality for expert users. > > > > > > It seems as though this is something that should either be in a > separate > > > package that depends on cclib, or at least is separate from the > parsers/ > > > directory. > > > > > > Martin, do you have scripts for parsing any of these that are publicly > > > viewable? I'd be interested to see how you do it. I'd really love to > know > > > the structure of ORCA's *.gbw files. > > > > > > Eric > > > > > > > > > On Tue, Aug 19, 2014 at 8:21 AM, Karol M. Langner < > kar...@gm... > > > > wrote: > > > > > >> Here is the issue I created for this: > > >> https://github.com/cclib/cclib/issues/120 > > >> > > >> On Aug 18 2014, Karol M. Langner wrote: > > >> > Hi Martin, > > >> > > > >> > That's true, technically it is not hard. I was thinking more about > > >> integration > > >> > with the rest of cclib. Binary files will not be "parsed" in the > sense > > >> logfiles > > >> > are, that is the task is not one of reading a text file line by > line and > > >> > extracting useful information. Rather, it is one of reading data > that > > >> is stored > > >> > in a predefined format. This is obvious, but it implies some design > > >> choices > > >> > that have not been discussed here. Above all, the line-by-line > paradigm > > >> > used in the class LogFile (inherited by all parsers) is not useful > and > > >> > we cannot just pass one extra file. > > >> > > > >> > Anyway, this can definitely be done, I'm just saying it needs a > little > > >> > bit of forethought. > > >> > > > >> > Cheers, > > >> > Karol > > >> > > > >> > On Aug 18 2014, Martin Blood-Forsythe wrote: > > >> > > I'll have to look at the way that the multiple file parse is > > >> implemented > > >> > > for Molpro, but at least for ORCA and Q-Chem adding a binary file > > >> parser > > >> > > wouldn't require much different than the way a text based multiple > > >> file > > >> > > parsing works. I've had a lot of success parsing these files using > > >> > > numpy.fromfile and numpy.frombuffer. Typically the only required > > >> knowledge > > >> > > is the number of orbitals (to determine the array shape to dump > into) > > >> and > > >> > > the file name. > > >> > > > > >> > > -Martin > > >> > > >> -- > > >> written by Karol M. Langner > > >> Tue Aug 19 08:20:53 EDT 2014 > > >> > > > > > > > > -- > written by Karol M. Langner > Tue Aug 19 17:00:06 EDT 2014 > |