From: Karol M. L. <kar...@gm...> - 2014-08-19 21:06:31
|
All good counterpoints, the usual show stoppers in this context. And I like the idea of gathering the tools for reading binary files in a separate code base, perhaps using/requiring data from the logfile, because cclib is now somewhat mature and has a well-defined scope. Still, we could keep it within the cclib project on github, and call it cclib-binary or whatever. @Noel @Adam -- any thoughts on this topic? On Aug 19 2014, Martin Blood-Forsythe wrote: > 1. Yes it is true that most codes delete their binary files unless you tell > them not to. Which is why I was mentioning the need to be careful about > checking for the existence of a file. ORCA is actually a bit unusual in the > regard that it always leaves the *.gbw binary file around. I don't know > the *entire* contents of the file, but I know how to parse the MO > coefficients/energies, occupation numbers, job name, and perhaps orbital > labels (1pz 2s etc, the offsets to these are a little tricky since it is a > mixed binary/ascii format). As far as I know, the *.hess file is a text > file, but the point about job specific binaries, like the *.cis file, is > certainly correct. > > 2 . With regard to Q-Chem's custom binary files: > There are two different categories of binary files. The ones that get > saved if you do 'qchem infile outfile savename' and the more complete set > that are stored if you do the '-save' option. > For the most part I think that the standard ones saved without the '-save' > command are pretty safe to trust. These include the 53.0 [MO coefficients + > eigenvalues] and 54.0 [density matrix]. There are a few other binary files > that I trust: 320.0 [overlap matrix], 601.0 [NBO coefficients]. > The general point though that binary files are a pain and can require some > developer level knowledge is well taken. One reason that I thought adding > binary parsing support might be neat is exactly because non-developers have > no idea what is in these files. > > 2b. The point about developers misusing the binary files in Q-chem is > certainly a good one, and is a strong reason to not support parsing any of > the non-vanilla binary files. I'd argue that the 53.0 and 54.0 files are > quite safe to parse since they have a very well defined structure that > hasn't changed across versions 3 - 4.2 as far as I can tell. Their content > mapping is very easy and only requires knowing the total number of orbitals. > > 3. I agree that any package that attempts to unpack a whole bunch of other > binary files should probably be separate from cclib. Mainly I was > wondering if the MO coefficients & eigenvalues would be of interest for > visualization purposes. The tricky part there is that a lot of codes use > non-standard basis set definitions that differ from the EMSL specifications > of the same name. > > 4. My scripts that unpack *.gbw and Q-Chem binaries are currently wrapped > up in with some other code. I strip them out and add them to > https://github.com/mbloodforsythe/qc-binary-parsers > > > --------------------------------------------------- > Martin Blood-Forsythe > Graduate Student in Physics > Harvard University > Aspuru-Guzik Lab > > > On Tue, Aug 19, 2014 at 12:57 PM, Eric Berquist <er...@pi...> wrote: > > > This would certainly be interesting, but I have a few comments that are > > probably Q-Chem specific: > > > > 1. As Martin already mentioned, unless you modify the `qchem` driver > > script or you pass the flag `-save` when running, the folder with all of > > these binary files gets deleted. This means they usually aren't present, > > except when you want them, in contrast to the regular plain-text output > > file which is always produced. This is in contrast to something like the > > HESS file, which is only present when performing an analytic Hessian > > calculation. ORCA is even worse about this, since it deletes scratch/temp > > files early and often. > > > > 2. Since almost all of these files are named with a number, anyone who is > > not a Q-Chem developer probably doesn't know what's in these files. For > > example, 53 corresponds to the MO coefficients. Even knowing the > > number/content mapping doesn't easily tell you the dimensionality of the > > file; you have to know what all of the reads and writes to that file look > > like. > > > > 2b. Once you know the number/content mapping and the dimensionality, there > > is still no guarantee about what's in the file, due to nasty bits in the > > source code with people using files for things other than what they are > > specified as containing. > > > > If one wanted to make a Q-Chem-specific parser for a bunch of binary > > files, it would need to be maintained by someone who is a developer and > > knows where all of the reads/writes to that file are, but even then it's > > quite risky. You just can't perform a visual spot-check on them to see if > > there are problems. It's definitely functionality for expert users. > > > > It seems as though this is something that should either be in a separate > > package that depends on cclib, or at least is separate from the parsers/ > > directory. > > > > Martin, do you have scripts for parsing any of these that are publicly > > viewable? I'd be interested to see how you do it. I'd really love to know > > the structure of ORCA's *.gbw files. > > > > Eric > > > > > > On Tue, Aug 19, 2014 at 8:21 AM, Karol M. Langner <kar...@gm... > > > wrote: > > > >> Here is the issue I created for this: > >> https://github.com/cclib/cclib/issues/120 > >> > >> On Aug 18 2014, Karol M. Langner wrote: > >> > Hi Martin, > >> > > >> > That's true, technically it is not hard. I was thinking more about > >> integration > >> > with the rest of cclib. Binary files will not be "parsed" in the sense > >> logfiles > >> > are, that is the task is not one of reading a text file line by line and > >> > extracting useful information. Rather, it is one of reading data that > >> is stored > >> > in a predefined format. This is obvious, but it implies some design > >> choices > >> > that have not been discussed here. Above all, the line-by-line paradigm > >> > used in the class LogFile (inherited by all parsers) is not useful and > >> > we cannot just pass one extra file. > >> > > >> > Anyway, this can definitely be done, I'm just saying it needs a little > >> > bit of forethought. > >> > > >> > Cheers, > >> > Karol > >> > > >> > On Aug 18 2014, Martin Blood-Forsythe wrote: > >> > > I'll have to look at the way that the multiple file parse is > >> implemented > >> > > for Molpro, but at least for ORCA and Q-Chem adding a binary file > >> parser > >> > > wouldn't require much different than the way a text based multiple > >> file > >> > > parsing works. I've had a lot of success parsing these files using > >> > > numpy.fromfile and numpy.frombuffer. Typically the only required > >> knowledge > >> > > is the number of orbitals (to determine the array shape to dump into) > >> and > >> > > the file name. > >> > > > >> > > -Martin > >> > >> -- > >> written by Karol M. Langner > >> Tue Aug 19 08:20:53 EDT 2014 > >> > > > > -- written by Karol M. Langner Tue Aug 19 17:00:06 EDT 2014 |