Re: [cclib-devel] Regarding machine learning to complement cclib

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Aditya,

Sorry for the late reply. Yes, we took out the ML project, since it seems
both too research-y and probably too advanced for a GSOC project. I still
think it's worth exploring, but maybe just as a hobby someday.

As far as training a model to estimate electronic structure properties,
which is what your describing, there have already been many studies done in
this area, although most aim to produce the electronic structure directly
and not properties based on other properties as you suggest. The latter
presumes you already get the electronic structure from somewhere else.

In any case, I don't think (anymore) that these are good projects for GSOC.
If you'd like to go in this direction with a software project in GSOC,
consider looking at DeepChem's project ideas (also part of OpenChemistry).

Hope that helps,
Karol

On Mon, Mar 4, 2019 at 11:20 PM Aditya Kamath <adt...@gm...> wrote:

> Dear Karol,
> I have gone through the features and functions of cclib. I had been trying
> to look into your idea of using machine learning for parsing data. However,
> this seems very ambitious as it would involve NLP as well as complex deep
> neural networks. I have noticed that you have changed your project idea
> towards mining data.
>
> I still believe we can use machine learning to compliment cclib. An Idea I
> have is to try to search for relations between the parsed data. As
> cclib deals with different software outputs, it has access to a large
> amount of data like electronic transition energies, molecular orbital
> energies, vibrational frequencies etc for different kind of molecular
> systems.
>
> Usually, a computational chemist would have to model the molecule and
> setup numerical calculations to solve for the energies and vibrational
> frequencies etc. I suspect a Neural Network is smart enough to directly
> predict these states given the essential information about the molecular
> system- which is also extracted by cclib.
>
> In essence,
> http://cclib.sourceforge.net/wiki/index.php/Development_parsed_data
> we split the list in the table into parameters and results. The model will
> then train using the extracted data and can predict results for a new set
> of parameter data without having any computation done for the new cases.
>
> Please let me know if this is something you might be interested in. I will
> be happy to explain this idea further.
>
> Bests,
> Aditya
> ᐧ
>
> On Wed, Feb 6, 2019 at 2:02 AM Karol Langner <kar...@gm...>
> wrote:
>
>> Hi guys,
>>
>> I think Tyler is spot on here, generally. Which is why we haven't really
>> explored this path and converted cclib to a machine learned model.
>>
>> While output files are typically structured and somewhat organized, as a
>> developer of cclib parsers I would qualify that statement significantly...
>> maybe it's fair to say they are
>> usually-well-organized-but-with-many-caveats-and-subtle-changes-over-time-which-require-much-debugging-and-testing-to-get-consistent
>> :)
>>
>> Really, the purpose of this project would be to see how far ML can get
>> you in extracting chemical facts. It would definitely need to leverage the
>> strong prior knowledge we have about the data we want to parse, and use it
>> to enforce invariants/constraints. The goal, however, is not to estimate
>> the values, but extract them from the logfile and classify them as
>> attributes. If the model extracts the wrong data, then it is still exact,
>> but just classified wrong. This is something a human might also do, and
>> I've seen students do before (oops, wrong orbital!). There has been some
>> work along those lines in the past for extraction, although not in this
>> particular domain. And it's not clear that this sort of model is easily
>> achievable. But then again, maybe it is, I haven't tried. On the flip side,
>> if it is possible, then we wouldn't have to write this complicated parsing
>> code, just teach the model and tune it for a specific new program or output
>> type.
>>
>> Anyway, it is true that if you're interested in standard ML methods and
>> chemistry, there are much more "hot" topics out there.
>>
>> - Karol
>>
>>
>> On Tue, Feb 5, 2019 at 12:14 PM Tyler Josephson <jos...@um...> wrote:
>>
>>> Hi Aditya,
>>>
>>> While I think ML is exciting and promising for learning potential energy
>>> surfaces or efficiently interpolating/extrapolating properties of materials
>>> and molecules within a particular domain, I don't see it's value for
>>> parsing electronic structure output files.
>>> Because output files from simulation are quite well-organized, values
>>> can be extracted by simply following the logic of the output file and
>>> reading the relevant text into memory. Not only are the values *exact* this
>>> way, rather than estimates, it likely requires fewer operations for the
>>> computer to perform than a DL approach; parsing is quite cheap as far as
>>> computational tasks are concerned.
>>>
>>> DL might also be able to find interesting correlations and patterns in
>>> the data that humans tend to miss; using the whole simulation output, not
>>> the just values we assume to be important, may lead to new insights.
>>>
>>> Best,
>>> Tyler
>>>
>>> On Tue, Feb 5, 2019 at 9:59 AM Karol Langner <kar...@gm...>
>>> wrote:
>>>
>>>> Well, the parsers are not organized by method or whatnot... there's a
>>>> base class with a parse method (
>>>> https://github.com/cclib/cclib/blob/master/cclib/parser/logfileparser.py#L281)
>>>> and the various parsers inherit from that and implement and extract method
>>>> which is called by the parse method (for NWChem, for example:
>>>> https://github.com/cclib/cclib/blob/master/cclib/parser/nwchemparser.py#L42).
>>>> So there's no function really that would be replaced. It's the whole
>>>> parsing that would be imitated by the machine learned thing. Of course, one
>>>> could as a first step try to learn to parse rather easy things like number
>>>> of atoms and charge.
>>>>
>>>> HTH,
>>>> Karol
>>>>
>>>> On Tue, Feb 5, 2019 at 3:32 AM Aditya Kamath <adt...@gm...>
>>>> wrote:
>>>>
>>>>> Hi Karol,
>>>>> Thank you for responding to my message. In that case, the problem
>>>>> becomes information extraction. I think it is possible using Deep Learning.
>>>>> Can you tell me some examples of cclib parsing functions you feel can be
>>>>> replaced with ML?
>>>>> Best Regards,
>>>>> Aditya
>>>>>
>>>>> On Wed, Jan 30, 2019 at 1:12 AM Karol Langner <kar...@gm...>
>>>>> wrote:
>>>>>
>>>>>> Hi Aditya,
>>>>>>
>>>>>> My intention with the idea was solely data extraction from log files,
>>>>>> so parsing. But if you see other applications of ML within the scope of
>>>>>> cclib, we're definitely interested. Please note other projects under the
>>>>>> OpenChemistry umbrella also have ML ideas, and many of those are more
>>>>>> straightforward. Here, with parsing, things will be much more researchy.
>>>>>>
>>>>>>
>>>>>> HTH,
>>>>>> Karol
>>>>>>
>>>>>> On Tue, Jan 29, 2019, 2:13 AM Aditya Kamath <adt...@gm...>
>>>>>> wrote:
>>>>>>
>>>>>>> Dear Karol,
>>>>>>> I am Aditya, I read your GSoC project Idea to possibly implement
>>>>>>> machine learning to compete with cclib as an efficient data parser. From
>>>>>>> what I understand, you wish to train a machine learning model to handle and
>>>>>>> convert data between various software outputs.
>>>>>>>
>>>>>>> I suggest that the role of machine learning is not to handle or
>>>>>>> parse data but rather to analyze it. cclib can benefit from backend trained
>>>>>>> ML models to do tasks like classify file data, identify and extract
>>>>>>> information from files. It can also perform very accurate regression and
>>>>>>> emulate complex function maps which could benefit any calculation methods
>>>>>>> used by cclib.
>>>>>>>
>>>>>>> We can use algorithms like CRF's to label and identify data in data
>>>>>>> files or use neural networks or any other regression methods to compliment
>>>>>>> calculations.
>>>>>>>
>>>>>>> I am a final year student, looking for a prospective GSoC project to
>>>>>>> work with. I have previously worked with a research group implementing
>>>>>>> machine learning for ODE solvers to compete with Gaussian software
>>>>>>> calculations, ab initio calculations. I would be happy to discuss further
>>>>>>> on how we can work with cclib functionalities. I look forward to hearing
>>>>>>> from you.
>>>>>>>
>>>>>>> Best Wishes,
>>>>>>> Aditya Kamath
>>>>>>>
>>>>>> _______________________________________________
>>>> cclib-devel mailing list
>>>> ccl...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/cclib-devel
>>>>
>>>
>>>
>>> --
>>> Tyler Josephson
>>> PhD Chemical Engineering
>>> Postdoctoral Research Associate, Siepmann Group
>>> University of Minnesota, Twin Cities
>>> 651-269-1433 |  | LinkedIn <https://www.linkedin.com/in/trjosephson/>
>>>
>>