From: Karol L. <kar...@gm...> - 2019-03-07 03:54:16
|
Hi Yibo, Great to hear from you. You've covered a lot of ground there, so let me respond to your questions in points: 1. You understood correctly, cclib will not parse "papers", it's sole purpose is to parse the output of comp chem programs like NWChem. 2. There are projects specifically trying to extract data from papers, this project is not meant to address that at all. 3. There are various places/sources of comp chem logfiles oneline, including supplementary information uploaded with papers and bulk repositories. 4. I am not doing anything at the moment like this project, but it seems like a natural extension of what cclib does. Hopefully that answers your questions. LMK if there's something else. - Karol On Wed, Mar 6, 2019 at 2:42 PM Zhang, Yibo <yz...@wi...> wrote: > Hello Karol, > > > I’m Yibo Zhang, a first year PHD student majoring in computational > chemistry at University of New Hampshire. I found your project(Discovering > computational chemistry content online) under GSoC Ideas 2019, it’s really > cool. There’s several reasons why this project attracts me a lot. The first > one is that I need to do a machine learning project about a chemical > sensor, so collecting a big data is a kind of precursor. Another reason is > that crawler always interest me especially combined with chemistry. I've > learned how to crawl a web and extract some interesting information and > I’m a fan of RSS(kind of crawler). Also learning more good tools like cclib > is helpful, because in my group we use another software to handle data, > which is not super handy for me. > > > However, I have a question about your project. As I have read through the > cclib’s description and learned how to use it, I found it can read the > output file like Nwchem, which I work with. However, if we want the crawler > to grab information from any published paper, it can not be processed by > cclib directly. What I think is we need to use Regex to extract some useful > information then store them into database, which is unrelated to cclib. > Another possible way is that some published paper may contain the output > file, then we only grab that kind of paper then handle them with cclib. > However, as fas as I know, not too much paper contains the output file if > there’s any. My final guess could be that there’s some websites storing a > lot of computational model, then we can grab them and put them into cclib. > > > Please give me a hint as I’m really curious about what you are going to > do. > > Thanks, > Yibo > |