From: Karol L. <kar...@gm...> - 2019-03-07 19:29:56
|
Hi Yibo, There's definitely the possibility! I think there are indeed a lot of things in this area that could sidetrack you, rabbit holes like exposing the data to the whole world, organizing it and using it in smart ways. Defining the scope is really your job and one of the main things you need to think about for the application. You need to now what your strengths are and how much you can do in 12 weeks. I would encourage you to be as narrow as possible, and detailed within the scope you think is realistic. Hope that helps, Karol On Thu, Mar 7, 2019 at 11:22 AM Zhang, Yibo <yz...@wi...> wrote: > Hello koral, > > Cool, that make sense. > > So, I want to know what's the next step. I guess we need to store them > into database( I have some experience with mongodb in javascript). And then > what kind of application can it be used? Maybe transfer them into a small > chemistry structure search engine or maybe can be used to some machine > learning materials or the project stops there as it’s just 12 weeks > projects. Yeah, I know It's kind of overthinking, but I want to know the > scope of this project. > > Also I can write a proposal, which you can give me some suggestions based > on, if you think I have some some possibility to join this project. > > Thanks, > Yibo > On Mar 6, 2019, 22:54 -0500, Karol Langner <kar...@gm...>, > wrote: > > Hi Yibo, > > Great to hear from you. You've covered a lot of ground there, so let me > respond to your questions in points: > 1. You understood correctly, cclib will not parse "papers", it's sole > purpose is to parse the output of comp chem programs like NWChem. > 2. There are projects specifically trying to extract data from papers, > this project is not meant to address that at all. > 3. There are various places/sources of comp chem logfiles oneline, > including supplementary information uploaded with papers and bulk > repositories. > 4. I am not doing anything at the moment like this project, but it seems > like a natural extension of what cclib does. > > Hopefully that answers your questions. LMK if there's something else. > > - Karol > > On Wed, Mar 6, 2019 at 2:42 PM Zhang, Yibo <yz...@wi...> > wrote: > >> Hello Karol, >> >> >> I’m Yibo Zhang, a first year PHD student majoring in computational >> chemistry at University of New Hampshire. I found your project(Discovering >> computational chemistry content online) under GSoC Ideas 2019, it’s really >> cool. There’s several reasons why this project attracts me a lot. The first >> one is that I need to do a machine learning project about a chemical >> sensor, so collecting a big data is a kind of precursor. Another reason is >> that crawler always interest me especially combined with chemistry. I've >> learned how to crawl a web and extract some interesting information and >> I’m a fan of RSS(kind of crawler). Also learning more good tools like cclib >> is helpful, because in my group we use another software to handle data, >> which is not super handy for me. >> >> >> However, I have a question about your project. As I have read through the >> cclib’s description and learned how to use it, I found it can read the >> output file like Nwchem, which I work with. However, if we want the crawler >> to grab information from any published paper, it can not be processed by >> cclib directly. What I think is we need to use Regex to extract some useful >> information then store them into database, which is unrelated to cclib. >> Another possible way is that some published paper may contain the output >> file, then we only grab that kind of paper then handle them with cclib. >> However, as fas as I know, not too much paper contains the output file if >> there’s any. My final guess could be that there’s some websites storing a >> lot of computational model, then we can grab them and put them into cclib. >> >> >> Please give me a hint as I’m really curious about what you are going to >> do. >> >> Thanks, >> Yibo >> > |