From: Karol L. <kar...@gm...> - 2018-03-20 15:42:12
|
Hi Alok, That's a nice first hack. Here are some questions you can consider. 1. How do you actually discover content online - I mean, we know NIST and Pitt exist, a crawler should be able to also find new sources we did not know about before. 2. Do you want to write a crawler for each repository out there (NIST, Pitt, ...)? 3. When you rerun the crawler, do you need to go over the whole repo again? Also, I'd encourage you to read through PEP8 ( https://www.python.org/dev/peps/pep-0008/) and try to apply it to the code you write. - Karol On Tue, Mar 20, 2018 at 2:38 AM, Alok Kumar <alo...@gm...> wrote: > *cclib project - Discovering computational chemistry content online* > > I was looking for databases to crawl logfiles and came across NIST database > https://cccbdb.nist.gov/output.asp (edited) > I wrote a crawler script in python to fetch the Gaussian input/output > logfiles. > https://github.com/DCoderGH/ccNIST-crawler > [Use Python2] > It not well documented yet but will add comments soon. > Please have a look and share some feedback. > > I am not fully aware of the policies for crawling NIST databases. > Also, do let me know about any open other databases which can be mined. > > -Alok > ` > > On Tue, Mar 6, 2018 at 10:53 PM, Karol Langner <kar...@gm...> > wrote: > >> Hi Alok, >> >> No specific issue comes to mind at the moment, you can peruse the various >> bugs we have filed at https://github.com/cclib/cclib/issues >> >> An idea for pre-project work: you could try to write a simplistic parser >> and try to get it to crawl the github repository and find out test files. >> >> As far as the application is concerned, I would suggest putting a lot of >> focus on what the project would achieve. For example, can you estimate the >> number of logfiles out there you expect to find with the crawler? >> >> >> On Tue, Mar 6, 2018 at 7:01 AM, Alok Kumar <alo...@gm...> wrote: >> >>> #cclib #parser-crawler >>> >>> Hello. Had been busy with my mid-semester exams . >>> Looked into various parser codes. >>> I wanted to ask what should I start working upon - any specific issue in >>> parser or building a crawler... >>> >>> Regards, >>> Alok >>> >>> On Tue, Feb 20, 2018 at 9:38 PM, Karol Langner <kar...@gm...> >>> wrote: >>> >>>> Done. >>>> >>>> On Tue, Feb 20, 2018 at 8:01 AM, ALOK KUMAR <alo...@gm...> >>>> wrote: >>>> >>>>> I wanted to join the slack channel to follow the discussion .Can you >>>>> please send Open Chemistry slack invite to my email >>>>> alo...@gm... >>>>> >>>>> Regards, >>>>> Alok >>>>> >>>>> On Tue, Feb 20, 2018 at 10:48 AM, Karol Langner < >>>>> kar...@gm...> wrote: >>>>> >>>>>> Hi Alok, >>>>>> >>>>>> Good to hear from you. I don't think a deep understanding of >>>>>> computational chemistry or web crawling is a prerequisite, but of course it >>>>>> would be an advantage. You need to be familiar enough with these subjects >>>>>> to be able to propose a project plan that would deliver results during the >>>>>> summer. The only way we can judge this is your proposal and any >>>>>> contributions you make before the application period. >>>>>> >>>>>> I would mention that this project is a little more research-y than >>>>>> most others. You'll need to design a crawler/scraper that is appropriate >>>>>> for this application, and figure what gaps need to be filled in in the >>>>>> existing codebase. >>>>>> >>>>>> Let us know if you have any other questions >>>>>> >>>>>> - Karol >>>>>> >>>>>> >>>>>> On Mon, Feb 19, 2018 at 1:09 PM, ALOK KUMAR <alo...@gm...> >>>>>> wrote: >>>>>> >>>>>>> Hello everyone. >>>>>>> I am Alok , a second year undergraduate at Indian Institute of >>>>>>> Technology Bombay , India. I have interest in computational chemistry and >>>>>>> had taken up a course this semester in my college. >>>>>>> >>>>>>> Among the GSoC-2018 ideas I found projects under cclib quite >>>>>>> interesting, particularly ##Discovering computational chemistry content >>>>>>> online## >>>>>>> >>>>>>> I am familiar with web scraping using BeautifulSoup and urllib in >>>>>>> Python and had recently started looking into Scrapy as well. >>>>>>> >>>>>>> I wanted to know how much understanding / familiarity is expected >>>>>>> for contributing to this project . Also,(if required) is deeper >>>>>>> understanding of computational chemistry or advanced web crawling a >>>>>>> prerequisite or can it be worked upon during the initial Coding phase of >>>>>>> GSoC. >>>>>>> >>>>>>> Looking ahead to staying in touch with the Open Chemistry community. >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Alok Kumar >>>>>>> alo...@gm... >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Openchemistry-developers mailing list >>>>>>> Ope...@pu... >>>>>>> https://public.kitware.com/mailman/listinfo/openchemistry-developers >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > |