Hi,
I have looked at the page you point to,
http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Chemicals
And, so far as I can see, the best way to programmatically obtain as many
chemicals that I can see is to use the three pages,
http://en.wikipedia.org/wiki/List_of_inorganic_compounds
http://en.wikipedia.org/wiki/List_of_organic_compounds
http://en.wikipedia.org/wiki/List_of_biomolecules
And the titles listed in these provide links (when edited, which I have done in
relatively little time using vi) to over 1900 chemicals, many of which have
chemfoboxes. Unfortunately, these aren't accessible through any of the Category:
links I could find, so that it's either up to us to write a custom parser just
to get the chemical names, or to manually cut up the pages themselves (as I did,
with vi enough shortcuts can be taken that it takes me less than 5 minutes per
page).
My next step will be to use your ChemboxExtractor.php to produce a file
containing the bodies of all these 1900 chemicals, and then have a look at all
the data fields that show up inside chemfoboxes, so I can devise a parsing
approach.
I already have a working local wikipedia installation, so that nightmare needn't
be worried about at the moment ;-)
Are my messages making it properly to the list? If so, shall I omit your email
addresses when responding on this thread?
Thanks,
Christian
Quoting Georgi Kobilarov <gkob@gmx.de>:
> Hi Christian,
>
> > What I am currently trying to do (and please stop me if you have
> > already gotten
> > beyond what I have, or have a better strategy) is to identify how many
> > chemical
> > subcategories (such as Household_chemicals, Solvents, etc.) have well-
> > defined
> > 'chemfoboxes' that could easily be sought out and parsed to create
> > triples.
>
> I suggest you start not with identifying all chemistry related
> categories, but looking for the different mechanisms/templates for
> resenting chemicals in Wikipedia, like the chembox tables or
> Chembox_new.
> http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Chemicals should
> help.
>
> I created ChemboxExtractor.php for you, so you can start with writing
> code to parse articles that use the chembox "template".
>
> Extractors are usually processed on all Wikipedia articles, so I
> included a g_match("{{chembox header") at top to ensure the article
> uses that template before starting to parse.
>
> You can use extract.php to get single article sources from
> Live-Wikipedia, but if you want to try a complete extraction of all
> articles you will need to install a local Wikipedia MySql database.
>
> That's a bit tricky, but you can use our import.php script. It's
> important to configure MySql adequate, i.e. much memory to the key
> buffer and other caches.
> The import script will fail the first time, just try again, mysql needs
> to load everything into memory and that will cause a timeout at first.
> Comment files in import.php that are successfully progressed.
>
> Our extraction code is not yet documented, but that will hopefully
> happen the next days.
>
> Happy coding ;)
>
> Cheers,
>
> Georgi
>
> --
> Georgi Kobilarov
> Freie Universität Berlin
> www.georgikobilarov.com
>
>
> > -----Original Message-----
> > From: dbpedia-discussion-bounces@lists.sourceforge.net />
> [mailto:dbpedia-
> > discussion-bounces@lists.sourceforge.net] On Behalf Of
> > cleger@scs.carleton.ca />
> > Sent: Tuesday, July 31, 2007 8:49 PM
> > To: dbpedia-discussion@lists.sourceforge.net />
> > Subject: [Dbpedia-discussion] chemistry-related infoboxes
> >
> > Hi list,
> >
> > Like others before me, I am interested in adding to dbpedia's parsing
> > capabilities the power to correctly digest infoboxes that are
> > associated with
> > different types of chemical substances. These infoboxes are not
> stricly
> > 'infoboxes', in that their names and syntax in the wiki markup is not
> > exactly
> > the same as for the infobox that would show up for, say, Innsbruck.
> > Therefore,
> > these boxes are skipped in the current system.
> >
> > What I am currently trying to do (and please stop me if you have
> > already gotten
> > beyond what I have, or have a better strategy) is to identify how many
> > chemical
> > subcategories (such as Household_chemicals, Solvents, etc.) have well-
> > defined
> > 'chemfoboxes' that could easily be sought out and parsed to create
> > triples.
> >
> > My initial observation is that there are several types of chemfoboxes,
> > so at the
> > moment I am trying to compose a list of which categories of chemicals
> > contain
> > which types of chemfoboxes. When I have enough of these (categories
> > whose
> > members often have chemfoboxes) to cover perhaps a few hundred
> > substances, I
> > will then set to work trying to parse those that have chemfoboxes and
> > ignore
> > gracefully those that don't. I will share this when done.
> >
> > Your suggestions and comments are most welcome.
> >
> > Thanks,
> >
> > Christian
> >
> >
> >
> -----------------------------------------------------------------------
> > --
> > This SF.net email is sponsored by: Splunk Inc.
> > Still grepping through log files to find problems? Stop.
> > Now Search log events and configuration files using AJAX and a
> browser.
> > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > _______________________________________________
> > Dbpedia-discussion mailing list
> > Dbpedia-discussion@lists.sourceforge.net />
> > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>
|