Re: [Aperture-devel] extracting information for mbox

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Jose Gato Luis wrote:
> Qualipso is a project with many activities and work packages. In one of
> this packages, We need to extract information from several sources (svn,
> mailing list, web pages, forum, etc), all this information is going to
> define an ontology about an open source project. With this information,
> we could make different tools, like a semantic browser that could be
> useful for developers.

For some of these sources we have crawlers available.

In the past I have played around with our WebCrawler, applying it to 
online forums. Turns out that you really, really want a dedicated forum 
crawler, one that is probably necessarily specific for certain forum 
software packages, e.d. phpBB. Such a crawler would create DataObjects 
representing individual posts and threads, rather than the pages they 
are presented in, know how to separate the real content from the 
navigation links generated by the software, knows how to prevent 
crawling of duplicate information (the same thread is presented in 
various ways, e.g. using different sorting criteria), skips the "report 
abuse" and other administrative pages, etc.

You may also want to take a look at http://simile.mit.edu/wiki/RDFizers, 
they have created and collected converters that create RDF 
representations of SVN, Jira, Javadoc, etc. Expect a wide range of 
technologies here though, e.d. Java, Python, shell scripts, etc. Ports 
of some of these to Java and to Aperture APIs would be a welcome addition :)

Regards,

Chris
--