From: Christiaan F. <chr...@ad...> - 2008-01-30 17:36:43
|
Jose Gato Luis wrote: > Qualipso is a project with many activities and work packages. In one of > this packages, We need to extract information from several sources (svn, > mailing list, web pages, forum, etc), all this information is going to > define an ontology about an open source project. With this information, > we could make different tools, like a semantic browser that could be > useful for developers. For some of these sources we have crawlers available. In the past I have played around with our WebCrawler, applying it to online forums. Turns out that you really, really want a dedicated forum crawler, one that is probably necessarily specific for certain forum software packages, e.d. phpBB. Such a crawler would create DataObjects representing individual posts and threads, rather than the pages they are presented in, know how to separate the real content from the navigation links generated by the software, knows how to prevent crawling of duplicate information (the same thread is presented in various ways, e.g. using different sorting criteria), skips the "report abuse" and other administrative pages, etc. You may also want to take a look at http://simile.mit.edu/wiki/RDFizers, they have created and collected converters that create RDF representations of SVN, Jira, Javadoc, etc. Expect a wide range of technologies here though, e.d. Java, Python, shell scripts, etc. Ports of some of these to Java and to Aperture APIs would be a welcome addition :) Regards, Chris -- |