It would be great to collect and store project Mailing lists. Currently we have tables in the database for the very few that James collected (project_lists and list_posts).
Lists are not available anywhere (ie not in the dumps that Notre Dame keeps, although Forums and Trackers are).
While the archives are generated from an mbox file, Sourceforge, at least, doesn't make that file available, so we'll need to spider. This will be a large project because there is a lot of text for each project. I suggest that we do an initial run, that might take 3 months or so, then just do incrementals after that.
The mailing lists are found linked from the project home page. Eg for BibDesk:
http://sourceforge.net/mail/?group_id=61487
There are a number for lists for each project and each can be viewed as Nested, Flat or Threaded. Only nested shows the full text and the threading structure, so that's the one to get.
There is Perl code that used to work in the OSSmolePerl module in the CVS repository.
We should prioritize collection over parsing, since collection will take ages. The nested is a little hard to parse (since the reply structure is shown by an Ordered Lists (
Some other forges might simply give access to the mbox files for the project, which would be great. SF disguises the email address, but at least it does it consistently for a basic matching of senders.
Anonymous
Logged In: YES
user_id=844653
Originator: YES
So I've been thinking some more about this. Here's the dream spec:
Forges.each
Projects.each
Lists.each (dev, users, all of them)
Month.each
store page
Forums.each (dev, users, all of them)
Month.each
store page
Trackers.each (bugs, rfe, support)
store page
First collection - get everything to date
Regular collection - just get new stuff. We'll need someway to join these onto the original collections (without loosing the traceability)
Initial priority should be collecting the mailing list HTML. The HTML for the others should be done next. Parsing can always be re-jigged during and/or after the collection. Or maybe we should do this in rounds (with a sample of interesting projects first ...)
I anticipate that, even with sleeps, this spidering run could take a few months. We are talking a page for each month for each list (sometimes over 100 months and 4 or 5 lists per project, maybe avg of bugs). Assuming 100,000 projects we're talking in the order of 20,000,000 pages just for the lists (100,000 projects, 100 months, 2 lists per project).
For a scope comparison: Currently project_indexes holds about 2,000,000 pages and that is 107GB (uncompressed).
Some of the list pages are going to be larger than the project pages. I think we'll definitely need to compress this data (orders of magnitude decreases), by gzipping it before putting it into the tables. yes, that means we can't do LIKE queries, but that doesn't really matter since we do all processing using java/perl/ruby which can uncompress as required for parsing.
For draft specs of tables to hold the parsed data, see ossmole_next.projects_lists and ossmole_next.list_html and ossmole_next.lists_posts as well as the ossmole_next.tracker* tables.