FLOSSmole / Feature Requests / #1 Collect mailing lists

#1 Collect mailing lists

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2012-09-14

Created: 2007-09-29

Creator: James Howison

Private: No

It would be great to collect and store project Mailing lists. Currently we have tables in the database for the very few that James collected (project_lists and list_posts).

Lists are not available anywhere (ie not in the dumps that Notre Dame keeps, although Forums and Trackers are).

While the archives are generated from an mbox file, Sourceforge, at least, doesn't make that file available, so we'll need to spider. This will be a large project because there is a lot of text for each project. I suggest that we do an initial run, that might take 3 months or so, then just do incrementals after that.

The mailing lists are found linked from the project home page. Eg for BibDesk:

http://sourceforge.net/mail/?group_id=61487

There are a number for lists for each project and each can be viewed as Nested, Flat or Threaded. Only nested shows the full text and the threading structure, so that's the one to get.

There is Perl code that used to work in the OSSmolePerl module in the CVS repository.

We should prioritize collection over parsing, since collection will take ages. The nested is a little hard to parse (since the reply structure is shown by an Ordered Lists (

Some other forges might simply give access to the mbox files for the project, which would be great. SF disguises the email address, but at least it does it consistently for a basic matching of senders.

Discussion

James Howison - 2007-10-25

Logged In: YES
user_id=844653
Originator: YES

So I've been thinking some more about this. Here's the dream spec:

Forges.each
Projects.each
Lists.each (dev, users, all of them)
Month.each
store page
Forums.each (dev, users, all of them)
Month.each
store page
Trackers.each (bugs, rfe, support)
store page

First collection - get everything to date
Regular collection - just get new stuff. We'll need someway to join these onto the original collections (without loosing the traceability)

Initial priority should be collecting the mailing list HTML. The HTML for the others should be done next. Parsing can always be re-jigged during and/or after the collection. Or maybe we should do this in rounds (with a sample of interesting projects first ...)

I anticipate that, even with sleeps, this spidering run could take a few months. We are talking a page for each month for each list (sometimes over 100 months and 4 or 5 lists per project, maybe avg of bugs). Assuming 100,000 projects we're talking in the order of 20,000,000 pages just for the lists (100,000 projects, 100 months, 2 lists per project).

For a scope comparison: Currently project_indexes holds about 2,000,000 pages and that is 107GB (uncompressed).

Some of the list pages are going to be larger than the project pages. I think we'll definitely need to compress this data (orders of magnitude decreases), by gzipping it before putting it into the tables. yes, that means we can't do LIKE queries, but that doesn't really matter since we do all processing using java/perl/ruby which can uncompress as required for parsing.

For draft specs of tables to hold the parsed data, see ossmole_next.projects_lists and ossmole_next.list_html and ossmole_next.lists_posts as well as the ossmole_next.tracker* tables.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous

Collect mailing lists

Group

Searches

Help

#1 Collect mailing lists

Discussion