James Howison - 2007-10-25

Logged In: YES
user_id=844653
Originator: YES

So I've been thinking some more about this. Here's the dream spec:

Forges.each
Projects.each
Lists.each (dev, users, all of them)
Month.each
store page
Forums.each (dev, users, all of them)
Month.each
store page
Trackers.each (bugs, rfe, support)
store page

First collection - get everything to date
Regular collection - just get new stuff. We'll need someway to join these onto the original collections (without loosing the traceability)

Initial priority should be collecting the mailing list HTML. The HTML for the others should be done next. Parsing can always be re-jigged during and/or after the collection. Or maybe we should do this in rounds (with a sample of interesting projects first ...)

I anticipate that, even with sleeps, this spidering run could take a few months. We are talking a page for each month for each list (sometimes over 100 months and 4 or 5 lists per project, maybe avg of bugs). Assuming 100,000 projects we're talking in the order of 20,000,000 pages just for the lists (100,000 projects, 100 months, 2 lists per project).

For a scope comparison: Currently project_indexes holds about 2,000,000 pages and that is 107GB (uncompressed).

Some of the list pages are going to be larger than the project pages. I think we'll definitely need to compress this data (orders of magnitude decreases), by gzipping it before putting it into the tables. yes, that means we can't do LIKE queries, but that doesn't really matter since we do all processing using java/perl/ruby which can uncompress as required for parsing.

For draft specs of tables to hold the parsed data, see ossmole_next.projects_lists and ossmole_next.list_html and ossmole_next.lists_posts as well as the ossmole_next.tracker* tables.