While I agree that the idea of community is important, there is one significant issue that has compelled me to collect my own personal dumps of various data source: the need to combine data from disparate sources.
Unless OssMole is looking to become the one and only repository of Open Source data, I think there will always be a need to be able to allow researchers to import the data into their own systems for combination and further analysis.
I think James's 80/20 suggestion would work well to solve this problem since many users would be satisfied with the standard web access and those who weren't would still be able to use the data locally.
On 10/20/06, James Howison <jhowison@syr.edu> wrote:

On Oct 20, 2006, at 8:51 AM, Kevin Crowston wrote:

> I think we should think carefully about giving the data as dumps--
> it's almost like forking the project, since you end up with lots of
> disconnected piles.

Yes, I know what you mean, it's a concern of mine too.  Let's
brainstorm a little on how to do it.

> Part of the goal of the project is to develop a community around
> the data, which is hard to do if everyone is off on their own. On
> the other hand, what people are asking for, better access, is
> perfectly reasonable. So, can we satisfy those needs, but still
> keep people connecting?

A restricted access phpmyadmin would be a decent start giving people
much more flexibility with viewing structure of the database etc.  Of
course, it has an interface for exporting the results of queries, so
the tendency would be for people to take their queries as text files,
thereby creating data silos at a different level. (It also has the
ability to store stored queries, so we could make datasets
dynamically available that way).

Giving direct server access to the data is always a possibility, but
we don't really gain much from that, people still do their analyses
in separate sandboxes, but now we're providing bandwidth and their
connections are slower :(

What we really want is for people to be able to share, and to
actually share, their analyses, yet there is an abundance of tool
diversity out there.  Just amongst us, for example, we use perl,
java, ruby, R, some other sna libraries ...

Perhaps we could follow the microformats 80/20 rule and choose a
toolchain that will satisfy 80% of the audience, while providing data
in other ways to the 20% who need more.  Except that those will
probably be the interesting 20%!

Maybe the way forward is to document the crap out of our own
toolchains, so that it is easier to start with ours as the templates.

Maybe even a basic format for describing their tool chains:
1.  Starting query
2.  Cleaning approach
3.  Analysis
       a. Which application
       b. What model used
4.  Graphics generation

It's hard to anticipate what people want to do in research, but we
should definitely provide more documentation and a wiki (with
userIDs) to encourage a community around that.

I suspect that there will be some useful models in the
collaboratories that Borgman and Olson and Olson have studied, but
this project has some interesting differences with regular FLOSS
projects, in that we aren't working towards a unitary source-tree and/
or binary.  Maybe we should be, maybe that unitary source tree is the
crucial coordinating artifact, and we need to enforce doing things in
a particular way, becoming not everything to everyone but a
particular gravitational center.


Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo

Ossmole-discuss mailing list