One thought that comes to mind is that the developers column in the stats table may not mean what you think it means. Those tables are calculated tables created at various intervals by the back-end system and stored to the database which is then dumped monthly to Notre Dame. There are no firm answers in the SourceForge documentation regarding the database schema. However, my impression from what I read here ( is that development statistics are based on CVS and SVN commits. It could be that the developers column in that table tracks the number of developers who committed to the source repository over the corresponding time period. It would make sense for this number to always be less than or equal too the number of developers indicated by the FLOSSMole dataset (and usually for this number to be strictly less than the FLOSSMole number based on existing research).
You may try this query which pulls the data from the UND database that corresponds to the data in the FLOSSMole tables you mention:
select count(distinct group_id) group_count
from (select g.group_id,
             count(distinct ug.user_id) as user_count
      from groups g,
           user_group ug
      where g.is_public = 1
        and g.status = 'A'
        and ug.group_id = g.group_id
      group by g.group_id) a
where user_count between 6 and 10
The subquery counts the number of users per project for all projects that are public and active. The main query counts the number of projects from the subquery that have between 6 and 10 users. The numbers do not match exactly. From this query I get 4335 projects compared to your 4322, but that's a difference I'm comfortable chalking up to different data collection cycles.
Hope that helps.
On 5/21/07, Megan Conklin <> wrote:
Kevin Crowston wrote:
> The only thing that comes to mind is a different definition of
> developers in the two data sets--it seems that flossmole's dev_count is
> greater than Notre Dame's:

Just to be clear, dev_count in the projects table is not a calculated
field. Rather, it is simply the number of developers as reported on
the main project page.

When our spider "clicks" into the devs page for a project and gathers
the specific devs, those go into the developers and dev_projects
tables, the theory being that if you add up the number of devs for
each project, it would probably match the dev_count value.

To test this, I looked at a few of the projects you sent, such as ac3:

FLOSSMole data:

ac3     10

Notre Dame's data:


the number '10' in our data set is of course the dev_count value which
comes from their project page.

I then ran this query to see exactly who the devs are:

FROM developer_projects
WHERE datasource_id =57
AND proj_unixname = 'ac3'

And indeed there are 10 devs listed for this project:

hib (is_admin)
jansb000 (is_admin)

2 are marked as admins as you see there. But then Kevin says:

> I just looked at the acmemail page and it does say 8 developers, not 6
> as in the ND data. Odd...  I was hypothesizing that the ND data doesn't
> include admins, but acmemail has 3 admins, not 2. So it's a mystery...

That is very strange. Here are the devs listed for acmemail:

acme (is_admin)
peterw (is_admin)
wim (is_admin)

So yeah, 8 devs with 3 admins. These are the ones listed on the web
site (still), so I'm not sure what to tell you about the ND data.

Let us know when you find out!


This email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
Ossmole-discuss mailing list