Re: [Gusdev-gusdev] [Fwd: Attribution in GUS]

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Arnaud-

> > Here a case of what could happen on a given project:
> >
> > * The sequences would come from TIGR,
> > * The gene models would come from SBRI,
> > * The manual annotation of the gene models and the GO curation would
> > be done by TIGR,
> > * The curation would be done by the Sanger,
> > * Some curated comments would be sent by members of the community.
> >
> > Instead of using the evidence table, would it be possible to attribute
> > data by using the user_id attribute ?

We certainly want to be able to support the situation you describe, and 
although we have some working examples that are similar to this, I don't
think that the way they're currently implemented (using a combination of
external_db_id and the ProjectLink table) is necessarily the best answer,
nor do I think it covers all the possibilities.

I agree with you that the Evidence table is not ideal for this purpose, 
although I'm also not convinced that adapting the 'row_user_id' column
will be sufficient either.  Here are the problems/issues I see with doing
attribution solely with the user_id:

  * Currently a user_id represents a single individual, not an organization.
    We will be adding a pointer in the UserInfo table to Sres::Contact 
    and I believe that the Contact table *can* be used to represent whole 
    organizations.  So by giving a particular user_id to a row in the 
    database we will also be associating it (indirectly) with whatever 
    organization the person in question works for (e.g., TIGR, for sequence
    data generated there.)  There are two problems with this.  First, the 
    choice of user_id will likely be arbitrary.  That is, should it be the
    person in charge of the sequencing project at TIGR, or the person who
    e-mailed the data to us?  So far we've tended to use the user_id to 
    reflect the identity of the person who actually loaded the data into 
    the database (typically someone in our lab., or the user_id of an 
    annotator or collaborator editing the database through a web interface.)  
    Second, if the person in question goes on to work for a different 
    organization/company, we can't easily reflect that change without 
    losing the association between the original data and the organization 
    that should get credit for generating them (short of "cloning" the person
    in the UserInfo table!)
    Anyway, my point here is that I think we'll want to be able to attribute 
    datasets both to individuals and also (directly) to organizations.  Does
    this sound like a reasonable requirement based on your use-case?  If so
    then I think it implies that Sres::Contact might be a better table to use
    than Sres::UserInfo.

> > e.g. if the gene models are coming from SBRI, the user_id would
> > acknowledge the gene features as owned by SBRI. Any update would keep
> > the ownership and would acknowledge who's done the update.

  * This brings up my second point/question, which is whether we need to 
    support multiple attributions; I think your example suggests that we do. 
    For example, suppose that, as in your example, SBRI generates an initial 
    set of gene models.  Then suppose that--as part of the manual annotation 
    process--an annotator at  TIGR determines that one of the exons in one of 
    the SBRI gene models is incorrect (though not by much.)  He/she adjusts 
    the 5' boundary of one exon accordingly.  Who should now be cited as the 
    source of this gene model?  I would say that it should be *both* SBRI and 
    TIGR, but this is not supported by the single 'row_user_id' in the 
    GeneFeature table.  In general this is a problem for any  kind of derived 
    data, or any data that is likely to be refined over time (like
    gene models.)  Does this agree with what you mean by "manual annotation"?
    Even if not, I think that we will want to support having multiple 
    attributions, because many of the "curated comments sent by members of
    the community" that you mention are likely to be corrections to the gene
    models based on various kinds of evidence, much of it experimental.

Now, if you agree that "attribution" is something that should apply to either 
individuals or organizations, and that can be shared among one or more such
entities, then the question is how to represent this in the database.  What 
I've argued for so far is to have a many-to-1 relationship between entries in 
the Sres::Contact table and any row in the database (meaning that a new linking
table would have to be generated.)  I would also leave the 'row_user_id' alone, 
using it--as we do now--to represent which user *owns* that row in the database, 
where the users are by definition those who have the ability to alter the 
database directly (by which I mean to include annotators working through an 
interface like Artemis or Apollo.)  Does this sound reasonable?

One question that I have not yet considered in detail is how this affects what
we're currently doing with the ExternalDatabase table.  In effect we've used 
this table to represent not just databases (e.g. GenBank, SWISS-PROT) but also 
the more general notion of "externally-generated datasets."  For example, the
published Plasmodium falciparum genomic sequences from TIGR have their own 
ExternalDatabase entry that we use for the purposes of attribution, and the
sequences from Sanger and Stanford have similar entries.  I don't think we 
necessarily want to combine these two different parts of the schema, but there
are clearly significant overlaps between them, if only because the institution
that generates the data (Contact) is often also the source of that data (which
is what the ExternalDatabase is supposed to represent.)  But one could imagine 
cases in which we'd want to attribute a particular dataset to one organization 
(e.g., RIKEN) but record the fact that the data was actually obtained/downloaded 
from another (e.g., GenBank or EMBL.)  This the case right now for the draft 
human and mouse genome sequence assemblies we're using, which were generated by 
NCBI and the MGSC, respectively, but which we downloaded from UCSC, after the
files had been subjected to some reformatting.  In terms of data provenance this
is all information that it is crucial to track, and I think what I'm suggesting 
is that we use the Contact table (along with a new Attribution table of some
sort) to record where the data *originally* came from and that we use the 
ExternalDatabase/ExternalDatabaseRelease tables to track where the data most 
recently resided before being entered into GUS.

> > The other point was the attribution of data coming from publication or
> > personal communication. I had a look at flybase. Flybase considers
> > personal communication as references. To differentiate them, they have
> > an extra attribute in the reference table to allow the classification
> > of the different references.

Yes, we should definitely extend Sres::BibliographicReference to make use of
a controlled vocabulary for reference types, including personal communications
(which should make use of an optional contact_id to specify the individual in
question.)  I'll make this change in the schema as well as adding the new 
Phenotype tables that you sent along.

Jonathan

-- 
Jonathan Crabtree
Center for Bioinformatics, University of Pennsylvania
1406 Blockley Hall, 423 Guardian Drive Philadelphia, PA 19104-6021
215-573-3115