Re: [Gusdev-gusdev] [Fwd: Attribution in GUS]

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Jonathan

cra...@pc... wrote:

>Arnaud-
>
>  
>
>>>Here a case of what could happen on a given project:
>>>
>>>* The sequences would come from TIGR,
>>>* The gene models would come from SBRI,
>>>* The manual annotation of the gene models and the GO curation would
>>>be done by TIGR,
>>>* The curation would be done by the Sanger,
>>>* Some curated comments would be sent by members of the community.
>>>
>>>Instead of using the evidence table, would it be possible to attribute
>>>data by using the user_id attribute ?
>>>      
>>>
>
>We certainly want to be able to support the situation you describe, and 
>although we have some working examples that are similar to this, I don't
>think that the way they're currently implemented (using a combination of
>external_db_id and the ProjectLink table) is necessarily the best answer,
>nor do I think it covers all the possibilities.
>
>I agree with you that the Evidence table is not ideal for this purpose, 
>although I'm also not convinced that adapting the 'row_user_id' column
>will be sufficient either.  Here are the problems/issues I see with doing
>attribution solely with the user_id:
>
>  * Currently a user_id represents a single individual, not an organization.
>    We will be adding a pointer in the UserInfo table to Sres::Contact 
>    and I believe that the Contact table *can* be used to represent whole 
>    organizations.  So by giving a particular user_id to a row in the 
>    database we will also be associating it (indirectly) with whatever 
>    organization the person in question works for (e.g., TIGR, for sequence
>    data generated there.)  There are two problems with this.  First, the 
>    choice of user_id will likely be arbitrary.  That is, should it be the
>    person in charge of the sequencing project at TIGR, or the person who
>    e-mailed the data to us?  So far we've tended to use the user_id to 
>    reflect the identity of the person who actually loaded the data into 
>    the database (typically someone in our lab., or the user_id of an 
>    annotator or collaborator editing the database through a web interface.)  
>    Second, if the person in question goes on to work for a different 
>    organization/company, we can't easily reflect that change without 
>    losing the association between the original data and the organization 
>    that should get credit for generating them (short of "cloning" the person
>    in the UserInfo table!)
>    Anyway, my point here is that I think we'll want to be able to attribute 
>    datasets both to individuals and also (directly) to organizations.  Does
>    this sound like a reasonable requirement based on your use-case? 
>
Yes, it sounds like it is

> If so
>    then I think it implies that Sres::Contact might be a better table to use
>    than Sres::UserInfo.
>
>  
>
>>>e.g. if the gene models are coming from SBRI, the user_id would
>>>acknowledge the gene features as owned by SBRI. Any update would keep
>>>the ownership and would acknowledge who's done the update.
>>>      
>>>
>
>  * This brings up my second point/question, which is whether we need to 
>    support multiple attributions; I think your example suggests that we do. 
>    For example, suppose that, as in your example, SBRI generates an initial 
>    set of gene models.  Then suppose that--as part of the manual annotation 
>    process--an annotator at  TIGR determines that one of the exons in one of 
>    the SBRI gene models is incorrect (though not by much.)  He/she adjusts 
>    the 5' boundary of one exon accordingly.  Who should now be cited as the 
>    source of this gene model?  I would say that it should be *both* SBRI and 
>    TIGR, but this is not supported by the single 'row_user_id' in the 
>    GeneFeature table.  In general this is a problem for any  kind of derived 
>    data, or any data that is likely to be refined over time (like
>    gene models.)  Does this agree with what you mean by "manual annotation"?
>    Even if not, I think that we will want to support having multiple 
>    attributions, because many of the "curated comments sent by members of
>    the community" that you mention are likely to be corrections to the gene
>    models based on various kinds of evidence, much of it experimental.
>
>Now, if you agree that "attribution" is something that should apply to either 
>individuals or organizations, and that can be shared among one or more such
>entities,
>
We agree on this point too.

> then the question is how to represent this in the database.  What 
>I've argued for so far is to have a many-to-1 relationship between entries in 
>the Sres::Contact table and any row in the database (meaning that a new linking
>table would have to be generated.)  I would also leave the 'row_user_id' alone, 
>using it--as we do now--to represent which user *owns* that row in the database, 
>where the users are by definition those who have the ability to alter the 
>database directly (by which I mean to include annotators working through an 
>interface like Artemis or Apollo.)  Does this sound reasonable?
>
>One question that I have not yet considered in detail is how this affects what
>we're currently doing with the ExternalDatabase table.  In effect we've used 
>this table to represent not just databases (e.g. GenBank, SWISS-PROT) but also 
>the more general notion of "externally-generated datasets."  For example, the
>published Plasmodium falciparum genomic sequences from TIGR have their own 
>ExternalDatabase entry that we use for the purposes of attribution, and the
>sequences from Sanger and Stanford have similar entries.  I don't think we 
>necessarily want to combine these two different parts of the schema, but there
>are clearly significant overlaps between them, if only because the institution
>that generates the data (Contact) is often also the source of that data (which
>is what the ExternalDatabase is supposed to represent.)  But one could imagine 
>cases in which we'd want to attribute a particular dataset to one organization 
>(e.g., RIKEN) but record the fact that the data was actually obtained/downloaded 
>from another (e.g., GenBank or EMBL.)  This the case right now for the draft 
>human and mouse genome sequence assemblies we're using, which were generated by 
>NCBI and the MGSC, respectively, but which we downloaded from UCSC, after the
>files had been subjected to some reformatting.  In terms of data provenance this
>is all information that it is crucial to track, and I think what I'm suggesting 
>is that we use the Contact table (along with a new Attribution table of some
>sort) to record where the data *originally* came from and that we use the 
>ExternalDatabase/ExternalDatabaseRelease tables to track where the data most 
>recently resided before being entered into GUS.
>  
>
I think it sounds reasonable. I don't think I have something else to add.

>  
>
>>>The other point was the attribution of data coming from publication or
>>>personal communication. I had a look at flybase. Flybase considers
>>>personal communication as references. To differentiate them, they have
>>>an extra attribute in the reference table to allow the classification
>>>of the different references.
>>>      
>>>
>
>Yes, we should definitely extend Sres::BibliographicReference to make use of
>a controlled vocabulary for reference types, including personal communications
>(which should make use of an optional contact_id to specify the individual in
>question.)  I'll make this change in the schema as well as adding the new 
>Phenotype tables that you sent along.
>  
>
Fine. how long do you reckon it's going to take to commit these schema 
modifications ?

>Jonathan
>
>  
>
Cheers
Arnaud