From: Arnaud K. <ax...@sa...> - 2002-11-15 14:04:47
|
Hi Jonathan cra...@pc... wrote: >Arnaud- > > > >>>Here a case of what could happen on a given project: >>> >>>* The sequences would come from TIGR, >>>* The gene models would come from SBRI, >>>* The manual annotation of the gene models and the GO curation would >>>be done by TIGR, >>>* The curation would be done by the Sanger, >>>* Some curated comments would be sent by members of the community. >>> >>>Instead of using the evidence table, would it be possible to attribute >>>data by using the user_id attribute ? >>> >>> > >We certainly want to be able to support the situation you describe, and >although we have some working examples that are similar to this, I don't >think that the way they're currently implemented (using a combination of >external_db_id and the ProjectLink table) is necessarily the best answer, >nor do I think it covers all the possibilities. > >I agree with you that the Evidence table is not ideal for this purpose, >although I'm also not convinced that adapting the 'row_user_id' column >will be sufficient either. Here are the problems/issues I see with doing >attribution solely with the user_id: > > * Currently a user_id represents a single individual, not an organization. > We will be adding a pointer in the UserInfo table to Sres::Contact > and I believe that the Contact table *can* be used to represent whole > organizations. So by giving a particular user_id to a row in the > database we will also be associating it (indirectly) with whatever > organization the person in question works for (e.g., TIGR, for sequence > data generated there.) There are two problems with this. First, the > choice of user_id will likely be arbitrary. That is, should it be the > person in charge of the sequencing project at TIGR, or the person who > e-mailed the data to us? So far we've tended to use the user_id to > reflect the identity of the person who actually loaded the data into > the database (typically someone in our lab., or the user_id of an > annotator or collaborator editing the database through a web interface.) > Second, if the person in question goes on to work for a different > organization/company, we can't easily reflect that change without > losing the association between the original data and the organization > that should get credit for generating them (short of "cloning" the person > in the UserInfo table!) > Anyway, my point here is that I think we'll want to be able to attribute > datasets both to individuals and also (directly) to organizations. Does > this sound like a reasonable requirement based on your use-case? > Yes, it sounds like it is > If so > then I think it implies that Sres::Contact might be a better table to use > than Sres::UserInfo. > > > >>>e.g. if the gene models are coming from SBRI, the user_id would >>>acknowledge the gene features as owned by SBRI. Any update would keep >>>the ownership and would acknowledge who's done the update. >>> >>> > > * This brings up my second point/question, which is whether we need to > support multiple attributions; I think your example suggests that we do. > For example, suppose that, as in your example, SBRI generates an initial > set of gene models. Then suppose that--as part of the manual annotation > process--an annotator at TIGR determines that one of the exons in one of > the SBRI gene models is incorrect (though not by much.) He/she adjusts > the 5' boundary of one exon accordingly. Who should now be cited as the > source of this gene model? I would say that it should be *both* SBRI and > TIGR, but this is not supported by the single 'row_user_id' in the > GeneFeature table. In general this is a problem for any kind of derived > data, or any data that is likely to be refined over time (like > gene models.) Does this agree with what you mean by "manual annotation"? > Even if not, I think that we will want to support having multiple > attributions, because many of the "curated comments sent by members of > the community" that you mention are likely to be corrections to the gene > models based on various kinds of evidence, much of it experimental. > >Now, if you agree that "attribution" is something that should apply to either >individuals or organizations, and that can be shared among one or more such >entities, > We agree on this point too. > then the question is how to represent this in the database. What >I've argued for so far is to have a many-to-1 relationship between entries in >the Sres::Contact table and any row in the database (meaning that a new linking >table would have to be generated.) I would also leave the 'row_user_id' alone, >using it--as we do now--to represent which user *owns* that row in the database, >where the users are by definition those who have the ability to alter the >database directly (by which I mean to include annotators working through an >interface like Artemis or Apollo.) Does this sound reasonable? > >One question that I have not yet considered in detail is how this affects what >we're currently doing with the ExternalDatabase table. In effect we've used >this table to represent not just databases (e.g. GenBank, SWISS-PROT) but also >the more general notion of "externally-generated datasets." For example, the >published Plasmodium falciparum genomic sequences from TIGR have their own >ExternalDatabase entry that we use for the purposes of attribution, and the >sequences from Sanger and Stanford have similar entries. I don't think we >necessarily want to combine these two different parts of the schema, but there >are clearly significant overlaps between them, if only because the institution >that generates the data (Contact) is often also the source of that data (which >is what the ExternalDatabase is supposed to represent.) But one could imagine >cases in which we'd want to attribute a particular dataset to one organization >(e.g., RIKEN) but record the fact that the data was actually obtained/downloaded >from another (e.g., GenBank or EMBL.) This the case right now for the draft >human and mouse genome sequence assemblies we're using, which were generated by >NCBI and the MGSC, respectively, but which we downloaded from UCSC, after the >files had been subjected to some reformatting. In terms of data provenance this >is all information that it is crucial to track, and I think what I'm suggesting >is that we use the Contact table (along with a new Attribution table of some >sort) to record where the data *originally* came from and that we use the >ExternalDatabase/ExternalDatabaseRelease tables to track where the data most >recently resided before being entered into GUS. > > I think it sounds reasonable. I don't think I have something else to add. > > >>>The other point was the attribution of data coming from publication or >>>personal communication. I had a look at flybase. Flybase considers >>>personal communication as references. To differentiate them, they have >>>an extra attribute in the reference table to allow the classification >>>of the different references. >>> >>> > >Yes, we should definitely extend Sres::BibliographicReference to make use of >a controlled vocabulary for reference types, including personal communications >(which should make use of an optional contact_id to specify the individual in >question.) I'll make this change in the schema as well as adding the new >Phenotype tables that you sent along. > > Fine. how long do you reckon it's going to take to commit these schema modifications ? >Jonathan > > > Cheers Arnaud |