From: <cra...@pc...> - 2002-11-13 21:03:59
|
Arnaud- > > Here a case of what could happen on a given project: > > > > * The sequences would come from TIGR, > > * The gene models would come from SBRI, > > * The manual annotation of the gene models and the GO curation would > > be done by TIGR, > > * The curation would be done by the Sanger, > > * Some curated comments would be sent by members of the community. > > > > Instead of using the evidence table, would it be possible to attribute > > data by using the user_id attribute ? We certainly want to be able to support the situation you describe, and although we have some working examples that are similar to this, I don't think that the way they're currently implemented (using a combination of external_db_id and the ProjectLink table) is necessarily the best answer, nor do I think it covers all the possibilities. I agree with you that the Evidence table is not ideal for this purpose, although I'm also not convinced that adapting the 'row_user_id' column will be sufficient either. Here are the problems/issues I see with doing attribution solely with the user_id: * Currently a user_id represents a single individual, not an organization. We will be adding a pointer in the UserInfo table to Sres::Contact and I believe that the Contact table *can* be used to represent whole organizations. So by giving a particular user_id to a row in the database we will also be associating it (indirectly) with whatever organization the person in question works for (e.g., TIGR, for sequence data generated there.) There are two problems with this. First, the choice of user_id will likely be arbitrary. That is, should it be the person in charge of the sequencing project at TIGR, or the person who e-mailed the data to us? So far we've tended to use the user_id to reflect the identity of the person who actually loaded the data into the database (typically someone in our lab., or the user_id of an annotator or collaborator editing the database through a web interface.) Second, if the person in question goes on to work for a different organization/company, we can't easily reflect that change without losing the association between the original data and the organization that should get credit for generating them (short of "cloning" the person in the UserInfo table!) Anyway, my point here is that I think we'll want to be able to attribute datasets both to individuals and also (directly) to organizations. Does this sound like a reasonable requirement based on your use-case? If so then I think it implies that Sres::Contact might be a better table to use than Sres::UserInfo. > > e.g. if the gene models are coming from SBRI, the user_id would > > acknowledge the gene features as owned by SBRI. Any update would keep > > the ownership and would acknowledge who's done the update. * This brings up my second point/question, which is whether we need to support multiple attributions; I think your example suggests that we do. For example, suppose that, as in your example, SBRI generates an initial set of gene models. Then suppose that--as part of the manual annotation process--an annotator at TIGR determines that one of the exons in one of the SBRI gene models is incorrect (though not by much.) He/she adjusts the 5' boundary of one exon accordingly. Who should now be cited as the source of this gene model? I would say that it should be *both* SBRI and TIGR, but this is not supported by the single 'row_user_id' in the GeneFeature table. In general this is a problem for any kind of derived data, or any data that is likely to be refined over time (like gene models.) Does this agree with what you mean by "manual annotation"? Even if not, I think that we will want to support having multiple attributions, because many of the "curated comments sent by members of the community" that you mention are likely to be corrections to the gene models based on various kinds of evidence, much of it experimental. Now, if you agree that "attribution" is something that should apply to either individuals or organizations, and that can be shared among one or more such entities, then the question is how to represent this in the database. What I've argued for so far is to have a many-to-1 relationship between entries in the Sres::Contact table and any row in the database (meaning that a new linking table would have to be generated.) I would also leave the 'row_user_id' alone, using it--as we do now--to represent which user *owns* that row in the database, where the users are by definition those who have the ability to alter the database directly (by which I mean to include annotators working through an interface like Artemis or Apollo.) Does this sound reasonable? One question that I have not yet considered in detail is how this affects what we're currently doing with the ExternalDatabase table. In effect we've used this table to represent not just databases (e.g. GenBank, SWISS-PROT) but also the more general notion of "externally-generated datasets." For example, the published Plasmodium falciparum genomic sequences from TIGR have their own ExternalDatabase entry that we use for the purposes of attribution, and the sequences from Sanger and Stanford have similar entries. I don't think we necessarily want to combine these two different parts of the schema, but there are clearly significant overlaps between them, if only because the institution that generates the data (Contact) is often also the source of that data (which is what the ExternalDatabase is supposed to represent.) But one could imagine cases in which we'd want to attribute a particular dataset to one organization (e.g., RIKEN) but record the fact that the data was actually obtained/downloaded from another (e.g., GenBank or EMBL.) This the case right now for the draft human and mouse genome sequence assemblies we're using, which were generated by NCBI and the MGSC, respectively, but which we downloaded from UCSC, after the files had been subjected to some reformatting. In terms of data provenance this is all information that it is crucial to track, and I think what I'm suggesting is that we use the Contact table (along with a new Attribution table of some sort) to record where the data *originally* came from and that we use the ExternalDatabase/ExternalDatabaseRelease tables to track where the data most recently resided before being entered into GUS. > > The other point was the attribution of data coming from publication or > > personal communication. I had a look at flybase. Flybase considers > > personal communication as references. To differentiate them, they have > > an extra attribute in the reference table to allow the classification > > of the different references. Yes, we should definitely extend Sres::BibliographicReference to make use of a controlled vocabulary for reference types, including personal communications (which should make use of an optional contact_id to specify the individual in question.) I'll make this change in the schema as well as adding the new Phenotype tables that you sent along. Jonathan -- Jonathan Crabtree Center for Bioinformatics, University of Pennsylvania 1406 Blockley Hall, 423 Guardian Drive Philadelphia, PA 19104-6021 215-573-3115 |