|
From: <cra...@pc...> - 2002-11-13 21:03:59
|
Arnaud-
> > Here a case of what could happen on a given project:
> >
> > * The sequences would come from TIGR,
> > * The gene models would come from SBRI,
> > * The manual annotation of the gene models and the GO curation would
> > be done by TIGR,
> > * The curation would be done by the Sanger,
> > * Some curated comments would be sent by members of the community.
> >
> > Instead of using the evidence table, would it be possible to attribute
> > data by using the user_id attribute ?
We certainly want to be able to support the situation you describe, and
although we have some working examples that are similar to this, I don't
think that the way they're currently implemented (using a combination of
external_db_id and the ProjectLink table) is necessarily the best answer,
nor do I think it covers all the possibilities.
I agree with you that the Evidence table is not ideal for this purpose,
although I'm also not convinced that adapting the 'row_user_id' column
will be sufficient either. Here are the problems/issues I see with doing
attribution solely with the user_id:
* Currently a user_id represents a single individual, not an organization.
We will be adding a pointer in the UserInfo table to Sres::Contact
and I believe that the Contact table *can* be used to represent whole
organizations. So by giving a particular user_id to a row in the
database we will also be associating it (indirectly) with whatever
organization the person in question works for (e.g., TIGR, for sequence
data generated there.) There are two problems with this. First, the
choice of user_id will likely be arbitrary. That is, should it be the
person in charge of the sequencing project at TIGR, or the person who
e-mailed the data to us? So far we've tended to use the user_id to
reflect the identity of the person who actually loaded the data into
the database (typically someone in our lab., or the user_id of an
annotator or collaborator editing the database through a web interface.)
Second, if the person in question goes on to work for a different
organization/company, we can't easily reflect that change without
losing the association between the original data and the organization
that should get credit for generating them (short of "cloning" the person
in the UserInfo table!)
Anyway, my point here is that I think we'll want to be able to attribute
datasets both to individuals and also (directly) to organizations. Does
this sound like a reasonable requirement based on your use-case? If so
then I think it implies that Sres::Contact might be a better table to use
than Sres::UserInfo.
> > e.g. if the gene models are coming from SBRI, the user_id would
> > acknowledge the gene features as owned by SBRI. Any update would keep
> > the ownership and would acknowledge who's done the update.
* This brings up my second point/question, which is whether we need to
support multiple attributions; I think your example suggests that we do.
For example, suppose that, as in your example, SBRI generates an initial
set of gene models. Then suppose that--as part of the manual annotation
process--an annotator at TIGR determines that one of the exons in one of
the SBRI gene models is incorrect (though not by much.) He/she adjusts
the 5' boundary of one exon accordingly. Who should now be cited as the
source of this gene model? I would say that it should be *both* SBRI and
TIGR, but this is not supported by the single 'row_user_id' in the
GeneFeature table. In general this is a problem for any kind of derived
data, or any data that is likely to be refined over time (like
gene models.) Does this agree with what you mean by "manual annotation"?
Even if not, I think that we will want to support having multiple
attributions, because many of the "curated comments sent by members of
the community" that you mention are likely to be corrections to the gene
models based on various kinds of evidence, much of it experimental.
Now, if you agree that "attribution" is something that should apply to either
individuals or organizations, and that can be shared among one or more such
entities, then the question is how to represent this in the database. What
I've argued for so far is to have a many-to-1 relationship between entries in
the Sres::Contact table and any row in the database (meaning that a new linking
table would have to be generated.) I would also leave the 'row_user_id' alone,
using it--as we do now--to represent which user *owns* that row in the database,
where the users are by definition those who have the ability to alter the
database directly (by which I mean to include annotators working through an
interface like Artemis or Apollo.) Does this sound reasonable?
One question that I have not yet considered in detail is how this affects what
we're currently doing with the ExternalDatabase table. In effect we've used
this table to represent not just databases (e.g. GenBank, SWISS-PROT) but also
the more general notion of "externally-generated datasets." For example, the
published Plasmodium falciparum genomic sequences from TIGR have their own
ExternalDatabase entry that we use for the purposes of attribution, and the
sequences from Sanger and Stanford have similar entries. I don't think we
necessarily want to combine these two different parts of the schema, but there
are clearly significant overlaps between them, if only because the institution
that generates the data (Contact) is often also the source of that data (which
is what the ExternalDatabase is supposed to represent.) But one could imagine
cases in which we'd want to attribute a particular dataset to one organization
(e.g., RIKEN) but record the fact that the data was actually obtained/downloaded
from another (e.g., GenBank or EMBL.) This the case right now for the draft
human and mouse genome sequence assemblies we're using, which were generated by
NCBI and the MGSC, respectively, but which we downloaded from UCSC, after the
files had been subjected to some reformatting. In terms of data provenance this
is all information that it is crucial to track, and I think what I'm suggesting
is that we use the Contact table (along with a new Attribution table of some
sort) to record where the data *originally* came from and that we use the
ExternalDatabase/ExternalDatabaseRelease tables to track where the data most
recently resided before being entered into GUS.
> > The other point was the attribution of data coming from publication or
> > personal communication. I had a look at flybase. Flybase considers
> > personal communication as references. To differentiate them, they have
> > an extra attribute in the reference table to allow the classification
> > of the different references.
Yes, we should definitely extend Sres::BibliographicReference to make use of
a controlled vocabulary for reference types, including personal communications
(which should make use of an optional contact_id to specify the individual in
question.) I'll make this change in the schema as well as adding the new
Phenotype tables that you sent along.
Jonathan
--
Jonathan Crabtree
Center for Bioinformatics, University of Pennsylvania
1406 Blockley Hall, 423 Guardian Drive Philadelphia, PA 19104-6021
215-573-3115
|