[Gmod-schema] Re: feature_cvtermprop and analogues

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi All,

Having resolved the "Tables with text fields in unique key" thread, we 
need to push through this cvterm/dbxref thread.

I'll try to be brief, as Chris is probably holding his breath.

I've chatted with Aubrey about this, and while he has some fairly 
compelling arguments as to why the cv and dbxref modules could be merged 
into one, he agrees with me that at this point in the game, the cost of 
making such a fundamental change in the schema would be much greater 
than any benefit we might realize.

The cvterm/dbxref thread got started because FlyBase needs to be able to 
do these two things with dbxrefs which are not presently supported:

1) link properties (eg, comments) to the relationship between a feature 
(eg, "gene"), and a dbxref (eg, GB-acc#).

2) maintain relationships between dbxrefs, for example, the relationship 
between a GenBank nucleotide record accession number and the protein_ids 
for proteins coded in that GenBank record.

I propose that we add a feature_dbxrefprop table to the sequence module 
to accomodate the former, and a dbxref_relationship table to the dbxref 
module to accomodate the latter.

Chris suggested that we should consider using feature & 
feature_relationship for managing the relations between these acc#.  
Under some circumstances, this might be a proper approach, but for what 
FlyBase needs to do, I think it would be extremely heavy.  All we care 
about are the acc#; in order to implement these as features, we'd have 
to do quite a bit more work than just know the acc#.  Also, we'd be in a 
position of having dbxref relationships happening via two routes, 
feature_dbxref and feature_relationship, which I'm not too keen on.

This is getting long, and Chris is probably turning blue, so I'll stop 
here.  Unless there are objections, I propose to go forward adding these 
two tables to sourceforge early next week.

Breathe!

-Dave

On 02/07/05 at 07:22 Aubrey de Grey FlyBase account wrote:
>
>  Hi Chris,
>
>  > Is interpro really central to this?
>
>  Yes, of course it is!  It's the only way that I've succeeded in
> showing
>  you that the distinction between what is an ontology and what isn't is
>  "perhaps fuzzier than I originally made out it was".  Now all I
> need to
>  do is go a teeny bit further and show that that distinction is in fact
>  a carbuncle on the face of chado -- a valueless, arbitrary matter of
>  pure taste that has no business being enshrined in the structure of a
>  database with the ambitions of chado.  More to the point I need to
>  demonstrate that enshrining it (by having cvterm and dbxref as
> separate
>  tables) has a bunch of big disadvantages in actual practice that
> we can
>  already see as we implement Cambridge data in chado, and no remotely
>  comparable advantages.
>
>  > For chado normalization and accurate modeling comes first. In
> general
>  > you'll have to do a lot more joins to get the same data from
> chado than
>  > most other dbs. This doesn't mean chado isn't all-purpose, but
> it does
>  > sacrifice speed for normalization.
>
>  Makes fine sense (except that normalisation can perhaps be taken
> too far,
>  as we're seeing in the other thread).  But then if we can get
> fewer joins
>  WITHOUT sacrificing normalisation, that's still a big plus, isn't it?
>  And this proposal has no effect on normalisation whatever (see below).
>
>  > > Please can you also elaborate on the disadvantages you
> mention above.
>  > > What are they?
>  >
>  > non-uniformity (all other dbxref/IDs are stored in a dbxref
> table, apart
>  > from GO/OBO dbxrefs/IDs)
>
>  I have no objection to the merged table being called dbxref
> rather than
>  cvterm if that makes people happier, but maybe that's not what
> you mean.
>  In the proposal I'm making, all dbxref/IDs are stored in one
> table -- so
>  please elaborate.
>
>  Also, hang on, GO dbxref/IDs **are** stored in dbxref in the
> chadoXML I'm
>  getting from Pinglei -- what do you mean here?
>
>  > denormalisation (the GO/OBO dbxref/ID is collapsed into a
> single string,
>  > making it necessary to do string match queries to ask
> questions like "how
>  > many terms in GO vs SO" (though in practice the same result can be
>  > obtained by using the equivalent of cv_id, although as noted
> before this
>  > isn't always 1:1).
>
>  um, you seem to be answering yourself here and pointing out that
> we don't
>  have to do string matches.
>
>  > it also means any kind of automated ID granting has to
>  > be mediated via code, which isn't a problem since this is all
> handled
>  > ex-database for GO anyway)
>
>  So you're saying this is not a disadvantage elther?  So what
> **are** these
>  alleged disadvantages (apart from "non-uniformity", which I need
> you to
>  explain more thoroughly)?
>
>  > > > First of all, your example sql query will return X01803
> because you said
>  > > > that this has the name ATPG418
>  > >
>  > > No, not unless we populate FlyBase with GenBank LOCUS
> fields, which we can
>  > > choose whether or not to do and have historically not done.
>  >
>  > But a change in policy would break existing queries that are purely
>  > cv-oriented and have nothing to do with genbank locus fields
> per se. Sign
>  > of a problem
>
>  Not at all! -- changes in policy always break things that rely
> on the old
>  policy.  If a query wants to exclude GenBank it can exclude it
> by saying
>  so explicitly, rather than by saying "only use rows that have a
> non-null
>  name".  GenBank is no different from Interpro here -- we might
> start off
>  not populating cvterm for Interpro and then change the policy,
> and results
>  of "cv-oriented" queries would change.
>
>  > > No - cv_id and accession would both be non-nullable.  name
> would be
>  > > nullable but that's fine, we don't need it in the UC.
>  >
>  > on the contrary, this is a significantly loss if we can't
> guarantee that
>  > name and cv_id can uniquely identify a (non-obsolete) term
>  >
>  > I think this is enough to kill the proposal
>
>  Hang on -- you were saying a few rounds ago that names should
> never be used
>  for such things, only IDs should:
>
>  > You should only supply the dbxref, and the dbxref only. If you
> want to
>  > provide cvterm/name, it could be problemmatic. What if you're
> supplying
>  > GO annotations and your version of GO is lagging behind the
> actual GO?
>  > An out-of-date name from you will clobber the up-to-date name in the
>  > database.
>
>  and I agreed with this (and still do), even though it means CVs
> that are
>  not currently in OBO being put there.  Please give us a concrete
> example
>  where one would want to use name+cv_id as a unique constraint.
>
>  > I'm still in the dark as to what you're proposing
>  > is it
>  > table cvterm
>  > 	cvterm_id PK
>  > 	name	  ??NOT NULL
>  > 	cv_id	  NOT NULL
>  > 	db_id	  NOT NULL
>  > 	accession NOT NULL
>  >
>  > 	??UNIQUE (name, cv_id)
>  > 	UNIQUE (db_id, accession)
>
>  No, it's this:
>
>  table cvterm
>  	cvterm_id PK
>  	name	
>  	cv_id	  NOT NULL
>  	accession NOT NULL
>
>  	UNIQUE (cv_id, accession)
>
>  No db_id here, because that's in the cv table.  So not this:
>
>  > db(1,GO)
>  > db(2,GenBank)
>  > cv(1,cellular component)
>  > cv(2,nucleotide)
>  > cvterm(1,synaptic blah,1,1,0008021)
>  > cvterm(2,aptg418,2,2,X01803)
>
>  but rather:
>
>  db(1,GO)
>  db(2,GenBank)
>  cv(1,cellular component,db_id=1)
>  cv(2,nucleotide,db_id=2)
>  cvterm(1,synaptic blah,1,0008021)
>  cvterm(2,<null>,2,X01803)  <--- "aptg418" can be added later if
> we choose
>
>  > OK, a lot of the time with interpro, you just know the dbxref,
> and maybe
>  > that's all you want to populate. You have the option of
> loading the whole
>  > interpro ontology at some later time, as cvterms, and then
> filling in all
>  > your extra info.
>
>  OK, so just to be totally explicit, your answer to my question:
>
>  >    Where in the current chado schema would you put Interpro names?
>
>  is "in cvterm.name", right?
>
>  > This is analagous to features - often you'll only know the
> dbxref, and
>  > that's all you'll populate. Then you may want to load the equivalent
>  > genbank records, giving you fully populated feature rows.
>  >
>  > I've tried to explain that the main difference is in the
> desired meaning
>  > of the table; this part can't be explained in SQL. The cvterm
> table is
>  > more modeling concepts. The dbxref table is for modeling
> addresses. A
>  > cvterm has a dbxref. Other things can have dbxrefs too. A
> feature can have
>  > a dbxref. Where the dbxref_id is the UC for a table, the
> dbxref (and thus
>  > its stringified UC) can stand in for a particular piece of
> data, whether
>  > that piece of data is a feature or a concept.
>  >
>  > GenBank dbxrefs stand for features, GO dbxrefs stand for
> concepts/cvterms.
>
>  And Interpro dbxrefs stand for, hm, something in between.  This
> is the core
>  of the problem.  It was only in this thread that you first
> suggested the
>  idea of putting GenBank records in feature, and you immediately
> pointed out
>  yourself that that is actually rather fraught:
>
>  > Note that this will make things trickier for the
> apollo/annotation side
>  > of chado, because these links will a little like alternate
> spliceforms.
>
>  and it's actually worse than that, because (eg) a single GenBank
> nucleotide
>  record can represent more than one sequence, via the "variation"
> feature.
>
>  Consider this generally (since chado is supposed to be
> all-purpose).  What
>  the current schema requires is that any controlled data type
> which has ANY
>  of the following features:
>
>  - an inheritance hierarchy (relationships between its own items)
>  - relationships to items of other controlled data types
>  - names for its items that one would like to store in addition
> to the IDs
>  - links to features that need attribution
>  - properties of its items or of the links of its items to features
>
>  has to be instantiated either as a CV (using cv and cvterm) or
> as a type
>  of feature (using the feature table) as well as being a database with
>  db and dbxref.  This is necessary because cvterm and feature have
>  *_relationship, feature_cvterm has a pub_id column, and feature and
>  feature_cvterm both have *prop, whereas dbxref has none of those
> things.
>  But, feature can only store mappable objects.  So, a controlled data
>  type consisting of things that are not mappable has to be
> treated as an
>  ontology unless we want to ordain that it shall never have ANY of the
>  five features listed above.  Whereas, as far as I can tell (though you
>  still haven't really said), you think that something should not be in
>  cvterm unless it either has an inheritance hierarchy or is stored in
>  OBO.  Conversely, even GenBank *has* an inheritance hierarchy -- the
>  CON records subsume smaller ones -- so it qualifies to go into cvterm
>  even by your definition, doesn't it?
>
>  --------------------
>
>  So what we're left with at this point is that the proposal to
> merge the
>  cvterm and dbxref tables has the following substantial advantages:
>
>  - fewer joins
>  - simpler XML, simpler code everywhere
>  - no need to force GenBank into feature
>  - no need to decide in advance whether something is going to be
> treated
>    as an ontology or not
>
>  and no disadvantages at all except that all CVs have to be in a
> database
>  somewhere, default being OBO, which we agree is actually a
> constraint we
>  should be enforcing anyway so that queries/updates are done on IDs and
>  not on names.
>
>  If you've got other disadvantages to raise, bring them on; eg I
> hope now
>  you can give examples of this:
>
>  > it makes all kinds of basic ontology queries harder and weirder.
>
>  Cheers A