Re: [Gmod-schema] Re: Tables with text fields in unique key

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi All,

First, I apologize for the shudder of horror which seeing this thread 
resurrected in your inbox has probably caused some of you.

Unfortunately, we need to settle this issue before FlyBase can go 
forward with implementing our genetic data in chado, so: deep breath, 
center, and exhale...

Aubrey and I chatted today, and we had agreed that FlyBase would go 
forward with the "denormalization" option of adding a pub_id fkey to 
featureprop, and making the unique key be (feature_id, type_id, 
pub_id).  I was about to write a note announcing the Big News, but did 
some exploration of the data we already have in chado_dmel, and 
discussed this whole thing with Pinglei, and its clear now that this is 
not a workable solution for three reasons:

1) Because we already have (many, many) cases in chado_dmel where we 
have multiple featureprops for a given feature of the same feature.type 
and from a single publication.  Some of these originated from apollo, 
but we have them from other sources as well.

2) Because apollo itself allows multiple featureprops (specifically, of 
type "comment"), which are not differentiated by anything like a 
publication.  So that in order to implement this plan, we would have to 
have corresponding changes in apollo, which is, I think, unreasonable to 
expect.

3) Because we would prefer the solution we adopt to work also for other 
property tables like pubprop, and this solution certainly would not.  
That is to say, we can't make a pubprop.pub_id column.

The other options we discussed were: 1) to adopt the use of functions as 
below, and 2) to add a "value_checksum" column, which would contain, for 
example, the md5checksum for whatever string is in the value column, and 
could be included in the unique index.

It seemed that the general consensus was that we shouldn't go the 
value.checksum route, as the column would have to be not null in order 
to be useful as part of any unique key, and we don't want to burden 
everyone who would use chado with calculating checksums for every insert 
into a -prop table.

So we've decided that our best option is to go forward with the plan 
below, using rank in the unique key, and following Chris's advice to put 
functions on the server side.  This is not beautiful, but it will work.

I also think this is a fairly flexible option, since it will allow chado 
users who can maintain rank to use chado & XORT in the usual 
straight-forward manner, and also let groups like FB, who can't or don't 
store property ranks, to use XORT, albeit with the overhead cost of 
using those functions.

I'd love to hear what anybody has to say about this.  If there are any 
other options we've missed, please don't be shy about bringing them on!

Best,

-Dave

On 02/11/05 at 17:35 Chris Mungall wrote:
>
>
>  On Fri, 11 Feb 2005, Pinglei Zhou wrote:
>
>  > Hi Aubrey, Chris,
>  >
>  >
>  >
>  > >Do you mean file formats that don't support attribution?  I don't 
> see
>  > >that as a problem, as such file formats won't expect to support the 
> full
>  > >richness of FlyBase data.
>  > >
>  > >> Change XORT
>  > >
>  > >Right.  I think all that remains in this thread is to determine 
> whether
>  > >there are XORT fixes for this that are sufficiently easy that it's 
> OK to
>  > >ask Pinglei to implement them.  Pinglei is the one to speak to 
> that.
>  >
>  > So what you expect is XORT can deal with some arbitary SQL, right ? 
> I
>  > just have quick code, I think its doable, see attached sample file 
> for
>  > reference. here I add tag '_sql', which will be one select stm(if
>  > needed, could be multiple stm, eg. a few insertion, delete, than 
> LAST
>  > one be select stm, separate by ';' ?). so now we have a few ways to
>  > specify data:data self, foreign key reference, and SQL result.
>  >
>  > See the attached sample, try to insert new featureprop which never
>  > overwrite old one. One problem I can see here, what happen it return
>  > nothing ?
>  >
>  > Anyway, does this help to solve some of your problem ? If not, what 
> we
>  > need to improve ?
>
>  Hi Pinglei
>
>  This looks like a useful addition to xort. Your attached example xml 
> deals
>  with the use case where we wish to append to a list. I don't think 
> your
>  sql is adequately constrained - you need to also constrain feature by
>  organism_id and the cvterm by cv_id to be safe.
>
>  We also have to deal with Aubrey's use case which was providing an
>  update/delete for a fp that is additonally constrained by 
> featureprop_pub
>
>  This can be accommodated by extending your sql:
>
>  [i'm using pseudo-sql with * to indicate the appropriate join syntax]
>
>  <rank>
>  <_sql>
>  select max(fp.rank)+1
>  from
>    feature f * organism o * cvterm ft * cv ftcv
>    featureprop fp * cvterm fptype * cv fptypecv *
>    featureprop_pub fpp * pub * cvterm pubtype * cv pubtypecv
>  where o.genus='Drosophila' and o.species='melanogaster'
>   and  f.uniquename='CGxxxx' and fp.name='gene' and 
> ftcv.name='sequence'
>   and  fptype.name='...' and fptypecv.name='props'
>   and  pub.uniquename='FBrf...' and pubtype.name='..'
>   and  pubtypecv.name='pub'
>  </_sql>
>  </_rank>
>
>  Wow!
>
>  This may seem over the top, but none of the above joins can be safely
>  omitted
>
>  It seems to go against the whole rationale for chadoxml and xort to 
> start
>  embedding this in the xml. If Aubrey (or whoever the aubrey equivalent 
> is
>  in any MOD :) is going to write sql, he may as well write the whole
>  transaction set in sql rather than an awkward mix of xml and sql.
>
>  What about my solution of calling a function? That way the complex sql 
> is
>  hidden in the implementation, which removes the need to embed it in 
> the
>  data producer code and the xml
>
>  The spec may look something like this:
>
>   -- create a new set of values, or append to existing set
>   add_fp(feature_id,type_id,value,pub_id)
>
>   -- ditto, but with no provenance data
>   add_fp(feature_id,type_id,value)
>
>   -- remove a fp
>   del_fp(feature_id,type_id,value,pub_id)
>
>   -- remove all fps of a specific type from a certain source
>   del_fp(feature_id,type_id,pub_id)
>
>  chadoxml function calls would then retain the same structure as 
> existing
>  chadoxml, which seems nice and simple and elegant. Function arguments
>  could be macros or expanded nodes.
>
>  Of course, someone would still have to write the pgplsql for the
>  functions, but this is fairly simple.
>
>  I strongly believe that the complex sql belongs on the server side, 
> either
>  in the DBMS or hidden in a middleware layer. In fact, I'd rather see 
> the
>  schema denormalized than this (although denormalization doesn't solve 
> the
>  rank problem, unless we get rid of rank altogether and forget about 
> having
>  a UC for fp).
>
>  There are other possible solutions. For example, an additional 
> transform
>  between the xml produced by the data producer and the xml that is 
> finally
>  loaded; xort could allow 'hooks' into server-specific perl code that 
> would
>  be evaluated on a node. This is essentially the same as the dbms 
> functions
>  solution, the difference being the code sits in front of the dbms 
> rather
>  than behind it.
>
>  I don't want to seem intransigent or anything - FB should adopt 
> whatever
>  solution is best for them, other MODs can do it their own way. 
> However,
>  like Aubrey says, we should focus on getting this right, even if it is 
> a
>  painful process...
>
>  c
>
>  > Cheers,
>  >
>  >
>  > Pinglei
>  >