Re: [Gusdev-gusdev] GUS 3.0 schema changes

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Arnaud -

> > Which DNA/RNA features do you mean (other than those mentioned above)?
>
> The file I sent you should include views on the top of NAFeatureImp
> table. Here the list :

Yes, you're absolutely right; there was a period when I wasn't paying very
close attention to the schema mailing list, and I'm afraid I misplaced a
couple of the files you sent, at least temporarily.  I believe I've
now added all the views and tables that you originally proposed, with
some minor modifications to take into account discussions we've had since
then.  See the attached text file for a complete list of the changes I've
made this time around.

> Yes we had! So regarding chromosome regions, shall we keep
> TelomereFeature and CentromereFeature ?

No, I think we should use ChromosomeElementFeature instead; I've created
this view based on the ChromosomeElement view you suggested, but with a
couple of additional columns to handle the data currently in
gusdev.TelomereFeature and gusdev.CentromereFeature.

> > At
> > the other extreme, we could continue what we're doing now, i.e. using
> > an ad-hoc classification of features based on the data we actually have
> > available, and just make sure that every feature is tagged with the
> > correct sequence ontology term.  Any thoughts?
>
> It makes sense as SO may undergo revisions this year.

OK, as noted in the attachment, I've added sequence_ontology_id to *all*
views of NAFeatureImp and AAFeatureImp.

> >> A controlled vocabulary table with the four attributes you've
> >> mentioned is fine.

Done; it's called ProteinPropertyType, and the schema/contents are
described in the attached list of changes.

> >> As you're going to add a extra attribute sequence_ontology_id to the
> >> NA Features, could you do the same to any AA Features ?

OK, done.

> The way the SignalPeptideFeature is designed make difficult the
> annotation of localization signal features. We can leave
> SignalPeptideFeature as it is as it fits with SignalP software
> prediction and in the future create a new feature LocalizationSignalFeature.

OK, based on our discussion today the only change I've made to
SignalPeptideFeature is to add the sequence_ontology_id, which can be
used to reference the different localization ontology terms that you
mentioned.  A column has been added to SequenceOntology to let us store
multiple ontologies (and versions thereof) in the same table.
Experimental evidence, references, and annotator's comments can be linked
to SignalPeptideFeature (or a future LocalizationSignalFeature view) using
DoTS.Evidence.

> >> I reckon they could be merged.

(This comment was in reference to incorporating TM domain features into
the DomainFeature view.)  I've added a "number_of_domains" column to
DomainFeature to permit this.  We will *not* have a separate view
specifically for TM domain features.

> > I also realized belatedly that I could have left the Interaction table
> > unchanged, rather than introducing specific references to RowSet.  This
> > would have allowed us to represent either singleton effectors/targets or
> > set-valued effectors/targets, without having to always join through
> > RowSet
> > in the singleton case.  On the other hand, if we do associate some
> > additional information with the RowSets, then the current representation
> > is correct.
>
> It depends if we want to represent many-to-many relationship between
> interaction and members of this interaction. Without the RowSet table,
> we can't assign a set of several effectors/targets, right ? Unless we
> consider that this set of effectors are being part of a complex and act
> as the whole.

It's true that without the RowSet table we can't assign a set of several
effectors or targets.  What I was trying to say was that I replaced the
following rows in DoTS.Interaction--
 effector_table_id
 effector_row_id (or something to that effect)

using instead a single row that references a RowSet:
 effector_row_set_id

However, I could have left the Interaction table unchanged, and used the
effector_table_id and effector_row_id to reference entries in the RowSet
table (in the case where there are multiple effectors.)  With this
approach one would have the choice of either using or not using the RowSet
table on a case-by-case basis.  I don't think it's too important which way
we do this; on the one hand you save a join when you only need to reference
a single effector/target (using the table_id/row_id approach) but on the
other hand with the row_set_id approach you can write uniform code and
also have an enforceable referential integrity constraint.  So barring any
strong objection, I'll leave the table as it is now (i.e., with explicit
references to RowSet, meaning that you always have to have a RowSet even
when the effector or target is a single object.)

> A case we came across here for Tbrucei is nested repeat regions (at the
> DNA level). Each repeat region has coordinates and is annotated with a
> unique repeat unit type. This repeat region can be within a bigger
> repeat region annotated with a different repeat unit type.
> ... which is in other words your suggestion with parent_id as an extra
> attribute ...

I haven't added the parent_id yet, but I'll do so.

> Regarding transposon repeat types, if we have a TransposableElement
> feature and its type is given as an attribute, a repeat feature will
> just be useful to locate the LTRs within a given a transposable element.
> Can we keep this functionality ? Then the feature will be simple, just a
> repeat_type, and a parent_id atributes.

Are you saying that we still need the two tables/features, one for
RepeatFeature, the other for RepeatRegionFeature?  Could you give me a
specific example of how you would envision using these tables (and also
these tables in conjunction with the TransposableElement view, under the
assumption that they're all equipped with parent_ids)?

> Let's leave the design as it is for now. Curators are not going to
> curate interactions data in the short term. We shall come back later
> with more precise ideas/use cases about them.

Sounds good.  Let me know if there's anything I've missed.  I'll try to
generate updated SQL scripts tomorrow, and also update the schema browser
so that everyone can review the changes one last time.  Cheers,

Jonathan