Re: [Gusdev-gusdev] GUS 3.0 schema changes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Jonathan

Jonathan Crabtree wrote:

>Arnaud -
>
>  
>
>>>Which DNA/RNA features do you mean (other than those mentioned above)?
>>>      
>>>
>>The file I sent you should include views on the top of NAFeatureImp
>>table. Here the list :
>>    
>>
>
>Yes, you're absolutely right; there was a period when I wasn't paying very
>close attention to the schema mailing list, and I'm afraid I misplaced a
>couple of the files you sent, at least temporarily.  I believe I've
>now added all the views and tables that you originally proposed, with
>some minor modifications to take into account discussions we've had since
>then.  See the attached text file for a complete list of the changes I've
>made this time around.
>
>  
>
>>Yes we had! So regarding chromosome regions, shall we keep
>>TelomereFeature and CentromereFeature ?
>>    
>>
>
>No, I think we should use ChromosomeElementFeature instead; I've created
>this view based on the ChromosomeElement view you suggested, but with a
>couple of additional columns to handle the data currently in
>gusdev.TelomereFeature and gusdev.CentromereFeature.
>
>  
>
>>>At
>>>the other extreme, we could continue what we're doing now, i.e. using
>>>an ad-hoc classification of features based on the data we actually have
>>>available, and just make sure that every feature is tagged with the
>>>correct sequence ontology term.  Any thoughts?
>>>      
>>>
>>It makes sense as SO may undergo revisions this year.
>>    
>>
>
>OK, as noted in the attachment, I've added sequence_ontology_id to *all*
>views of NAFeatureImp and AAFeatureImp.
>
>  
>
>>>>A controlled vocabulary table with the four attributes you've
>>>>mentioned is fine.
>>>>        
>>>>
>
>Done; it's called ProteinPropertyType, and the schema/contents are
>described in the attached list of changes.
>
>  
>
>>>>As you're going to add a extra attribute sequence_ontology_id to the
>>>>NA Features, could you do the same to any AA Features ?
>>>>        
>>>>
>
>OK, done.
>
>  
>
>>The way the SignalPeptideFeature is designed make difficult the
>>annotation of localization signal features. We can leave
>>SignalPeptideFeature as it is as it fits with SignalP software
>>prediction and in the future create a new feature LocalizationSignalFeature.
>>    
>>
>
>OK, based on our discussion today the only change I've made to
>SignalPeptideFeature is to add the sequence_ontology_id, which can be
>used to reference the different localization ontology terms that you
>mentioned.  A column has been added to SequenceOntology to let us store
>multiple ontologies (and versions thereof) in the same table.
>Experimental evidence, references, and annotator's comments can be linked
>to SignalPeptideFeature (or a future LocalizationSignalFeature view) using
>DoTS.Evidence.
>  
>
A quick question regarding evidences, you're mentioning that the 
Evidence table will connect Features and Experimental evidences. Where 
will the latter be stored ?

>  
>
>>>>I reckon they could be merged.
>>>>        
>>>>
>
>(This comment was in reference to incorporating TM domain features into
>the DomainFeature view.)  I've added a "number_of_domains" column to
>DomainFeature to permit this.  We will *not* have a separate view
>specifically for TM domain features.
>
>  
>
>>>I also realized belatedly that I could have left the Interaction table
>>>unchanged, rather than introducing specific references to RowSet.  This
>>>would have allowed us to represent either singleton effectors/targets or
>>>set-valued effectors/targets, without having to always join through
>>>RowSet
>>>in the singleton case.  On the other hand, if we do associate some
>>>additional information with the RowSets, then the current representation
>>>is correct.
>>>      
>>>
>>It depends if we want to represent many-to-many relationship between
>>interaction and members of this interaction. Without the RowSet table,
>>we can't assign a set of several effectors/targets, right ? Unless we
>>consider that this set of effectors are being part of a complex and act
>>as the whole.
>>    
>>
>
>It's true that without the RowSet table we can't assign a set of several
>effectors or targets.  What I was trying to say was that I replaced the
>following rows in DoTS.Interaction--
> effector_table_id
> effector_row_id (or something to that effect)
>
>using instead a single row that references a RowSet:
> effector_row_set_id
>
>However, I could have left the Interaction table unchanged, and used the
>effector_table_id and effector_row_id to reference entries in the RowSet
>table (in the case where there are multiple effectors.)  With this
>approach one would have the choice of either using or not using the RowSet
>table on a case-by-case basis.  I don't think it's too important which way
>we do this; on the one hand you save a join when you only need to reference
>a single effector/target (using the table_id/row_id approach) but on the
>other hand with the row_set_id approach you can write uniform code and
>also have an enforceable referential integrity constraint.  So barring any
>strong objection, I'll leave the table as it is now (i.e., with explicit
>references to RowSet, meaning that you always have to have a RowSet even
>when the effector or target is a single object.)
>  
>
fine, I think this way is more consistent as storing one and storing 
more than one effectors will be done the same way.

>  
>
>>A case we came across here for Tbrucei is nested repeat regions (at the
>>DNA level). Each repeat region has coordinates and is annotated with a
>>unique repeat unit type. This repeat region can be within a bigger
>>repeat region annotated with a different repeat unit type.
>>... which is in other words your suggestion with parent_id as an extra
>>attribute ...
>>    
>>
>
>I haven't added the parent_id yet, but I'll do so.
>
>  
>
>>Regarding transposon repeat types, if we have a TransposableElement
>>feature and its type is given as an attribute, a repeat feature will
>>just be useful to locate the LTRs within a given a transposable element.
>>Can we keep this functionality ? Then the feature will be simple, just a
>>repeat_type, and a parent_id atributes.
>>    
>>
>
>Are you saying that we still need the two tables/features, one for
>RepeatFeature, the other for RepeatRegionFeature?  Could you give me a
>specific example of how you would envision using these tables (and also
>these tables in conjunction with the TransposableElement view, under the
>assumption that they're all equipped with parent_ids)?
>  
>
Here two examples of transposable elements annotations, one is from 
Tbrucei, the other one is a common one in procaryote genomes.

The first one in the inclusion of a INGI transposon  within an ORF, the 
RHS gene. The transposon includes two RIME flanking repeats and another ORF.
So in GUS, the INGI transposon could be stored as a transposable element 
feature, attached to a RHS gene feature. The transposable element 
feature will have three sub features, a gene feature, tagged as a 
pseudo-gene and two repeat features, which repeat_type is RIME and with 
a given location.

The second example is nested transposable elements in procaryote 
genomes, ie insertion of a transposable element within another one. Each 
transposable element can have a similar structure including the 
following sub features : two flanking Inverted Repeats, a gene and its 
promoter and/or a promoter, functional on the other strand !

So if there is no repeat feature, the flanking repeats will have to be 
annotated part of the transposable element feature.
Let me know what you think about these.

>  
>
>>Let's leave the design as it is for now. Curators are not going to
>>curate interactions data in the short term. We shall come back later
>>with more precise ideas/use cases about them.
>>    
>>
>
>Sounds good.  Let me know if there's anything I've missed.  I'll try to
>generate updated SQL scripts tomorrow, and also update the schema browser
>so that everyone can review the changes one last time.  Cheers,
>
>Jonathan
>
>  
>
>------------------------------------------------------------------------
>
>
>-Added nullable 'is_obsolete' column to DoTS.GeneSynonym
>-Added and populated DoTS.ProteinPropertyType table (please correct/improve my 
> protein property descriptions, shown below.)  I did not include a source_id column, 
> because that usually implies a reference to an external database (in conjunction 
> with an external_database_release_id to specify which database).
>
>  1 isoelectric point       The pH at which the net charge of the entire polypeptide is zero.
>  2 molecular mass          The mass of the entire polypeptide.
>  3 charge                  The net charge of the entire polypeptide.
>  4 average residue mass    The average mass of a single residue in the polypeptide chain.
>
>-Modified DoTS.ProteinProperty table to reference ProteinPropertyType
> One question I have regarding these tables is how will the units be specified?
> Should I make the "property_value" column a varchar2 column?  It may have had 
> this type originally, and I might have changed it without considering the 
> consequences.  One option would be to specify in the ProteinPropertyType table
> what units are to be used, though this is clumsy if there is more than one
> choice of units for a given property.
>
Whatever the unit they're in, they should all be numbers (some would be 
integer) so we can go for the "number" data type but float or varchar 
could also be fine!

>-Created DoTS.SecondaryStructureAAFeature (instead of AASecondaryStructure)
>-Created DoTS.TertiaryStructureAAFeature (instead of AATertiaryStructure)
>-Created DoTS.ChromosomeElementFeature (instead of ChromosomeElement), with 
> a few additional columns to handle the data currently in gusdev.TelomereFeature
> and gusdev.CentromereFeature
>-Added "probability" column to DoTS.DomainFeature.
>-Added "number_of_domains" column to DoTS.DomainFeature, so that it can be used 
> instead of the proposed TransmembraneDomainFeature to represent TM domains.
>-Added DoTS.GenomicSequence view, with sequencing_center_contact_id instead of
> the proposed free text column, "sequencing_center".
>-Added sequencing_center_contact_id to DoTS.NASequenceImp to support this.
>-Created DoTS.InflectionPointFeature
>-Added columns to ProteinProperty to more closely reflect the original proposal
> (e.g. prediction_algorithm_id, is_predicted, review_status_id, source_id)
>-Modified DoTS.PostTranslationalModFeature as per Arnaud's original proposal
>-Created DoTS.ReplicationFeature (should this be ReplicationOriginFeature?)
>
I reckon ReplicationOriginFeature would make more sense

>-Added "type_of_cut" column to DoTS.RestrictionFragmentFeature
>-Created DoTS.RNARegulatoryFeature (instead of RNARegulatory), but omitted the
> "evidence" column; shouldn't the Evidence table be used for this purpose?
>-Created DoTS.RNASecondaryStructureFeature (instead of RNASecondaryStructure)
>-Created DoTS.SpliceSiteFeature
>-Created DoTS.TransposableElement
>-Added external_database_release_id to any view that has a source_id; these two 
> fields should always appear together, since by convention they are used to 
> specify a reference to an external database.  (Admittedly this is somewhat 
> obscure, and we should probably think about using something more obvious.)
>-Added sequence_ontology_id to AAFeatureImp and all of its views
>-Added "ontology_name" column to SequenceOntology to allow us to store multiple
> ontologies (na sequence + aa sequence) in the table.  We *could* have used
> the existing so_version column for this purpose, but I think adding an extra
> column is a slightly better idea.  Alternatively we could switch to using an
> external_database_release_id, which I think we might have done for the GO
> terms already.
>
>  
>
cheers
Arnaud