Re: [Gusdev-gusdev] GUS 3.0 schema changes

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Jonathan

Jonathan Crabtree wrote:

>
> Arnaud-
>
> Thanks for the feedback; I think we're getting close to agreement here.

I think so too !

>> I have noticed that your changes don't cover the DNA/RNA features. Is 
>> there any reason for this ? I know there are quite a lot of them and 
>> there might be another way of storing data some information such as 
>> telomere or centromere regions, origin of replication, inflection 
>> point etc. All these features are covered by Sequence Ontology, so a 
>> new ChromosomeElement or ChromosomeRegion feature could be generic 
>> enough to cover most of them.
>> Let me know what you think.
>
>
> Which DNA/RNA features do you mean (other than those mentioned above)?

The file I sent you should include views on the top of NAFeatureImp 
table. Here the list :

* ChromosomeElement or we can keep CentromereFeature and TelomereFeature 
as they are in gusdev - IMPORTANT
* InfectionPointFeature
* ReplicationFeature, for annotated origins of replication
* RNARegulatory - as there is a DNARegulatory feature => regulatory 
element at the RNA level
* RNASecondaryStructure
* SpliceSiteFeature
* TransposableElement

+ an extra attribute in RestrictionFragmentFeature, "type_of_cut" 
(Sticky or blunt)
+ an extra attribute in GeneSynonym, "is_obsolete"

+ a new view on the top of NASequenceImp, "GenomicSequence" instead of 
the existing one, ExternalNASequence.

I can send the files to you if you want.

>
> It's possible that I misplaced the e-mail or notes where we discussed
> these.  Or are you just saying that we will eventually have a view for
> each type of DNA/RNA feature in the Sequence Ontology?  I think that
> this is true, although I hadn't planned to make the change immediately,
> since I believe we had agreed on a "transitional" period in which the
> various NAFeature views would first be given a nullable 
> sequence_ontology_id

Yes we had! So regarding chromosome regions, shall we keep 
TelomereFeature and CentromereFeature ?

> and we would then decide how to best rearrange the views to more closely
> match the ontology terms.  I haven't added the sequence_ontology_id
> column to the NAFeature views, but I will do so right away.  We do
> currently have some relevant NAFeature views in gusdev that have not
> been migrated into 3.0:
>
>  CentromereFeature
>  LowComplexityNAFeature
>  ScaffoldGapFeature
>  TelomereFeature
>
> I have no objection to merging the telomere and centromere features into
> a single view--along with any other chromosomal regions covered by the
> ontology--although it would mean that we wouldn't have a 1-1 mapping
> between sequence ontology terms and views on NAFeature.  I think that
> at one point this was proposed as the eventual goal of the rearrangement.
> Anyway, given that I'm not certain of the plan here, I'm going to add
> the sequence_ontology_id column but leave the views unchanged for now.
> They can easily be changed without interfering with our data migration,
> so their fate doesn't have to be settled immediately.  We have yet to
> establish a consistent set of rules for deciding when different types
> of features get grouped into a single view and when they get their own
> views, so this is probably a good opportunity to settle the question
> once and for all.  The Sequence Ontology is big enough that we probably
> *don't* want a view for each and every term in the ontology; it would
> make maintenance quite difficult.  But we could, for example, create a
> view for every top-level (or second-level) sequence ontology term.
> However, even a relatively high-level feature like "chromosomal region"
> (SO:0000711) looks like it's already a 4th or 5th level feature...  

> At
> the other extreme, we could continue what we're doing now, i.e. using
> an ad-hoc classification of features based on the data we actually have
> available, and just make sure that every feature is tagged with the
> correct sequence ontology term.  Any thoughts?

It makes sense as SO may undergo revisions this year.

>
>>>
>>> alter table DOTS.PROTEINPROPERTY add constraint PROTEINPROPERTY_CK01 
>>> check (property_name in ('isoelectric point', 'molecular mass', 
>>> 'charge', 'average residue mass'));
>>>
>>> The table allows multiple protein properties of the same type to be 
>>> associated with
>>> entries in DoTS.AASequenceImp.  Arnaud had suggested originally that 
>>> the last property, average residue mass, could actually be an 
>>> attribute of the table that stores the protein sequence itself.  
>>> However, it seemed that if the molecular mass attribute could have 
>>> multiple values (e.g., from different experiments) then
>>> the same should be true of the average residue mass, which is 
>>> essentially a derived property.  Let me know if you disagree with 
>>> this, or think we should create an explicit controlled vocab. for 
>>> these 4 properties.
>>>  
>>>
>> A controlled vocabulary table with the four attributes you've 
>> mentioned is fine.
>
>
> OK, I'll make this change.
>
>>> -Protein features
>>> *Signal peptide features (stored in DoTS.SignalPeptideFeature)
>>>  This view exists already, as DoTS.SignalPeptideFeature, but we need 
>>> to add the
>>>  ability to store curated data, such as targetting information.  It 
>>> should be  straightforward to modify the view to accomodate this, 
>>> but I'm not sure exactly
>>>  what needs to be stored.  Currently we use the view exclusively for 
>>> SignalP
>>>  predictions, and from what I understand SignalP is only concerned 
>>> with predicting
>>>  secreted proteins, meaning that we don't currently have any 
>>> explicit targetting  information.  Is this something we could 
>>> represent using the GO ontology for cellular  localization?  Do we 
>>> also need some free text columns?  Let me know and I'll make
>>>  the changes.  All the SignalP-specific columns appear to be 
>>> nullable, so we don't
>>>  necessarily have to do anything except add the new columns for the 
>>> manually curated
>>>  information.
>>>  
>>>
>> After talking to the curators it appears that GO component suplements 
>> targetting information at the feature level but will not be enough.
>> The targeting information is represented by the component ontology in 
>> one context i.e. mitochondrial, nuclear, membrane localization but 
>> not in the context of the actual residues involved.
>> The actual residues involved in the targeting (or any other 
>> phenomena) need to be represented by a protein feature ontology can 
>> be mapped onto specific amino acids of a protein.
>> This ontology is the equivalent of Sequence Ontology (SO) which is 
>> meant for DNA features. It is being prepared by Val Wood with input 
>> from Swiss-prot.
>
>
> OK, so the idea is that the various signal peptides have been classified
> into named classes that should be represented by some kind of ontology?
>
>> As you're going to add a extra attribute sequence_ontology_id to the 
>> NA Features, could you do the same to any AA Features ?
>
>
> This will only work if the new ontology is actually part of the Sequence
> Ontology (or if we use the SequenceOntology table to store both 
> ontologies.)
> Do you know if this is the case?  It's quite possible, since the SO does
> already cover amino acid features.  Otherwise we'll have to create a
> separate AASequenceOntology (or whatever the new ontology is called).

It is at the moment a different project but it would make sense they 
merge in the future. Just to give you an idea about Localization 
Signals, here is a snapshot:

   %localization signal      
      %N-terminal signal sequence
      %nuclear localization signal
         %bipartite nuclear localization signal
         %etc
      %mitochondrial localization sequence
         %thylakoid localization signal
      %ER retention signal

The way the SignalPeptideFeature is designed make difficult the 
annotation of localization signal features. We can leave 
SignalPeptideFeature as it is as it fits with SignalP software 
prediction and in the future create a new feature LocalizationSignalFeature.

>
>>> *Transmembrane domain features (stored in DoTS.PredictedAAFeature)
>>>  "PlasmoDB web site shows hydrophobicity graphics, where is it 
>>> stored in GUS?"
>>>  The hydrophobicity plots are computed dynamically based on the 
>>> amino acid sequence.
>>>  Transmembrane domains are currently stored in the 
>>> PredictedAAFeature view, although
>>>  I will probably create a new view for them when I get around to 
>>> eliminating  PredictedAAFeature.  Another possibility would be to 
>>> treat TM domains as another
>>>  type of domain, and store them in DomainFeature.  What do you think 
>>> about this?
>>>  
>>>
>> I reckon they could be merged.
>
>
> OK, sounds good.
>
>>> *Post-translational modification features (new view: 
>>> DoTS:PostTranslationalModFeature)
>>>  Has a "type" column to represent the type of modification.  It was 
>>> also suggested
>>>  that we have a column called "modified_by", which would be a 
>>> reference to the  Interaction table.  However, isn't it possible 
>>> that the same post-translational
>>>  modification (e.g., phosphorylation of a specific amino acid) could 
>>> be the result
>>>  of one of several Interactions?
>>
>> yes you're right, the effector could be different. In that case the 
>> attribute
>> "modified_by" is not useful.
>>
>>> This argues for an additional relationship  between Interaction and 
>>> PostTranslationalModFeature, unless we're OK creating  multiple 
>>> PostTranslationalModFeatures, identical except for their modified_by 
>>>  attribute.  Comments on this?
>>>  
>>>
>> I don't think they should be duplicated as they corresponds to a 
>> unique site. This unique feature would
>> be associated with different interaction entries. We might not need 
>> an extra table between Interaction and PostTranslationalModFeature 
>> though. We still can do the following query : "give me all the 
>> interaction entries which target is a PostTranslationalModFeature 
>> which id is ...".
>> How does it sound ?
>
>
> We could do this, although one question is whether, semantically 
> speaking,
> the "target" of an Interaction should be "the thing to be modified" 
> (e.g. an
> unphosphorylated sequence or residue) or "the resulting modification" 
> (e.g.
> the feature that represents a phosphorylated residue at the appropriate
> location.)  The answer is probably that we just shouldn't worry about it
> and should just do whatever is most convenient on a case-by-case basis.
> To do it "correctly" would be problematic either way.  For example, if we
> say that the target is the thing to be modified, then we have to create a
> feature that represents a region of sequence that *could* be modified in
> some way and then create another feature to represent the actual 
> modification.
> But if we say that the target is the result of the modification then 
> we may
> have to create equally unusual tables/views.  For example, if the 
> result of
> a given interaction is to degrade a protein, then do we have to create a
> table/object that represents a degraded protein (or "nothing", or 
> whatever
> it is that's left after the modification)?  For now I have no problem 
> with
> interpreting the "target" based on context, but in the longer term we may
> want to consider separating the notions of "target prior to modification"
> and either "target after modification" or "effect of modification".
>
> I also realized belatedly that I could have left the Interaction table
> unchanged, rather than introducing specific references to RowSet.  This
> would have allowed us to represent either singleton effectors/targets or
> set-valued effectors/targets, without having to always join through 
> RowSet
> in the singleton case.  On the other hand, if we do associate some
> additional information with the RowSets, then the current representation
> is correct.

It depends if we want to represent many-to-many relationship between 
interaction and members of this interaction. Without the RowSet table, 
we can't assign a set of several effectors/targets, right ? Unless we 
consider that this set of effectors are being part of a complex and act 
as the whole.

>
>>> *AA repeats (new view: RepeatRegionAAFeature)
>>>  I called this view RepeatRegionAAFeature in case we want to have a 
>>> similar view
>>>  for NASequences.  I also created only a single view, instead of 
>>> following Arnaud's
>>>  original suggestion, which was for both:
>>>
>>>      * RepeatRegionFeature as a set of RepeatUnitFeatures,
>>>      * RepeatUnitFeature, with the consensus sequence, name and size
>>>
>>>  I based the design of this view on that of TandemRepeatFeature, 
>>> which we have for
>>>  NASequences already.  Instead of splitting the consensus sequence, 
>>> name, and size
>>>  into a separate table, they occupy columns in 
>>> RepeatRegionAAFeature.  This works
>>>  quite well for the tandem repeats we already have (for DNA 
>>> sequences.)  However, if
>>>  there is a known set of named amino acid sequence repeats, then it 
>>> would probably
>>>  make sense to do what Arnaud suggested, and store these in a 
>>> separate table  (likely named RepeatUnit, not RepeatUnitFeature, 
>>> since they would have no unique  locations.)  Does this sound 
>>> reasonable?  That is, put the consensus seqs in the
>>>  repeat region table itself if they're anonymous, but if they've 
>>> been named, then  store them  in a separate table.  Also note that 
>>> this view has a reference to  RepeatType, although the current 
>>> contents of this table are probably applicable  only to DNA sequence 
>>> repeats (LINEs, SINEs, ALUs, etc.), since I believe that I  parsed 
>>> them out of RepBase.
>>>
>>>
>> I proposed a separate repeat feature because one may want to annotate 
>> a repeat outside a repeat region, for example LTR repeats attached to 
>> a given transposable element. These RepeatFeatures or 
>> RepeatUnitFeatures can then have a location.
>> The other case is when a repeat region is made of a set of different 
>> repeat units.
>
>
> OK, I didn't realize that this was what you were trying to represent.  As
> currently conceived, RepeatRegionAAFeature is meant to represent a region
> that contains one or more immediately adjacent copies of the same type
> of (amino acid sequence) repeat.  The assumption is also that these 
> regions
> will typically be maximal (with respect to the chosen repeat type, 
> consensus,
> and max. mismatch, the last of which is not represented directly in the
> table.)  We can still represent more complex repeat structures using this
> single table, but the representation is implicit, not explicit (i.e. you
> have to do a query to find out what other repeats lie within the 
> bounds of
> the transposon, meaning that there's no easy way to query for all 
> transposable
> elements with a particular flanking LTR structure.)  Do you want to 
> come up
> with a 2-table version of what I've done?  The use cases aren't clear 
> enough
> in my mind yet for me to be able to do it.  It seems that the bare 
> minimum we
> need is just another column in the RepeatRegionAAFeature, parent_id; 
> which
> would let us represent explicitly that a particular repeat is a 
> *necessary*
> (versus incidental) component of another NA/AAFeature.  Both AAFeatureImp
> and NAFeatureImp already have a parent_id, so this would be a 
> straightforward
> change.  The queries still might not be terribly efficient, but I 
> don't know
> what exactly you wanted to support in terms of queries, versus just 
> making
> sure that the representation is sufficiently rich to capture the 
> structure.

A case we came across here for Tbrucei is nested repeat regions (at the 
DNA level). Each repeat region has coordinates and is annotated with a 
unique repeat unit type. This repeat region can be within a bigger 
repeat region annotated with a different repeat unit type.
... which is in other words your suggestion with parent_id as an extra 
attribute ...

Regarding transposon repeat types, if we have a TransposableElement 
feature and its type is given as an attribute, a repeat feature will 
just be useful to locate the LTRs within a given a transposable element. 
Can we keep this functionality ? Then the feature will be simple, just a 
repeat_type, and a parent_id atributes.

>
>> In any case, NA repeats and AA repeats should have the same design. 
>> Just the controlled vocabulary representing the types of repeats will 
>> be different.
>
>
> Absolutely, yes, although one question is whether AA repeats can have the
> same kind of nested structure that you mention as a possibility for NA
> repeats (the transposon with LTRs).  I don't know the answer to this.
>
>>> -DoTS.Interaction (table modified, dependent tables added)
>>> *Added "has_direction" column, as discussed previously.  The idea 
>>> here is that
>>>  not all interactions (particularly physical ones, e.g., 
>>> dimerization) have a
>>>  direction.  If has_direction == 0, then the value of 
>>> direction_is_known can
>>>  be ignored.
>>> *Added non-nullable "effector_action_type_id" column, referencing 
>>>  DoTS.EffectorActionType (a new table.)  This column/table 
>>> represents the possible
>>>  things that an effector can do to a target.  For example, the 
>>> InteractionType
>>>  associated with the Interaction could be "binds to" (e.g., a 
>>> promoter region), and
>>>  the EffectorActionType for that Interaction could be to either 
>>> "inhibit" or "enhance"
>>>  expression of the coresponding gene.
>>> *Replaced effector_table_id and effector_row_id with 
>>> effector_row_set_id, and
>>>  similarly for the target_table_id and target_row_id.  This allows 
>>> us to represent
>>>  the interaction of a set of objects (the effector) with another set 
>>> of objects
>>>  (the target.)  Previously the Interaction table could only 
>>> represent the interaction
>>>  between a single pair of entities (OK if they happened to be 
>>> Complexes, for example,
>>>  but a potential problem in other situations.)  Now both effector 
>>> and target are  represented as references to DoTS.RowSet, which in 
>>> tun references DoTS.RowSetMember,
>>>  which...in turn...references the individual database rows that 
>>> comprise the effector
>>>  or target.  These tables (RowSet and RowSetMember) are essentially 
>>> the same as  Complex and ComplexComponent, except that they are 
>>> totally generic; they can be  used to group any set of rows in the 
>>> database and they store no additional information.   However, if 
>>> there are any additional columns that we can think of (that are 
>>> specific  to Interactions) then these tables should be replaced by 
>>> less generic ones (e.g.  InteractingEntitySet or InteractionSet, or 
>>> something along those lines.)
>>>  
>>>
>> Sounds fine. The only thing I can see is regarding the 
>> EffectorActionType. If each effector, member of a RowSet, has a 
>> different action type, the attribute, effector_action_type_id, should 
>> go in the RowSetMember table. I don't have any example though.
>
>
> OK, I think I'd be inclined to wait until we have some use cases for 
> this.
> Although the current schema lets us group effectors together, it 
> doesn't let
> us say (for example) that E1 interacts *directly* with T1 to 
> phosphorylate
> it, but that E1's active site is only exposed when E1 is bound to E2.  In
> other words, E1's role in the activity can be viewed as "primary", and 
> E2's
> role is secondary (in some sense) but all we can say in the schema is 
> that
> the Complex consisting of E1 and E2 interacts with T1 to phosphorylate 
> it.
> I think that the solution we have now is OK, but it only lets us 
> represent
> the overall action of the entire set of effectors.

Let's leave the design as it is for now. Curators are not going to 
curate interactions data in the short term. We shall come back later 
with more precise ideas/use cases about them.

>
> Jonathan
>

Arnaud