Re: [Gusdev-gusdev] GUS 3.0 schema changes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Arnaud-

Thanks for the feedback; I think we're getting close to agreement here.

> I have noticed that your changes don't cover the DNA/RNA features. Is 
> there any reason for this ? I know there are quite a lot of them and 
> there might be another way of storing data some information such as 
> telomere or centromere regions, origin of replication, inflection point 
> etc. All these features are covered by Sequence Ontology, so a new 
> ChromosomeElement or ChromosomeRegion feature could be generic enough to 
> cover most of them.
> Let me know what you think.

Which DNA/RNA features do you mean (other than those mentioned above)?
It's possible that I misplaced the e-mail or notes where we discussed
these.  Or are you just saying that we will eventually have a view for
each type of DNA/RNA feature in the Sequence Ontology?  I think that
this is true, although I hadn't planned to make the change immediately,
since I believe we had agreed on a "transitional" period in which the
various NAFeature views would first be given a nullable sequence_ontology_id
and we would then decide how to best rearrange the views to more closely
match the ontology terms.  I haven't added the sequence_ontology_id
column to the NAFeature views, but I will do so right away.  We do
currently have some relevant NAFeature views in gusdev that have not
been migrated into 3.0:

  CentromereFeature
  LowComplexityNAFeature
  ScaffoldGapFeature
  TelomereFeature

I have no objection to merging the telomere and centromere features into
a single view--along with any other chromosomal regions covered by the
ontology--although it would mean that we wouldn't have a 1-1 mapping
between sequence ontology terms and views on NAFeature.  I think that
at one point this was proposed as the eventual goal of the rearrangement.
Anyway, given that I'm not certain of the plan here, I'm going to add
the sequence_ontology_id column but leave the views unchanged for now.
They can easily be changed without interfering with our data migration,
so their fate doesn't have to be settled immediately.  We have yet to
establish a consistent set of rules for deciding when different types
of features get grouped into a single view and when they get their own
views, so this is probably a good opportunity to settle the question
once and for all.  The Sequence Ontology is big enough that we probably
*don't* want a view for each and every term in the ontology; it would
make maintenance quite difficult.  But we could, for example, create a
view for every top-level (or second-level) sequence ontology term.
However, even a relatively high-level feature like "chromosomal region"
(SO:0000711) looks like it's already a 4th or 5th level feature...  At
the other extreme, we could continue what we're doing now, i.e. using
an ad-hoc classification of features based on the data we actually have
available, and just make sure that every feature is tagged with the
correct sequence ontology term.  Any thoughts?

>>
>> alter table DOTS.PROTEINPROPERTY add constraint PROTEINPROPERTY_CK01 check 
>> (property_name in ('isoelectric point', 'molecular mass', 'charge', 'average residue mass'));
>>
>> The table allows multiple protein properties of the same type to be associated with
>> entries in DoTS.AASequenceImp.  Arnaud had suggested originally that the last 
>> property, average residue mass, could actually be an attribute of the table that 
>> stores the protein sequence itself.  However, it seemed that if the molecular 
>> mass attribute could have multiple values (e.g., from different experiments) then
>> the same should be true of the average residue mass, which is essentially a 
>> derived property.  Let me know if you disagree with this, or think we should 
>> create an explicit controlled vocab. for these 4 properties.
>>  
>>
> A controlled vocabulary table with the four attributes you've mentioned 
> is fine.

OK, I'll make this change.

>>-Protein features
>> *Signal peptide features (stored in DoTS.SignalPeptideFeature)
>>  This view exists already, as DoTS.SignalPeptideFeature, but we need to add the
>>  ability to store curated data, such as targetting information.  It should be 
>>  straightforward to modify the view to accomodate this, but I'm not sure exactly
>>  what needs to be stored.  Currently we use the view exclusively for SignalP
>>  predictions, and from what I understand SignalP is only concerned with predicting
>>  secreted proteins, meaning that we don't currently have any explicit targetting 
>>  information.  Is this something we could represent using the GO ontology for cellular 
>>  localization?  Do we also need some free text columns?  Let me know and I'll make
>>  the changes.  All the SignalP-specific columns appear to be nullable, so we don't
>>  necessarily have to do anything except add the new columns for the manually curated
>>  information.
>>  
>>
> After talking to the curators it appears that GO component suplements 
> targetting information at the feature level but will not be enough.
> The targeting information is represented by the component ontology in 
> one context i.e. mitochondrial, nuclear, membrane localization but not 
> in the context of the actual residues involved.
> The actual residues involved in the targeting (or any other phenomena) 
> need to be represented by a protein feature ontology can be mapped onto 
> specific amino acids of a protein.
> This ontology is the equivalent of Sequence Ontology (SO) which is meant 
> for DNA features. It is being prepared by Val Wood with input from 
> Swiss-prot.

OK, so the idea is that the various signal peptides have been classified
into named classes that should be represented by some kind of ontology?

> As you're going to add a extra attribute sequence_ontology_id to the NA 
> Features, could you do the same to any AA Features ?

This will only work if the new ontology is actually part of the Sequence
Ontology (or if we use the SequenceOntology table to store both ontologies.)
Do you know if this is the case?  It's quite possible, since the SO does
already cover amino acid features.  Otherwise we'll have to create a
separate AASequenceOntology (or whatever the new ontology is called).

>> *Transmembrane domain features (stored in DoTS.PredictedAAFeature)
>>  "PlasmoDB web site shows hydrophobicity graphics, where is it stored in GUS?"
>>  The hydrophobicity plots are computed dynamically based on the amino acid sequence.
>>  Transmembrane domains are currently stored in the PredictedAAFeature view, although
>>  I will probably create a new view for them when I get around to eliminating 
>>  PredictedAAFeature.  Another possibility would be to treat TM domains as another
>>  type of domain, and store them in DomainFeature.  What do you think about this?
>>  
>>
> I reckon they could be merged.

OK, sounds good.

>> *Post-translational modification features (new view: DoTS:PostTranslationalModFeature)
>>  Has a "type" column to represent the type of modification.  It was also suggested
>>  that we have a column called "modified_by", which would be a reference to the 
>>  Interaction table.  However, isn't it possible that the same post-translational
>>  modification (e.g., phosphorylation of a specific amino acid) could be the result
>>  of one of several Interactions? 
>>
> yes you're right, the effector could be different. In that case the 
> attribute
> "modified_by" is not useful.
> 
>> This argues for an additional relationship 
>>  between Interaction and PostTranslationalModFeature, unless we're OK creating 
>>  multiple PostTranslationalModFeatures, identical except for their modified_by 
>>  attribute.  Comments on this?
>>  
>>
> I don't think they should be duplicated as they corresponds to a unique 
> site. This unique feature would
> be associated with different interaction entries. We might not need an 
> extra table between Interaction and PostTranslationalModFeature though. 
> We still can do the following query : "give me all the interaction 
> entries which target is a PostTranslationalModFeature which id is ...".
> How does it sound ?

We could do this, although one question is whether, semantically speaking,
the "target" of an Interaction should be "the thing to be modified" (e.g. an
unphosphorylated sequence or residue) or "the resulting modification" (e.g.
the feature that represents a phosphorylated residue at the appropriate
location.)  The answer is probably that we just shouldn't worry about it
and should just do whatever is most convenient on a case-by-case basis.
To do it "correctly" would be problematic either way.  For example, if we
say that the target is the thing to be modified, then we have to create a
feature that represents a region of sequence that *could* be modified in
some way and then create another feature to represent the actual modification.
But if we say that the target is the result of the modification then we may
have to create equally unusual tables/views.  For example, if the result of
a given interaction is to degrade a protein, then do we have to create a
table/object that represents a degraded protein (or "nothing", or whatever
it is that's left after the modification)?  For now I have no problem with
interpreting the "target" based on context, but in the longer term we may
want to consider separating the notions of "target prior to modification"
and either "target after modification" or "effect of modification".

I also realized belatedly that I could have left the Interaction table
unchanged, rather than introducing specific references to RowSet.  This
would have allowed us to represent either singleton effectors/targets or
set-valued effectors/targets, without having to always join through RowSet
in the singleton case.  On the other hand, if we do associate some
additional information with the RowSets, then the current representation
is correct.

>> *AA repeats (new view: RepeatRegionAAFeature)
>>  I called this view RepeatRegionAAFeature in case we want to have a similar view
>>  for NASequences.  I also created only a single view, instead of following Arnaud's
>>  original suggestion, which was for both:
>>
>>      * RepeatRegionFeature as a set of RepeatUnitFeatures,
>>      * RepeatUnitFeature, with the consensus sequence, name and size
>>
>>  I based the design of this view on that of TandemRepeatFeature, which we have for
>>  NASequences already.  Instead of splitting the consensus sequence, name, and size
>>  into a separate table, they occupy columns in RepeatRegionAAFeature.  This works
>>  quite well for the tandem repeats we already have (for DNA sequences.)  However, if
>>  there is a known set of named amino acid sequence repeats, then it would probably
>>  make sense to do what Arnaud suggested, and store these in a separate table 
>>  (likely named RepeatUnit, not RepeatUnitFeature, since they would have no unique 
>>  locations.)  Does this sound reasonable?  That is, put the consensus seqs in the
>>  repeat region table itself if they're anonymous, but if they've been named, then 
>>  store them  in a separate table.  Also note that this view has a reference to 
>>  RepeatType, although the current contents of this table are probably applicable 
>>  only to DNA sequence repeats (LINEs, SINEs, ALUs, etc.), since I believe that I 
>>  parsed them out of RepBase.
>> 
>>
> I proposed a separate repeat feature because one may want to annotate a 
> repeat outside a repeat region, for example LTR repeats attached to a 
> given transposable element. These RepeatFeatures or RepeatUnitFeatures 
> can then have a location.
> The other case is when a repeat region is made of a set of different 
> repeat units.

OK, I didn't realize that this was what you were trying to represent.  As
currently conceived, RepeatRegionAAFeature is meant to represent a region
that contains one or more immediately adjacent copies of the same type
of (amino acid sequence) repeat.  The assumption is also that these regions
will typically be maximal (with respect to the chosen repeat type, consensus,
and max. mismatch, the last of which is not represented directly in the
table.)  We can still represent more complex repeat structures using this
single table, but the representation is implicit, not explicit (i.e. you
have to do a query to find out what other repeats lie within the bounds of
the transposon, meaning that there's no easy way to query for all transposable
elements with a particular flanking LTR structure.)  Do you want to come up
with a 2-table version of what I've done?  The use cases aren't clear enough
in my mind yet for me to be able to do it.  It seems that the bare minimum we
need is just another column in the RepeatRegionAAFeature, parent_id; which
would let us represent explicitly that a particular repeat is a *necessary*
(versus incidental) component of another NA/AAFeature.  Both AAFeatureImp
and NAFeatureImp already have a parent_id, so this would be a straightforward
change.  The queries still might not be terribly efficient, but I don't know
what exactly you wanted to support in terms of queries, versus just making
sure that the representation is sufficiently rich to capture the structure.

> In any case, NA repeats and AA repeats should have the same design. Just 
> the controlled vocabulary representing the types of repeats will be 
> different.

Absolutely, yes, although one question is whether AA repeats can have the
same kind of nested structure that you mention as a possibility for NA
repeats (the transposon with LTRs).  I don't know the answer to this.

>>-DoTS.Interaction (table modified, dependent tables added)
>> *Added "has_direction" column, as discussed previously.  The idea here is that
>>  not all interactions (particularly physical ones, e.g., dimerization) have a
>>  direction.  If has_direction == 0, then the value of direction_is_known can
>>  be ignored.
>> *Added non-nullable "effector_action_type_id" column, referencing 
>>  DoTS.EffectorActionType (a new table.)  This column/table represents the possible
>>  things that an effector can do to a target.  For example, the InteractionType
>>  associated with the Interaction could be "binds to" (e.g., a promoter region), and
>>  the EffectorActionType for that Interaction could be to either "inhibit" or "enhance"
>>  expression of the coresponding gene.
>> *Replaced effector_table_id and effector_row_id with effector_row_set_id, and
>>  similarly for the target_table_id and target_row_id.  This allows us to represent
>>  the interaction of a set of objects (the effector) with another set of objects
>>  (the target.)  Previously the Interaction table could only represent the interaction
>>  between a single pair of entities (OK if they happened to be Complexes, for example,
>>  but a potential problem in other situations.)  Now both effector and target are 
>>  represented as references to DoTS.RowSet, which in tun references DoTS.RowSetMember,
>>  which...in turn...references the individual database rows that comprise the effector
>>  or target.  These tables (RowSet and RowSetMember) are essentially the same as 
>>  Complex and ComplexComponent, except that they are totally generic; they can be 
>>  used to group any set of rows in the database and they store no additional information.  
>>  However, if there are any additional columns that we can think of (that are specific 
>>  to Interactions) then these tables should be replaced by less generic ones (e.g. 
>>  InteractingEntitySet or InteractionSet, or something along those lines.)
>>  
>>
> Sounds fine. The only thing I can see is regarding the 
> EffectorActionType. If each effector, member of a RowSet, has a 
> different action type, the attribute, effector_action_type_id, should go 
> in the RowSetMember table. I don't have any example though.

OK, I think I'd be inclined to wait until we have some use cases for this.
Although the current schema lets us group effectors together, it doesn't let
us say (for example) that E1 interacts *directly* with T1 to phosphorylate
it, but that E1's active site is only exposed when E1 is bound to E2.  In
other words, E1's role in the activity can be viewed as "primary", and E2's
role is secondary (in some sense) but all we can say in the schema is that
the Complex consisting of E1 and E2 interacts with T1 to phosphorylate it.
I think that the solution we have now is OK, but it only lets us represent
the overall action of the entire set of effectors.

Jonathan

-- 
Jonathan Crabtree
Center for Bioinformatics, University of Pennsylvania
1406 Blockley Hall, 423 Guardian Drive Philadelphia, PA 19104-6021
215-573-3115