Re: [Gusdev-gusdev] GUS 3.0 schema changes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

  Hi Jonathan

Thanks for doing this. Please find below some comments I've inserted.

I have noticed that your changes don't cover the DNA/RNA features. Is 
there any reason for this ? I know there are quite a lot of them and 
there might be another way of storing data some information such as 
telomere or centromere regions, origin of replication, inflection point 
etc. All these features are covered by Sequence Ontology, so a new 
ChromosomeElement or ChromosomeRegion feature could be generic enough to 
cover most of them.
Let me know what you think.

cheers
Arnaud

Jonathan Crabtree wrote:

>Hi all-
>
>The attached text file describes the schema changes that I just finished
>implementing.  It's attached as a separate file to avoid problems with the
>mail clients changing the line wrapping.  Sorry if there are any typos,
>but it's getting late and I want to get this out there for everyone to
>look at in the morning.
>
>Jonathan
>
>  
>
>------------------------------------------------------------------------
>
>
>Hi all-
>
>Here are the schema changes that I've just finished implementing:
>
>-Protein properties (new table: DoTS.ProteinProperty)
> A new table that Arnaud requested back in July, but was overlooked in the earlier
> schema changes.  There are four possible protein properties as represented by the
> following constraint (we could instead have a ProteinPropertyType table and treat
> this as a controlled vocabulary):
>
> alter table DOTS.PROTEINPROPERTY add constraint PROTEINPROPERTY_CK01 check 
> (property_name in ('isoelectric point', 'molecular mass', 'charge', 'average residue mass'));
>
> The table allows multiple protein properties of the same type to be associated with
> entries in DoTS.AASequenceImp.  Arnaud had suggested originally that the last 
> property, average residue mass, could actually be an attribute of the table that 
> stores the protein sequence itself.  However, it seemed that if the molecular 
> mass attribute could have multiple values (e.g., from different experiments) then
> the same should be true of the average residue mass, which is essentially a 
> derived property.  Let me know if you disagree with this, or think we should 
> create an explicit controlled vocab. for these 4 properties.
>  
>
A controlled vocabulary table with the four attributes you've mentioned 
is fine.

>-Protein features
> *Signal peptide features (stored in DoTS.SignalPeptideFeature)
>  This view exists already, as DoTS.SignalPeptideFeature, but we need to add the
>  ability to store curated data, such as targetting information.  It should be 
>  straightforward to modify the view to accomodate this, but I'm not sure exactly
>  what needs to be stored.  Currently we use the view exclusively for SignalP
>  predictions, and from what I understand SignalP is only concerned with predicting
>  secreted proteins, meaning that we don't currently have any explicit targetting 
>  information.  Is this something we could represent using the GO ontology for cellular 
>  localization?  Do we also need some free text columns?  Let me know and I'll make
>  the changes.  All the SignalP-specific columns appear to be nullable, so we don't
>  necessarily have to do anything except add the new columns for the manually curated
>  information.
>  
>
After talking to the curators it appears that GO component suplements 
targetting information at the feature level but will not be enough.
The targeting information is represented by the component ontology in 
one context i.e. mitochondrial, nuclear, membrane localization but not 
in the context of the actual residues involved.
The actual residues involved in the targeting (or any other phenomena) 
need to be represented by a protein feature ontology can be mapped onto 
specific amino acids of a protein.
This ontology is the equivalent of Sequence Ontology (SO) which is meant 
for DNA features. It is being prepared by Val Wood with input from 
Swiss-prot.

As you're going to add a extra attribute sequence_ontology_id to the NA 
Features, could you do the same to any AA Features ?

> *Domain/motif features (new view: DoTS.DomainFeature)
>  I've created this as a view on AAFeatureImp.  You can either use the NAME column to
>  specify the type of domain (e.g., "leucine zipper" or "coiled coil"), or include
>  an explicit reference to a domain/motif database (SMART, ProSite) using the 
>  external_database_release_id and source_id columns.  PFam is handled as a special
>  case, with a specific pfam_entry_id column that references the PfamEntry table.  
>  This was originally done because the entries in the PFam database are HMMs, so 
>  they don't fit too well in the sequence-related tables.  Most other motif databases
>  have consensus sequences for their motifs that we can store in MotifAASequence.
>
>  Note that motif/domain features are currently stored in GUS in the PredictedAAFeature
>  table, which is also a view on AAFeatureImp.  After the migration I plan to eliminate
>  the PredictedAAFeature view and move its contents into feature-specific tables (like
>  DomainFeature) instead.
>
> *Transmembrane domain features (stored in DoTS.PredictedAAFeature)
>  "PlasmoDB web site shows hydrophobicity graphics, where is it stored in GUS?"
>  The hydrophobicity plots are computed dynamically based on the amino acid sequence.
>  Transmembrane domains are currently stored in the PredictedAAFeature view, although
>  I will probably create a new view for them when I get around to eliminating 
>  PredictedAAFeature.  Another possibility would be to treat TM domains as another
>  type of domain, and store them in DomainFeature.  What do you think about this?
>  
>
I reckon they could be merged.

> *Post-translational modification features (new view: DoTS:PostTranslationalModFeature)
>  Has a "type" column to represent the type of modification.  It was also suggested
>  that we have a column called "modified_by", which would be a reference to the 
>  Interaction table.  However, isn't it possible that the same post-translational
>  modification (e.g., phosphorylation of a specific amino acid) could be the result
>  of one of several Interactions? 
>
yes you're right, the effector could be different. In that case the 
attribute
"modified_by" is not useful.

> This argues for an additional relationship 
>  between Interaction and PostTranslationalModFeature, unless we're OK creating 
>  multiple PostTranslationalModFeatures, identical except for their modified_by 
>  attribute.  Comments on this?
>  
>
I don't think they should be duplicated as they corresponds to a unique 
site. This unique feature would
be associated with different interaction entries. We might not need an 
extra table between Interaction and PostTranslationalModFeature though. 
We still can do the following query : "give me all the interaction 
entries which target is a PostTranslationalModFeature which id is ...".
How does it sound ?

> *AA repeats (new view: RepeatRegionAAFeature)
>  I called this view RepeatRegionAAFeature in case we want to have a similar view
>  for NASequences.  I also created only a single view, instead of following Arnaud's
>  original suggestion, which was for both:
>
>      * RepeatRegionFeature as a set of RepeatUnitFeatures,
>      * RepeatUnitFeature, with the consensus sequence, name and size
>
>  I based the design of this view on that of TandemRepeatFeature, which we have for
>  NASequences already.  Instead of splitting the consensus sequence, name, and size
>  into a separate table, they occupy columns in RepeatRegionAAFeature.  This works
>  quite well for the tandem repeats we already have (for DNA sequences.)  However, if
>  there is a known set of named amino acid sequence repeats, then it would probably
>  make sense to do what Arnaud suggested, and store these in a separate table 
>  (likely named RepeatUnit, not RepeatUnitFeature, since they would have no unique 
>  locations.)  Does this sound reasonable?  That is, put the consensus seqs in the
>  repeat region table itself if they're anonymous, but if they've been named, then 
>  store them  in a separate table.  Also note that this view has a reference to 
>  RepeatType, although the current contents of this table are probably applicable 
>  only to DNA sequence repeats (LINEs, SINEs, ALUs, etc.), since I believe that I 
>  parsed them out of RepBase.
> 
>
I proposed a separate repeat feature because one may want to annotate a 
repeat outside a repeat region, for example LTR repeats attached to a 
given transposable element. These RepeatFeatures or RepeatUnitFeatures 
can then have a location.
The other case is when a repeat region is made of a set of different 
repeat units.

In any case, NA repeats and AA repeats should have the same design. Just 
the controlled vocabulary representing the types of repeats will be 
different.

> *2D structures (not currently represented)  
>  "Another question : What about 2D structures (beta-sheet and alpha-helice) in GUS?"
>  I don't *believe* we have any of these.  They should be easy to add as either a
>  single feature view, or a set of views.
>  
>
fine.

>-DoTS.Interaction (table modified, dependent tables added)
> *Added "has_direction" column, as discussed previously.  The idea here is that
>  not all interactions (particularly physical ones, e.g., dimerization) have a
>  direction.  If has_direction == 0, then the value of direction_is_known can
>  be ignored.
> *Added non-nullable "effector_action_type_id" column, referencing 
>  DoTS.EffectorActionType (a new table.)  This column/table represents the possible
>  things that an effector can do to a target.  For example, the InteractionType
>  associated with the Interaction could be "binds to" (e.g., a promoter region), and
>  the EffectorActionType for that Interaction could be to either "inhibit" or "enhance"
>  expression of the coresponding gene.
> *Replaced effector_table_id and effector_row_id with effector_row_set_id, and
>  similarly for the target_table_id and target_row_id.  This allows us to represent
>  the interaction of a set of objects (the effector) with another set of objects
>  (the target.)  Previously the Interaction table could only represent the interaction
>  between a single pair of entities (OK if they happened to be Complexes, for example,
>  but a potential problem in other situations.)  Now both effector and target are 
>  represented as references to DoTS.RowSet, which in tun references DoTS.RowSetMember,
>  which...in turn...references the individual database rows that comprise the effector
>  or target.  These tables (RowSet and RowSetMember) are essentially the same as 
>  Complex and ComplexComponent, except that they are totally generic; they can be 
>  used to group any set of rows in the database and they store no additional information.  
>  However, if there are any additional columns that we can think of (that are specific 
>  to Interactions) then these tables should be replaced by less generic ones (e.g. 
>  InteractingEntitySet or InteractionSet, or something along those lines.)
>  
>
Sounds fine. The only thing I can see is regarding the 
EffectorActionType. If each effector, member of a RowSet, has a 
different action type, the attribute, effector_action_type_id, should go 
in the RowSetMember table. I don't have any example though.

>-DoTS.Attribution (new table)
> A table intended to allow us to attribute data to people and/or organizations, using
> the Contact table.  It is a many-to-1 relationship between SReS.Contact and any row in 
> the DoTS schema.
>  
>
Fine. We already agreed on this implementation.

>-SRes.BibRefType (new table), SRes.BibliographicReference (modified table)
> Added new table, BibRefType, to represent the different types of references/publications
> that one might encounter.  I've populated this table based on a combination of the terms
> used in MEDLINE 2003 and those from FlyBase, as well as one or two rows of my own
> devising.  Correspondingly, a non-nullable column has been added to BibliographicReference
> to allow one to specify the BibRefType of the reference.  I've also added a contact_id
> column to BibliographicReference, to be used in the case where the BibRefType == 
> "personal communication".  You can find the rows that I've placed in BibRefType in the
> 3.0 db creation scripts mentioned below (in the file gus30-sres-BibRefType-rows.sql).
>
>I think those are the main changes.  I also wrote a couple of scripts to help check
>and maintain the version tables (those ending in "Ver") and to check that the actual
>database schema and the information stored in Core.TableInfo actually agree.  In the
>process I fixed a number of problems, although there are still some things to be done,
>such as:
>
> -Create SEQUENCE objects for all the tables (or at least modify the database dump
>  script to generate CREATE SEQUENCE statements for all the tables)
> -Check that all the foreign key constraints have been defined correctly
> -Check that all the foreign key columns are indexed correctly (I have a script 
>  that will do this)
> -Add sequence_ontology_id to all the views on NAFeatureImp.
>
>I've updated the schema browser, so you should be able to see all the new and 
>modified tables online:
>
>  <http://www.cbil.upenn.edu/cgi-bin/GUS30/schemaBrowser.pl?db=GUS30>
>
>There's also a prelimary dump of the create database scripts, which should be consistent
>with's shown in the schema browser:
>
> <http://www.cbil.upenn.edu/downloads/GUS/releases/3.0-beta/schema/>
>
>Jonathan
>
>
>  
>