You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(11) |
Jul
(34) |
Aug
(14) |
Sep
(10) |
Oct
(10) |
Nov
(11) |
Dec
(6) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
(56) |
Feb
(76) |
Mar
(68) |
Apr
(11) |
May
(97) |
Jun
(16) |
Jul
(29) |
Aug
(35) |
Sep
(18) |
Oct
(32) |
Nov
(23) |
Dec
(77) |
2004 |
Jan
(52) |
Feb
(44) |
Mar
(55) |
Apr
(38) |
May
(106) |
Jun
(82) |
Jul
(76) |
Aug
(47) |
Sep
(36) |
Oct
(56) |
Nov
(46) |
Dec
(61) |
2005 |
Jan
(52) |
Feb
(118) |
Mar
(41) |
Apr
(40) |
May
(35) |
Jun
(99) |
Jul
(84) |
Aug
(104) |
Sep
(53) |
Oct
(107) |
Nov
(68) |
Dec
(30) |
2006 |
Jan
(19) |
Feb
(27) |
Mar
(24) |
Apr
(9) |
May
(22) |
Jun
(11) |
Jul
(34) |
Aug
(8) |
Sep
(15) |
Oct
(55) |
Nov
(16) |
Dec
(2) |
2007 |
Jan
(12) |
Feb
(4) |
Mar
(8) |
Apr
|
May
(19) |
Jun
(3) |
Jul
(1) |
Aug
(6) |
Sep
(12) |
Oct
(3) |
Nov
|
Dec
|
2008 |
Jan
(4) |
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
(21) |
2009 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(1) |
Jun
(8) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2010 |
Jan
|
Feb
(1) |
Mar
(4) |
Apr
(3) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
(4) |
May
(19) |
Jun
(14) |
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
(22) |
Apr
(12) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(2) |
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
(2) |
Sep
|
Oct
|
Nov
|
Dec
(1) |
2016 |
Jan
(1) |
Feb
(1) |
Mar
|
Apr
(1) |
May
|
Jun
(2) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
|
2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Jonathan C. <cra...@sn...> - 2003-01-17 05:41:27
|
Arnaud - > > Which DNA/RNA features do you mean (other than those mentioned above)? > > The file I sent you should include views on the top of NAFeatureImp > table. Here the list : Yes, you're absolutely right; there was a period when I wasn't paying very close attention to the schema mailing list, and I'm afraid I misplaced a couple of the files you sent, at least temporarily. I believe I've now added all the views and tables that you originally proposed, with some minor modifications to take into account discussions we've had since then. See the attached text file for a complete list of the changes I've made this time around. > Yes we had! So regarding chromosome regions, shall we keep > TelomereFeature and CentromereFeature ? No, I think we should use ChromosomeElementFeature instead; I've created this view based on the ChromosomeElement view you suggested, but with a couple of additional columns to handle the data currently in gusdev.TelomereFeature and gusdev.CentromereFeature. > > At > > the other extreme, we could continue what we're doing now, i.e. using > > an ad-hoc classification of features based on the data we actually have > > available, and just make sure that every feature is tagged with the > > correct sequence ontology term. Any thoughts? > > It makes sense as SO may undergo revisions this year. OK, as noted in the attachment, I've added sequence_ontology_id to *all* views of NAFeatureImp and AAFeatureImp. > >> A controlled vocabulary table with the four attributes you've > >> mentioned is fine. Done; it's called ProteinPropertyType, and the schema/contents are described in the attached list of changes. > >> As you're going to add a extra attribute sequence_ontology_id to the > >> NA Features, could you do the same to any AA Features ? OK, done. > The way the SignalPeptideFeature is designed make difficult the > annotation of localization signal features. We can leave > SignalPeptideFeature as it is as it fits with SignalP software > prediction and in the future create a new feature LocalizationSignalFeature. OK, based on our discussion today the only change I've made to SignalPeptideFeature is to add the sequence_ontology_id, which can be used to reference the different localization ontology terms that you mentioned. A column has been added to SequenceOntology to let us store multiple ontologies (and versions thereof) in the same table. Experimental evidence, references, and annotator's comments can be linked to SignalPeptideFeature (or a future LocalizationSignalFeature view) using DoTS.Evidence. > >> I reckon they could be merged. (This comment was in reference to incorporating TM domain features into the DomainFeature view.) I've added a "number_of_domains" column to DomainFeature to permit this. We will *not* have a separate view specifically for TM domain features. > > I also realized belatedly that I could have left the Interaction table > > unchanged, rather than introducing specific references to RowSet. This > > would have allowed us to represent either singleton effectors/targets or > > set-valued effectors/targets, without having to always join through > > RowSet > > in the singleton case. On the other hand, if we do associate some > > additional information with the RowSets, then the current representation > > is correct. > > It depends if we want to represent many-to-many relationship between > interaction and members of this interaction. Without the RowSet table, > we can't assign a set of several effectors/targets, right ? Unless we > consider that this set of effectors are being part of a complex and act > as the whole. It's true that without the RowSet table we can't assign a set of several effectors or targets. What I was trying to say was that I replaced the following rows in DoTS.Interaction-- effector_table_id effector_row_id (or something to that effect) using instead a single row that references a RowSet: effector_row_set_id However, I could have left the Interaction table unchanged, and used the effector_table_id and effector_row_id to reference entries in the RowSet table (in the case where there are multiple effectors.) With this approach one would have the choice of either using or not using the RowSet table on a case-by-case basis. I don't think it's too important which way we do this; on the one hand you save a join when you only need to reference a single effector/target (using the table_id/row_id approach) but on the other hand with the row_set_id approach you can write uniform code and also have an enforceable referential integrity constraint. So barring any strong objection, I'll leave the table as it is now (i.e., with explicit references to RowSet, meaning that you always have to have a RowSet even when the effector or target is a single object.) > A case we came across here for Tbrucei is nested repeat regions (at the > DNA level). Each repeat region has coordinates and is annotated with a > unique repeat unit type. This repeat region can be within a bigger > repeat region annotated with a different repeat unit type. > ... which is in other words your suggestion with parent_id as an extra > attribute ... I haven't added the parent_id yet, but I'll do so. > Regarding transposon repeat types, if we have a TransposableElement > feature and its type is given as an attribute, a repeat feature will > just be useful to locate the LTRs within a given a transposable element. > Can we keep this functionality ? Then the feature will be simple, just a > repeat_type, and a parent_id atributes. Are you saying that we still need the two tables/features, one for RepeatFeature, the other for RepeatRegionFeature? Could you give me a specific example of how you would envision using these tables (and also these tables in conjunction with the TransposableElement view, under the assumption that they're all equipped with parent_ids)? > Let's leave the design as it is for now. Curators are not going to > curate interactions data in the short term. We shall come back later > with more precise ideas/use cases about them. Sounds good. Let me know if there's anything I've missed. I'll try to generate updated SQL scripts tomorrow, and also update the schema browser so that everyone can review the changes one last time. Cheers, Jonathan |
From: Arnaud K. <ax...@sa...> - 2003-01-16 15:16:21
|
Hi Joan I'll get the new controlled vocabularies ready for population. If you're planning to use the UpdateFromXML.pm plugin for populating GUS I should have examples. Regarding ComplexType it should be covered by GO component. Regarding InteractionType, we need to find a controlled vocabulary which I'm not aware of yet ! cheers Arnaud mazz wrote: >Hi Jonathan, > >Perhaps we can ask Matt to revisit his documentation plugin. There are probably >additional changes he will have to make for its use with GUS30 now. >Also, I can send Arnaud an example of the XML for a table. We can use the XML to >populate the rows of the controlled vocabulary tables (ids, terms (names) and >definitions (descriptions). > > >Joan > >Jonathan Crabtree wrote: > > > >>Hi Joan- >> >>Arnaud did supply us with documentation (attached) for the new Phenotype tables, >>but I just haven't loaded it into the database yet (I've also been quite busy :)) >>I started working on updating the documentation a couple of days ago, but in the >>process discovered that there are some invalid rows in core.DatabaseDocumentation >>that should be corrected first. A query shows that there are 73 rows in this >>table that reference nonexistent columns in GUS 3.0. For the most part I think >>that these are relatively minor problems stemming from the fact that the schema >>has been updated more recently than the documentation. However, there are also >>a few rows that suggest we need to improve the plugin and/or procedure used to >>populate this table. For example, the following rows have spaces in the column >>name (attribute_name), probably because the input files were invalid and the plugin >>has no restrictions on the format of the attribute_name: >> >>DATABASE_DOCUMENTATION_ID >>------------------------- >>ATTRIBUTE_NAME >>-------------------------------------------------------------------------------- >> 1419 >>bio_material_id fk to LabelledExtract view of BioMaterial >> >> 1103 >>bio_source_characteristic_id primary key >> >> 1120 >>treatment_id fk to Treatment >> >>DATABASE_DOCUMENTATION_ID >>------------------------- >>ATTRIBUTE_NAME >>-------------------------------------------------------------------------------- >> 1374 >>review_status_id The identifer of the review status >> >> 1418 >>assay_id fk to Assay >> >> 1373 >>synonym_name The gene symbol >> >>6 rows selected. >> >>Also, as an aside (and not a comment to you in particular), it strikes me that >>column "documentation" of the form "fk to Table X" and "Primary key" could be >>generated automatically from the schema. However, comments on foreign keys >>are useful if they identify the specific subclass (i.e. view) to which the >>reference is expected to link, or if they explain what the referenced value is >>used for (if not obvious). Anyway, since there are still some minor schema >>changes taking place, I think that next week might be a good time to worry >>about updating all the documentation, since the database will be locked down >>for the migration at that point anyway. As for the controlled vocabularies, >>I think you're right, and we should try to populate these as soon as we can, >>even if it will be an iterative process in some cases. >> >>Jonathan >> >>-- >>Jonathan Crabtree >>Center for Bioinformatics, University of Pennsylvania >>1406 Blockley Hall, 423 Guardian Drive Philadelphia, PA 19104-6021 >>215-573-3115 >> >> ------------------------------------------------------------------------ >> Name: gus.phenotype_draft.doc >> gus.phenotype_draft.doc Type: Winword File (application/msword) >> Encoding: base64 >> >> > > > |
From: Arnaud K. <ax...@sa...> - 2003-01-16 14:02:56
|
Hi Jonathan Jonathan Crabtree wrote: > > Arnaud- > > Thanks for the feedback; I think we're getting close to agreement here. I think so too ! >> I have noticed that your changes don't cover the DNA/RNA features. Is >> there any reason for this ? I know there are quite a lot of them and >> there might be another way of storing data some information such as >> telomere or centromere regions, origin of replication, inflection >> point etc. All these features are covered by Sequence Ontology, so a >> new ChromosomeElement or ChromosomeRegion feature could be generic >> enough to cover most of them. >> Let me know what you think. > > > Which DNA/RNA features do you mean (other than those mentioned above)? The file I sent you should include views on the top of NAFeatureImp table. Here the list : * ChromosomeElement or we can keep CentromereFeature and TelomereFeature as they are in gusdev - IMPORTANT * InfectionPointFeature * ReplicationFeature, for annotated origins of replication * RNARegulatory - as there is a DNARegulatory feature => regulatory element at the RNA level * RNASecondaryStructure * SpliceSiteFeature * TransposableElement + an extra attribute in RestrictionFragmentFeature, "type_of_cut" (Sticky or blunt) + an extra attribute in GeneSynonym, "is_obsolete" + a new view on the top of NASequenceImp, "GenomicSequence" instead of the existing one, ExternalNASequence. I can send the files to you if you want. > > It's possible that I misplaced the e-mail or notes where we discussed > these. Or are you just saying that we will eventually have a view for > each type of DNA/RNA feature in the Sequence Ontology? I think that > this is true, although I hadn't planned to make the change immediately, > since I believe we had agreed on a "transitional" period in which the > various NAFeature views would first be given a nullable > sequence_ontology_id Yes we had! So regarding chromosome regions, shall we keep TelomereFeature and CentromereFeature ? > and we would then decide how to best rearrange the views to more closely > match the ontology terms. I haven't added the sequence_ontology_id > column to the NAFeature views, but I will do so right away. We do > currently have some relevant NAFeature views in gusdev that have not > been migrated into 3.0: > > CentromereFeature > LowComplexityNAFeature > ScaffoldGapFeature > TelomereFeature > > I have no objection to merging the telomere and centromere features into > a single view--along with any other chromosomal regions covered by the > ontology--although it would mean that we wouldn't have a 1-1 mapping > between sequence ontology terms and views on NAFeature. I think that > at one point this was proposed as the eventual goal of the rearrangement. > Anyway, given that I'm not certain of the plan here, I'm going to add > the sequence_ontology_id column but leave the views unchanged for now. > They can easily be changed without interfering with our data migration, > so their fate doesn't have to be settled immediately. We have yet to > establish a consistent set of rules for deciding when different types > of features get grouped into a single view and when they get their own > views, so this is probably a good opportunity to settle the question > once and for all. The Sequence Ontology is big enough that we probably > *don't* want a view for each and every term in the ontology; it would > make maintenance quite difficult. But we could, for example, create a > view for every top-level (or second-level) sequence ontology term. > However, even a relatively high-level feature like "chromosomal region" > (SO:0000711) looks like it's already a 4th or 5th level feature... > At > the other extreme, we could continue what we're doing now, i.e. using > an ad-hoc classification of features based on the data we actually have > available, and just make sure that every feature is tagged with the > correct sequence ontology term. Any thoughts? It makes sense as SO may undergo revisions this year. > >>> >>> alter table DOTS.PROTEINPROPERTY add constraint PROTEINPROPERTY_CK01 >>> check (property_name in ('isoelectric point', 'molecular mass', >>> 'charge', 'average residue mass')); >>> >>> The table allows multiple protein properties of the same type to be >>> associated with >>> entries in DoTS.AASequenceImp. Arnaud had suggested originally that >>> the last property, average residue mass, could actually be an >>> attribute of the table that stores the protein sequence itself. >>> However, it seemed that if the molecular mass attribute could have >>> multiple values (e.g., from different experiments) then >>> the same should be true of the average residue mass, which is >>> essentially a derived property. Let me know if you disagree with >>> this, or think we should create an explicit controlled vocab. for >>> these 4 properties. >>> >>> >> A controlled vocabulary table with the four attributes you've >> mentioned is fine. > > > OK, I'll make this change. > >>> -Protein features >>> *Signal peptide features (stored in DoTS.SignalPeptideFeature) >>> This view exists already, as DoTS.SignalPeptideFeature, but we need >>> to add the >>> ability to store curated data, such as targetting information. It >>> should be straightforward to modify the view to accomodate this, >>> but I'm not sure exactly >>> what needs to be stored. Currently we use the view exclusively for >>> SignalP >>> predictions, and from what I understand SignalP is only concerned >>> with predicting >>> secreted proteins, meaning that we don't currently have any >>> explicit targetting information. Is this something we could >>> represent using the GO ontology for cellular localization? Do we >>> also need some free text columns? Let me know and I'll make >>> the changes. All the SignalP-specific columns appear to be >>> nullable, so we don't >>> necessarily have to do anything except add the new columns for the >>> manually curated >>> information. >>> >>> >> After talking to the curators it appears that GO component suplements >> targetting information at the feature level but will not be enough. >> The targeting information is represented by the component ontology in >> one context i.e. mitochondrial, nuclear, membrane localization but >> not in the context of the actual residues involved. >> The actual residues involved in the targeting (or any other >> phenomena) need to be represented by a protein feature ontology can >> be mapped onto specific amino acids of a protein. >> This ontology is the equivalent of Sequence Ontology (SO) which is >> meant for DNA features. It is being prepared by Val Wood with input >> from Swiss-prot. > > > OK, so the idea is that the various signal peptides have been classified > into named classes that should be represented by some kind of ontology? > >> As you're going to add a extra attribute sequence_ontology_id to the >> NA Features, could you do the same to any AA Features ? > > > This will only work if the new ontology is actually part of the Sequence > Ontology (or if we use the SequenceOntology table to store both > ontologies.) > Do you know if this is the case? It's quite possible, since the SO does > already cover amino acid features. Otherwise we'll have to create a > separate AASequenceOntology (or whatever the new ontology is called). It is at the moment a different project but it would make sense they merge in the future. Just to give you an idea about Localization Signals, here is a snapshot: %localization signal %N-terminal signal sequence %nuclear localization signal %bipartite nuclear localization signal %etc %mitochondrial localization sequence %thylakoid localization signal %ER retention signal The way the SignalPeptideFeature is designed make difficult the annotation of localization signal features. We can leave SignalPeptideFeature as it is as it fits with SignalP software prediction and in the future create a new feature LocalizationSignalFeature. > >>> *Transmembrane domain features (stored in DoTS.PredictedAAFeature) >>> "PlasmoDB web site shows hydrophobicity graphics, where is it >>> stored in GUS?" >>> The hydrophobicity plots are computed dynamically based on the >>> amino acid sequence. >>> Transmembrane domains are currently stored in the >>> PredictedAAFeature view, although >>> I will probably create a new view for them when I get around to >>> eliminating PredictedAAFeature. Another possibility would be to >>> treat TM domains as another >>> type of domain, and store them in DomainFeature. What do you think >>> about this? >>> >>> >> I reckon they could be merged. > > > OK, sounds good. > >>> *Post-translational modification features (new view: >>> DoTS:PostTranslationalModFeature) >>> Has a "type" column to represent the type of modification. It was >>> also suggested >>> that we have a column called "modified_by", which would be a >>> reference to the Interaction table. However, isn't it possible >>> that the same post-translational >>> modification (e.g., phosphorylation of a specific amino acid) could >>> be the result >>> of one of several Interactions? >> >> yes you're right, the effector could be different. In that case the >> attribute >> "modified_by" is not useful. >> >>> This argues for an additional relationship between Interaction and >>> PostTranslationalModFeature, unless we're OK creating multiple >>> PostTranslationalModFeatures, identical except for their modified_by >>> attribute. Comments on this? >>> >>> >> I don't think they should be duplicated as they corresponds to a >> unique site. This unique feature would >> be associated with different interaction entries. We might not need >> an extra table between Interaction and PostTranslationalModFeature >> though. We still can do the following query : "give me all the >> interaction entries which target is a PostTranslationalModFeature >> which id is ...". >> How does it sound ? > > > We could do this, although one question is whether, semantically > speaking, > the "target" of an Interaction should be "the thing to be modified" > (e.g. an > unphosphorylated sequence or residue) or "the resulting modification" > (e.g. > the feature that represents a phosphorylated residue at the appropriate > location.) The answer is probably that we just shouldn't worry about it > and should just do whatever is most convenient on a case-by-case basis. > To do it "correctly" would be problematic either way. For example, if we > say that the target is the thing to be modified, then we have to create a > feature that represents a region of sequence that *could* be modified in > some way and then create another feature to represent the actual > modification. > But if we say that the target is the result of the modification then > we may > have to create equally unusual tables/views. For example, if the > result of > a given interaction is to degrade a protein, then do we have to create a > table/object that represents a degraded protein (or "nothing", or > whatever > it is that's left after the modification)? For now I have no problem > with > interpreting the "target" based on context, but in the longer term we may > want to consider separating the notions of "target prior to modification" > and either "target after modification" or "effect of modification". > > I also realized belatedly that I could have left the Interaction table > unchanged, rather than introducing specific references to RowSet. This > would have allowed us to represent either singleton effectors/targets or > set-valued effectors/targets, without having to always join through > RowSet > in the singleton case. On the other hand, if we do associate some > additional information with the RowSets, then the current representation > is correct. It depends if we want to represent many-to-many relationship between interaction and members of this interaction. Without the RowSet table, we can't assign a set of several effectors/targets, right ? Unless we consider that this set of effectors are being part of a complex and act as the whole. > >>> *AA repeats (new view: RepeatRegionAAFeature) >>> I called this view RepeatRegionAAFeature in case we want to have a >>> similar view >>> for NASequences. I also created only a single view, instead of >>> following Arnaud's >>> original suggestion, which was for both: >>> >>> * RepeatRegionFeature as a set of RepeatUnitFeatures, >>> * RepeatUnitFeature, with the consensus sequence, name and size >>> >>> I based the design of this view on that of TandemRepeatFeature, >>> which we have for >>> NASequences already. Instead of splitting the consensus sequence, >>> name, and size >>> into a separate table, they occupy columns in >>> RepeatRegionAAFeature. This works >>> quite well for the tandem repeats we already have (for DNA >>> sequences.) However, if >>> there is a known set of named amino acid sequence repeats, then it >>> would probably >>> make sense to do what Arnaud suggested, and store these in a >>> separate table (likely named RepeatUnit, not RepeatUnitFeature, >>> since they would have no unique locations.) Does this sound >>> reasonable? That is, put the consensus seqs in the >>> repeat region table itself if they're anonymous, but if they've >>> been named, then store them in a separate table. Also note that >>> this view has a reference to RepeatType, although the current >>> contents of this table are probably applicable only to DNA sequence >>> repeats (LINEs, SINEs, ALUs, etc.), since I believe that I parsed >>> them out of RepBase. >>> >>> >> I proposed a separate repeat feature because one may want to annotate >> a repeat outside a repeat region, for example LTR repeats attached to >> a given transposable element. These RepeatFeatures or >> RepeatUnitFeatures can then have a location. >> The other case is when a repeat region is made of a set of different >> repeat units. > > > OK, I didn't realize that this was what you were trying to represent. As > currently conceived, RepeatRegionAAFeature is meant to represent a region > that contains one or more immediately adjacent copies of the same type > of (amino acid sequence) repeat. The assumption is also that these > regions > will typically be maximal (with respect to the chosen repeat type, > consensus, > and max. mismatch, the last of which is not represented directly in the > table.) We can still represent more complex repeat structures using this > single table, but the representation is implicit, not explicit (i.e. you > have to do a query to find out what other repeats lie within the > bounds of > the transposon, meaning that there's no easy way to query for all > transposable > elements with a particular flanking LTR structure.) Do you want to > come up > with a 2-table version of what I've done? The use cases aren't clear > enough > in my mind yet for me to be able to do it. It seems that the bare > minimum we > need is just another column in the RepeatRegionAAFeature, parent_id; > which > would let us represent explicitly that a particular repeat is a > *necessary* > (versus incidental) component of another NA/AAFeature. Both AAFeatureImp > and NAFeatureImp already have a parent_id, so this would be a > straightforward > change. The queries still might not be terribly efficient, but I > don't know > what exactly you wanted to support in terms of queries, versus just > making > sure that the representation is sufficiently rich to capture the > structure. A case we came across here for Tbrucei is nested repeat regions (at the DNA level). Each repeat region has coordinates and is annotated with a unique repeat unit type. This repeat region can be within a bigger repeat region annotated with a different repeat unit type. ... which is in other words your suggestion with parent_id as an extra attribute ... Regarding transposon repeat types, if we have a TransposableElement feature and its type is given as an attribute, a repeat feature will just be useful to locate the LTRs within a given a transposable element. Can we keep this functionality ? Then the feature will be simple, just a repeat_type, and a parent_id atributes. > >> In any case, NA repeats and AA repeats should have the same design. >> Just the controlled vocabulary representing the types of repeats will >> be different. > > > Absolutely, yes, although one question is whether AA repeats can have the > same kind of nested structure that you mention as a possibility for NA > repeats (the transposon with LTRs). I don't know the answer to this. > >>> -DoTS.Interaction (table modified, dependent tables added) >>> *Added "has_direction" column, as discussed previously. The idea >>> here is that >>> not all interactions (particularly physical ones, e.g., >>> dimerization) have a >>> direction. If has_direction == 0, then the value of >>> direction_is_known can >>> be ignored. >>> *Added non-nullable "effector_action_type_id" column, referencing >>> DoTS.EffectorActionType (a new table.) This column/table >>> represents the possible >>> things that an effector can do to a target. For example, the >>> InteractionType >>> associated with the Interaction could be "binds to" (e.g., a >>> promoter region), and >>> the EffectorActionType for that Interaction could be to either >>> "inhibit" or "enhance" >>> expression of the coresponding gene. >>> *Replaced effector_table_id and effector_row_id with >>> effector_row_set_id, and >>> similarly for the target_table_id and target_row_id. This allows >>> us to represent >>> the interaction of a set of objects (the effector) with another set >>> of objects >>> (the target.) Previously the Interaction table could only >>> represent the interaction >>> between a single pair of entities (OK if they happened to be >>> Complexes, for example, >>> but a potential problem in other situations.) Now both effector >>> and target are represented as references to DoTS.RowSet, which in >>> tun references DoTS.RowSetMember, >>> which...in turn...references the individual database rows that >>> comprise the effector >>> or target. These tables (RowSet and RowSetMember) are essentially >>> the same as Complex and ComplexComponent, except that they are >>> totally generic; they can be used to group any set of rows in the >>> database and they store no additional information. However, if >>> there are any additional columns that we can think of (that are >>> specific to Interactions) then these tables should be replaced by >>> less generic ones (e.g. InteractingEntitySet or InteractionSet, or >>> something along those lines.) >>> >>> >> Sounds fine. The only thing I can see is regarding the >> EffectorActionType. If each effector, member of a RowSet, has a >> different action type, the attribute, effector_action_type_id, should >> go in the RowSetMember table. I don't have any example though. > > > OK, I think I'd be inclined to wait until we have some use cases for > this. > Although the current schema lets us group effectors together, it > doesn't let > us say (for example) that E1 interacts *directly* with T1 to > phosphorylate > it, but that E1's active site is only exposed when E1 is bound to E2. In > other words, E1's role in the activity can be viewed as "primary", and > E2's > role is secondary (in some sense) but all we can say in the schema is > that > the Complex consisting of E1 and E2 interacts with T1 to phosphorylate > it. > I think that the solution we have now is OK, but it only lets us > represent > the overall action of the entire set of effectors. Let's leave the design as it is for now. Curators are not going to curate interactions data in the short term. We shall come back later with more precise ideas/use cases about them. > > Jonathan > Arnaud |
From: mazz <ma...@sn...> - 2003-01-15 23:55:18
|
Hi Jonathan, Perhaps we can ask Matt to revisit his documentation plugin. There are probably additional changes he will have to make for its use with GUS30 now. Also, I can send Arnaud an example of the XML for a table. We can use the XML to populate the rows of the controlled vocabulary tables (ids, terms (names) and definitions (descriptions). Joan Jonathan Crabtree wrote: > Hi Joan- > > Arnaud did supply us with documentation (attached) for the new Phenotype tables, > but I just haven't loaded it into the database yet (I've also been quite busy :)) > I started working on updating the documentation a couple of days ago, but in the > process discovered that there are some invalid rows in core.DatabaseDocumentation > that should be corrected first. A query shows that there are 73 rows in this > table that reference nonexistent columns in GUS 3.0. For the most part I think > that these are relatively minor problems stemming from the fact that the schema > has been updated more recently than the documentation. However, there are also > a few rows that suggest we need to improve the plugin and/or procedure used to > populate this table. For example, the following rows have spaces in the column > name (attribute_name), probably because the input files were invalid and the plugin > has no restrictions on the format of the attribute_name: > > DATABASE_DOCUMENTATION_ID > ------------------------- > ATTRIBUTE_NAME > -------------------------------------------------------------------------------- > 1419 > bio_material_id fk to LabelledExtract view of BioMaterial > > 1103 > bio_source_characteristic_id primary key > > 1120 > treatment_id fk to Treatment > > DATABASE_DOCUMENTATION_ID > ------------------------- > ATTRIBUTE_NAME > -------------------------------------------------------------------------------- > 1374 > review_status_id The identifer of the review status > > 1418 > assay_id fk to Assay > > 1373 > synonym_name The gene symbol > > 6 rows selected. > > Also, as an aside (and not a comment to you in particular), it strikes me that > column "documentation" of the form "fk to Table X" and "Primary key" could be > generated automatically from the schema. However, comments on foreign keys > are useful if they identify the specific subclass (i.e. view) to which the > reference is expected to link, or if they explain what the referenced value is > used for (if not obvious). Anyway, since there are still some minor schema > changes taking place, I think that next week might be a good time to worry > about updating all the documentation, since the database will be locked down > for the migration at that point anyway. As for the controlled vocabularies, > I think you're right, and we should try to populate these as soon as we can, > even if it will be an iterative process in some cases. > > Jonathan > > -- > Jonathan Crabtree > Center for Bioinformatics, University of Pennsylvania > 1406 Blockley Hall, 423 Guardian Drive Philadelphia, PA 19104-6021 > 215-573-3115 > > ------------------------------------------------------------------------ > Name: gus.phenotype_draft.doc > gus.phenotype_draft.doc Type: Winword File (application/msword) > Encoding: base64 |
From: Jonathan C. <cra...@pc...> - 2003-01-15 19:18:43
|
Hi Joan- Arnaud did supply us with documentation (attached) for the new Phenotype tables, but I just haven't loaded it into the database yet (I've also been quite busy :)) I started working on updating the documentation a couple of days ago, but in the process discovered that there are some invalid rows in core.DatabaseDocumentation that should be corrected first. A query shows that there are 73 rows in this table that reference nonexistent columns in GUS 3.0. For the most part I think that these are relatively minor problems stemming from the fact that the schema has been updated more recently than the documentation. However, there are also a few rows that suggest we need to improve the plugin and/or procedure used to populate this table. For example, the following rows have spaces in the column name (attribute_name), probably because the input files were invalid and the plugin has no restrictions on the format of the attribute_name: DATABASE_DOCUMENTATION_ID ------------------------- ATTRIBUTE_NAME -------------------------------------------------------------------------------- 1419 bio_material_id fk to LabelledExtract view of BioMaterial 1103 bio_source_characteristic_id primary key 1120 treatment_id fk to Treatment DATABASE_DOCUMENTATION_ID ------------------------- ATTRIBUTE_NAME -------------------------------------------------------------------------------- 1374 review_status_id The identifer of the review status 1418 assay_id fk to Assay 1373 synonym_name The gene symbol 6 rows selected. Also, as an aside (and not a comment to you in particular), it strikes me that column "documentation" of the form "fk to Table X" and "Primary key" could be generated automatically from the schema. However, comments on foreign keys are useful if they identify the specific subclass (i.e. view) to which the reference is expected to link, or if they explain what the referenced value is used for (if not obvious). Anyway, since there are still some minor schema changes taking place, I think that next week might be a good time to worry about updating all the documentation, since the database will be locked down for the migration at that point anyway. As for the controlled vocabularies, I think you're right, and we should try to populate these as soon as we can, even if it will be an iterative process in some cases. Jonathan -- Jonathan Crabtree Center for Bioinformatics, University of Pennsylvania 1406 Blockley Hall, 423 Guardian Drive Philadelphia, PA 19104-6021 215-573-3115 |
From: Joan M. <ma...@pc...> - 2003-01-15 18:33:15
|
Hi Arnand, I have been very busy and have not had the time to follow these thread messages, completely but I have a request for GUS30. Since there have been many controlled vocabulary tables created, perhaps some that are not covered by the existing ontologies (e.gs., DoTS::InteractionType, DoTS::EffectorActionType, DoTS::ComplexType, also any that have been mentioned previously, see below). Could you provide the terms and definitions that will be in these tables as the controlled vocabularies, this would best be in an XML format representing the table, so the table can be populated by a plugin, and also document these tables using the format that is required by the documentation plugin (I believe when you were here we mentioned this plugin). In addition, if other tables have been created by Crabtree for you please do this for the documentation of these tables. If you had already planned to do this then sorry for the push. Thanks, Joan Jonathan Crabtree wrote: > Arnaud- > > Thanks for the feedback; I think we're getting close to agreement here. > > > I have noticed that your changes don't cover the DNA/RNA features. Is > > there any reason for this ? I know there are quite a lot of them and > > there might be another way of storing data some information such as > > telomere or centromere regions, origin of replication, inflection point > > etc. All these features are covered by Sequence Ontology, so a new > > ChromosomeElement or ChromosomeRegion feature could be generic enough to > > cover most of them. > > Let me know what you think. > > Which DNA/RNA features do you mean (other than those mentioned above)? > It's possible that I misplaced the e-mail or notes where we discussed > these. Or are you just saying that we will eventually have a view for > each type of DNA/RNA feature in the Sequence Ontology? I think that > this is true, although I hadn't planned to make the change immediately, > since I believe we had agreed on a "transitional" period in which the > various NAFeature views would first be given a nullable sequence_ontology_id > and we would then decide how to best rearrange the views to more closely > match the ontology terms. I haven't added the sequence_ontology_id > column to the NAFeature views, but I will do so right away. We do > currently have some relevant NAFeature views in gusdev that have not > been migrated into 3.0: > > CentromereFeature > LowComplexityNAFeature > ScaffoldGapFeature > TelomereFeature > > I have no objection to merging the telomere and centromere features into > a single view--along with any other chromosomal regions covered by the > ontology--although it would mean that we wouldn't have a 1-1 mapping > between sequence ontology terms and views on NAFeature. I think that > at one point this was proposed as the eventual goal of the rearrangement. > Anyway, given that I'm not certain of the plan here, I'm going to add > the sequence_ontology_id column but leave the views unchanged for now. > They can easily be changed without interfering with our data migration, > so their fate doesn't have to be settled immediately. We have yet to > establish a consistent set of rules for deciding when different types > of features get grouped into a single view and when they get their own > views, so this is probably a good opportunity to settle the question > once and for all. The Sequence Ontology is big enough that we probably > *don't* want a view for each and every term in the ontology; it would > make maintenance quite difficult. But we could, for example, create a > view for every top-level (or second-level) sequence ontology term. > However, even a relatively high-level feature like "chromosomal region" > (SO:0000711) looks like it's already a 4th or 5th level feature... At > the other extreme, we could continue what we're doing now, i.e. using > an ad-hoc classification of features based on the data we actually have > available, and just make sure that every feature is tagged with the > correct sequence ontology term. Any thoughts? > > >> > >> alter table DOTS.PROTEINPROPERTY add constraint PROTEINPROPERTY_CK01 check > >> (property_name in ('isoelectric point', 'molecular mass', 'charge', 'average residue mass')); > >> > >> The table allows multiple protein properties of the same type to be associated with > >> entries in DoTS.AASequenceImp. Arnaud had suggested originally that the last > >> property, average residue mass, could actually be an attribute of the table that > >> stores the protein sequence itself. However, it seemed that if the molecular > >> mass attribute could have multiple values (e.g., from different experiments) then > >> the same should be true of the average residue mass, which is essentially a > >> derived property. Let me know if you disagree with this, or think we should > >> create an explicit controlled vocab. for these 4 properties. > >> > >> > > A controlled vocabulary table with the four attributes you've mentioned > > is fine. > > OK, I'll make this change. > > >>-Protein features > >> *Signal peptide features (stored in DoTS.SignalPeptideFeature) > >> This view exists already, as DoTS.SignalPeptideFeature, but we need to add the > >> ability to store curated data, such as targetting information. It should be > >> straightforward to modify the view to accomodate this, but I'm not sure exactly > >> what needs to be stored. Currently we use the view exclusively for SignalP > >> predictions, and from what I understand SignalP is only concerned with predicting > >> secreted proteins, meaning that we don't currently have any explicit targetting > >> information. Is this something we could represent using the GO ontology for cellular > >> localization? Do we also need some free text columns? Let me know and I'll make > >> the changes. All the SignalP-specific columns appear to be nullable, so we don't > >> necessarily have to do anything except add the new columns for the manually curated > >> information. > >> > >> > > After talking to the curators it appears that GO component suplements > > targetting information at the feature level but will not be enough. > > The targeting information is represented by the component ontology in > > one context i.e. mitochondrial, nuclear, membrane localization but not > > in the context of the actual residues involved. > > The actual residues involved in the targeting (or any other phenomena) > > need to be represented by a protein feature ontology can be mapped onto > > specific amino acids of a protein. > > This ontology is the equivalent of Sequence Ontology (SO) which is meant > > for DNA features. It is being prepared by Val Wood with input from > > Swiss-prot. > > OK, so the idea is that the various signal peptides have been classified > into named classes that should be represented by some kind of ontology? > > > As you're going to add a extra attribute sequence_ontology_id to the NA > > Features, could you do the same to any AA Features ? > > This will only work if the new ontology is actually part of the Sequence > Ontology (or if we use the SequenceOntology table to store both ontologies.) > Do you know if this is the case? It's quite possible, since the SO does > already cover amino acid features. Otherwise we'll have to create a > separate AASequenceOntology (or whatever the new ontology is called). > > >> *Transmembrane domain features (stored in DoTS.PredictedAAFeature) > >> "PlasmoDB web site shows hydrophobicity graphics, where is it stored in GUS?" > >> The hydrophobicity plots are computed dynamically based on the amino acid sequence. > >> Transmembrane domains are currently stored in the PredictedAAFeature view, although > >> I will probably create a new view for them when I get around to eliminating > >> PredictedAAFeature. Another possibility would be to treat TM domains as another > >> type of domain, and store them in DomainFeature. What do you think about this? > >> > >> > > I reckon they could be merged. > > OK, sounds good. > > >> *Post-translational modification features (new view: DoTS:PostTranslationalModFeature) > >> Has a "type" column to represent the type of modification. It was also suggested > >> that we have a column called "modified_by", which would be a reference to the > >> Interaction table. However, isn't it possible that the same post-translational > >> modification (e.g., phosphorylation of a specific amino acid) could be the result > >> of one of several Interactions? > >> > > yes you're right, the effector could be different. In that case the > > attribute > > "modified_by" is not useful. > > > >> This argues for an additional relationship > >> between Interaction and PostTranslationalModFeature, unless we're OK creating > >> multiple PostTranslationalModFeatures, identical except for their modified_by > >> attribute. Comments on this? > >> > >> > > I don't think they should be duplicated as they corresponds to a unique > > site. This unique feature would > > be associated with different interaction entries. We might not need an > > extra table between Interaction and PostTranslationalModFeature though. > > We still can do the following query : "give me all the interaction > > entries which target is a PostTranslationalModFeature which id is ...". > > How does it sound ? > > We could do this, although one question is whether, semantically speaking, > the "target" of an Interaction should be "the thing to be modified" (e.g. an > unphosphorylated sequence or residue) or "the resulting modification" (e.g. > the feature that represents a phosphorylated residue at the appropriate > location.) The answer is probably that we just shouldn't worry about it > and should just do whatever is most convenient on a case-by-case basis. > To do it "correctly" would be problematic either way. For example, if we > say that the target is the thing to be modified, then we have to create a > feature that represents a region of sequence that *could* be modified in > some way and then create another feature to represent the actual modification. > But if we say that the target is the result of the modification then we may > have to create equally unusual tables/views. For example, if the result of > a given interaction is to degrade a protein, then do we have to create a > table/object that represents a degraded protein (or "nothing", or whatever > it is that's left after the modification)? For now I have no problem with > interpreting the "target" based on context, but in the longer term we may > want to consider separating the notions of "target prior to modification" > and either "target after modification" or "effect of modification". > > I also realized belatedly that I could have left the Interaction table > unchanged, rather than introducing specific references to RowSet. This > would have allowed us to represent either singleton effectors/targets or > set-valued effectors/targets, without having to always join through RowSet > in the singleton case. On the other hand, if we do associate some > additional information with the RowSets, then the current representation > is correct. > > >> *AA repeats (new view: RepeatRegionAAFeature) > >> I called this view RepeatRegionAAFeature in case we want to have a similar view > >> for NASequences. I also created only a single view, instead of following Arnaud's > >> original suggestion, which was for both: > >> > >> * RepeatRegionFeature as a set of RepeatUnitFeatures, > >> * RepeatUnitFeature, with the consensus sequence, name and size > >> > >> I based the design of this view on that of TandemRepeatFeature, which we have for > >> NASequences already. Instead of splitting the consensus sequence, name, and size > >> into a separate table, they occupy columns in RepeatRegionAAFeature. This works > >> quite well for the tandem repeats we already have (for DNA sequences.) However, if > >> there is a known set of named amino acid sequence repeats, then it would probably > >> make sense to do what Arnaud suggested, and store these in a separate table > >> (likely named RepeatUnit, not RepeatUnitFeature, since they would have no unique > >> locations.) Does this sound reasonable? That is, put the consensus seqs in the > >> repeat region table itself if they're anonymous, but if they've been named, then > >> store them in a separate table. Also note that this view has a reference to > >> RepeatType, although the current contents of this table are probably applicable > >> only to DNA sequence repeats (LINEs, SINEs, ALUs, etc.), since I believe that I > >> parsed them out of RepBase. > >> > >> > > I proposed a separate repeat feature because one may want to annotate a > > repeat outside a repeat region, for example LTR repeats attached to a > > given transposable element. These RepeatFeatures or RepeatUnitFeatures > > can then have a location. > > The other case is when a repeat region is made of a set of different > > repeat units. > > OK, I didn't realize that this was what you were trying to represent. As > currently conceived, RepeatRegionAAFeature is meant to represent a region > that contains one or more immediately adjacent copies of the same type > of (amino acid sequence) repeat. The assumption is also that these regions > will typically be maximal (with respect to the chosen repeat type, consensus, > and max. mismatch, the last of which is not represented directly in the > table.) We can still represent more complex repeat structures using this > single table, but the representation is implicit, not explicit (i.e. you > have to do a query to find out what other repeats lie within the bounds of > the transposon, meaning that there's no easy way to query for all transposable > elements with a particular flanking LTR structure.) Do you want to come up > with a 2-table version of what I've done? The use cases aren't clear enough > in my mind yet for me to be able to do it. It seems that the bare minimum we > need is just another column in the RepeatRegionAAFeature, parent_id; which > would let us represent explicitly that a particular repeat is a *necessary* > (versus incidental) component of another NA/AAFeature. Both AAFeatureImp > and NAFeatureImp already have a parent_id, so this would be a straightforward > change. The queries still might not be terribly efficient, but I don't know > what exactly you wanted to support in terms of queries, versus just making > sure that the representation is sufficiently rich to capture the structure. > > > In any case, NA repeats and AA repeats should have the same design. Just > > the controlled vocabulary representing the types of repeats will be > > different. > > Absolutely, yes, although one question is whether AA repeats can have the > same kind of nested structure that you mention as a possibility for NA > repeats (the transposon with LTRs). I don't know the answer to this. > > >>-DoTS.Interaction (table modified, dependent tables added) > >> *Added "has_direction" column, as discussed previously. The idea here is that > >> not all interactions (particularly physical ones, e.g., dimerization) have a > >> direction. If has_direction == 0, then the value of direction_is_known can > >> be ignored. > >> *Added non-nullable "effector_action_type_id" column, referencing > >> DoTS.EffectorActionType (a new table.) This column/table represents the possible > >> things that an effector can do to a target. For example, the InteractionType > >> associated with the Interaction could be "binds to" (e.g., a promoter region), and > >> the EffectorActionType for that Interaction could be to either "inhibit" or "enhance" > >> expression of the coresponding gene. > >> *Replaced effector_table_id and effector_row_id with effector_row_set_id, and > >> similarly for the target_table_id and target_row_id. This allows us to represent > >> the interaction of a set of objects (the effector) with another set of objects > >> (the target.) Previously the Interaction table could only represent the interaction > >> between a single pair of entities (OK if they happened to be Complexes, for example, > >> but a potential problem in other situations.) Now both effector and target are > >> represented as references to DoTS.RowSet, which in tun references DoTS.RowSetMember, > >> which...in turn...references the individual database rows that comprise the effector > >> or target. These tables (RowSet and RowSetMember) are essentially the same as > >> Complex and ComplexComponent, except that they are totally generic; they can be > >> used to group any set of rows in the database and they store no additional information. > >> However, if there are any additional columns that we can think of (that are specific > >> to Interactions) then these tables should be replaced by less generic ones (e.g. > >> InteractingEntitySet or InteractionSet, or something along those lines.) > >> > >> > > Sounds fine. The only thing I can see is regarding the > > EffectorActionType. If each effector, member of a RowSet, has a > > different action type, the attribute, effector_action_type_id, should go > > in the RowSetMember table. I don't have any example though. > > OK, I think I'd be inclined to wait until we have some use cases for this. > Although the current schema lets us group effectors together, it doesn't let > us say (for example) that E1 interacts *directly* with T1 to phosphorylate > it, but that E1's active site is only exposed when E1 is bound to E2. In > other words, E1's role in the activity can be viewed as "primary", and E2's > role is secondary (in some sense) but all we can say in the schema is that > the Complex consisting of E1 and E2 interacts with T1 to phosphorylate it. > I think that the solution we have now is OK, but it only lets us represent > the overall action of the entire set of effectors. > > Jonathan > > -- > Jonathan Crabtree > Center for Bioinformatics, University of Pennsylvania > 1406 Blockley Hall, 423 Guardian Drive Philadelphia, PA 19104-6021 > 215-573-3115 > > ------------------------------------------------------- > This SF.NET email is sponsored by: Take your first step towards giving > your online business a competitive advantage. Test-drive a Thawte SSL > certificate - our easy online guide will show you how. Click here to get > started: http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0027en > _______________________________________________ > Gusdev-gusdev mailing list > Gus...@li... > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev -- Joan Mazzarelli Computational Biology and Informatics Laboratory Center for Bioinformatics 1429 Blockley Hall University of Pennsylvania Philadelphia, PA 19104 |
From: Jonathan C. <cra...@pc...> - 2003-01-15 16:18:32
|
Arnaud- Thanks for the feedback; I think we're getting close to agreement here. > I have noticed that your changes don't cover the DNA/RNA features. Is > there any reason for this ? I know there are quite a lot of them and > there might be another way of storing data some information such as > telomere or centromere regions, origin of replication, inflection point > etc. All these features are covered by Sequence Ontology, so a new > ChromosomeElement or ChromosomeRegion feature could be generic enough to > cover most of them. > Let me know what you think. Which DNA/RNA features do you mean (other than those mentioned above)? It's possible that I misplaced the e-mail or notes where we discussed these. Or are you just saying that we will eventually have a view for each type of DNA/RNA feature in the Sequence Ontology? I think that this is true, although I hadn't planned to make the change immediately, since I believe we had agreed on a "transitional" period in which the various NAFeature views would first be given a nullable sequence_ontology_id and we would then decide how to best rearrange the views to more closely match the ontology terms. I haven't added the sequence_ontology_id column to the NAFeature views, but I will do so right away. We do currently have some relevant NAFeature views in gusdev that have not been migrated into 3.0: CentromereFeature LowComplexityNAFeature ScaffoldGapFeature TelomereFeature I have no objection to merging the telomere and centromere features into a single view--along with any other chromosomal regions covered by the ontology--although it would mean that we wouldn't have a 1-1 mapping between sequence ontology terms and views on NAFeature. I think that at one point this was proposed as the eventual goal of the rearrangement. Anyway, given that I'm not certain of the plan here, I'm going to add the sequence_ontology_id column but leave the views unchanged for now. They can easily be changed without interfering with our data migration, so their fate doesn't have to be settled immediately. We have yet to establish a consistent set of rules for deciding when different types of features get grouped into a single view and when they get their own views, so this is probably a good opportunity to settle the question once and for all. The Sequence Ontology is big enough that we probably *don't* want a view for each and every term in the ontology; it would make maintenance quite difficult. But we could, for example, create a view for every top-level (or second-level) sequence ontology term. However, even a relatively high-level feature like "chromosomal region" (SO:0000711) looks like it's already a 4th or 5th level feature... At the other extreme, we could continue what we're doing now, i.e. using an ad-hoc classification of features based on the data we actually have available, and just make sure that every feature is tagged with the correct sequence ontology term. Any thoughts? >> >> alter table DOTS.PROTEINPROPERTY add constraint PROTEINPROPERTY_CK01 check >> (property_name in ('isoelectric point', 'molecular mass', 'charge', 'average residue mass')); >> >> The table allows multiple protein properties of the same type to be associated with >> entries in DoTS.AASequenceImp. Arnaud had suggested originally that the last >> property, average residue mass, could actually be an attribute of the table that >> stores the protein sequence itself. However, it seemed that if the molecular >> mass attribute could have multiple values (e.g., from different experiments) then >> the same should be true of the average residue mass, which is essentially a >> derived property. Let me know if you disagree with this, or think we should >> create an explicit controlled vocab. for these 4 properties. >> >> > A controlled vocabulary table with the four attributes you've mentioned > is fine. OK, I'll make this change. >>-Protein features >> *Signal peptide features (stored in DoTS.SignalPeptideFeature) >> This view exists already, as DoTS.SignalPeptideFeature, but we need to add the >> ability to store curated data, such as targetting information. It should be >> straightforward to modify the view to accomodate this, but I'm not sure exactly >> what needs to be stored. Currently we use the view exclusively for SignalP >> predictions, and from what I understand SignalP is only concerned with predicting >> secreted proteins, meaning that we don't currently have any explicit targetting >> information. Is this something we could represent using the GO ontology for cellular >> localization? Do we also need some free text columns? Let me know and I'll make >> the changes. All the SignalP-specific columns appear to be nullable, so we don't >> necessarily have to do anything except add the new columns for the manually curated >> information. >> >> > After talking to the curators it appears that GO component suplements > targetting information at the feature level but will not be enough. > The targeting information is represented by the component ontology in > one context i.e. mitochondrial, nuclear, membrane localization but not > in the context of the actual residues involved. > The actual residues involved in the targeting (or any other phenomena) > need to be represented by a protein feature ontology can be mapped onto > specific amino acids of a protein. > This ontology is the equivalent of Sequence Ontology (SO) which is meant > for DNA features. It is being prepared by Val Wood with input from > Swiss-prot. OK, so the idea is that the various signal peptides have been classified into named classes that should be represented by some kind of ontology? > As you're going to add a extra attribute sequence_ontology_id to the NA > Features, could you do the same to any AA Features ? This will only work if the new ontology is actually part of the Sequence Ontology (or if we use the SequenceOntology table to store both ontologies.) Do you know if this is the case? It's quite possible, since the SO does already cover amino acid features. Otherwise we'll have to create a separate AASequenceOntology (or whatever the new ontology is called). >> *Transmembrane domain features (stored in DoTS.PredictedAAFeature) >> "PlasmoDB web site shows hydrophobicity graphics, where is it stored in GUS?" >> The hydrophobicity plots are computed dynamically based on the amino acid sequence. >> Transmembrane domains are currently stored in the PredictedAAFeature view, although >> I will probably create a new view for them when I get around to eliminating >> PredictedAAFeature. Another possibility would be to treat TM domains as another >> type of domain, and store them in DomainFeature. What do you think about this? >> >> > I reckon they could be merged. OK, sounds good. >> *Post-translational modification features (new view: DoTS:PostTranslationalModFeature) >> Has a "type" column to represent the type of modification. It was also suggested >> that we have a column called "modified_by", which would be a reference to the >> Interaction table. However, isn't it possible that the same post-translational >> modification (e.g., phosphorylation of a specific amino acid) could be the result >> of one of several Interactions? >> > yes you're right, the effector could be different. In that case the > attribute > "modified_by" is not useful. > >> This argues for an additional relationship >> between Interaction and PostTranslationalModFeature, unless we're OK creating >> multiple PostTranslationalModFeatures, identical except for their modified_by >> attribute. Comments on this? >> >> > I don't think they should be duplicated as they corresponds to a unique > site. This unique feature would > be associated with different interaction entries. We might not need an > extra table between Interaction and PostTranslationalModFeature though. > We still can do the following query : "give me all the interaction > entries which target is a PostTranslationalModFeature which id is ...". > How does it sound ? We could do this, although one question is whether, semantically speaking, the "target" of an Interaction should be "the thing to be modified" (e.g. an unphosphorylated sequence or residue) or "the resulting modification" (e.g. the feature that represents a phosphorylated residue at the appropriate location.) The answer is probably that we just shouldn't worry about it and should just do whatever is most convenient on a case-by-case basis. To do it "correctly" would be problematic either way. For example, if we say that the target is the thing to be modified, then we have to create a feature that represents a region of sequence that *could* be modified in some way and then create another feature to represent the actual modification. But if we say that the target is the result of the modification then we may have to create equally unusual tables/views. For example, if the result of a given interaction is to degrade a protein, then do we have to create a table/object that represents a degraded protein (or "nothing", or whatever it is that's left after the modification)? For now I have no problem with interpreting the "target" based on context, but in the longer term we may want to consider separating the notions of "target prior to modification" and either "target after modification" or "effect of modification". I also realized belatedly that I could have left the Interaction table unchanged, rather than introducing specific references to RowSet. This would have allowed us to represent either singleton effectors/targets or set-valued effectors/targets, without having to always join through RowSet in the singleton case. On the other hand, if we do associate some additional information with the RowSets, then the current representation is correct. >> *AA repeats (new view: RepeatRegionAAFeature) >> I called this view RepeatRegionAAFeature in case we want to have a similar view >> for NASequences. I also created only a single view, instead of following Arnaud's >> original suggestion, which was for both: >> >> * RepeatRegionFeature as a set of RepeatUnitFeatures, >> * RepeatUnitFeature, with the consensus sequence, name and size >> >> I based the design of this view on that of TandemRepeatFeature, which we have for >> NASequences already. Instead of splitting the consensus sequence, name, and size >> into a separate table, they occupy columns in RepeatRegionAAFeature. This works >> quite well for the tandem repeats we already have (for DNA sequences.) However, if >> there is a known set of named amino acid sequence repeats, then it would probably >> make sense to do what Arnaud suggested, and store these in a separate table >> (likely named RepeatUnit, not RepeatUnitFeature, since they would have no unique >> locations.) Does this sound reasonable? That is, put the consensus seqs in the >> repeat region table itself if they're anonymous, but if they've been named, then >> store them in a separate table. Also note that this view has a reference to >> RepeatType, although the current contents of this table are probably applicable >> only to DNA sequence repeats (LINEs, SINEs, ALUs, etc.), since I believe that I >> parsed them out of RepBase. >> >> > I proposed a separate repeat feature because one may want to annotate a > repeat outside a repeat region, for example LTR repeats attached to a > given transposable element. These RepeatFeatures or RepeatUnitFeatures > can then have a location. > The other case is when a repeat region is made of a set of different > repeat units. OK, I didn't realize that this was what you were trying to represent. As currently conceived, RepeatRegionAAFeature is meant to represent a region that contains one or more immediately adjacent copies of the same type of (amino acid sequence) repeat. The assumption is also that these regions will typically be maximal (with respect to the chosen repeat type, consensus, and max. mismatch, the last of which is not represented directly in the table.) We can still represent more complex repeat structures using this single table, but the representation is implicit, not explicit (i.e. you have to do a query to find out what other repeats lie within the bounds of the transposon, meaning that there's no easy way to query for all transposable elements with a particular flanking LTR structure.) Do you want to come up with a 2-table version of what I've done? The use cases aren't clear enough in my mind yet for me to be able to do it. It seems that the bare minimum we need is just another column in the RepeatRegionAAFeature, parent_id; which would let us represent explicitly that a particular repeat is a *necessary* (versus incidental) component of another NA/AAFeature. Both AAFeatureImp and NAFeatureImp already have a parent_id, so this would be a straightforward change. The queries still might not be terribly efficient, but I don't know what exactly you wanted to support in terms of queries, versus just making sure that the representation is sufficiently rich to capture the structure. > In any case, NA repeats and AA repeats should have the same design. Just > the controlled vocabulary representing the types of repeats will be > different. Absolutely, yes, although one question is whether AA repeats can have the same kind of nested structure that you mention as a possibility for NA repeats (the transposon with LTRs). I don't know the answer to this. >>-DoTS.Interaction (table modified, dependent tables added) >> *Added "has_direction" column, as discussed previously. The idea here is that >> not all interactions (particularly physical ones, e.g., dimerization) have a >> direction. If has_direction == 0, then the value of direction_is_known can >> be ignored. >> *Added non-nullable "effector_action_type_id" column, referencing >> DoTS.EffectorActionType (a new table.) This column/table represents the possible >> things that an effector can do to a target. For example, the InteractionType >> associated with the Interaction could be "binds to" (e.g., a promoter region), and >> the EffectorActionType for that Interaction could be to either "inhibit" or "enhance" >> expression of the coresponding gene. >> *Replaced effector_table_id and effector_row_id with effector_row_set_id, and >> similarly for the target_table_id and target_row_id. This allows us to represent >> the interaction of a set of objects (the effector) with another set of objects >> (the target.) Previously the Interaction table could only represent the interaction >> between a single pair of entities (OK if they happened to be Complexes, for example, >> but a potential problem in other situations.) Now both effector and target are >> represented as references to DoTS.RowSet, which in tun references DoTS.RowSetMember, >> which...in turn...references the individual database rows that comprise the effector >> or target. These tables (RowSet and RowSetMember) are essentially the same as >> Complex and ComplexComponent, except that they are totally generic; they can be >> used to group any set of rows in the database and they store no additional information. >> However, if there are any additional columns that we can think of (that are specific >> to Interactions) then these tables should be replaced by less generic ones (e.g. >> InteractingEntitySet or InteractionSet, or something along those lines.) >> >> > Sounds fine. The only thing I can see is regarding the > EffectorActionType. If each effector, member of a RowSet, has a > different action type, the attribute, effector_action_type_id, should go > in the RowSetMember table. I don't have any example though. OK, I think I'd be inclined to wait until we have some use cases for this. Although the current schema lets us group effectors together, it doesn't let us say (for example) that E1 interacts *directly* with T1 to phosphorylate it, but that E1's active site is only exposed when E1 is bound to E2. In other words, E1's role in the activity can be viewed as "primary", and E2's role is secondary (in some sense) but all we can say in the schema is that the Complex consisting of E1 and E2 interacts with T1 to phosphorylate it. I think that the solution we have now is OK, but it only lets us represent the overall action of the entire set of effectors. Jonathan -- Jonathan Crabtree Center for Bioinformatics, University of Pennsylvania 1406 Blockley Hall, 423 Guardian Drive Philadelphia, PA 19104-6021 215-573-3115 |
From: Arnaud K. <ax...@sa...> - 2003-01-14 11:27:06
|
Hi Jonathan Thanks for doing this. Please find below some comments I've inserted. I have noticed that your changes don't cover the DNA/RNA features. Is there any reason for this ? I know there are quite a lot of them and there might be another way of storing data some information such as telomere or centromere regions, origin of replication, inflection point etc. All these features are covered by Sequence Ontology, so a new ChromosomeElement or ChromosomeRegion feature could be generic enough to cover most of them. Let me know what you think. cheers Arnaud Jonathan Crabtree wrote: >Hi all- > >The attached text file describes the schema changes that I just finished >implementing. It's attached as a separate file to avoid problems with the >mail clients changing the line wrapping. Sorry if there are any typos, >but it's getting late and I want to get this out there for everyone to >look at in the morning. > >Jonathan > > > >------------------------------------------------------------------------ > > >Hi all- > >Here are the schema changes that I've just finished implementing: > >-Protein properties (new table: DoTS.ProteinProperty) > A new table that Arnaud requested back in July, but was overlooked in the earlier > schema changes. There are four possible protein properties as represented by the > following constraint (we could instead have a ProteinPropertyType table and treat > this as a controlled vocabulary): > > alter table DOTS.PROTEINPROPERTY add constraint PROTEINPROPERTY_CK01 check > (property_name in ('isoelectric point', 'molecular mass', 'charge', 'average residue mass')); > > The table allows multiple protein properties of the same type to be associated with > entries in DoTS.AASequenceImp. Arnaud had suggested originally that the last > property, average residue mass, could actually be an attribute of the table that > stores the protein sequence itself. However, it seemed that if the molecular > mass attribute could have multiple values (e.g., from different experiments) then > the same should be true of the average residue mass, which is essentially a > derived property. Let me know if you disagree with this, or think we should > create an explicit controlled vocab. for these 4 properties. > > A controlled vocabulary table with the four attributes you've mentioned is fine. >-Protein features > *Signal peptide features (stored in DoTS.SignalPeptideFeature) > This view exists already, as DoTS.SignalPeptideFeature, but we need to add the > ability to store curated data, such as targetting information. It should be > straightforward to modify the view to accomodate this, but I'm not sure exactly > what needs to be stored. Currently we use the view exclusively for SignalP > predictions, and from what I understand SignalP is only concerned with predicting > secreted proteins, meaning that we don't currently have any explicit targetting > information. Is this something we could represent using the GO ontology for cellular > localization? Do we also need some free text columns? Let me know and I'll make > the changes. All the SignalP-specific columns appear to be nullable, so we don't > necessarily have to do anything except add the new columns for the manually curated > information. > > After talking to the curators it appears that GO component suplements targetting information at the feature level but will not be enough. The targeting information is represented by the component ontology in one context i.e. mitochondrial, nuclear, membrane localization but not in the context of the actual residues involved. The actual residues involved in the targeting (or any other phenomena) need to be represented by a protein feature ontology can be mapped onto specific amino acids of a protein. This ontology is the equivalent of Sequence Ontology (SO) which is meant for DNA features. It is being prepared by Val Wood with input from Swiss-prot. As you're going to add a extra attribute sequence_ontology_id to the NA Features, could you do the same to any AA Features ? > *Domain/motif features (new view: DoTS.DomainFeature) > I've created this as a view on AAFeatureImp. You can either use the NAME column to > specify the type of domain (e.g., "leucine zipper" or "coiled coil"), or include > an explicit reference to a domain/motif database (SMART, ProSite) using the > external_database_release_id and source_id columns. PFam is handled as a special > case, with a specific pfam_entry_id column that references the PfamEntry table. > This was originally done because the entries in the PFam database are HMMs, so > they don't fit too well in the sequence-related tables. Most other motif databases > have consensus sequences for their motifs that we can store in MotifAASequence. > > Note that motif/domain features are currently stored in GUS in the PredictedAAFeature > table, which is also a view on AAFeatureImp. After the migration I plan to eliminate > the PredictedAAFeature view and move its contents into feature-specific tables (like > DomainFeature) instead. > > *Transmembrane domain features (stored in DoTS.PredictedAAFeature) > "PlasmoDB web site shows hydrophobicity graphics, where is it stored in GUS?" > The hydrophobicity plots are computed dynamically based on the amino acid sequence. > Transmembrane domains are currently stored in the PredictedAAFeature view, although > I will probably create a new view for them when I get around to eliminating > PredictedAAFeature. Another possibility would be to treat TM domains as another > type of domain, and store them in DomainFeature. What do you think about this? > > I reckon they could be merged. > *Post-translational modification features (new view: DoTS:PostTranslationalModFeature) > Has a "type" column to represent the type of modification. It was also suggested > that we have a column called "modified_by", which would be a reference to the > Interaction table. However, isn't it possible that the same post-translational > modification (e.g., phosphorylation of a specific amino acid) could be the result > of one of several Interactions? > yes you're right, the effector could be different. In that case the attribute "modified_by" is not useful. > This argues for an additional relationship > between Interaction and PostTranslationalModFeature, unless we're OK creating > multiple PostTranslationalModFeatures, identical except for their modified_by > attribute. Comments on this? > > I don't think they should be duplicated as they corresponds to a unique site. This unique feature would be associated with different interaction entries. We might not need an extra table between Interaction and PostTranslationalModFeature though. We still can do the following query : "give me all the interaction entries which target is a PostTranslationalModFeature which id is ...". How does it sound ? > *AA repeats (new view: RepeatRegionAAFeature) > I called this view RepeatRegionAAFeature in case we want to have a similar view > for NASequences. I also created only a single view, instead of following Arnaud's > original suggestion, which was for both: > > * RepeatRegionFeature as a set of RepeatUnitFeatures, > * RepeatUnitFeature, with the consensus sequence, name and size > > I based the design of this view on that of TandemRepeatFeature, which we have for > NASequences already. Instead of splitting the consensus sequence, name, and size > into a separate table, they occupy columns in RepeatRegionAAFeature. This works > quite well for the tandem repeats we already have (for DNA sequences.) However, if > there is a known set of named amino acid sequence repeats, then it would probably > make sense to do what Arnaud suggested, and store these in a separate table > (likely named RepeatUnit, not RepeatUnitFeature, since they would have no unique > locations.) Does this sound reasonable? That is, put the consensus seqs in the > repeat region table itself if they're anonymous, but if they've been named, then > store them in a separate table. Also note that this view has a reference to > RepeatType, although the current contents of this table are probably applicable > only to DNA sequence repeats (LINEs, SINEs, ALUs, etc.), since I believe that I > parsed them out of RepBase. > > I proposed a separate repeat feature because one may want to annotate a repeat outside a repeat region, for example LTR repeats attached to a given transposable element. These RepeatFeatures or RepeatUnitFeatures can then have a location. The other case is when a repeat region is made of a set of different repeat units. In any case, NA repeats and AA repeats should have the same design. Just the controlled vocabulary representing the types of repeats will be different. > *2D structures (not currently represented) > "Another question : What about 2D structures (beta-sheet and alpha-helice) in GUS?" > I don't *believe* we have any of these. They should be easy to add as either a > single feature view, or a set of views. > > fine. >-DoTS.Interaction (table modified, dependent tables added) > *Added "has_direction" column, as discussed previously. The idea here is that > not all interactions (particularly physical ones, e.g., dimerization) have a > direction. If has_direction == 0, then the value of direction_is_known can > be ignored. > *Added non-nullable "effector_action_type_id" column, referencing > DoTS.EffectorActionType (a new table.) This column/table represents the possible > things that an effector can do to a target. For example, the InteractionType > associated with the Interaction could be "binds to" (e.g., a promoter region), and > the EffectorActionType for that Interaction could be to either "inhibit" or "enhance" > expression of the coresponding gene. > *Replaced effector_table_id and effector_row_id with effector_row_set_id, and > similarly for the target_table_id and target_row_id. This allows us to represent > the interaction of a set of objects (the effector) with another set of objects > (the target.) Previously the Interaction table could only represent the interaction > between a single pair of entities (OK if they happened to be Complexes, for example, > but a potential problem in other situations.) Now both effector and target are > represented as references to DoTS.RowSet, which in tun references DoTS.RowSetMember, > which...in turn...references the individual database rows that comprise the effector > or target. These tables (RowSet and RowSetMember) are essentially the same as > Complex and ComplexComponent, except that they are totally generic; they can be > used to group any set of rows in the database and they store no additional information. > However, if there are any additional columns that we can think of (that are specific > to Interactions) then these tables should be replaced by less generic ones (e.g. > InteractingEntitySet or InteractionSet, or something along those lines.) > > Sounds fine. The only thing I can see is regarding the EffectorActionType. If each effector, member of a RowSet, has a different action type, the attribute, effector_action_type_id, should go in the RowSetMember table. I don't have any example though. >-DoTS.Attribution (new table) > A table intended to allow us to attribute data to people and/or organizations, using > the Contact table. It is a many-to-1 relationship between SReS.Contact and any row in > the DoTS schema. > > Fine. We already agreed on this implementation. >-SRes.BibRefType (new table), SRes.BibliographicReference (modified table) > Added new table, BibRefType, to represent the different types of references/publications > that one might encounter. I've populated this table based on a combination of the terms > used in MEDLINE 2003 and those from FlyBase, as well as one or two rows of my own > devising. Correspondingly, a non-nullable column has been added to BibliographicReference > to allow one to specify the BibRefType of the reference. I've also added a contact_id > column to BibliographicReference, to be used in the case where the BibRefType == > "personal communication". You can find the rows that I've placed in BibRefType in the > 3.0 db creation scripts mentioned below (in the file gus30-sres-BibRefType-rows.sql). > >I think those are the main changes. I also wrote a couple of scripts to help check >and maintain the version tables (those ending in "Ver") and to check that the actual >database schema and the information stored in Core.TableInfo actually agree. In the >process I fixed a number of problems, although there are still some things to be done, >such as: > > -Create SEQUENCE objects for all the tables (or at least modify the database dump > script to generate CREATE SEQUENCE statements for all the tables) > -Check that all the foreign key constraints have been defined correctly > -Check that all the foreign key columns are indexed correctly (I have a script > that will do this) > -Add sequence_ontology_id to all the views on NAFeatureImp. > >I've updated the schema browser, so you should be able to see all the new and >modified tables online: > > <http://www.cbil.upenn.edu/cgi-bin/GUS30/schemaBrowser.pl?db=GUS30> > >There's also a prelimary dump of the create database scripts, which should be consistent >with's shown in the schema browser: > > <http://www.cbil.upenn.edu/downloads/GUS/releases/3.0-beta/schema/> > >Jonathan > > > > |
From: Arnaud K. <ax...@sa...> - 2003-01-10 10:05:30
|
Thanks Jonathan, we'll have a look at it. cheers and good night ! Arnaud On Fri, 10 Jan 2003, Jonathan Crabtree wrote: > > Hi all- > > The attached text file describes the schema changes that I just finished > implementing. It's attached as a separate file to avoid problems with the > mail clients changing the line wrapping. Sorry if there are any typos, > but it's getting late and I want to get this out there for everyone to > look at in the morning. > > Jonathan > > Arnaud Kerhornou The Wellcome Trust Sanger Institute The Pathogen Sequencing Unit Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK Work: +44 (0) 1223 494955 Fax: +44 (0) 1223 494919 |
From: Jonathan C. <cra...@sn...> - 2003-01-10 07:47:39
|
Hi all- The attached text file describes the schema changes that I just finished implementing. It's attached as a separate file to avoid problems with the mail clients changing the line wrapping. Sorry if there are any typos, but it's getting late and I want to get this out there for everyone to look at in the morning. Jonathan |
From: Jonathan C. <cra...@pc...> - 2003-01-09 17:42:22
|
Hi Arnaud- Yes, I did have a good break, thanks, although the only work I did was to carry around a 10 pound laptop, which ended up not getting a lot of use. I have one more schema change to make (the new table we discussed for handling data attribution) before I can proceed to moving the data. I'll send out another e-mail later today covering all the schema changes that I've made. We have talked about most of them already, although we haven't necessarily discussed exactly how each should be implemented. So there are a few places where I've implemented something, but there's probably room for improvement, and your feedback would be helpful. I'll let you know as soon as the changes are complete (and the schema browser updated) so you can take a look at what I've done. Jonathan Arnaud Kerhornou wrote: > Hi Jonathan > > Hope you had a good Christmas break. > How is it going regarding GUS3 migration ? > > Thanks for giving me an update > cheers > Arnaud > |
From: Arnaud K. <ax...@sa...> - 2003-01-09 17:16:39
|
Hi Jonathan Hope you had a good Christmas break. How is it going regarding GUS3 migration ? Thanks for giving me an update cheers Arnaud |
From: Jonathan C. <cra...@pc...> - 2002-12-12 21:20:50
|
Arnaud Kerhornou wrote: > Hi Jonathan > > I've noticed that the GUS3 schema is browsable and that you've > integrated phenotypes. It looks fine. I can't find the new NA/AA > feature views we submitted though (e.g. PeptideProperty). Are you > working on it ? > > cheers > Arnaud > Arnaud- Yes, that's right, I've added the Phenotype tables. I was sidetracked for a couple of days but I'm finishing up the rest of the changes now, including the PeptideProperty table and some adjustments to the GO-related tables. Jonathan |
From: Arnaud K. <ax...@sa...> - 2002-12-12 13:59:09
|
Hi Jonathan I've noticed that the GUS3 schema is browsable and that you've integrated phenotypes. It looks fine. I can't find the new NA/AA feature views we submitted though (e.g. PeptideProperty). Are you working on it ? cheers Arnaud |
From: Chris S. <sto...@pc...> - 2002-12-04 12:46:42
|
Matt, Thanks! Looks like there are a number of direct mappings which is great. I've been working on the MGED ontology this week while Helen Parkinson has been visiting but will get back to this tomorrow and try to fill in gaps and address questions. Cheers, Chris On Tuesday, December 3, 2002, at 02:21 PM, Matthew Berriman wrote: > Chris > Here is my attempt to map GUS to SO (I've also attached it as because > I'm not sure how well my browser preserves the tabs) There are some big > gaps, such as "source", protein features and miscellaneous features. > > Cheers > Matt > ... > > > SUBCLASS_VIEW > <BindingSiteFeature: names of binding sites> > CentromereFeature: > predicted centromere ???? SO:0000577 centromere > DNARegulatory: > -10_signal SO:0000175 -10 signal > -35_signal SO:0000176 -35 signal > CAAT_signal SO:0000172 CAAT signal > GC_signal SO:0000173 GC-rich region > TATA_signal SO:0000174 TATA box > attenuator SO:0000140 attenuator > enhancer SO:0000165 enhancer * > promoter SO:0000167 promoter * > terminator SO:0000141 terminator * > DNAStructure: > D-loop SO:0000297 D-loop > primer_bind ???? SO:0005850 primer binding site > (sub-class of LTR-retrotransposon) > rep_origin SO:0000296 origin of replication > ExonFeature > ExonFeature SO:0000147 exon * > GeneFeature > CDS ???? SO:0001254 coding sequence predicted > GeneFeature SO:0000704 gene element > gene SO:0000008 gene (sensu > yourfavoriteorganism) > snRNA SO:0000274 small nuclear RNA > tRNA SO:0000253 transfer RNA > HexamerFeature > hexamer ???? > Immunoglobulin SO:0000460 vertebrate immunoglobulin/T-cell > receptor gene > C-region ???? SO:0000478 C-gene > D_segment ???? SO:0000458 D-gene > J_segment ???? SO:0000470 J-gene > N_region ???? > S_region ???? > V_region ???? SO:0000466 V-gene > V_segment ???? > iDNA ???? > LowComplexityNAFeature > low complexity sequence ???? > Miscellaneous > misc_binding ???? > misc_feature ???? > misc_signal ???? > protein_bind ???? > stem_loop SO:0000313 stem loop > unsure ???? > PolyAFeature > PolyAFeature ???? SO:0000610 polyA sequence > ???? SO:0000553 polyA site > PromoterFeature > PromoterFeature ???? How does this differ from > DNARegulatory>Promoter ? differ from > DNARegulatory>Promoter ? > various promoter sets ???? > ProteinFeature > mat_peptide ???? N/A > sig_peptide ???? N/A > transit_peptide ???? N/A > RNAFeature > PREDICTED REFSEQ ???? > PROVISIONAL REFSEQ ???? > REFSEQ ???? > REVIEWED REFSEQ ???? > RNAFeature ???? SO:0000184 transcript region > assembly ???? > mRNA SO:0000234 messenger RNA > RNAStructure > RBS SO:0000552 5'-ribosome binding site > misc_RNA ???? > RNAType > rRNA ???? is this different to the GeneFeature>rRNA? > scRNA ???? > snRNA ???? is this different to the GeneFeature>rRNA? > tRNA ???? is this different to the GeneFeature>rRNA? > <Repeats: names of repeats> > RestrictionFragmentFeature > RestrictionFragmentFeature ???? Is this to describe the DNA > source? > <SAGETagFeature: locations on chromosomes> > STS > STS ???? > ScaffoldGapFeature > assembly gap ???? > SeqVariation > SNP SO:0000694 single nucleotide polymorphism > allele ???? > conflict ???? > misc_difference ???? > misc_recomb ???? > modified_base SO:0000199 modified DNA base > feature > mutation ???? SO:1000170 uncharacterised > chromosomal mutation > old_sequence ???? > variation SO:1000128 sequence variation > feature > Source > source ???? > TandemRepeatFeature > tandem repeat SO:0000705 tandem repeat > TelomereFeature > predicted telomere ???? SO:0000624 telomere > Transcript SO:0000673 transcript > 3'UTR SO:0000205 3'-untranslated region > 3'clip SO:0000557 3'-clip > 5'UTR SO:0000204 5'-untranslated region > 5'clip SO:0000555 5'-clip > CDS ???? SO:0001254 coding sequence predicted > exon SO:0000147 exon > intron SO:0000188 intron > mRNA SO:0000234 messenger RNA > polyA_signal SO:0000551 polyA signal sequence > precursor_RNA ???? > prim_transcript SO:0000185 primary > transcript > >> -----Original Message----- >> From: Chris Stoeckert [mailto:sto...@pc...] >> Sent: 26 November 2002 19:37 >> To: Matthew Berryman >> Cc: gusdev-gusdev >> Subject: Re: GUS features >> >> >> Hi Matt, >> I've generated a list of the 26 subclass views of NAFeatureImp and >> looked at the distinct names associated with. These are given below. >> Will now start looking at the SO and SO lite for these. >> Chris >> >> On Friday, November 22, 2002, at 12:59 PM, Matt Berriman wrote: >> >>> Hi Chris >>> Do you have a list of GUS sequence features? The SO group are >>> preparing a "SO Lite" -- it would be good to check that >> everything is >>> represented. >>> >>> cheers >>> Matt >> >> SUBCLASS_VIEW >> <BindingSiteFeature: names of binding sites> >> CentromereFeature: >> predicted centromere >> DNARegulatory: >> -10_signal >> -35_signal >> CAAT_signal >> GC_signal >> TATA_signal >> attenuator >> enhancer >> promoter >> terminator >> DNAStructure: >> D-loop >> primer_bind >> rep_origin >> ExonFeature >> ExonFeature >> GeneFeature >> CDS >> GeneFeature >> gene >> snRNA >> tRNA >> HexamerFeature >> hexamer >> Immunoglobulin >> C-region >> D_segment >> J_segment >> N_region >> S_region >> V_region >> V_segment >> iDNA >> LowComplexityNAFeature >> low complexity sequence >> Miscellaneous >> misc_binding >> misc_feature >> misc_signal >> protein_bind >> stem_loop >> unsure >> PolyAFeature >> PolyAFeature >> PromoterFeature >> PromoterFeature >> various promoter sets >> ProteinFeature >> mat_peptide >> sig_peptide >> transit_peptide >> RNAFeature >> PREDICTED REFSEQ >> PROVISIONAL REFSEQ >> REFSEQ >> REVIEWED REFSEQ >> RNAFeature >> assembly >> mRNA >> RNAStructure >> RBS >> misc_RNA >> RNAType >> rRNA >> scRNA >> snRNA >> tRNA >> <Repeats: names of repeats> >> RestrictionFragmentFeature >> RestrictionFragmentFeature >> <SAGETagFeature: locations on chromosomes> >> STS >> STS >> ScaffoldGapFeature >> assembly gap >> SeqVariation >> SNP >> allele >> conflict >> misc_difference >> misc_recomb >> modified_base >> mutation >> old_sequence >> variation >> Source >> source >> TandemRepeatFeature >> tandem repeat >> TelomereFeature >> predicted telomere >> Transcript >> 3'UTR >> 3'clip >> 5'UTR >> 5'clip >> CDS >> exon >> intron >> mRNA >> polyA_signal >> precursor_RNA >> prim_transcript >> >> > <GUS2SO.txt> |
From: Matthew B. <mb...@sa...> - 2002-12-03 19:21:16
|
Chris=20 Here is my attempt to map GUS to SO (I've also attached it as because I'm not sure how well my browser preserves the tabs) There are some big gaps, such as "source", protein features and miscellaneous features. Cheers Matt ... SUBCLASS_VIEW =09 <BindingSiteFeature: names of binding sites> =09 CentromereFeature: =09 predicted centromere ???? SO:0000577 centromere DNARegulatory: =09 -10_signal SO:0000175 -10 signal -35_signal SO:0000176 -35 signal CAAT_signal SO:0000172 CAAT signal GC_signal SO:0000173 GC-rich region TATA_signal SO:0000174 TATA box attenuator SO:0000140 attenuator enhancer SO:0000165 enhancer * promoter SO:0000167 promoter * terminator SO:0000141 terminator * DNAStructure: =09 D-loop SO:0000297 D-loop primer_bind ???? SO:0005850 primer binding site (sub-class of LTR-retrotransposon) rep_origin SO:0000296 origin of replication ExonFeature =09 ExonFeature SO:0000147 exon * GeneFeature =09 CDS ???? SO:0001254 coding sequence predicted GeneFeature SO:0000704 gene element gene SO:0000008 gene (sensu yourfavoriteorganism) snRNA SO:0000274 small nuclear RNA tRNA SO:0000253 transfer RNA HexamerFeature =09 hexamer ???? =09 Immunoglobulin SO:0000460 vertebrate immunoglobulin/T-cell receptor gene C-region ???? SO:0000478 C-gene D_segment ???? SO:0000458 D-gene J_segment ???? SO:0000470 J-gene N_region ???? =09 S_region ???? =09 V_region ???? SO:0000466 V-gene V_segment ???? =09 iDNA ???? =09 LowComplexityNAFeature =09 low complexity sequence ???? =09 Miscellaneous =09 misc_binding ???? =09 misc_feature ???? =09 misc_signal ???? =09 protein_bind ???? =09 stem_loop SO:0000313 stem loop unsure ???? =09 PolyAFeature =09 PolyAFeature ???? SO:0000610 polyA sequence ???? SO:0000553 polyA site PromoterFeature =09 PromoterFeature ???? How does this differ from DNARegulatory>Promoter ? differ from DNARegulatory>Promoter ? various promoter sets ???? =09 ProteinFeature =09 mat_peptide ???? N/A =09 sig_peptide ???? N/A =09 transit_peptide ???? N/A =09 RNAFeature =09 PREDICTED REFSEQ ???? =09 PROVISIONAL REFSEQ ???? =09 REFSEQ ???? =09 REVIEWED REFSEQ ???? =09 RNAFeature ???? SO:0000184 transcript region assembly ???? =09 mRNA SO:0000234 messenger RNA RNAStructure =09 RBS SO:0000552 5'-ribosome binding site misc_RNA ???? =09 RNAType =09 rRNA ???? is this different to the GeneFeature>rRNA? =09 scRNA ???? =09 snRNA ???? is this different to the GeneFeature>rRNA? =09 tRNA ???? is this different to the GeneFeature>rRNA? =09 <Repeats: names of repeats> =09 RestrictionFragmentFeature =09 RestrictionFragmentFeature ???? Is this to describe the DNA source? =09 <SAGETagFeature: locations on chromosomes> =09 STS =09 STS ???? =09 ScaffoldGapFeature =09 assembly gap ???? =09 SeqVariation =09 SNP SO:0000694 single nucleotide polymorphism allele ???? =09 conflict ???? =09 misc_difference ???? =09 misc_recomb ???? =09 modified_base SO:0000199 modified DNA base feature mutation ???? SO:1000170 uncharacterised chromosomal mutation old_sequence ???? =09 variation SO:1000128 sequence variation feature Source =09 source ???? =09 TandemRepeatFeature =09 tandem repeat SO:0000705 tandem repeat TelomereFeature =09 predicted telomere ???? SO:0000624 telomere Transcript SO:0000673 transcript 3'UTR SO:0000205 3'-untranslated region 3'clip SO:0000557 3'-clip 5'UTR SO:0000204 5'-untranslated region 5'clip SO:0000555 5'-clip CDS ???? SO:0001254 coding sequence predicted exon SO:0000147 exon intron SO:0000188 intron mRNA SO:0000234 messenger RNA polyA_signal SO:0000551 polyA signal sequence precursor_RNA ???? =09 prim_transcript SO:0000185 primary transcript > -----Original Message----- > From: Chris Stoeckert [mailto:sto...@pc...]=20 > Sent: 26 November 2002 19:37 > To: Matthew Berryman > Cc: gusdev-gusdev > Subject: Re: GUS features >=20 >=20 > Hi Matt, > I've generated a list of the 26 subclass views of NAFeatureImp and=20 > looked at the distinct names associated with. These are given below.=20 > Will now start looking at the SO and SO lite for these. > Chris >=20 > On Friday, November 22, 2002, at 12:59 PM, Matt Berriman wrote: >=20 > > Hi Chris > > Do you have a list of GUS sequence features? The SO group are > > preparing a "SO Lite" -- it would be good to check that=20 > everything is=20 > > represented. > > > > cheers > > Matt >=20 > SUBCLASS_VIEW > <BindingSiteFeature: names of binding sites> > CentromereFeature: > predicted centromere > DNARegulatory: > -10_signal > -35_signal > CAAT_signal > GC_signal > TATA_signal > attenuator > enhancer > promoter > terminator > DNAStructure: > D-loop > primer_bind > rep_origin > ExonFeature > ExonFeature > GeneFeature > CDS > GeneFeature > gene > snRNA > tRNA > HexamerFeature > hexamer > Immunoglobulin > C-region > D_segment > J_segment > N_region > S_region > V_region > V_segment > iDNA > LowComplexityNAFeature > low complexity sequence > Miscellaneous > misc_binding > misc_feature > misc_signal > protein_bind > stem_loop > unsure > PolyAFeature > PolyAFeature > PromoterFeature > PromoterFeature > various promoter sets > ProteinFeature=09 > mat_peptide > sig_peptide > transit_peptide > RNAFeature > PREDICTED REFSEQ > PROVISIONAL REFSEQ > REFSEQ > REVIEWED REFSEQ > RNAFeature > assembly > mRNA > RNAStructure > RBS > misc_RNA > RNAType > rRNA > scRNA > snRNA > tRNA > <Repeats: names of repeats> > RestrictionFragmentFeature > RestrictionFragmentFeature > <SAGETagFeature: locations on chromosomes> > STS > STS > ScaffoldGapFeature > assembly gap > SeqVariation > SNP > allele > conflict > misc_difference > misc_recomb > modified_base > mutation > old_sequence > variation > Source > source > TandemRepeatFeature > tandem repeat > TelomereFeature > predicted telomere > Transcript > 3'UTR > 3'clip > 5'UTR > 5'clip > CDS > exon > intron > mRNA > polyA_signal > precursor_RNA > prim_transcript >=20 >=20 |
From: Jonathan C. <cra...@pc...> - 2002-12-02 19:12:09
|
Arnaud Kerhornou wrote: > Hi Jonathan > > How is it going ? Can you give us an update of the GUS3 migration ? > > cheers > Arnaud > Hi Arnaud- I'm hoping to be able to get it done this week, although I've run into some technical problems that have to be addressed first. At the very least I want to get all the schema changes made today and then give people a couple of days to look them over (and check for errors) while I continue working on the migration scripts. Jonathan |
From: Arnaud K. <ax...@sa...> - 2002-12-02 16:24:00
|
Hi Jonathan How is it going ? Can you give us an update of the GUS3 migration ? cheers Arnaud |
From: Matthew B. <mb...@sa...> - 2002-11-27 09:59:16
|
Hi Chris=20 I'l do the same over here Matt -----Original Message----- From: Chris Stoeckert [mailto:sto...@pc...]=20 Sent: 26 November 2002 19:37 To: Matthew Berryman Cc: gusdev-gusdev Subject: Re: GUS features Hi Matt, I've generated a list of the 26 subclass views of NAFeatureImp and=20 looked at the distinct names associated with. These are given below.=20 Will now start looking at the SO and SO lite for these. Chris On Friday, November 22, 2002, at 12:59 PM, Matt Berriman wrote: > Hi Chris > Do you have a list of GUS sequence features? The SO group are > preparing a "SO Lite" -- it would be good to check that everything is=20 > represented. > > cheers > Matt SUBCLASS_VIEW <BindingSiteFeature: names of binding sites> CentromereFeature: predicted centromere DNARegulatory: -10_signal -35_signal CAAT_signal GC_signal TATA_signal attenuator enhancer promoter terminator DNAStructure: D-loop primer_bind rep_origin ExonFeature ExonFeature GeneFeature CDS GeneFeature gene snRNA tRNA HexamerFeature hexamer Immunoglobulin C-region D_segment J_segment N_region S_region V_region V_segment iDNA LowComplexityNAFeature low complexity sequence Miscellaneous misc_binding misc_feature misc_signal protein_bind stem_loop unsure PolyAFeature PolyAFeature PromoterFeature PromoterFeature various promoter sets ProteinFeature=09 mat_peptide sig_peptide transit_peptide RNAFeature PREDICTED REFSEQ PROVISIONAL REFSEQ REFSEQ REVIEWED REFSEQ RNAFeature assembly mRNA RNAStructure RBS misc_RNA RNAType rRNA scRNA snRNA tRNA <Repeats: names of repeats> RestrictionFragmentFeature RestrictionFragmentFeature <SAGETagFeature: locations on chromosomes> STS STS ScaffoldGapFeature assembly gap SeqVariation SNP allele conflict misc_difference misc_recomb modified_base mutation old_sequence variation Source source TandemRepeatFeature tandem repeat TelomereFeature predicted telomere Transcript 3'UTR 3'clip 5'UTR 5'clip CDS exon intron mRNA polyA_signal precursor_RNA prim_transcript |
From: Chris S. <sto...@pc...> - 2002-11-26 19:36:10
|
Hi Matt, I've generated a list of the 26 subclass views of NAFeatureImp and looked at the distinct names associated with. These are given below. Will now start looking at the SO and SO lite for these. Chris On Friday, November 22, 2002, at 12:59 PM, Matt Berriman wrote: > Hi Chris > Do you have a list of GUS sequence features? The SO group are > preparing a "SO Lite" -- it would be good to check that everything is > represented. > > cheers > Matt SUBCLASS_VIEW <BindingSiteFeature: names of binding sites> CentromereFeature: predicted centromere DNARegulatory: -10_signal -35_signal CAAT_signal GC_signal TATA_signal attenuator enhancer promoter terminator DNAStructure: D-loop primer_bind rep_origin ExonFeature ExonFeature GeneFeature CDS GeneFeature gene snRNA tRNA HexamerFeature hexamer Immunoglobulin C-region D_segment J_segment N_region S_region V_region V_segment iDNA LowComplexityNAFeature low complexity sequence Miscellaneous misc_binding misc_feature misc_signal protein_bind stem_loop unsure PolyAFeature PolyAFeature PromoterFeature PromoterFeature various promoter sets ProteinFeature mat_peptide sig_peptide transit_peptide RNAFeature PREDICTED REFSEQ PROVISIONAL REFSEQ REFSEQ REVIEWED REFSEQ RNAFeature assembly mRNA RNAStructure RBS misc_RNA RNAType rRNA scRNA snRNA tRNA <Repeats: names of repeats> RestrictionFragmentFeature RestrictionFragmentFeature <SAGETagFeature: locations on chromosomes> STS STS ScaffoldGapFeature assembly gap SeqVariation SNP allele conflict misc_difference misc_recomb modified_base mutation old_sequence variation Source source TandemRepeatFeature tandem repeat TelomereFeature predicted telomere Transcript 3'UTR 3'clip 5'UTR 5'clip CDS exon intron mRNA polyA_signal precursor_RNA prim_transcript |
From: Arnaud K. <ax...@sa...> - 2002-11-15 14:04:47
|
Hi Jonathan cra...@pc... wrote: >Arnaud- > > > >>>Here a case of what could happen on a given project: >>> >>>* The sequences would come from TIGR, >>>* The gene models would come from SBRI, >>>* The manual annotation of the gene models and the GO curation would >>>be done by TIGR, >>>* The curation would be done by the Sanger, >>>* Some curated comments would be sent by members of the community. >>> >>>Instead of using the evidence table, would it be possible to attribute >>>data by using the user_id attribute ? >>> >>> > >We certainly want to be able to support the situation you describe, and >although we have some working examples that are similar to this, I don't >think that the way they're currently implemented (using a combination of >external_db_id and the ProjectLink table) is necessarily the best answer, >nor do I think it covers all the possibilities. > >I agree with you that the Evidence table is not ideal for this purpose, >although I'm also not convinced that adapting the 'row_user_id' column >will be sufficient either. Here are the problems/issues I see with doing >attribution solely with the user_id: > > * Currently a user_id represents a single individual, not an organization. > We will be adding a pointer in the UserInfo table to Sres::Contact > and I believe that the Contact table *can* be used to represent whole > organizations. So by giving a particular user_id to a row in the > database we will also be associating it (indirectly) with whatever > organization the person in question works for (e.g., TIGR, for sequence > data generated there.) There are two problems with this. First, the > choice of user_id will likely be arbitrary. That is, should it be the > person in charge of the sequencing project at TIGR, or the person who > e-mailed the data to us? So far we've tended to use the user_id to > reflect the identity of the person who actually loaded the data into > the database (typically someone in our lab., or the user_id of an > annotator or collaborator editing the database through a web interface.) > Second, if the person in question goes on to work for a different > organization/company, we can't easily reflect that change without > losing the association between the original data and the organization > that should get credit for generating them (short of "cloning" the person > in the UserInfo table!) > Anyway, my point here is that I think we'll want to be able to attribute > datasets both to individuals and also (directly) to organizations. Does > this sound like a reasonable requirement based on your use-case? > Yes, it sounds like it is > If so > then I think it implies that Sres::Contact might be a better table to use > than Sres::UserInfo. > > > >>>e.g. if the gene models are coming from SBRI, the user_id would >>>acknowledge the gene features as owned by SBRI. Any update would keep >>>the ownership and would acknowledge who's done the update. >>> >>> > > * This brings up my second point/question, which is whether we need to > support multiple attributions; I think your example suggests that we do. > For example, suppose that, as in your example, SBRI generates an initial > set of gene models. Then suppose that--as part of the manual annotation > process--an annotator at TIGR determines that one of the exons in one of > the SBRI gene models is incorrect (though not by much.) He/she adjusts > the 5' boundary of one exon accordingly. Who should now be cited as the > source of this gene model? I would say that it should be *both* SBRI and > TIGR, but this is not supported by the single 'row_user_id' in the > GeneFeature table. In general this is a problem for any kind of derived > data, or any data that is likely to be refined over time (like > gene models.) Does this agree with what you mean by "manual annotation"? > Even if not, I think that we will want to support having multiple > attributions, because many of the "curated comments sent by members of > the community" that you mention are likely to be corrections to the gene > models based on various kinds of evidence, much of it experimental. > >Now, if you agree that "attribution" is something that should apply to either >individuals or organizations, and that can be shared among one or more such >entities, > We agree on this point too. > then the question is how to represent this in the database. What >I've argued for so far is to have a many-to-1 relationship between entries in >the Sres::Contact table and any row in the database (meaning that a new linking >table would have to be generated.) I would also leave the 'row_user_id' alone, >using it--as we do now--to represent which user *owns* that row in the database, >where the users are by definition those who have the ability to alter the >database directly (by which I mean to include annotators working through an >interface like Artemis or Apollo.) Does this sound reasonable? > >One question that I have not yet considered in detail is how this affects what >we're currently doing with the ExternalDatabase table. In effect we've used >this table to represent not just databases (e.g. GenBank, SWISS-PROT) but also >the more general notion of "externally-generated datasets." For example, the >published Plasmodium falciparum genomic sequences from TIGR have their own >ExternalDatabase entry that we use for the purposes of attribution, and the >sequences from Sanger and Stanford have similar entries. I don't think we >necessarily want to combine these two different parts of the schema, but there >are clearly significant overlaps between them, if only because the institution >that generates the data (Contact) is often also the source of that data (which >is what the ExternalDatabase is supposed to represent.) But one could imagine >cases in which we'd want to attribute a particular dataset to one organization >(e.g., RIKEN) but record the fact that the data was actually obtained/downloaded >from another (e.g., GenBank or EMBL.) This the case right now for the draft >human and mouse genome sequence assemblies we're using, which were generated by >NCBI and the MGSC, respectively, but which we downloaded from UCSC, after the >files had been subjected to some reformatting. In terms of data provenance this >is all information that it is crucial to track, and I think what I'm suggesting >is that we use the Contact table (along with a new Attribution table of some >sort) to record where the data *originally* came from and that we use the >ExternalDatabase/ExternalDatabaseRelease tables to track where the data most >recently resided before being entered into GUS. > > I think it sounds reasonable. I don't think I have something else to add. > > >>>The other point was the attribution of data coming from publication or >>>personal communication. I had a look at flybase. Flybase considers >>>personal communication as references. To differentiate them, they have >>>an extra attribute in the reference table to allow the classification >>>of the different references. >>> >>> > >Yes, we should definitely extend Sres::BibliographicReference to make use of >a controlled vocabulary for reference types, including personal communications >(which should make use of an optional contact_id to specify the individual in >question.) I'll make this change in the schema as well as adding the new >Phenotype tables that you sent along. > > Fine. how long do you reckon it's going to take to commit these schema modifications ? >Jonathan > > > Cheers Arnaud |
From: Arnaud K. <ax...@sa...> - 2002-11-14 16:36:44
|
cra...@pc... wrote: >Arnaud- > > > >>I didn't include anything about genetic interactions even if in the >>future we will want to store them. I've reviewed the interaction table >>and I've got some thoughts about this table. >> >>Genetic interactions may involve more than one effector/target. If we >>want to make the Interaction table generic, we need to store more than >>one effector and more than one target. I don't have any use cases yet, >>but I can ask around one if needed. >> > >This makes sense to me, since not all groups of effectors or targets will >necessarily be Complexes. > > > >>An extra controlled vocabulary is needed. This controlled vocabulary >>will be used to classify the behaviour of the effector for a given >>interaction. >>e.g. Allele1 "inhibits" the expression of Allele2. >> >> > >This also sounds like a good idea; I don't see anywhere that we represent >this currently. > > > >>Regarding physical interactions, there are two cases in which it will be >>useful to annotate them: >> * Transient interactions associated with a function, e.g. a protein >>regulating the transcription will be interacting with DNA. >> * Structural interactions involved in the formation of a complex. In >>that case, we can associate component interactions with the complex they >>are involved in. Some of these interactions are experimentally >>characterized, others are hypothetical. >> >> > >I don't think we have to make any changes to the Interaction table in order >to use it for representing physical interactions, right? > right! > Presumably the >physical interactions will simply be a subset of the interactions enumerated >in the controlled vocabulary that we're going to create. > Do you mean an extra controlled vocabulary to specify the type of interactions ? > The only thing I >noticed when looking at the DoTS::Interaction table is that there's no way >of representing interactions for which "direction" is not a meaningful >concept. (I assume there will be such interactions, particularly when we >consider physical interactions.) So maybe we could add a "has_direction" >column, which would only be non-NULL in the case that direction_is_known > 0 >(or which should indicate that the direction_is_known column should be >ignored, if set to 1.) > > I agree, I don't think the direction information is useful for some interactions such as structural ones. > > >>Currently in GUS, a complex is a set of components. Would it be possible >>to associate a complex with a set of interactions as well ? >> >> > >It should be, since the DoTS::Interaction table can reference any other table >in the database (including Complex.) Or are you asking about an explicit >representation for an "InteractionSet" (whatever that would mean)? I'm not >sure I completely understand this question. > Well, a use case could be: Complex A is made of 4 proteins components: Component1 interacts with component 2, component3 and component4 and these interactions are experimentally characterized. Component2 may interact with component3. A query would be : "give me all the interactions between components of Complex A" I was thinking of adding a complex_id attribute to the interaction table to associate interactions between components with a given complex, but actually, as components are already associated with this given complex, an extra attribute may not be needed. So it should be fine like it is now. > > >>The other point I didn't mention in my previous email was the review of >>the phenotype table. Would it be possible to associate phenotypic data >>with GO terms ? >> >> > >I believe we should be able to use the DoTS::GOTermAssociation table to >associate GO terms with the appropriate rows in the Phenotype table. In >gusdev right now we're using a special-purpose table called ECGOFunctionMap >to represent the mapping between the enzyme classification numbers and the >GO Function terms, but I believe that the only reason we did this is because >we didn't have the GOTermAssociation table to work with in GUSdev. > > sounds fine >Jonathan > > > cheers Arnaud |
From: <cra...@pc...> - 2002-11-13 21:40:28
|
Arnaud- > I didn't include anything about genetic interactions even if in the > future we will want to store them. I've reviewed the interaction table > and I've got some thoughts about this table. > > Genetic interactions may involve more than one effector/target. If we > want to make the Interaction table generic, we need to store more than > one effector and more than one target. I don't have any use cases yet, > but I can ask around one if needed. This makes sense to me, since not all groups of effectors or targets will necessarily be Complexes. > An extra controlled vocabulary is needed. This controlled vocabulary > will be used to classify the behaviour of the effector for a given > interaction. > e.g. Allele1 "inhibits" the expression of Allele2. This also sounds like a good idea; I don't see anywhere that we represent this currently. > Regarding physical interactions, there are two cases in which it will be > useful to annotate them: > * Transient interactions associated with a function, e.g. a protein > regulating the transcription will be interacting with DNA. > * Structural interactions involved in the formation of a complex. In > that case, we can associate component interactions with the complex they > are involved in. Some of these interactions are experimentally > characterized, others are hypothetical. I don't think we have to make any changes to the Interaction table in order to use it for representing physical interactions, right? Presumably the physical interactions will simply be a subset of the interactions enumerated in the controlled vocabulary that we're going to create. The only thing I noticed when looking at the DoTS::Interaction table is that there's no way of representing interactions for which "direction" is not a meaningful concept. (I assume there will be such interactions, particularly when we consider physical interactions.) So maybe we could add a "has_direction" column, which would only be non-NULL in the case that direction_is_known > 0 (or which should indicate that the direction_is_known column should be ignored, if set to 1.) > Currently in GUS, a complex is a set of components. Would it be possible > to associate a complex with a set of interactions as well ? It should be, since the DoTS::Interaction table can reference any other table in the database (including Complex.) Or are you asking about an explicit representation for an "InteractionSet" (whatever that would mean)? I'm not sure I completely understand this question. > The other point I didn't mention in my previous email was the review of > the phenotype table. Would it be possible to associate phenotypic data > with GO terms ? I believe we should be able to use the DoTS::GOTermAssociation table to associate GO terms with the appropriate rows in the Phenotype table. In gusdev right now we're using a special-purpose table called ECGOFunctionMap to represent the mapping between the enzyme classification numbers and the GO Function terms, but I believe that the only reason we did this is because we didn't have the GOTermAssociation table to work with in GUSdev. Jonathan -- Jonathan Crabtree Center for Bioinformatics, University of Pennsylvania 1406 Blockley Hall, 423 Guardian Drive Philadelphia, PA 19104-6021 215-573-3115 |
From: <cra...@pc...> - 2002-11-13 21:03:59
|
Arnaud- > > Here a case of what could happen on a given project: > > > > * The sequences would come from TIGR, > > * The gene models would come from SBRI, > > * The manual annotation of the gene models and the GO curation would > > be done by TIGR, > > * The curation would be done by the Sanger, > > * Some curated comments would be sent by members of the community. > > > > Instead of using the evidence table, would it be possible to attribute > > data by using the user_id attribute ? We certainly want to be able to support the situation you describe, and although we have some working examples that are similar to this, I don't think that the way they're currently implemented (using a combination of external_db_id and the ProjectLink table) is necessarily the best answer, nor do I think it covers all the possibilities. I agree with you that the Evidence table is not ideal for this purpose, although I'm also not convinced that adapting the 'row_user_id' column will be sufficient either. Here are the problems/issues I see with doing attribution solely with the user_id: * Currently a user_id represents a single individual, not an organization. We will be adding a pointer in the UserInfo table to Sres::Contact and I believe that the Contact table *can* be used to represent whole organizations. So by giving a particular user_id to a row in the database we will also be associating it (indirectly) with whatever organization the person in question works for (e.g., TIGR, for sequence data generated there.) There are two problems with this. First, the choice of user_id will likely be arbitrary. That is, should it be the person in charge of the sequencing project at TIGR, or the person who e-mailed the data to us? So far we've tended to use the user_id to reflect the identity of the person who actually loaded the data into the database (typically someone in our lab., or the user_id of an annotator or collaborator editing the database through a web interface.) Second, if the person in question goes on to work for a different organization/company, we can't easily reflect that change without losing the association between the original data and the organization that should get credit for generating them (short of "cloning" the person in the UserInfo table!) Anyway, my point here is that I think we'll want to be able to attribute datasets both to individuals and also (directly) to organizations. Does this sound like a reasonable requirement based on your use-case? If so then I think it implies that Sres::Contact might be a better table to use than Sres::UserInfo. > > e.g. if the gene models are coming from SBRI, the user_id would > > acknowledge the gene features as owned by SBRI. Any update would keep > > the ownership and would acknowledge who's done the update. * This brings up my second point/question, which is whether we need to support multiple attributions; I think your example suggests that we do. For example, suppose that, as in your example, SBRI generates an initial set of gene models. Then suppose that--as part of the manual annotation process--an annotator at TIGR determines that one of the exons in one of the SBRI gene models is incorrect (though not by much.) He/she adjusts the 5' boundary of one exon accordingly. Who should now be cited as the source of this gene model? I would say that it should be *both* SBRI and TIGR, but this is not supported by the single 'row_user_id' in the GeneFeature table. In general this is a problem for any kind of derived data, or any data that is likely to be refined over time (like gene models.) Does this agree with what you mean by "manual annotation"? Even if not, I think that we will want to support having multiple attributions, because many of the "curated comments sent by members of the community" that you mention are likely to be corrections to the gene models based on various kinds of evidence, much of it experimental. Now, if you agree that "attribution" is something that should apply to either individuals or organizations, and that can be shared among one or more such entities, then the question is how to represent this in the database. What I've argued for so far is to have a many-to-1 relationship between entries in the Sres::Contact table and any row in the database (meaning that a new linking table would have to be generated.) I would also leave the 'row_user_id' alone, using it--as we do now--to represent which user *owns* that row in the database, where the users are by definition those who have the ability to alter the database directly (by which I mean to include annotators working through an interface like Artemis or Apollo.) Does this sound reasonable? One question that I have not yet considered in detail is how this affects what we're currently doing with the ExternalDatabase table. In effect we've used this table to represent not just databases (e.g. GenBank, SWISS-PROT) but also the more general notion of "externally-generated datasets." For example, the published Plasmodium falciparum genomic sequences from TIGR have their own ExternalDatabase entry that we use for the purposes of attribution, and the sequences from Sanger and Stanford have similar entries. I don't think we necessarily want to combine these two different parts of the schema, but there are clearly significant overlaps between them, if only because the institution that generates the data (Contact) is often also the source of that data (which is what the ExternalDatabase is supposed to represent.) But one could imagine cases in which we'd want to attribute a particular dataset to one organization (e.g., RIKEN) but record the fact that the data was actually obtained/downloaded from another (e.g., GenBank or EMBL.) This the case right now for the draft human and mouse genome sequence assemblies we're using, which were generated by NCBI and the MGSC, respectively, but which we downloaded from UCSC, after the files had been subjected to some reformatting. In terms of data provenance this is all information that it is crucial to track, and I think what I'm suggesting is that we use the Contact table (along with a new Attribution table of some sort) to record where the data *originally* came from and that we use the ExternalDatabase/ExternalDatabaseRelease tables to track where the data most recently resided before being entered into GUS. > > The other point was the attribution of data coming from publication or > > personal communication. I had a look at flybase. Flybase considers > > personal communication as references. To differentiate them, they have > > an extra attribute in the reference table to allow the classification > > of the different references. Yes, we should definitely extend Sres::BibliographicReference to make use of a controlled vocabulary for reference types, including personal communications (which should make use of an optional contact_id to specify the individual in question.) I'll make this change in the schema as well as adding the new Phenotype tables that you sent along. Jonathan -- Jonathan Crabtree Center for Bioinformatics, University of Pennsylvania 1406 Blockley Hall, 423 Guardian Drive Philadelphia, PA 19104-6021 215-573-3115 |
From: Chris S. <sto...@pc...> - 2002-11-13 13:21:27
|
Hi Arnaud, I think it is a good idea to add a publication type attribute to bibliographic reference. An alternative list of terms can be found at http://srs.ebi.ac.uk/srs6bin/cgi-bin/wgetz?-page+FieldInfo+- id+4kuFf1ITQgp+-lib+MEDLINE+-bf+PublicationType (click on 'list values') but the Flybase list has things like 'personal communication' not present in the Meldine list.. Chris On Tuesday, November 12, 2002, at 10:29 AM, Arnaud Kerhornou wrote: > Hi Jonathan > > Sorry for the delay to come back to you with some thoughts on > attribution data. > > Here a case of what could happen on a given project: > > * The sequences would come from TIGR, > * The gene models would come from SBRI, > * The manual annotation of the gene models and the GO curation would > be done by TIGR, > * The curation would be done by the Sanger, > * Some curated comments would be sent by members of the community. > > Instead of using the evidence table, would it be possible to attribute > data by using the user_id attribute ? > e.g. if the gene models are coming from SBRI, the user_id would > acknowledge the gene features as owned by SBRI. Any update would keep > the ownership and would acknowledge who's done the update. > > The other point was the attribution of data coming from publication or > personal communication. I had a look at flybase. Flybase considers > personal communication as references. To differentiate them, they have > an extra attribute in the reference table to allow the classification > of the different references. > For more information about the refernce class controlled vocabulary, > see > http://flybase.bio.indiana.edu/.data/docs/refman/refman-B.html#B.13.2. > > cheers > Arnaud > > ------------------------------ > > Item 3: Attribution of data from multiple sources. Three methods are > available in GUS3.0 to attach information to tables. Evidence which > allows attributions to be linked to any row. NAComment which allows > multiple attachment of comments to a sequence. Comment which is > attached to a review_status_id; each NAFeature has a review_status_id. > Use cases are needed to determine if any of these mechanisms are > appropriate. > see addendum from Jonathan Crabtree below. > > > Addendum to item 3 from Jonathan. > I spent a little time looking into this and the number of methods > differs > depending on how you count them (and also because in most situations > the > number of alternatives differs depending on which table you're > commenting > on.) But here are the ways we currently support in GUS 3.0 for adding > comments to things (external to the tables themselves): > > 1. DoTS.Comments (not "Comment") + DoTS.Evidence > I list these together because the Comments table relies on the > Evidence > table to link its rows to other objects in the db. > This method can be used with any table and supports CLOB comments. > 2. DoTS.AAComment + DoTS.CommentName > Can be used only with AASequence entries and supports > VARCHAR2(4000). > 3. DoTS.NAComment > Can be used only with NASequence entries and supports > VARCHAR2(4000). > (Does *not* have a link to DoTS.CommentName) > 4. DoTS.Note > Can be used only with NAFeature entries and supports VARCHAR2(4000) > (Note that this is different from gusdev.Note, which has a > VARCHAR(255) > AND a CLOB column.) > > Note that DoTS.Comments is the only generic option (that I found) for > associating notes/comments with rows. Note also that AASequence, > NASequence, and NAFeature all have their own specialized comment > tables, > but AAFeature doesn't appear to (at least not one with "comment" in its > name!) Conceptually speaking I'm also not sure that I agree with the > use > of the "Evidence" table to link comments to rows in general. For > example, > during the conference call I gave the example of a note in PlasmoDB > that > basically says "the second exon of this predicted gene is incorrect"; > this > would actually be evidence *against* the GeneFeature, not *for* it (the > typical use of the Evidence table.) Likewise, one could merely be > commenting on an aspect of a predicted feature, without actually > providing > any further evidence for its existence or correctness. In other words, > an implicit statement of the form "if this thing exists, then it's > interesting that such and such would be true...". > > Another thing to point out is that none of these tables (as far as I > can > remember), has a pointer to SRES.Contact, so they don't really address > the > question of attribution. In PlasmoDB right now we handle attribution > mainly through creative use of the ExternalDatabase table > (external_db_id > in the current GUSdev). In GUS 3.0 I believe that external database > releases will be linked to Contacts, so perhaps the thing to do is to > allow a single entry in the database to be associated with multiple > external databases? This gets slightly messy if you want to be able to > attribute something to a personal communication with somebody, or to a > journal article (neither of which is expressed particularly well as an > "external database".) Although both might be nicely represented as > References, perhaps? There are enough possibilities that maybe we > should > just find out exactly what the PSU folks have in mind, and tailor a > solution that works for them (using the existing schema as much as > possible.) > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Gusdev-gusdev mailing list > Gus...@li... > https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev > |