From: Sofia R. <so...@so...> - 2018-09-13 17:05:28
|
Thanks Ethy! This is a lot to figure out, there are quite a few different routes to take. I really like the idea of using the assembly version name in the identifier for the gene model. I need to figure out which approach make sense to input and retrieve with Tripal and to the individuals generating the gene models. I think the plan is to also add manually curated genes as well. The versioning of the entire set is also a good thought. Sofia On Thu, Sep 13, 2018 at 7:18 AM, Cannon, Ethalinda K [COM S] < ekc...@ia...> wrote: > Sorry to be late to the party; this is something I've worked on at length > with maize genes. > > > First: I'll note that the AgBioData consortium <https://www.agbiodata.org> > is forming a genome and gene model nomenclature group. Anyone working with > genome and/or gene model nomenclature is welcome to join. There is a > recording <https://www.youtube.com/watch?v=kNW6YReFP28&feature=youtu.be> > of our nomenclature discussion last week, and copies of the slides > <https://www.agbiodata.org/sites/default/files/Genome%20Nomenclature%20meeting%20slides.pdf> > are available. > > After playing with the idea of versioning gene models the maize > group decided to instead version the sets. We haven't (yet) been successful > with hand-curation of gene models and instead improve the gene model sets > via re-analysis. > > Note that, the .<digit> suffix indicates alternative isoforms in many > nomenclature patterns. > > An analysis record represents each gene model set and gene models are > linked via analysisfeature. It is possible for the same gene model feature > record to attached to multiple versions if it hasn't changed. Sequence > isn't stored in the feature record but retrieved from the appropriate BLAST > db as needed. This takes care of (rare?) situations in which the name stays > the same but there are minor changes to the sequence. I have a rather > clunky way of indicating the current version via analysisprop. > > There is a request in for the addition of an analysis.type_id field for > Chado 1.4 (https://github.com/GMOD/Chado/pull/52). > > For maintaining history, I use feature_relationship with a set of cvterms > indicating, for example, whether a gene model has been split or merged. > Split and merged gene models get new names. > > Because we have gene models from several different maize genome > assemblies, we run an analysis to find likely orthologs across the multiple > gene model sets. These are also linked via feature_relationship records. > > Hope this helps. > > Ethy > ------------------------------ > *From:* Joe Carlson <jwc...@lb...> > *Sent:* Wednesday, September 12, 2018 4:36 PM > *To:* Sofia Robb > *Cc:* GMOD Schema/Chado List > *Subject:* Re: [Gmod-schema] gene/mRNA version > > > On Sep 12, 2018, at 1:43 PM, Sofia Robb <so...@so...> wrote: > > Good point about merging and splitting genes. > > I think this is meant to be a pretty stable assembly and the hopes are > that the annotations are good. But split and merged genes are quite typical > issues I have seen in many different annotation sets, and I suspect we will > find some in this as well. My first gut solution to merging or splitting is > that these would have to have new stable ids, if we go the stable ID route. > > When you say using the feature_relationship table for tracking are you > thinking that the cvterm_id would be some term like version_of and the > subject would be the versioned feature and the object would be the stable > feature (new_version 'version_of' stable_version)? Or are you saying that > the stable ID route isn't great in your opinion and that the cvterm should > be something like new_feature 'is_new_version_of' old_feature? > > > I was thinking of having a ‘previous_version_of’ (or some such label) and > link annotations through the feature_relation table. I really don’t know > which solution is best: it depends on what you want the tracking to do. Or > how fine-grained you need the tracking to be. My one concern with merges is > that you’ll not be able to have multiple stable id’s for one gene unless > you keep track of the rank field or modify the schema. > > joe > > > Thank you for taking your time to discuss this with me. > Sofia > > > On Wed, Sep 12, 2018 at 2:28 PM, Joe Carlson <jwc...@lb...> wrote: > > > On Sep 12, 2018, at 1:09 PM, Sofia Robb <so...@so...> wrote: > > Hi Joe and other Chado users, > > Joe, Thanks for your response. I would like to know more about your data. > I have a few questions and will follow them up with a dump of my current > ideas on how to solve this. > > > I’m managing the backend db for the phytozome project at JGI ( > phytozome.jgi.doe.gov), a comparative land plant db. We have ~ 250 plant > genomes (assemblies, annotation and analysis results) loaded right now. The > size of the db is ~ 1.5T. > > > Are you the source of the sequence? > > > We have the land plants sequenced by the JGI, things done by collaborators > and other model organisms. It’s roughly an equal mixture of each. > > Or are pulling the data from another database? > > > Data import is with fasta files for chromosomes and proteins; gff3 for > structure. > > What do you do if the actual sequence changes? Do you just overwrite the > previous sequence data? > > > I never overwrite or delete. Once it goes in the database, it stays in the > database. > > > We are going to be the official repository of this data and have been > asked to keep track the history of changes. This is more than I have had to > keep track of in the past. > > I had been thinking of trying to implement some loading of the data which > gets across the idea that each feature has a stable version which is equal > to it its current version and any number of older versions. Now this is > just an idea (largely based on the representation of data from ensembl). > > The stable version would have a stable id which lacks the '.\d' suffix. > And there would be a feature record for each version which includes the > '.\d' suffix. I would mark older versions obsolete. What I am still working > on in this idea is what I could add as properties (gff 9th column) to help > with searches. Perhaps I could add a stableID=xyz in each record? I think > this would help with a query, I could search for the stableID and obsolete > when I need to retrieve the history of changes? > > feature.uniquename: some_gene.1 > featureprop.cvterm_id: some term that indicates the concept stableID > featureprop.value: some_gene > feature.is_obsolete: true > > > feature.uniquename some_gene.2 > featureprop.cvterm_id: some term that indicates the concept stableID > featureprop.value: some_gene > feature.is_obsolete: false > > feature.uniquename some_gene > featureprop.cvterm_id: some term that indicates the concept stableID > featureprop.value: some_gene > feature.is_obsolete: false > > > How you do this depends a bit on the nature of the reannotations. If you > have a fairly stable assembly and annotation then it entirely makes sense > to count on there being a stable identifier. In what I have, we often have > dramatically different assemblies from one version to another (many of our > assemblies do not have pseudo molecules) and we cannot count on stable ids. > > Your assigning a stable id as a property will work if the changes are not > too extensive. But think of the case where 2 genes in 1 version are > modified in such a way that 1 gene is split and half is merged into another > gene. What rules are you going to use to assign the stable id for the > merged gene? > > An alternative tracking mechanism between versions is to use a > feature_relationship. You could keep track of things a bit better with this > table if there are extensive merges and splits. For the most part we are > not maintaining gene history except in a few of our important genomes. > > Joe > > > > Thank you, > Sofia > > > > On Wed, Sep 12, 2018 at 1:34 PM, Joe Carlson <jwc...@lb...> wrote: > > For what it’s worth, I’ve been using dbxref’s to track annotation > versions. I’ve modified the schema to make dbxref_id in the feature table > to be not null, and use a record in the dbxref table to label the source - > and version - of the data. > > Appending a numerical identifier to the name means that a query for a > particular version will require a VERY expensive sql constraint "and name > like ‘%.N’” in the queries. > > Joe > > > On Sep 12, 2018, at 12:16 PM, Sofia Robb <so...@so...> wrote: > > Hello All, > > I have a question about how others are handling sequence feature versions. > I am using Tripal and have posted this question in the Tripal repository > Issues as well. > > I have a group that is developing gene/mRNA models. They are using an > ensembl like system for versioning of gene and transcript id. And they want > to maintain a history of previous versions. > > They plan on incrementing a digit after the id when a new version is > generated. > > gene nv2m00005394.1 > mRNA nv2m00005394.1.mRNA.1 > > Chr11 GFF3Conv gene 3598792 3603486 . - . Alias=Sox9;Name=nv2m00005394.1;ID=nv2m00005394.1 > Chr11 GFF3Conv mRNA 3598792 3603486 . - . ID=nv2m00005394.1.mRNA.1;Parent=nv2m00005394.1 > > How should I handle this? Create a new feature for each version and mark > the old one obsolete? How do I make it easy for users to find the correct > ID when they don't know there has been an update? I have some ideas, but it > would require the geneID and mRNAIDs to have different bases, ie > nv2g00005394 (change g->m) for gene and nv2m00005394 for mRNA. > > Any advice would be fantastic!!! > Thank you! > Sofia > > _______________________________________________ > Gmod-schema mailing list > Gmo...@li... > https://lists.sourceforge.net/lists/listinfo/gmod-schema > > > > > > > |