Re: [Gmod-schema] gene/mRNA version

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Thanks Ethy!

This is a lot to figure out, there are quite a few different routes to
take. I really like the idea of using the assembly version name in the
identifier for the gene model. I need to figure out which approach make
sense to input and retrieve with Tripal and to the individuals generating
the gene models. I think the plan is to also add manually curated genes as
well. The versioning of the entire set is also a good thought.

Sofia

On Thu, Sep 13, 2018 at 7:18 AM, Cannon, Ethalinda K [COM S] <
ekc...@ia...> wrote:

> Sorry to be late to the party; this is something I've worked on at length
> with maize genes.
>
>
> First: I'll note that the AgBioData consortium <https://www.agbiodata.org>
> is forming a genome and gene model nomenclature group. Anyone working with
> genome and/or gene model nomenclature is welcome to join. There is a
> recording <https://www.youtube.com/watch?v=kNW6YReFP28&feature=youtu.be>
> of our nomenclature discussion last week, and copies of the slides
> <https://www.agbiodata.org/sites/default/files/Genome%20Nomenclature%20meeting%20slides.pdf>
>  are available.
>
> After playing with the idea of versioning gene models the maize
> group decided to instead version the sets. We haven't (yet) been successful
> with hand-curation of gene models and instead improve the gene model sets
> via re-analysis.
>
> Note that, the .<digit> suffix indicates alternative isoforms in many
> nomenclature patterns.
>
> An analysis record represents each gene model set and gene models are
> linked via analysisfeature. It is possible for the same gene model feature
> record to attached to multiple versions if it hasn't changed. Sequence
> isn't stored in the feature record but retrieved from the appropriate BLAST
> db as needed. This takes care of (rare?) situations in which the name stays
> the same but there are minor changes to the sequence. I have a rather
> clunky way of indicating the current version via analysisprop.
>
> There is a request in for the addition of an analysis.type_id field for
> Chado 1.4 (https://github.com/GMOD/Chado/pull/52).
>
> For maintaining history, I use feature_relationship with a set of cvterms
> indicating, for example, whether a gene model has been split or merged.
> Split and merged gene models get new names.
>
> Because we have gene models from several different maize genome
> assemblies, we run an analysis to find likely orthologs across the multiple
> gene model sets. These are also linked via feature_relationship records.
>
> Hope this helps.
>
> Ethy
> ------------------------------
> *From:* Joe Carlson <jwc...@lb...>
> *Sent:* Wednesday, September 12, 2018 4:36 PM
> *To:* Sofia Robb
> *Cc:* GMOD Schema/Chado List
> *Subject:* Re: [Gmod-schema] gene/mRNA version
>
>
> On Sep 12, 2018, at 1:43 PM, Sofia Robb <so...@so...> wrote:
>
> Good point about merging and splitting genes.
>
> I think this is meant to be a pretty stable assembly and the hopes are
> that the annotations are good. But split and merged genes are quite typical
> issues I have seen in many different annotation sets, and I suspect we will
> find some in this as well. My first gut solution to merging or splitting is
> that these would have to have new stable ids, if we go the stable ID route.
>
> When you say using the feature_relationship table for tracking are you
> thinking that the cvterm_id would be some term like version_of and the
> subject would be the versioned feature and the object would be the stable
> feature (new_version 'version_of' stable_version)? Or are you saying that
> the stable ID route isn't great in your opinion and that the cvterm should
> be something like new_feature 'is_new_version_of' old_feature?
>
>
> I was thinking of having a ‘previous_version_of’ (or some such label) and
> link annotations through the feature_relation table. I really don’t know
> which solution is best: it depends on what you want the tracking to do. Or
> how fine-grained you need the tracking to be. My one concern with merges is
> that you’ll not be able to have multiple stable id’s for one gene unless
> you keep track of the rank field or modify the schema.
>
> joe
>
>
> Thank you for taking your time to discuss this with me.
> Sofia
>
>
> On Wed, Sep 12, 2018 at 2:28 PM, Joe Carlson <jwc...@lb...> wrote:
>
>
> On Sep 12, 2018, at 1:09 PM, Sofia Robb <so...@so...> wrote:
>
> Hi Joe and other Chado users,
>
> Joe, Thanks for your response. I would like to know more about your data.
> I have a few questions and will follow them up with a dump of my current
> ideas on how to solve this.
>
>
> I’m managing the backend db for the phytozome project at JGI (
> phytozome.jgi.doe.gov), a comparative land plant db. We have ~ 250 plant
> genomes (assemblies, annotation and analysis results) loaded right now. The
> size of the db is ~ 1.5T.
>
>
> Are you the source of the sequence?
>
>
> We have the land plants sequenced by the JGI, things done by collaborators
> and other model organisms. It’s roughly an equal mixture of each.
>
> Or are pulling the data from another database?
>
>
> Data import is with fasta files for chromosomes and proteins; gff3 for
> structure.
>
> What do you do if the actual sequence changes? Do you just overwrite the
> previous sequence data?
>
>
> I never overwrite or delete. Once it goes in the database, it stays in the
> database.
>
>
>  We are going to be the official repository of this data and have been
> asked to keep track the history of changes. This is more than I have had to
> keep track of in the past.
>
> I had been thinking of trying to implement some loading of the data which
> gets across the idea that each feature has a stable version which is equal
> to it its current version and any number of older versions. Now this is
> just an idea (largely based on the representation of data from ensembl).
>
> The stable version would have a stable id which lacks the '.\d' suffix.
> And there would be a feature record for each version which includes the
> '.\d' suffix. I would mark older versions obsolete. What I am still working
> on in this idea is what I could add as properties (gff 9th column) to help
> with searches. Perhaps I could add a stableID=xyz in each record? I think
> this would help with a query, I could search for the stableID and obsolete
> when I need to retrieve the history of changes?
>
> feature.uniquename: some_gene.1
> featureprop.cvterm_id: some term that indicates the concept stableID
> featureprop.value: some_gene
> feature.is_obsolete: true
>
>
> feature.uniquename some_gene.2
> featureprop.cvterm_id: some term that indicates the concept stableID
> featureprop.value: some_gene
> feature.is_obsolete: false
>
> feature.uniquename some_gene
> featureprop.cvterm_id: some term that indicates the concept stableID
> featureprop.value: some_gene
> feature.is_obsolete: false
>
>
> How you do this depends a bit on the nature of the reannotations. If you
> have a fairly stable assembly and annotation then it entirely makes sense
> to count on there being a stable identifier. In what I have, we often have
> dramatically different assemblies from one version to another (many of our
> assemblies do not have pseudo molecules) and we cannot count on stable ids.
>
> Your assigning a stable id as a property will work if the changes are not
> too extensive. But think of the case where 2 genes in 1 version are
> modified in such a way that 1 gene is split and half is merged into another
> gene. What rules are you going to use to assign the stable id for the
> merged gene?
>
> An alternative tracking mechanism between versions is to use a
> feature_relationship. You could keep track of things a bit better with this
> table if there are extensive merges and splits. For the most part we are
> not maintaining gene history except in a few of our important genomes.
>
> Joe
>
>
>
> Thank you,
> Sofia
>
>
>
> On Wed, Sep 12, 2018 at 1:34 PM, Joe Carlson <jwc...@lb...> wrote:
>
> For what it’s worth, I’ve been using dbxref’s to track annotation
> versions. I’ve modified the schema to make dbxref_id in the feature table
> to be not null, and use a record in the dbxref table to label the source -
> and version - of the data.
>
> Appending a numerical identifier to the name means that a query for a
> particular version will require a VERY expensive sql constraint  "and name
> like ‘%.N’” in the queries.
>
> Joe
>
>
> On Sep 12, 2018, at 12:16 PM, Sofia Robb <so...@so...> wrote:
>
> Hello All,
>
> I have a question about how others are handling sequence feature versions.
> I am using Tripal and have posted this question in the Tripal repository
> Issues as well.
>
> I have a group that is developing gene/mRNA models. They are using an
> ensembl like system for versioning of gene and transcript id. And they want
> to maintain a history of previous versions.
>
> They plan on incrementing a digit after the id when a new version is
> generated.
>
> gene nv2m00005394.1
> mRNA nv2m00005394.1.mRNA.1
>
> Chr11	GFF3Conv	gene	3598792	3603486	.	-	.	Alias=Sox9;Name=nv2m00005394.1;ID=nv2m00005394.1
> Chr11	GFF3Conv	mRNA	3598792	3603486	.	-	.	ID=nv2m00005394.1.mRNA.1;Parent=nv2m00005394.1
>
> How should I handle this? Create a new feature for each version and mark
> the old one obsolete? How do I make it easy for users to find the correct
> ID when they don't know there has been an update? I have some ideas, but it
> would require the geneID and mRNAIDs to have different bases, ie
> nv2g00005394 (change g->m) for gene and nv2m00005394 for mRNA.
>
> Any advice would be fantastic!!!
> Thank you!
> Sofia
>
> _______________________________________________
> Gmod-schema mailing list
> Gmo...@li...
> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>
>
>
>
>
>
>