Re: [Gmod-schema] gene/mRNA version

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Sorry to be late to the party; this is something I've worked on at length with maize genes.

First: I'll note that the AgBioData consortium<https://www.agbiodata.org> is forming a genome and gene model nomenclature group. Anyone working with genome and/or gene model nomenclature is welcome to join. There is a recording<https://www.youtube.com/watch?v=kNW6YReFP28&feature=youtu.be> of our nomenclature discussion last week, and copies of the slides<https://www.agbiodata.org/sites/default/files/Genome%20Nomenclature%20meeting%20slides.pdf> are available.

After playing with the idea of versioning gene models the maize group decided to instead version the sets. We haven't (yet) been successful with hand-curation of gene models and instead improve the gene model sets via re-analysis.

Note that, the .<digit> suffix indicates alternative isoforms in many nomenclature patterns.

An analysis record represents each gene model set and gene models are linked via analysisfeature. It is possible for the same gene model feature record to attached to multiple versions if it hasn't changed. Sequence isn't stored in the feature record but retrieved from the appropriate BLAST db as needed. This takes care of (rare?) situations in which the name stays the same but there are minor changes to the sequence. I have a rather clunky way of indicating the current version via analysisprop.

There is a request in for the addition of an analysis.type_id field for Chado 1.4 (https://github.com/GMOD/Chado/pull/52).

For maintaining history, I use feature_relationship with a set of cvterms indicating, for example, whether a gene model has been split or merged. Split and merged gene models get new names.

Because we have gene models from several different maize genome assemblies, we run an analysis to find likely orthologs across the multiple gene model sets. These are also linked via feature_relationship records.

Hope this helps.

Ethy
________________________________
From: Joe Carlson <jwc...@lb...>
Sent: Wednesday, September 12, 2018 4:36 PM
To: Sofia Robb
Cc: GMOD Schema/Chado List
Subject: Re: [Gmod-schema] gene/mRNA version

On Sep 12, 2018, at 1:43 PM, Sofia Robb <so...@so...<mailto:so...@so...>> wrote:

Good point about merging and splitting genes.

I think this is meant to be a pretty stable assembly and the hopes are that the annotations are good. But split and merged genes are quite typical issues I have seen in many different annotation sets, and I suspect we will find some in this as well. My first gut solution to merging or splitting is that these would have to have new stable ids, if we go the stable ID route.

When you say using the feature_relationship table for tracking are you thinking that the cvterm_id would be some term like version_of and the subject would be the versioned feature and the object would be the stable feature (new_version 'version_of' stable_version)? Or are you saying that the stable ID route isn't great in your opinion and that the cvterm should be something like new_feature 'is_new_version_of' old_feature?

I was thinking of having a ‘previous_version_of’ (or some such label) and link annotations through the feature_relation table. I really don’t know which solution is best: it depends on what you want the tracking to do. Or how fine-grained you need the tracking to be. My one concern with merges is that you’ll not be able to have multiple stable id’s for one gene unless you keep track of the rank field or modify the schema.

joe

Thank you for taking your time to discuss this with me.
Sofia

On Wed, Sep 12, 2018 at 2:28 PM, Joe Carlson <jwc...@lb...<mailto:jwc...@lb...>> wrote:

On Sep 12, 2018, at 1:09 PM, Sofia Robb <so...@so...<mailto:so...@so...>> wrote:

Hi Joe and other Chado users,

Joe, Thanks for your response. I would like to know more about your data. I have a few questions and will follow them up with a dump of my current ideas on how to solve this.

I’m managing the backend db for the phytozome project at JGI (phytozome.jgi.doe.gov<http://phytozome.jgi.doe.gov/>), a comparative land plant db. We have ~ 250 plant genomes (assemblies, annotation and analysis results) loaded right now. The size of the db is ~ 1.5T.

Are you the source of the sequence?

We have the land plants sequenced by the JGI, things done by collaborators and other model organisms. It’s roughly an equal mixture of each.

Or are pulling the data from another database?

Data import is with fasta files for chromosomes and proteins; gff3 for structure.
What do you do if the actual sequence changes? Do you just overwrite the previous sequence data?

I never overwrite or delete. Once it goes in the database, it stays in the database.

 We are going to be the official repository of this data and have been asked to keep track the history of changes. This is more than I have had to keep track of in the past.

I had been thinking of trying to implement some loading of the data which gets across the idea that each feature has a stable version which is equal to it its current version and any number of older versions. Now this is just an idea (largely based on the representation of data from ensembl).

The stable version would have a stable id which lacks the '.\d' suffix. And there would be a feature record for each version which includes the '.\d' suffix. I would mark older versions obsolete. What I am still working on in this idea is what I could add as properties (gff 9th column) to help with searches. Perhaps I could add a stableID=xyz in each record? I think this would help with a query, I could search for the stableID and obsolete when I need to retrieve the history of changes?

feature.uniquename: some_gene.1
featureprop.cvterm_id: some term that indicates the concept stableID
featureprop.value: some_gene
feature.is_obsolete: true

feature.uniquename some_gene.2
featureprop.cvterm_id: some term that indicates the concept stableID
featureprop.value: some_gene
feature.is_obsolete: false

feature.uniquename some_gene
featureprop.cvterm_id: some term that indicates the concept stableID
featureprop.value: some_gene
feature.is_obsolete: false

How you do this depends a bit on the nature of the reannotations. If you have a fairly stable assembly and annotation then it entirely makes sense to count on there being a stable identifier. In what I have, we often have dramatically different assemblies from one version to another (many of our assemblies do not have pseudo molecules) and we cannot count on stable ids.

Your assigning a stable id as a property will work if the changes are not too extensive. But think of the case where 2 genes in 1 version are modified in such a way that 1 gene is split and half is merged into another gene. What rules are you going to use to assign the stable id for the merged gene?

An alternative tracking mechanism between versions is to use a feature_relationship. You could keep track of things a bit better with this table if there are extensive merges and splits. For the most part we are not maintaining gene history except in a few of our important genomes.

Joe

Thank you,
Sofia

On Wed, Sep 12, 2018 at 1:34 PM, Joe Carlson <jwc...@lb...<mailto:jwc...@lb...>> wrote:
For what it’s worth, I’ve been using dbxref’s to track annotation versions. I’ve modified the schema to make dbxref_id in the feature table to be not null, and use a record in the dbxref table to label the source - and version - of the data.

Appending a numerical identifier to the name means that a query for a particular version will require a VERY expensive sql constraint  "and name like ‘%.N’” in the queries.

Joe

On Sep 12, 2018, at 12:16 PM, Sofia Robb <so...@so...<mailto:so...@so...>> wrote:

Hello All,

I have a question about how others are handling sequence feature versions. I am using Tripal and have posted this question in the Tripal repository Issues as well.

I have a group that is developing gene/mRNA models. They are using an ensembl like system for versioning of gene and transcript id. And they want to maintain a history of previous versions.

They plan on incrementing a digit after the id when a new version is generated.

gene nv2m00005394.1
mRNA nv2m00005394.1.mRNA.1

Chr11   GFF3Conv        gene    3598792 3603486 .       -       .       Alias=Sox9;Name=nv2m00005394.1;ID=nv2m00005394.1
Chr11   GFF3Conv        mRNA    3598792 3603486 .       -       .       ID=nv2m00005394.1.mRNA.1;Parent=nv2m00005394.1

How should I handle this? Create a new feature for each version and mark the old one obsolete? How do I make it easy for users to find the correct ID when they don't know there has been an update? I have some ideas, but it would require the geneID and mRNAIDs to have different bases, ie nv2g00005394 (change g->m) for gene and nv2m00005394 for mRNA.

Any advice would be fantastic!!!

Thank you!
Sofia

_______________________________________________
Gmod-schema mailing list
Gmo...@li...<mailto:Gmo...@li...>
https://lists.sourceforge.net/lists/listinfo/gmod-schema