Re: [Gmod-schema] gene/mRNA version

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

> On Sep 12, 2018, at 1:43 PM, Sofia Robb <so...@so...> wrote:
> 
> Good point about merging and splitting genes.
> 
> I think this is meant to be a pretty stable assembly and the hopes are that the annotations are good. But split and merged genes are quite typical issues I have seen in many different annotation sets, and I suspect we will find some in this as well. My first gut solution to merging or splitting is that these would have to have new stable ids, if we go the stable ID route. 
> 
> When you say using the feature_relationship table for tracking are you thinking that the cvterm_id would be some term like version_of and the subject would be the versioned feature and the object would be the stable feature (new_version 'version_of' stable_version)? Or are you saying that the stable ID route isn't great in your opinion and that the cvterm should be something like new_feature 'is_new_version_of' old_feature?

I was thinking of having a ‘previous_version_of’ (or some such label) and link annotations through the feature_relation table. I really don’t know which solution is best: it depends on what you want the tracking to do. Or how fine-grained you need the tracking to be. My one concern with merges is that you’ll not be able to have multiple stable id’s for one gene unless you keep track of the rank field or modify the schema.

joe
> 
> Thank you for taking your time to discuss this with me.
> Sofia
> 
> 
> On Wed, Sep 12, 2018 at 2:28 PM, Joe Carlson <jwc...@lb... <mailto:jwc...@lb...>> wrote:
> 
>> On Sep 12, 2018, at 1:09 PM, Sofia Robb <so...@so... <mailto:so...@so...>> wrote:
>> 
>> Hi Joe and other Chado users,
>> 
>> Joe, Thanks for your response. I would like to know more about your data. I have a few questions and will follow them up with a dump of my current ideas on how to solve this.
> 
> I’m managing the backend db for the phytozome project at JGI (phytozome.jgi.doe.gov <http://phytozome.jgi.doe.gov/>), a comparative land plant db. We have ~ 250 plant genomes (assemblies, annotation and analysis results) loaded right now. The size of the db is ~ 1.5T.
>> 
>> Are you the source of the sequence? 
> 
> We have the land plants sequenced by the JGI, things done by collaborators and other model organisms. It’s roughly an equal mixture of each.
> 
>> Or are pulling the data from another database? 
> 
> Data import is with fasta files for chromosomes and proteins; gff3 for structure.
>> What do you do if the actual sequence changes? Do you just overwrite the previous sequence data?
> 
> I never overwrite or delete. Once it goes in the database, it stays in the database.
>> 
>>  We are going to be the official repository of this data and have been asked to keep track the history of changes. This is more than I have had to keep track of in the past. 
>> 
>> I had been thinking of trying to implement some loading of the data which gets across the idea that each feature has a stable version which is equal to it its current version and any number of older versions. Now this is just an idea (largely based on the representation of data from ensembl).
>> 
>> The stable version would have a stable id which lacks the '.\d' suffix. And there would be a feature record for each version which includes the '.\d' suffix. I would mark older versions obsolete. What I am still working on in this idea is what I could add as properties (gff 9th column) to help with searches. Perhaps I could add a stableID=xyz in each record? I think this would help with a query, I could search for the stableID and obsolete when I need to retrieve the history of changes?
>> 
>> feature.uniquename: some_gene.1 
>> featureprop.cvterm_id: some term that indicates the concept stableID
>> featureprop.value: some_gene
>> feature.is_obsolete: true
>> 
>>  
>> feature.uniquename some_gene.2
>> featureprop.cvterm_id: some term that indicates the concept stableID
>> featureprop.value: some_gene
>> feature.is_obsolete: false
>> 
>> feature.uniquename some_gene
>> featureprop.cvterm_id: some term that indicates the concept stableID
>> featureprop.value: some_gene
>> feature.is_obsolete: false
> 
> How you do this depends a bit on the nature of the reannotations. If you have a fairly stable assembly and annotation then it entirely makes sense to count on there being a stable identifier. In what I have, we often have dramatically different assemblies from one version to another (many of our assemblies do not have pseudo molecules) and we cannot count on stable ids.
> 
> Your assigning a stable id as a property will work if the changes are not too extensive. But think of the case where 2 genes in 1 version are modified in such a way that 1 gene is split and half is merged into another gene. What rules are you going to use to assign the stable id for the merged gene?
> 
> An alternative tracking mechanism between versions is to use a feature_relationship. You could keep track of things a bit better with this table if there are extensive merges and splits. For the most part we are not maintaining gene history except in a few of our important genomes.
> 
> Joe
>> 
>> 
>> Thank you,
>> Sofia
>> 
>> 
>> 
>> On Wed, Sep 12, 2018 at 1:34 PM, Joe Carlson <jwc...@lb... <mailto:jwc...@lb...>> wrote:
>> For what it’s worth, I’ve been using dbxref’s to track annotation versions. I’ve modified the schema to make dbxref_id in the feature table to be not null, and use a record in the dbxref table to label the source - and version - of the data.
>> 
>> Appending a numerical identifier to the name means that a query for a particular version will require a VERY expensive sql constraint  "and name like ‘%.N’” in the queries.
>> 
>> Joe
>> 
>> 
>>> On Sep 12, 2018, at 12:16 PM, Sofia Robb <so...@so... <mailto:so...@so...>> wrote:
>>> 
>>> Hello All,
>>> 
>>> I have a question about how others are handling sequence feature versions. I am using Tripal and have posted this question in the Tripal repository Issues as well.
>>> 
>>> I have a group that is developing gene/mRNA models. They are using an ensembl like system for versioning of gene and transcript id. And they want to maintain a history of previous versions.
>>> 
>>> They plan on incrementing a digit after the id when a new version is generated.
>>> 
>>> gene nv2m00005394.1
>>> mRNA nv2m00005394.1.mRNA.1
>>> 
>>> Chr11	GFF3Conv	gene	3598792	3603486	.	-	.	Alias=Sox9;Name=nv2m00005394.1;ID=nv2m00005394.1
>>> Chr11	GFF3Conv	mRNA	3598792	3603486	.	-	.	ID=nv2m00005394.1.mRNA.1;Parent=nv2m00005394.1
>>> How should I handle this? Create a new feature for each version and mark the old one obsolete? How do I make it easy for users to find the correct ID when they don't know there has been an update? I have some ideas, but it would require the geneID and mRNAIDs to have different bases, ie nv2g00005394 (change g->m) for gene and nv2m00005394 for mRNA.
>>> 
>>> Any advice would be fantastic!!!
>>> 
>>> Thank you!
>>> Sofia
>>> 
>>> _______________________________________________
>>> Gmod-schema mailing list
>>> Gmo...@li... <mailto:Gmo...@li...>
>>> https://lists.sourceforge.net/lists/listinfo/gmod-schema <https://lists.sourceforge.net/lists/listinfo/gmod-schema>
>> 
>> 
> 
>