Re: [Gmod-schema] gene/mRNA version

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Good point about merging and splitting genes.

I think this is meant to be a pretty stable assembly and the hopes are that
the annotations are good. But split and merged genes are quite typical
issues I have seen in many different annotation sets, and I suspect we will
find some in this as well. My first gut solution to merging or splitting is
that these would have to have new stable ids, if we go the stable ID route.

When you say using the feature_relationship table for tracking are you
thinking that the cvterm_id would be some term like version_of and the
subject would be the versioned feature and the object would be the stable
feature (new_version 'version_of' stable_version)? Or are you saying that
the stable ID route isn't great in your opinion and that the cvterm should
be something like new_feature 'is_new_version_of' old_feature?

Thank you for taking your time to discuss this with me.
Sofia

On Wed, Sep 12, 2018 at 2:28 PM, Joe Carlson <jwc...@lb...> wrote:

>
> On Sep 12, 2018, at 1:09 PM, Sofia Robb <so...@so...> wrote:
>
> Hi Joe and other Chado users,
>
> Joe, Thanks for your response. I would like to know more about your data.
> I have a few questions and will follow them up with a dump of my current
> ideas on how to solve this.
>
>
> I’m managing the backend db for the phytozome project at JGI (
> phytozome.jgi.doe.gov), a comparative land plant db. We have ~ 250 plant
> genomes (assemblies, annotation and analysis results) loaded right now. The
> size of the db is ~ 1.5T.
>
>
> Are you the source of the sequence?
>
>
> We have the land plants sequenced by the JGI, things done by collaborators
> and other model organisms. It’s roughly an equal mixture of each.
>
> Or are pulling the data from another database?
>
>
> Data import is with fasta files for chromosomes and proteins; gff3 for
> structure.
>
> What do you do if the actual sequence changes? Do you just overwrite the
> previous sequence data?
>
>
> I never overwrite or delete. Once it goes in the database, it stays in the
> database.
>
>
>  We are going to be the official repository of this data and have been
> asked to keep track the history of changes. This is more than I have had to
> keep track of in the past.
>
> I had been thinking of trying to implement some loading of the data which
> gets across the idea that each feature has a stable version which is equal
> to it its current version and any number of older versions. Now this is
> just an idea (largely based on the representation of data from ensembl).
>
> The stable version would have a stable id which lacks the '.\d' suffix.
> And there would be a feature record for each version which includes the
> '.\d' suffix. I would mark older versions obsolete. What I am still working
> on in this idea is what I could add as properties (gff 9th column) to help
> with searches. Perhaps I could add a stableID=xyz in each record? I think
> this would help with a query, I could search for the stableID and obsolete
> when I need to retrieve the history of changes?
>
> feature.uniquename: some_gene.1
> featureprop.cvterm_id: some term that indicates the concept stableID
> featureprop.value: some_gene
> feature.is_obsolete: true
>
>
> feature.uniquename some_gene.2
> featureprop.cvterm_id: some term that indicates the concept stableID
> featureprop.value: some_gene
> feature.is_obsolete: false
>
> feature.uniquename some_gene
> featureprop.cvterm_id: some term that indicates the concept stableID
> featureprop.value: some_gene
> feature.is_obsolete: false
>
>
> How you do this depends a bit on the nature of the reannotations. If you
> have a fairly stable assembly and annotation then it entirely makes sense
> to count on there being a stable identifier. In what I have, we often have
> dramatically different assemblies from one version to another (many of our
> assemblies do not have pseudo molecules) and we cannot count on stable ids.
>
> Your assigning a stable id as a property will work if the changes are not
> too extensive. But think of the case where 2 genes in 1 version are
> modified in such a way that 1 gene is split and half is merged into another
> gene. What rules are you going to use to assign the stable id for the
> merged gene?
>
> An alternative tracking mechanism between versions is to use a
> feature_relationship. You could keep track of things a bit better with this
> table if there are extensive merges and splits. For the most part we are
> not maintaining gene history except in a few of our important genomes.
>
> Joe
>
>
>
> Thank you,
> Sofia
>
>
>
> On Wed, Sep 12, 2018 at 1:34 PM, Joe Carlson <jwc...@lb...> wrote:
>
>> For what it’s worth, I’ve been using dbxref’s to track annotation
>> versions. I’ve modified the schema to make dbxref_id in the feature table
>> to be not null, and use a record in the dbxref table to label the source -
>> and version - of the data.
>>
>> Appending a numerical identifier to the name means that a query for a
>> particular version will require a VERY expensive sql constraint  "and name
>> like ‘%.N’” in the queries.
>>
>> Joe
>>
>>
>> On Sep 12, 2018, at 12:16 PM, Sofia Robb <so...@so...> wrote:
>>
>> Hello All,
>>
>> I have a question about how others are handling sequence feature
>> versions. I am using Tripal and have posted this question in the Tripal
>> repository Issues as well.
>>
>> I have a group that is developing gene/mRNA models. They are using an
>> ensembl like system for versioning of gene and transcript id. And they want
>> to maintain a history of previous versions.
>>
>> They plan on incrementing a digit after the id when a new version is
>> generated.
>>
>> gene nv2m00005394.1
>> mRNA nv2m00005394.1.mRNA.1
>>
>> Chr11	GFF3Conv	gene	3598792	3603486	.	-	.	Alias=Sox9;Name=nv2m00005394.1;ID=nv2m00005394.1
>> Chr11	GFF3Conv	mRNA	3598792	3603486	.	-	.	ID=nv2m00005394.1.mRNA.1;Parent=nv2m00005394.1
>>
>> How should I handle this? Create a new feature for each version and mark
>> the old one obsolete? How do I make it easy for users to find the correct
>> ID when they don't know there has been an update? I have some ideas, but it
>> would require the geneID and mRNAIDs to have different bases, ie
>> nv2g00005394 (change g->m) for gene and nv2m00005394 for mRNA.
>>
>> Any advice would be fantastic!!!
>> Thank you!
>> Sofia
>>
>> _______________________________________________
>> Gmod-schema mailing list
>> Gmo...@li...
>> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>>
>>
>>
>
>