From: Fields, C. J <cjf...@il...> - 2012-03-27 13:01:06
|
On Mar 27, 2012, at 3:23 AM, Peter Cock wrote: > Chris wrote: >> Peter wrote: >>> Terence wrote: >>>> For GFF3, we're treating this as a single discontiguous feature, >>>> and assigning a single ID to multiple rows of type 'tRNA': >>>> >>>> NW_001820338.1 RefSeq tRNA 68579 68616 . + . ID=rna592;Dbxref=GeneID:100682337;anticodon=%28pos:68613..68615%29;gbkey=tRNA;product=tRNA-Ile >>>> NW_001820338.1 RefSeq tRNA 68635 68670 . + . ID=rna592;Dbxref=GeneID:100682337;anticodon=%28pos:68613..68615%29;gbkey=tRNA;product=tRNA-Ile >>>> >>>> This is equivalent to annotation for CDS features, and seems >>>> to be within the GFF3 spec, but I'm told that Cufflinks gives an error: >>>> Error: duplicate GFF ID 'rna592' encountered! >>>> [FAILED] >>>> >>>> Do we need to change the GFF3 annotation (and if so, how), >>>> or should I report this as a bug in Cufflinks? >>> >>> It does look like a bug/limitation in cufflinks. That would be my >>> reading of their "Feature restrictions" text under GFF3 here: >>> http://cufflinks.cbcb.umd.edu/gff.html >>> >>> Peter >> >> Just curious, but wouldn't one want to use 'gene', then 'tRNA' >> (similar to 'mRNA'), and then have the two discontiguous regions >> as 'exon'? Something like: >> >> NW_001820338.1 RefSeq gene 68579 68670 . + . ID=gene592 >> NW_001820338.1 RefSeq tRNA 68579 68670 . + . ID=rna592;Parent=gene592 >> NW_001820338.1 RefSeq exon 68579 68616 . + . Parent=rna592 >> NW_001820338.1 RefSeq exon 68635 68670 . + . Parent=rna592 >> >> chris > > But tRNA genes are often transpliced - mixed strand is common, > even mixed chromosome like my new favourite pathological > example, nad1 in NC_016406 (and NC_016402): > > http://blastedbio.blogspot.co.uk/2012/03/missing-external-exons-in-genbank-with.html > > This has a GenBank location as so for the tRNA, > > join(complement(149815..150200), > complement(295492..295573),complement(293787..293978), > NC_016402.1:6618..6676,181647..181905) > > Here a four line GFF3 feature would work perfectly - and is > in some ways more elegant than in GenBank/EMBL's feature > table (handling of the fact it is on two different mitochondrial > chromosomes). > > You could still describe this with four axons, but what would > you do for their parent tRNA and gene? I find Terence's > solution more appealing and flexible. > > Peter Terrance's example does point out that not accepting multiple same-ID features is a bug in cufflinks. But I would argue in this case, since this is a noncoding transcript, they should be collected together in some way under common parent features(s) similar to a typical coding transcript. Otherwise, it sticks out semantically as an exception to the rule. Just to point out, there is a trans-spliced example under 'Pathological Cases' in the spec that retains a unique top-level ID, but I'm not sure how widely it is used. Would be worth investigating how many of the MODs are doing this, or finding out what they use in its place. This feasibly allows for your above pathological example of nad1, but uncertainty of spliced order might also (again) argue in favor of your and Don's suggestion of having a 'Part=X/Y' attribute. ChrX . gene XXXX YYYY . + . ID=gene01;name=my_gene ChrX . gene XXXX YYYY . + . ID=gene02;name=leader_gene ChrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01,gene02 ChrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01,gene02 ChrX . primary_transcript XXXX YYYY . + . ID=pt01;Parent=tran01;Derives_from=gene01 ChrX . spliced_leader_RNA XXXX YYYY . + . ID=sl01;Parent=tran01;Derives_from=gene02 ChrX . exon XXXX YYYY . + . Parent=tran01 ChrX . CDS XXXX YYYY . + . ID=cds01;Parent=tran01 chris |