|
From: Murphy, T. (NIH/NLM/N. [C] <mur...@nc...> - 2015-06-09 12:55:37
|
Hi All, I have a question about options for representing frameshifts (assembly errors) in GFF3 (and GTF), and any limitations of software that might sway towards one style over another. The basic issue is annotating a CDS that is thought to be functional but has an internal frameshift. INSDC has a couple of options for annotating such a situation: 1) Annotate the true exons, accept that the CDS on the genome is uninterpretable, and submit a second sequence for the correct protein sequence. The CDS is marked with /exception="annotated by transcript or proteomic data". NCBI's eukaryotic annotation pipeline uses a variation of this style, and also includes an alignment of the transcript product (with the frameshift) to the genome that can be used to interpret the size and location of the frameshift. NCBI's internal software takes advantage of these alignments, but I doubt anyone else realizes that they're there or what they're for. I believe FlyBase uses a variation of this style combined with some kind of variation feature to indicate the location of the frameshift, but I haven't been able to track down an example. 2) Annotate the CDS with a split exon and either a small overlap or micro-intron to correct the frame. The CDS is marked with /exception="low-quality sequence region" or /exception="heterogeneous population sequenced". The conceptual translation of the CDS remains in-phase throughout. An overlap restores the frame and implies the protein length, but may not add back the correct base. The first mechanism can also be used when there is missing sequence, but I'll stick to simple frameshifts for this discussion. I've seen micro-introns (but not overlaps) used by Ensembl, UCSC, WormBase, and others. I've also seen 3-bp micro-introns used to skip over stop codons, which is considered an error for GenBank submissions (the INSDC mechanism is to use a translation exception to indicate the codon that is thought to be something else, and one of the exceptions from #2 above). In some of NCBI's recent GFF3 output we've morphed annotation from our eukaryotic annotation pipeline to use small overlaps or deletions to maintain the frame. Plus that style was already used in GFF3 for any annotation in GenBank using mechanism #2 above. There's no prohibition of overlaps in the GFF3 spec, and an overlap is even used for the ribosomal frameshift example, but we're now getting feedback that some software explicitly doesn't allow overlaps. So I'm looking for the best way to address this. Three options come to mind: 1) Keep the overlaps, and consider it a downstream software problem. This is the only option for INSDC annotations submitted with #2 above - the overlaps are inherent in the data, and the GFF3 writer wouldn't know how to correct them. Those are also rare. 2) Use deletions instead of insertions, dropping bases that are really part of the CDS but maintaining the frame. This seems to be the most commonly used option. 3) Annotate abutting exon and CDS intervals (no overlap, no micro-intron), using the phase column to shift the reading frame. This would have a partial codon around the junction. This could be done for just insertions, or both insertions and deletions. There's some elegance to #3, but I've never seen it done. So the question is, what software would explicitly honor the phase column, vs. have an expectation that the phase would be constant through the entire CDS? Thanks for the input! -Terence |