Menu

strategy for repeated transcript_name and gene_name in Ensembl gtf file

Note: The previous 'SOAPfusexxSOAPfuse' postfix is abandoned from SOAPfuse_v1.27, and a new and meaningful postfix is applied now. It is the ENS_id. As we know, the ENS_id is unique in GTF database, so it is the best marker to distinguish the duplicated gene_names. like, MATR3_ENSG00000015479 and MATR3_ENSG00000280987. Click here to know about it. Here, this post is only for you to know about the existence of repeated (duplicated) gene_names and transcript_names. (Added by Jia on 2016-01-19)

As we all know, Ensembl gives the unique ENST-id. to each gene and each transcript. So we will never encounter the repeated ENST-id.

But, if you check the GTF file carefully, we may find that some gene_name(s) or transcript_name(s) are repeated: for one given gene_name or transcript_name, there are more than one 'sub-clone' in the GTF file. Although having their unique ENST-id., they have totally different locations (regions) on the chromosome, even different chromosomes.

For example, you can download the Release v72 Ensembl human being GTF file, and check "MOB4-001".
You could find two distinct transcirpts with different ENST-id. (ENST00000323303 and ENST00000604458) are both named as "MOB4-001".

Indeed, there are many repeated gene_names or transcript_names in the GTF file.

It is easy to understand this case, the reason is (but not limited to), there are many copy-translocations, trans-transposons and other types of elements that could copy gene sequence to other location in the genome.

SOAPfuse reports fusion cases with gene_name and transcript_name, so when it encounters the repeated gene_name or transcript_name, it will report error as it is hard to distinguish exact "sub-clone" of the repeated gene_name or transcript_name.

To solve this problem, obviously, using the ENST-id. to replace names is a solution, but it is not a good choice. As the fusion case always need further analysis, the gene_names or transcript_names could reflect the functions or other basic information, but ENST-id. cannot.

The only way to let SOAPfuse go through this problem is to give each "sub-clone" of repeated gene_name and transcript_name a new name. This's what we do from the database of v1.25.

For example, gene_name MOB4 are corresponding to two distinct "sub-clones" with diff ENST-id.:
ENSG00000115540 and ENSG00000270757
So, we name the one appears firstly in the GTF file as its original name "MOB4". And, we name the one appears later in the GTF file as a numbered name "MOB4SOAPfuse2SOAPfuse". Of course, if there is a third sub-clone, we use the string "SOAPfuse3SOAPfuse" as the postfix of its new name.
Yes, you may find that, there is no "SOAPfuse1SOAPfuse", because it is replaced by the original gene_name.

Note:
The changed names like the aboved case could only be found in the gene.psl/fa (for gene_name) and transcript.psl/fa (for transcript_name). We never change any information in GTF file, because it is the basic database file, not allowed any modification. So, if you find some genes or transcripts are numbered with "SOAPfuse" strings, please head to PSL file firstly, but not just GTF file. Of course, next, you may check the concerned gene/transcripts in GTF file according to the PSL file.

There are some updates in database of v1.26, such as giving higher priority to the transcripts of protein_coding type.

Wenlong Jia
2013-07-28

Posted by NOBEL89 2016-01-19

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.