I was approached by a user asking why the CDS of a HaMStR hit cannot be directly translated into the corresponding amino acid sequence. The user was puzzled in the first place how the CDS can be more than 3x longer than the amino acid sequence.
This behavior is actually a feature and not a bug… When I was programming hamstr, I was deliberately keeping internal transcript information that genewise could not align to the protein sequence, either due to a frame shift mutation or due to an intron, and marked it by lower case letters facilitating a post-hoc processing. By doing so, I do only interpret the available transcript information but I don’t alter it. Anyway, this might not always be desired, especially when both amino acid sequence and transcript information should be used for phylogeny reconstruction. So I added an option to hamstr that you can now choose how the program will deal with this issue.
-intron=keep - invokes the traditional behavior of hamstr
-intron=mask - invokes the masking of all unaligned nucleotide positions by an ‚N‘
-intron=remove - invokes the removal of all unaligned positions from the coding sequence. By that the CDS and the translated amino acid sequence output by HaMStR will be congruent, however, note also that this option removes introns but also partial codons due to an indel-Mutation. So what you get out of a transcript is not necessarily exactly what is encoded in the gene.
If you have any further questions concerning this issue drop me a line.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I was approached by a user asking why the CDS of a HaMStR hit cannot be directly translated into the corresponding amino acid sequence. The user was puzzled in the first place how the CDS can be more than 3x longer than the amino acid sequence.
This behavior is actually a feature and not a bug… When I was programming hamstr, I was deliberately keeping internal transcript information that genewise could not align to the protein sequence, either due to a frame shift mutation or due to an intron, and marked it by lower case letters facilitating a post-hoc processing. By doing so, I do only interpret the available transcript information but I don’t alter it. Anyway, this might not always be desired, especially when both amino acid sequence and transcript information should be used for phylogeny reconstruction. So I added an option to hamstr that you can now choose how the program will deal with this issue.
-intron=keep - invokes the traditional behavior of hamstr
-intron=mask - invokes the masking of all unaligned nucleotide positions by an ‚N‘
-intron=remove - invokes the removal of all unaligned positions from the coding sequence. By that the CDS and the translated amino acid sequence output by HaMStR will be congruent, however, note also that this option removes introns but also partial codons due to an indel-Mutation. So what you get out of a transcript is not necessarily exactly what is encoded in the gene.
If you have any further questions concerning this issue drop me a line.