I really find your paper interesting because I have some weird statistic after the RPKM calculation (mostly due to the low coverage on some gene => skyrocket of the RPKM value).
Here is the problem: there are 2 transcripts have the same gene annotation, but because of differences in mapped read, it has 2 different fold changes (34 vs 5).
In one library I got around 1703 genes that were differently expressed, many of them got the above pattern. According to your experience, is it possible to set a "set of rules" to lower the number of genes (ie remove the duplicate, but retain the one with lower fold changes or discard the one that have fewer mapped reads or any kind of scenario ?) .
Last edit: Charles Warden 2014-03-18
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Here are the possible suggestions that I can think of:
1) What was the total number of reads per sample? For human gene expression analysis, you should have at least 10 million reads. If you want to do work with splicing (using MATS, MISO, etc.), you'll want more reads (and you'll probably want them to be paired-end, although I think that may not be possible with Proton data). It looks like you aren't working with a standard organism (like human or mouse), but it seems like the RPKM values are higher than I would expect given raw number of aligned reads (usually, I'd say RPKM > 1 is pretty solid, and RPKM > 0.1 is the default setting in sRAP).
2) It sounds to me like these are two transcripts for the same gene. In my experience, this sort of transcript-based analysis has been less reliable than gene-based analysis. In other words, every time that I checked the alignments for a gene that supposedly had very different expression patterns for two transcripts, it appeared to be an artifact that I think came from low coverage at the informative splice junctions. You can sort of tell this from Figure 5 in the sRAP paper, but you have to remember that many genes will have similar expression levels predicted for each transcript (which will make the concordance seem higher). For human analysis, there is usually a "gene" option for mRNA quantification that uses a canonical set of coordinates (so, each gene has only one RPKM value). How are you calculating these RPKM values?
Last edit: Charles Warden 2014-03-18
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I really find your paper interesting because I have some weird statistic after the RPKM calculation (mostly due to the low coverage on some gene => skyrocket of the RPKM value).
Here is the problem: there are 2 transcripts have the same gene annotation, but because of differences in mapped read, it has 2 different fold changes (34 vs 5).
In one library I got around 1703 genes that were differently expressed, many of them got the above pattern. According to your experience, is it possible to set a "set of rules" to lower the number of genes (ie remove the duplicate, but retain the one with lower fold changes or discard the one that have fewer mapped reads or any kind of scenario ?) .
Last edit: Charles Warden 2014-03-18
Here are the possible suggestions that I can think of:
1) What was the total number of reads per sample? For human gene expression analysis, you should have at least 10 million reads. If you want to do work with splicing (using MATS, MISO, etc.), you'll want more reads (and you'll probably want them to be paired-end, although I think that may not be possible with Proton data). It looks like you aren't working with a standard organism (like human or mouse), but it seems like the RPKM values are higher than I would expect given raw number of aligned reads (usually, I'd say RPKM > 1 is pretty solid, and RPKM > 0.1 is the default setting in sRAP).
2) It sounds to me like these are two transcripts for the same gene. In my experience, this sort of transcript-based analysis has been less reliable than gene-based analysis. In other words, every time that I checked the alignments for a gene that supposedly had very different expression patterns for two transcripts, it appeared to be an artifact that I think came from low coverage at the informative splice junctions. You can sort of tell this from Figure 5 in the sRAP paper, but you have to remember that many genes will have similar expression levels predicted for each transcript (which will make the concordance seem higher). For human analysis, there is usually a "gene" option for mRNA quantification that uses a canonical set of coordinates (so, each gene has only one RPKM value). How are you calculating these RPKM values?
Last edit: Charles Warden 2014-03-18