Re: [Bio-bwa-help] BWA vs Tophat
Status: Beta
Brought to you by:
lh3lh3
From: Fabrice T. <fab...@gm...> - 2011-07-07 13:45:19
|
Alec, I am now analysis Illumina pair-end Human RNA-seq data. I want to get the expression on gene/exon/isoform level. I also want to investigate the alternative splicing. On read1/2, there are about 4,000,000 reads. For the unmapped reads, it is possible there are splicing? So I think it is better to keep them as Michael suggested in this email topic. I think tophat can do some estimate for splicing, not very sure. The genomic reference here I used is DNA, not cDNA. Thanks. On Thu, Jul 7, 2011 at 3:23 PM, Alec Wysoker <al...@br...> wrote: > Hi Fabrice, > > You'd have to describe to me in more detail what you're doing in order for > me to have a chance of answering this question intelligently. > > Note, that Tophat uses Bowtie to do its alignments, and I believe it prefers > a genomic alignment to a transcriptomic alignment, so I am not sure that > using BWA to try to align the reads that Tophat doesn't align will produce > very many alignments. I'm not an expert, though. > > -Alec > > On 7/7/11 5:46 AM, Fabrice Tourre wrote: >> >> Alec, >> >> If I use Tophat do the mapping for downstream analysis. Then using BWA >> mapping again to get the unmapped reads. Will this have some problems? >> >> Thanks. >> >> On Wed, Jul 6, 2011 at 4:22 PM, Alec Wysoker<al...@br...> >> wrote: >>> >>> The Picard team at Broad definitely advocates retaining unmapped reads. >>> However, we don't trust aligners to output unmapped reads in proper >>> format >>> (or mapped reads, for that matter). We use Picard MergeBamAlignment, >>> which >>> takes an unmapped BAM containing the reads before alignment, and a BAM >>> produced by the aligner, and merges them, producing a BAM that extracts >>> the >>> alignment information from the aligned BAM, and almost everything else >>> from >>> the unmapped BAM. This way you don't have to rely on the aligner to >>> output >>> the unmapped reads. >>> >>> -Alec >>> >>> On 7/6/11 10:12 AM, Rusch, Michael wrote: >>>> >>>> QC isn't actually the issue. You'll see that both report that you have >>>> no >>>> QC-failed reads, and all reads are QC-pass. So, the issue isn't the >>>> number >>>> of QC-passed reads per se, but rather the total number of reads. >>>> >>>> Bwa includes all reads in its output, not just the ones that are mapped. >>>> Unmapped reads are still included (but they are indicated as being >>>> unmapped). I don't know much about TopHat, but one thing is clear from >>>> the >>>> flagstat output: it is dropping the unmapped reads. You'll see that the >>>> number of mapped reads = number of total reads. >>>> >>>> Based on my experience, I would do everything you can to keep the >>>> unmapped >>>> reads in the BAM, even though they are labeled as unmapped. There are >>>> probably a dozen reasons to do this, but I will sum it up to say that >>>> time >>>> and time again I have seen that not having all of the reads in the BAM >>>> file >>>> can cause you serious headache downstream. >>>> >>>> As an aside: as you move forward in putting your pipeline together, >>>> you'll >>>> probably wind up needing to do the standard sort-merge-mark dups-index >>>> set >>>> of steps, and I recommend marking, rather than removing, your duplicates >>>> when you get to that point. You can use Picard MarkDuplicates to mark, >>>> and >>>> newer versions of samtools may allow you to mark instead of removing >>>> when >>>> you use rmdup, but I'm not sure. >>>> >>>> If you're going to use TopHat, I'd see if there's a way to tell it to >>>> keep >>>> the unmapped reads. If not, you might find that there's a commonly >>>> accepted >>>> process that people use to re-introduce the unmapped reads back into the >>>> bam. >>>> >>>> Michael >>>> >>>> -----Original Message----- >>>> From: Fabrice Tourre [mailto:fab...@gm...] >>>> Sent: Wednesday, July 06, 2011 4:14 AM >>>> To: bio...@li... >>>> Subject: [Bio-bwa-help] BWA vs Tophat. . >>>> >>>> Dear expert, >>>> >>>> I am now analysis Illumina pair-end Human RNA-seq data. >>>> >>>> At the first step, I want to choose a mapping software. I have tried >>>> BWA and tophat to Ensembl human DNA reference. >>>> I used samtools flagstat to statics the result. I cannot understand >>>> why the number of QC-passed reads are different here. The inpunt for >>>> BWA and tophat are extract same. >>>> >>>> Does anyone can give me some suggestion? Thanks. >>>> >>>> ------------------------------------------- >>>> BWA---------------------------------------------------- >>>> samtools flagstat accepted_hits.bam >>>> 66037724 + 0 in total (QC-passed reads + QC-failed reads) >>>> 0 + 0 duplicates >>>> 53622250 + 0 mapped (81.20%:nan%) >>>> 66037724 + 0 paired in sequencing >>>> 33018862 + 0 read1 >>>> 33018862 + 0 read2 >>>> 41577968 + 0 properly paired (62.96%:nan%) >>>> 48987028 + 0 with itself and mate mapped >>>> 4635222 + 0 singletons (7.02%:nan%) >>>> 1444118 + 0 with mate mapped to a different chr >>>> 1214890 + 0 with mate mapped to a different chr (mapQ>=5) >>>> >>>> >>>> >>>> -------------------------------------------Tophat---------------------------------------------------- >>>> samtools flagstat accepted_hits.bam >>>> 60223041 + 0 in total (QC-passed reads + QC-failed reads) >>>> 0 + 0 duplicates >>>> 60223041 + 0 mapped (100.00%:nan%) >>>> 60223041 + 0 paired in sequencing >>>> 30279902 + 0 read1 >>>> 29943139 + 0 read2 >>>> 40219740 + 0 properly paired (66.78%:nan%) >>>> 56286158 + 0 with itself and mate mapped >>>> 3936883 + 0 singletons (6.54%:nan%) >>>> 0 + 0 with mate mapped to a different chr >>>> 0 + 0 with mate mapped to a different chr (mapQ>=5) >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> All of the data generated in your IT infrastructure is seriously >>>> valuable. >>>> Why? It contains a definitive record of application performance, >>>> security >>>> threats, fraudulent activity, and more. Splunk takes this data and makes >>>> sense of it. IT sense. And common sense. >>>> http://p.sf.net/sfu/splunk-d2d-c2 >>>> _______________________________________________ >>>> Bio-bwa-help mailing list >>>> Bio...@li... >>>> https://lists.sourceforge.net/lists/listinfo/bio-bwa-help >>>> >>>> >>>> Email Disclaimer: www.stjude.org/emaildisclaimer >>>> >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> All of the data generated in your IT infrastructure is seriously >>>> valuable. >>>> Why? It contains a definitive record of application performance, >>>> security >>>> threats, fraudulent activity, and more. Splunk takes this data and makes >>>> sense of it. IT sense. And common sense. >>>> http://p.sf.net/sfu/splunk-d2d-c2 >>>> _______________________________________________ >>>> Bio-bwa-help mailing list >>>> Bio...@li... >>>> https://lists.sourceforge.net/lists/listinfo/bio-bwa-help > |