HTSeq / Support Requests / #26 Different Read Counts‏: Samtools flagstst VS HTseq

Paul Theodor Pyl - 2013-11-28

Dear Miguel,

your flagstat shows that of the 24 million reads about 76% are mapped which would yield ca. 18 million mapped reads.
Since you have a paired end library htseq-count actually counts the number of pairs (or fragments) overlapping genes, i.e. roughly half the number of reads.

So your numbers seem to be in order.

Kind Regards,
Paul Pyl

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Miguel Machado - 2013-11-28

Dear Paul

Sorry for my ignorance, but I never read that HTseq counted fragments in a pair-end data.
If so, there’s still some numbers that don’t match.
Samtools flagstat reported 8 098 139 pairs mapped (with itself and mate mapped). If HTseq counts fragments, HTseq counted 8 493 585 fragments mapped into genes, which means that it counted 395 446 more fragments than those that were mapped.
I hope I’m interpreting HTseq and Samtools flagstat outputs properly.
I used HTSeq version 0.5.3p9.

Thank you for your help

Best regards

Miguel Machado

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Paul Theodor Pyl - 2013-11-28

Dear Miguel,

could you provide a small example file where this effect appears (bam file with 10 reads or so).

I think it might just be some unclarity in the interpretation of SAM flags / their fuzzy definition in the SAM standard. If you are sure your file is not corrupted please go ahead and send an appropriately sized subset so that I can check for reproducibility of the error.

Thanks,
Paul

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Miguel Machado - 2013-12-04

Dear Paul

Sorry for the delay in answering you.
I prepared an example of the Stampy sam file used in HTseq.
However, since the read/fragments counts of HTseq and Samtools flagstat don’t matched also for Tophat, I also prepared an example of a sam file (derived from Tophat output) and a gtf file used in HTseq (for a de novo assembly strategy).
Let me also use this opportunity to ask you, in a pair-end data, how are the single mapped reads counted by HTseq?
Bellow you can find three links for the mention files (I couldn’t upload those files) and the commands used.
Thank you a lot for your attention

Bets regards

Miguel Machado

https://app.box.com/s/3jymcxnb65evj1autyr6
https://app.box.com/s/d1sxey2l0op887qf2pfl
https://app.box.com/s/9fcqpae2osh59jk1axng

$ python2.6 ~/bin/stampy-1.0.21/stampy -g ~/ensembl/stampyFiles/Zv9_69 -h ~/ensembl/stampyFiles/Zv9_69 --substitutionrate=0.11 --output=~/stampy/female_brain.sam -M ~/reads/pp_f_brain_1_trimmed.fastq ~/reads/pp_f_brain_2_trimmed.fastq
$ head -n 70000 ~/stampy/female_brain.sam 1> ~/stampy/stampy_female_brain_70000lines.sam
$ ~/virtualenv/env/bin/htseq-count --stranded=no --type=exon ~/stampy/stampy_female_brain_70000lines.sam ~/ensembl/Danio_rerio.Zv9.69.gtf

$ tophat -p 4 --phred64-quals --mate-inner-dist 20 --mate-std-dev 50 --output-dir ~/tophat/female_brain/ --max-multihits 1 ~/fasta/squaliusPyrenaicus_RNAseq_ssh ~/reads/pp_f_brain_1_trimmed.fastq ~/reads/pp_f_brain_2_trimmed.fastq
$ samtools sort -n ~/tophat/female_brain/accepted_hits.bam ~/tophat/female_brain/accepted_hits_sortedName
$ samtools view -h -o ~/tophat/female_brain/accepted_hits_sortedName.sam ~/tophat/female_brain/accepted_hits_sortedName.bam
$ head -n 300001 ~/tophat/female_brain/accepted_hits_sortedName.sam 1> ~/tophat/female_brain/tophat_accepted_hits_sortedName_300001lines.sam
$ ~/virtualenv/env/bin/htseq-count --stranded=no --type=transcript ~/tophat/female_brain/tophat_accepted_hits_sortedName_300001lines.sam ~/transcriptome_pp_RNAseq_ssh.gtf

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Simon Anders - 2013-12-05

Sorry, where does the number "8098139" mentioned in your post above come from, again?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Miguel Machado - 2013-12-05
  
  It’s the number of pairs/fragments that mapped determined using Samtools flagstat (“with itself and mate mapped”). I reached that value by dividing the number of reads with itself and mate mapped by 2 (16196278 + 0 with itself and mate mapped / 2).
  I hope I explained myself properly.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Simon Anders - 2013-12-05

htseq-count does not discard a read pair if one one of the mates is mapped. Rather it counts the pair for the gene that the aligned mate overlaps, if this is a single unambiguous overlap. So, the 395,446 extra reads are properly those of the 2,499,466 singletons wherev the aligned mate fell onto a gene.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Miguel Machado - 2013-12-06
  
  If I understood it right, for pair-end data, htseq-count does not count only reads neither only fragments, since it counts fragments and single mapped reads (unambiguous overlap singletons). It’s a mixture of both.
  Therefore, one cannot use htseq-count to calculate RPKM neither FPKM.
  Am I right?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Simon Anders - 2013-12-06

Well, htseq-count is meant as a tool to prepare a count table for subsequent testing for differential expression. Therefore, it aims to count how many cDNA fragments can be mapped unambiguously to a given gene. This is why a we count read pairs, not reads, and discard a read pair if its mates' mapiings contradict each other but not if one mate has not been mapped.

If you understand RPKM as reads per kilobase length and per million mapped reads, then htseq-count is unsuitable. But why would you want to count a fragment twice if both its mates are mapped and only once if one read is unmapped? As for FPKM: The difference between RPKM and FPKM, in the parlance of the cufflinks paper, is that the latter tries to get the transcript length correct, while the former does something arbitrary. And, no, htseq-count does not dive into this whole transcript-length issue.

See also http://seqanswers.com/forums/showthread.php?t=9129, especially my post #4 there.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Different Read Counts‏: Samtools flagstst VS HTseq

Analysing high-throughput sequencing data with Python

Group

Searches

Help

#26 Different Read Counts‏: Samtools flagstst VS HTseq

Discussion