Re: [Bio-bwa-help] Repetitive reads are not randomly placed
Status: Beta
Brought to you by:
lh3lh3
From: NAGAO K. <na...@sc...> - 2010-11-24 11:26:08
|
Hi Heng, Thank you for your response. My issue of BWA samse was resolved. In case of paired-end analysis I am interested in the behavior of sampe randomness as Ivette pointed out. So I created the following test query which has 100 identical pairs of reads. $ for i in {1..100}; do echo @${i}; echo TTGTATGTTACTGTGATCAGACCCTGAGACTGCGTTAG; echo '+'; echo IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII; done > test_read1.fastq $ for i in {1..100}; do echo @${i}; echo CCTGAGTGTTGCATTCAGCAGAGGTCAGCTTCCTAGGG; echo '+'; echo IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII; done > test_read2.fastq $ bwa aln hg18.fa test_read1.fastq > test_read1.sai $ bwa aln hg18.fa test_read2.fastq > test_read2.sai $ bwa sampe hg18.fa test_read1.sai test_read2.sai test_read1.fastq test_read2.fastq | samtools view -S - | cut -f2-9,16,17 | sort | uniq -c 100 147 chr1 394276 0 38M = 394134 -180 X0:i:3 X1:i:11 100 99 chr1 394134 0 38M = 394276 180 X0:i:7 X1:i:7 Alternatively I run BLAT using a concatenated sequence of the read 1 and the reverse-complemented read 2. SCORE START END QSIZE IDENTITY CHRO STRAND START END SPAN ------------------------------------------------------------------- 75 1 76 76 100.0% 8 - 68609 68788 180 75 1 76 76 100.0% 5 + 180763425 180763604 180 75 1 76 76 100.0% 1 + 394134 394313 180 ..... the rest is omitted. These indicate BWA sampe reports one of the three best position which have the same identity and span. Is it expected or not? Also I'd like to know how BWA sampe would choose a placed position for a paired-end read among several candidates if their edit distances were the same but their lengths of a span were slightly different. Thank you in advance. Koji Nagao On 2010/11/24, at 1:07, Heng Li wrote: > I think sampe is less affected by the randomness issue (but need > testing). For samse, you can fix the issue by yourself, following > Koji's email. I am also attaching a new version, not released yet. > > Heng > > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > > Limited, a charity registered in England with number 1021457 and a > compa > ny registered in England with number 2742969, whose registered > office is 2 > 15 Euston Road, London, NW1 2BE. > > > <bwa-r1544.tar.bz2> > > > On Nov 23, 2010, at 10:45 AM, i sc wrote: > >> Heng & bwa-helpers, >> >> Would the change be incorporated to sampe too? Is there an estimate >> time when the new BWA release will be coming? >> I have produced bwa 0.5.8c alignments of illumina PE reads 50bp >> with 40x genone coverage. I expected mappings around several copies >> of a transposon element in the reference genome but only one of >> them has reads mapping to it, although there may be a biological >> reason for this I also suspect the same problem with bwa sampe >> because I see mapping qualities of 0 for the reads that map to the >> one transposon. >> >> With sufficient coverage, if reads were to be mapped truly at >> random, would it make sense to use mapping quality alone to obtain >> all the coordinates of potentially repetitive regions in a >> reference? There may be a better way to do this, any pointers would >> be appreciated. >> >> Lastly, any possibility of adding an option to bwa to specify input >> fastq files qualities are in the illumina 1.3+ format, I find a lot >> of people who are new to illumina analysis miss that point like it >> happened to me. I understand this is not a bwa problem, but bwa is >> used a great deal for illumina data analysis. >> >> Thanks, >> Ivette Santana-Cruz >> Bioinformatics Analyst >> Institute for Genome Sciences >> School of Medicine >> U. Maryland Baltimore >> >> >> >> >> >> >> >> >> On Mon, Nov 22, 2010 at 11:49 PM, Heng Li <lh...@sa...> wrote: >> Hi Koji, >> >> I think your modification is correct, and much simpler than what I >> would modify. Your change will be added to 0.5.9. >> >> Many thanks, >> >> Heng >> >> On Nov 11, 2010, at 10:25 PM, NAGAO Koji wrote: >> >>> Dear Heng, >>> >>> I expect BWA samse randomly places a repetitive read across the >>> multiple equally best positions. >>> But I notice current BWA does not behave like this. >>> >>> For example, test.fastq contains 100 identical reads. >>> In the case of this query, BWA samse reports only 2 of 4 best >>> matches >>> as placed positions. >>> >>> $ for i in {1..100}; do echo @${i}; echo >>> aacgggatttcttcatataatgctagacagaagaattctca; echo '+'; echo >>> IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII; done > test.fastq >>> $ bwa aln hg18.fa test.fastq > test.sai >>> $ ~/bwa-0.5.8c/bwa samse hg18.fa test.sai test.fastq | samtools >>> view - >>> S - | cut -f2-4,14 | sort | uniq -c >>> 55 16 chr10 41848006 X0:i:4 >>> 45 16 chr10 41849367 X0:i:4 >>> >>> To fix this bias, I'd like to propose the following change. >>> >>> --- bwa-0.5.8c/bwase.c 2010-10-17 12:41:55.000000000 +0900 >>> +++ bwa-0.5.8c.work/bwase.c 2010-11-11 23:10:31.000000000 +0900 >>> @@ -27,7 +27,7 @@ >>> for (i = cnt = 0; i < n_aln; ++i) { >>> const bwt_aln1_t *p = aln + i; >>> if (p->score > best) break; >>> - if (drand48() * (p->l - p->k + 1) > >>> (double)cnt) { >>> + if (drand48() * (p->l - p->k + 1 + cnt) >= >>> (double)cnt) { >>> s->n_mm = p->n_mm; s->n_gapo = p- >>>> n_gapo; s->n_gape = p->n_gape; s->strand = p->a; >>> s->score = p->score; >>> s->sa = p->k + (bwtint_t)((p->l - p- >>>> k + 1) * drand48()); >>> >>> Using above code, all the best positions are randomly chosen for >>> repetitive reads. >>> >>> $ ~/bwa-0.5.8c.work/bwa samse hg18.fa test.sai test.fastq | samtools >>> view -S - | cut -f2-4,14 | sort | uniq -c >>> 24 0 chr10 41722270 X0:i:4 >>> 29 0 chr19 32425907 X0:i:4 >>> 24 16 chr10 41848006 X0:i:4 >>> 23 16 chr10 41849367 X0:i:4 >>> >>> Is my modification correct? >>> >>> Best regards, >>> >>> Koji Nagao >>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Centralized Desktop Delivery: Dell and VMware Reference Architecture >>> Simplifying enterprise desktop deployment and management using >>> Dell EqualLogic storage and VMware View: A highly scalable, end-to- >>> end >>> client virtualization framework. Read more! >>> http://p.sf.net/sfu/dell-eql-dev2dev >>> _______________________________________________ >>> Bio-bwa-help mailing list >>> Bio...@li... >>> https://lists.sourceforge.net/lists/listinfo/bio-bwa-help >> >> >> >> -- >> The Wellcome Trust Sanger Institute is operated by Genome Research >> Limited, a charity registered in England with number 1021457 and a >> company registered in England with number 2742969, whose registered >> office is 215 Euston Road, London, NW1 2BE. >> >> ------------------------------------------------------------------------------ >> Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! >> Tap into the largest installed PC base & get more eyes on your game >> by >> optimizing for Intel(R) Graphics Technology. Get started today with >> the >> Intel(R) Software Partner Program. Five $500 cash prizes are up for >> grabs. >> http://p.sf.net/sfu/intelisp-dev2dev >> _______________________________________________ >> Bio-bwa-help mailing list >> Bio...@li... >> https://lists.sourceforge.net/lists/listinfo/bio-bwa-help >> >> ------------------------------------------------------------------------------ >> Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! >> Tap into the largest installed PC base & get more eyes on your game >> by >> optimizing for Intel(R) Graphics Technology. Get started today with >> the >> Intel(R) Software Partner Program. Five $500 cash prizes are up for >> grabs. >> http://p.sf.net/sfu/intelisp-dev2dev_______________________________________________ >> Bio-bwa-help mailing list >> Bio...@li... >> https://lists.sourceforge.net/lists/listinfo/bio-bwa-help > > ------------------------------------------------------------------------------ > Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! > Tap into the largest installed PC base & get more eyes on your game by > optimizing for Intel(R) Graphics Technology. Get started today with > the > Intel(R) Software Partner Program. Five $500 cash prizes are up for > grabs. > http://p.sf.net/sfu/intelisp-dev2dev_______________________________________________ > Bio-bwa-help mailing list > Bio...@li... > https://lists.sourceforge.net/lists/listinfo/bio-bwa-help |