Re: [Bio-bwa-help] bwa properly paired problem
Status: Beta
Brought to you by:
lh3lh3
From: Lauren V. <lau...@ha...> - 2015-11-09 21:30:59
|
Hi all, I know I sent my last email out kind late on Friday, but if anyone has any thoughts they would be much appreciated. Lauren "Hi all, I'm still working on bwa aln and sampe, the aln worked, but I'm having a problem putting in the @RG with sampe. My code is bwa sampe -a 1200 -r "@RG\tID:site09_01\tSM:site09_01\tPL:Illumina" reference.fasta site09_01.R1.sai site09_01.R2.sai site09_01.R1.fq site09_01.R2.fq site09_01.sam When I use this I get the error: [E::bwa_set_rg] the read group line is not started with @RG However, when I remove the quotes, I get the error [E::bwa_set_rg] no ID at the read group line I've tried any variation I can think of including single ' and changing ID and SM to two unique words... am I missing something? Thanks again!" On Fri, Nov 6, 2015 at 2:53 PM, Lauren Vanwoudenberg <lau...@ha...> wrote: > Hi all, I'm still working on bwa aln and sampe, the aln worked, but I'm > having a problem putting in the RG with sampe. My code is > bwa sampe -a 1200 -r "@RG\tID:site09_01\tSM:site09_01\tPL:Illumina" > reference.fasta site09_01.R1.sai site09_01.R2.sai site09_01.R1.fq > site09_01.R2.fq site09_01.sam > When I use this I get the error: [E::bwa_set_rg] the read group line is > not started with @RG > However, when I remove the quotes, I get the error [E::bwa_set_rg] no ID > at the read group line > I've tried any variation I can think of including single ' and changing ID > and SM to two unique words... am I missing something? > > Thanks again! > Lauren > > On Wed, Nov 4, 2015 at 11:00 AM, Ross Whetten <ros...@nc...> wrote: > >> Hi Lauren, >> OK, that helps. The Rainbow assembler was originally designed to work >> with data produced by the first generation RAD-seq protocol, where one end >> is anchored to a restriction site and the other end is creating by >> shearing. This procedure results in a family of overlapping reads from the >> read2 sequencing primer that can be assembled into a short contig. If you >> used restriction enzymes to create both ends of the fragments, then both >> read1 and read2 sequences are anchored to restriction sites, and the only >> "contig" you can build is a stack of reads that all start at the same >> position. >> If your insert size is large enough that the reads don't overlap, then >> the the "contigs" from each end of the restriction fragments won't overlap >> either, which means that each end of a paired-end read will align to a >> different "contig" - this leads to the low fraction of reads mapped in >> proper pairs. >> >> >> Regards, >> Ross >> >> >> Ross Whetten, Professor >> Department of Forestry & Environmental Resources >> North Carolina State University >> Raleigh, North Carolina, 27695-8008 USA >> tel: 919-515-7578 >> >> NOTE: email sent to or from this account is subject to the North Carolina >> Public Records Act. >> >> On Wed, Nov 4, 2015 at 3:23 PM, Lauren Vanwoudenberg <lau...@ha... >> > wrote: >> >>> Thomas- thanks for the suggestions, I will try using 'bwa sampe' and 'bwa >>> aln' with the parameter for the unusual library prep. I'll let you know how >>> it goes. >>> >>> Ross- Looking at the read name, I believe there is sufficient >>> information in the first space-delimited field to distinguish among reads >>> (each has a unique number as a read name). I did filter for quality and >>> adapter trimming, but each set of files was done as a pair to maintain >>> their pairing and I double checked with the FastqPairedEndValidator.pl >>> script afterwards again to be sure. Do you have another validator that you >>> would suggest? If the issue really is read pairing between the files, I'm >>> not sure where it could have arose as I am following the same example >>> script/pipeline as a colleague working with very similar data and it has >>> worked for him. To do my assembly I used Rainbow, it's part of a dDocent >>> pipeline created specifically for ezRAD, restriction site–associated >>> DNA (RAD), for highly polymorphic marine species. Rainbow uses a clustering >>> approach... maybe working with contigs produced by the assembly of >>> paired-end reads. I created my reference with 50 randomly selected animals >>> from my larger data set of ~300, which was a total of about 32 million >>> reads (after dereplication) being fed to the program. I hope this >>> information helps. >>> >>> Lauren >>> >>> On Wed, Nov 4, 2015 at 3:54 AM, Ross Whetten <ros...@nc...> wrote: >>> >>>> Hi Lauren, >>>> The size of the inserts in your Hiseq library should have no effect on >>>> whether BWA detects the paired-end sequences as coming from the same >>>> cluster, nor should it matter if the reads overlap or not. >>>> >>>> I have not used the FastqPairedEndValidator.pl script, but a quick look >>>> at the Perl code suggests it compares the first white-space-delimited >>>> character string from the header lines of the read1 and read2 fastq files. >>>> Depending on how the sequencing center formatted your fastq files, this may >>>> or may not work - some sequencing centers alter the original output from >>>> the Illumina software in ways that such scripts do not expect. You can look >>>> at the headers of the first few sequences in your read1 and read2 files to >>>> see if there is sufficient information in the first space-delimited field >>>> to distinguish among reads within a file. >>>> If the fastq files were filtered for quality or for adapter trimming, >>>> removal of a read from one file can shift the relative pairing of all >>>> subsequent reads in the files and destroy the relationship that BWA >>>> expects. >>>> >>>> The flagstats output in your original post included the lines >>>> > 19462094 + 0 mapped (99.38%:nan%) >>>> > 2748711 + 0 paired in sequencing >>>> In English - over 99% of the reads mapped, but less than 3% were >>>> detected as "paired in sequencing" - this detection is dependent only on >>>> read position within the read1 and read2 files, so the low value suggests >>>> that read pairing in your files is a problem. >>>> >>>> Regardless of issues with read pairing, some of the difficulty could >>>> also originate in the de novo assembly. Your original post said you >>>> "created a de novo reference genome ... using Illumina HiSeq data" - can >>>> you expand on that a little? What assembly software did you use? Was the >>>> assembly scaffolded using jumping or linking libraries, or are you working >>>> with the contigs produced by assembly of paired-end or single-end reads? >>>> >>>> Regards, >>>> Ross >>>> >>>> >>>> >>> >> > |