sticking to the bwa part of the question, you're mapping different tiny fastq files to the same reference and outputting to different files. as these are tiny fastq files, why not map all the reads to the reference, output to once file, then do your sorting and splitting of results.  that will save you a ton of overhead.

On Fri, Jul 25, 2014 at 8:05 AM, Fred Shao <> wrote:
I have a large number of independent MiSeq sequencing reads to be processed with bwa and samtools in order to get the consensus sequences for each set. To get the consensus sequence, I took the vcf2fq  subroutine from by Heng Li and modified it into a stand alone perl script. I use a foreach loop to process all the datasets:

foreach $Tag (keys %DogTag_ID_hash_array ) {      
    $read_number = scalar@{$DogTag_ID_hash_array{$Tag} };

   next if ($read_number <5); 

     $seqgroup_ref = SequenceGrouper (\@{$DogTag_ID_hash_array{$Tag} }, \%aligned_seq_sample, $Tag); 
     $fastqout = Fasta2Fastq ($seqgroup_ref, $Tag);

    opendir (DIR, $input_dir) || die "cannot open $input_dir\n";   
    my  @fastqfile = grep (/_group\.fastq$/, readdir DIR );

    $a=qx+$bwa mem $ref_seq_file[0] $fastqfile[0] | $samtools view -bS -> $fastqfile[0].bam+;   ### create bam file from fastq


    opendir (DIR, $input_dir) || die "cannot open $input_dir\n"; 
    my @bam_files = grep (/_group\.fastq.bam$/, readdir DIR );

    $a=qx+ $samtools sort $bam_files[0] $bam_files[0].sorted+;  ### sort bam file


    opendir (DIR, $input_dir) || die "cannot open $input_dir\n";   
    my @sorted_bam = grep (/_group\.fastq\.bam\.sorted.bam$/, readdir DIR );

   $a=qx+ $samtools index $sorted_bam[0]+;   ### index bam file


    opendir (DIR, $input_dir) || die "cannot open $input_dir\n"; 
    my @indexed_bam = grep (/_group\.fastq\.bam\.sorted\.bam$/, readdir DIR );

    $a=qx+ $samtools mpileup -uf $ref_seq_file[0] $indexed_bam[0] | bcftools view -cg - | $vcf2fq  $Tag >>$output_file+;  ## create consensus sequence

   $a=qx+ rm -f *_group.*+; 


}   ### end of foreach loop

The above script section works fine. But it is kind slow.
As you can see, in each loop, there are several steps: (1) create a bam file with bwa and samtools. (2) sort it with samtools sort function. (3) create indexed bam file with samtools index function, (4) use samtools mpileup, bcftools view, and then to get the consensus sequence for thet sequence set. 
I need to open the directory, grep the correct input file, and then close the directory in each loop. This takes a lot of time considering I have a large loop. Is there a easy way to do this? Or is there another totally different way to get consensus sequences?

Also, bwa and samtools produces messages to the screen like:

[main] Version: 0.7.7-r441
[main] CMD: /opt/nasapps/production/bwa/0.7.7/bin/bwa mem ref_seq.fas TTCTAGTTGA_group.fastq
[main] Real time: 0.014 sec; CPU: 0.008 sec
[samopen] SAM header is present: 1 sequences.
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
[afs] 0:445.710 1:5.402 2:0.888
[M::main_mem] read 5 sequences (2405 bp)...
[M::mem_process_seqs] Processed 5 reads in 0.008 CPU sec, 0.010 real sec
[main] Version: 0.7.7-r441
[main] CMD: /opt/nasapps/production/bwa/0.7.7/bin/bwa mem ref_seq.fas CCGGCAGGAT_group.fastq[samopen] SAM header is present: 1 sequences.

Is there an option to not have those messages printed on the screen?



Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
Bio-bwa-help mailing list