Re: [wgs-assembler-users] PBcR question

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

Thanks, yes this looks like a bug in that the code recognized your genome is too big to do the precompute but didn't properly turn it off. Adding the localStaging="<path to local disk on node>" should let you work around the issue. We will make a new release candidate and fix this bug and the other one you encountered. I will say that with 16X you are probably not going to get a very good assembly because you'll likely have less than 10X after correction. I'd suggest trying ECTools as well (https://github.com/jgurtowski/ectools) as it is designed to work best with coverage in the 10-20X range in combination with short-read sequencing data.

Sergey

On Jun 25, 2014, at 2:33 PM, Matthew Conte <co...@gm...> wrote:

> Hi Serge,
> 
> On Wed, Jun 25, 2014 at 11:36 AM, Serge Koren <ser...@gm...> wrote:
> Hi,
> 
> On Jun 24, 2014, at 5:40 PM, Matthew Conte <co...@gm...> wrote:
> 
>> Hi all,
>> 
>> I'm trying out PBcR to make use of the new MHAP overlapper for self correcting a set of PacBio reads and I'm running into an issue.
>> 
>> I'm getting the following errors in the temp_dir/1-overlapper/1.err:
>> Exception in thread "main" java.io.FileNotFoundException: /raid3/PBcR_CA_8.2_alpha/tempLibrary/1-overlapper/stream_1/correct_reads_part000002.dat (No such file or directory)
> The dat file is a pre-computed index that is used to speed up the computation for smaller genomes. For larger genomes or if you are using local disk, it should not get created. Do you have the output of the pipeline up to this step along with the command-line you used to start the run? That will help diagnose why it is not properly recognizing that the index is not built. As a workaround, you can add "localStaging=</path to local disk>" to your PBcR command which will force the pipeline to never pre-compute the index.
> 
> The command that I ran was: 
> /sw/wgs-8.2alpha/Linux-amd64/bin/PBcR "mhap=-k 16 --num-hashes 1256 --num-min-matches 3 --threshold 0.04 localStaging=/path_to_working_dir/temp_staging" merSize=16 -length 500 -partitions 200 -threads 27 -lib
> aryname PBcR -s pacbio.spec fastqFile=filtered_subreads.bbmap.rm_adapters.split.fastq -genomeSize 1000000000
> 
> I changed the MHAP settings according to the PBcR wiki since I only have about 16x coverage of PacBio data.  
> 
> I should mention that runCA continues to run until the '5-consensus' step, and errors out there.  But I think the start of the problem is at this overlap step.
> 
> The relevant output was:
> ###  Reading options from 'pacbio.spec'
> ###  Reading options from the command line.
> 
> Warning: no frag files specified, assuming self-correction of pacbio sequences.
> Running with 27 threads and 200 partitions
> ********* Starting correction...
> ...
> ******** Configuration Summary ********
> bankPath		=	
> maxCoverage		=	40
> ...
> mhap			=	-k 16 --num-hashes 1256 --num-min-matches 3 --threshold 0.04 localStaging=/path_to_working_dir/temp_staging
> ovlRefBlockLength	=	100000000000
> cnsErrorRate		=	0.25
> ...
> ----------------------------------------START Wed Jun 25 11:24:30 2014
> mkdir tempPBcR
> ----------------------------------------END Wed Jun 25 11:24:30 2014 (0 seconds)
> ----------------------------------------START Wed Jun 25 11:24:30 2014
> /sw/wgs-8.2alpha/Linux-amd64/bin/fastqToCA -libraryname PBcR -type sanger -technology none -feature doConsensusCorrection 1 -reads /path_to_working_dir/filtered_subreads.bbmap.rm_adapters.split.fastq > /path_to_working_dir//tempPBcR/PBcR.frg
> ----------------------------------------END Wed Jun 25 11:24:30 2014 (0 seconds)
> ----------------------------------------START Wed Jun 25 11:24:30 2014
> /sw/wgs-8.2alpha/Linux-amd64/bin/runCA  -s /path_to_working_dir//tempPBcR/PBcR.spec -p asm -d tempPBcR  stopAfter=initialStoreBuilding   /path_to_working_dir//tempPBcR/PBcR.frg
> ----------------------------------------START Wed Jun 25 11:24:30 2014
> /sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper  -o /path_to_working_dir/tempPBcR/asm.gkpStore.BUILDING  -F  /path_to_working_dir//tempPBcR/PBcR.frg > /path_to_working_dir/tempPBcR/asm.gkpStore.err 2>&1
> ----------------------------------------END Wed Jun 25 11:35:27 2014 (657 seconds)
> numFrags = 2995674
> Stop requested after 'initialstorebuilding'.
> ----------------------------------------END Wed Jun 25 11:35:27 2014 (657 seconds)
> Will be correcting PacBio library 1 with librarie[s] 1 - 1
> ----------------------------------------START Wed Jun 25 11:35:29 2014
> /sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -dumpfragments -invert -tabular -longestovermin 1 500 -longestlength 1 8268329152 /path_to_working_dir//tempPBcR/asm.gkpStore 2> /path_to_working_dir//tempPBcR/asm.seedlength |awk '{if (!(match($1, "UID") != 0 && length($1) == 3)) { print "frg uid "$1" isdeleted 1"; } }' > /path_to_working_dir//tempPBcR/asm.toerase.uid
> ----------------------------------------END Wed Jun 25 11:35:38 2014 (9 seconds)
> ----------------------------------------START Wed Jun 25 11:35:38 2014
> /sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper --edit /path_to_working_dir//tempPBcR/asm.toerase.uid /path_to_working_dir//tempPBcR/asm.gkpStore > /path_to_working_dir//tempPBcR/asm.toerase.out 2> /path_to_working_dir//tempPBcR/asm.toerase.err
> ----------------------------------------END Wed Jun 25 11:35:44 2014 (6 seconds)
> Running with 8.268329256X (for genome size 1000000000) of PBcR sequences (8268329256 bp).
> Correcting with 16X sequences (16536658304 bp).
> Warning: performing self-correction with a total of 16. For best performance, at least 50 is recommended.
> ----------------------------------------START Wed Jun 25 11:35:44 2014
> /sw/wgs-8.2alpha/Linux-amd64/bin/jellyfish count  -m 16 -s 120000000 -t 32 -o /path_to_working_dir//tempPBcR/asm.mers /path_to_working_dir/filtered_subreads.bbmap.rm_adapters.split.fastq
> ----------------------------------------END Wed Jun 25 12:05:11 2014 (1767 seconds)
> ----------------------------------------START Wed Jun 25 12:05:11 2014
> /sw/wgs-8.2alpha/Linux-amd64/bin/jellyfish histo -t 32 -f /path_to_working_dir//tempPBcR/asm.mers > /path_to_working_dir//tempPBcR/asm.hist
> ----------------------------------------END Wed Jun 25 12:09:10 2014 (239 seconds)
> ----------------------------------------START Wed Jun 25 12:09:10 2014
> /sw/wgs-8.2alpha/Linux-amd64/bin/jellyfish dump -c -t -L 34 /path_to_working_dir//tempPBcR/asm.mers |awk -v TOTAL=3328265613 '{printf("%s\t%0.10f\t%d\t%d\n", $1, $2/TOTAL, $2, TOTAL)}' |sort -T . -rnk2> /path_to_working_dir//tempPBcR/asm.ignore
> ----------------------------------------END Wed Jun 25 12:21:17 2014 (727 seconds)
> ----------------------------------------START Wed Jun 25 12:21:17 2014
> rm /path_to_working_dir//tempPBcR/asm.mers*
> ----------------------------------------END Wed Jun 25 12:21:23 2014 (6 seconds)
> ----------------------------------------START Wed Jun 25 12:21:23 2014
> mkdir /path_to_working_dir//tempPBcR/1-overlapper
> ----------------------------------------END Wed Jun 25 12:21:23 2014 (0 seconds)
> ----------------------------------------START Wed Jun 25 12:21:23 2014
> /sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -dumpfragments -tabular asm.gkpStore |awk '{print $1"\t"$2}' > asm.eidToIID
> ----------------------------------------END Wed Jun 25 12:21:28 2014 (5 seconds)
> ----------------------------------------START Wed Jun 25 12:21:28 2014
> /sw/wgs-8.2alpha/Linux-amd64/bin/gatekeeper -dumpfragments -tabular asm.gkpStore |awk '{print $2"\t"$10}' > asm.iidToLen
> ----------------------------------------END Wed Jun 25 12:21:33 2014 (5 seconds)
> ----------------------------------------START CONCURRENT Wed Jun 25 12:21:33 2014
> /path_to_working_dir//tempPBcR/1-overlapper/ovlprep.sh 1
> Scanning store to find libraries used and reads to dump.
> Added 0 reads to maintain mate relationships.
> Dumping 0 fragments from unknown library (version 1 has these)
> Dumping 133125 fragments from library IID 1
> /path_to_working_dir//tempPBcR/1-overlapper/ovlprep.sh 2
> Scanning store to find libraries used and reads to dump.
> Added 0 reads to maintain mate relationships.
> ...
> /path_to_working_dir//tempPBcR/1-overlapper/ovlprep.sh 23
> Scanning store to find libraries used and reads to dump.
> Added 0 reads to maintain mate relationships.
> Dumping 0 fragments from unknown library (version 1 has these)
> Dumping 66924 fragments from library IID 1
> ----------------------------------------END CONCURRENT Wed Jun 25 12:27:16 2014 (343 seconds)
> ----------------------------------------START CONCURRENT Wed Jun 25 12:27:16 2014
> /path_to_working_dir//tempPBcR/1-overlapper/overlap.sh 1
> Running partition 000001 with options -h 1-133125 -r 133126-1597500 start 133125 end 1597500 total 1464375 zero job 0 and stride 1
> /path_to_working_dir//tempPBcR/1-overlapper/overlap.sh 2
> Running partition 000002 with options -h 1-133125 -r 1597501-2995674 start 1597500 end 2995674 total 1398174 zero job 0 and stride 1
> ...
> 
> 
> Thanks,
> Matt
>  
> 
>> 
>> There is no 'correct_reads_part000002.dat' file there, but there is a 'correct_reads_part000002.fasta' file where the 'stream_1/correct_reads_part000002.dat' points to. I'm not sure if it is just an extension naming issue or if the .dat files weren't created properly. 
>> 
>> Also, I've found another minor issue with the '-threads' option supplied to PBcR on the command line. It doesn't seem to use the number of threads supplied and simply uses the max number of cpus on the machine available. 
> Thanks, I'll check this and update the code.
>> 
>> Thanks,
>> Matt
>> ------------------------------------------------------------------------------
>> Open source business process management suite built on Java and Eclipse
>> Turn processes into business applications with Bonita BPM Community Edition
>> Quickly connect people, data, and systems into organized workflows
>> Winner of BOSSIE, CODIE, OW2 and Gartner awards
>> http://p.sf.net/sfu/Bonitasoft_______________________________________________
>> wgs-assembler-users mailing list
>> wgs...@li...
>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
> 
>