Re: [wgs-assembler-users] Fwd: PBcR error on overlapStoreBuild step

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I don't believe the way the overlaps are created is a problem but the way the overlap store is doing the partitioning is. It looks like you have about 4X of PacBio data and about 150X of Illumina data. This a larger difference than we normally use (usually we recommend no more than 50X of Illumina data and 10X+ PacBio) which is likely why this error has not been encountered before. The overlaps are only computed between the PacBio and Illumina reads which are evenly distributed among the partitions so they should all have approximately the same number of overlaps. This should be easy to confirm if all your overlap ovb files are approximately the same size and your output log seems to confirm this.

The overlap store bucketizing is assuming equal number of overlaps for each read in your dataset and your Illumina-Illumina overlaps do not exist so as a result all the IIDs with overlaps end up in the last bucket. You've got 505,893 pacbio fragments and 1,120,240,607 Illumina reads. To split the PacBio reads among multiple partitions, you'd want to have be able to open 10,000-20,000 files (partitions) which is above the current limit you have. If you can modify it using ulimit -n 50000 and then run the store creation specifying -f 20480  (or some other large number). That should make your last partition significantly smaller. If you cannot increase the limit then modifying the code is the only option. The good news is that if you are able to build the store, you can re-launch the PBcR pipeline and it will resume the correction after the overlapping step.

Sergey

The hash is only composed of the last set of reads (PacBio) and the refr sequences streamed against the hash are the Illumina data. 
On Jun 18, 2014, at 8:16 PM, Brian Walenz <th...@gm...> wrote:

> Unfortunately, I'm on vacation at the moment, and finding little time to spend helping you.
> 
> "Too many open files" is a limit imposed by the OS.  Can you increase this?  We've set our large memory machines to allow 100,000 open files.
> 
> The output files sizes -- and the problem you're suffering from -- are all caused by the way overlaps are created.  Correction asked for only overlaps between Illumina and PacBio reads.  All the illumina reads are 'first' in the store, and all the pacbio reads are at the end.  Overlap jobs will find overlaps between 'other' reads and some subset of the store - e.g., the first overlap job will process the first 10% of the reads, the second will do the second 10% of the reads, etc.  Since the pacbio are last, the last job found all the overlaps, so only the last file is of significant size.  This also breaks the partitioning scheme used when sorting overlaps.  It assumes overlaps are distributed randomly, but yours are all piled up at the end.
> 
> I don't see an easy fix here, but I think I can come up with a one-off hack to get your store built.  Are you comfortable working with C code and compiling?  Send the output of 'gatekeeper -dumpinfo *gkpStore' so I can see the number of reads per library.
> 
> 
> 
> 
> 
> 
> On Tue, Jun 17, 2014 at 6:45 PM, Santiago Revale <san...@gm...> wrote:
> Hi Brian,
> 
> When using 1024, it said the OS wasn't able to handle it, and it recommended using 1008.
> When using 1008, CA ended arguing "Failed to open output file... Too many open files".
> 
> Now I'm trying with fewer parts, but I don't think this would solve the problem.
> 
> Do you have any more ideas?
> 
> Thanks again in advance.
> 
> Regards,
> Santiago
> 
> 
> On Sun, Jun 15, 2014 at 10:10 PM, Santiago Revale <san...@gm...> wrote:
> Hi Brian,
> 
> Thanks for your reply. In regards of your suggestions:
> 
> 1) the PBcR process generates OVB files without zipping them; just to be sure, I've tried to unzip some of them just in case the extension were missing;
> 
> 2) I've re-launched the process with the suggested parameters, but using 512 instead of 1024; the result was exactly the same: same error in the same step. Also, again 511 out of 512 files had a size of 2.3Gb while the last file was 1.2Tb long. Do you know why does this happens?
> 
> I'm trying one last time using 1024 instead.
> 
> Thanks again for your reply. I'm open to some more suggestions.
> 
> Regards,
> Santiago
> 
> 
> On Fri, Jun 13, 2014 at 4:25 PM, Brian Walenz <th...@gm...> wrote:
> Hi-
> 
> This is a flaw in gzip, where it doesn't report the uncompressed size correctly for files larger than 2gb.  I'm not intimately familiar with this pipeline, so don't know exactly how to implement the fixes below.
> 
> Fix with either:
> 
> 1) gzip -d the *gz files before building the overlap store.  The 'find' command in the log indicates the pipeline will pick up the uncompressed files.  You might need to remove the 'asm.ovlStore.list' file before restarting (this has the list of inputs to overlapStoreBuild).
> 
> 2) Set ovlStoreMemory to (exactly) "0 -f 1024".  This will tell it to use 0MB memory, and instead use 1024 files regardless of the size.  512 files will also work, and is a little safer (not near some Linux 'number of open files' limits).
> 
> 3) Build the overlap store by hand (with either the uncompressed input, or the -f instead of -M option), outside the script, and then restart the script.  The script will notice there is an overlap store already present, and skip the build.  The command is in the log file -- make sure the final store is called 'asm.ovlStore', and not 'asm.ovlStore.BUILDING'.
> 
> Option 1 should work, but option 2 is the easiest to try.  I wouldn't try option 3 until Sergey speaks up.
> 
> b
> 
> 
> 
> 
> On Fri, Jun 13, 2014 at 12:33 PM, Santiago Revale <san...@gm...> wrote:
> Dear CA community,
> 
> I'm running the correction of some PacBio reads with high-identity Illumina reads, in a high memory server, for a 750 Mbp genome. I've considered the known issues addressed in the website when starting the correction.
> 
> When executing the pipeline, I've reached to the overlapStoreBuild step with 48 ovb files, size 26 Gb each (totaling 1.2Tb). ovls files have already been deleted by the script. The error happened while executing overlapStoreBuild:
> 
> ...
> bucketizing DONE!
> overlaps skipped:
>                0 OBT - low quality
>                0 DUP - non-duplicate overlap
>                0 DUP - different library
>                0 DUP - dedup not requested
> terminate called after throwing an instance of 'std::bad_alloc'
>   what():  std::bad_alloc
> 
> Failed with 'Aborted'
> ...
> 
> I ran this step twice: the first one having set ovlStoreMemory to 8192 Mb, but the second one, set it on 160000 (160 Gb). In the "Overlap store failure" FAQ, it mentioned as possible causes "Out of disk space" (which is not my case) and "Corrupt gzip files / too many fragments". I don't have gzip files and I have only 15 fragments. Also, bucketizing step finishes OK.
> 
> Also, some odd thing I've noticed (at least odd for me) is that 14 of the 15 temp files (tmp.sort.XXX) of the asm.ovlStore.BUILDING folder have a size 79Gb while the last one size is 1.2Tb.
> 
> Could anybody tell me what could be the cause of this error and how to solve it?
> 
> I'm attaching the asm.ovlStore.err and the pacBioToCA log files for complete descriptions of the error and the executed commands.
> 
> Thank you very much in advance.
> 
> Regards,
> Santiago
> 
> 
> 
> ------------------------------------------------------------------------------
> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
> Find What Matters Most in Your Big Data with HPCC Systems
> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
> http://p.sf.net/sfu/hpccsystems
> _______________________________________________
> wgs-assembler-users mailing list
> wgs...@li...
> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
> 
> 
> 
> 
> 
> ------------------------------------------------------------------------------
> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
> Find What Matters Most in Your Big Data with HPCC Systems
> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
> http://p.sf.net/sfu/hpccsystems_______________________________________________
> wgs-assembler-users mailing list
> wgs...@li...
> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users