Re: [wgs-assembler-users] PBcR error on overlapStoreBuild step

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi guys,

1. ovb files were using 1.2T (they were not compressed) and .fasta, .qual
and .qv, another 850Gb. All gone now.

2. in regards the -pbCNS option, no, haven't seen it by the time I've
started.

My problem now is that the process has been running for 3 days and at the
moment it is using about 97.2% of available memory (and growing). It is a
256Gb standalone server where I'm just a guest. Should I wait a little more
for it to finished? Why is it using all available memory? It is running the
layout step (runCorrection.sh script).

I'm attaching the pacBioToCA log, the runCorrection.sh script and the
asm.layout.err file as a reference for the options, specs and status.

Any help would be really appreciated.

Thank you very much in advance again.

Regards,
Santiago

On Mon, Jun 23, 2014 at 2:27 PM, Serge Koren <ser...@gm...> wrote:

> Hi,
>
> 1. Yes, as long as you have the asm.ovlStore constructed you can delete
> the contents of the 1-overlapper directory. I'm guessing it is fasta/qual
> files that are taking al the space
>
> 2. The overlapping is the most expensive part of the computation so the
> remaining steps should be relatively quick. The consensus can be another
> expensive step. I'm not sure if you specified -pbCNS when you ran
> pacBioToCA but if you haven't relaunched the run yet, you can add that
> option and it will use a faster consensus module (which is actually on by
> default in the next CA release).
>
> Sergey
>
> On Jun 21, 2014, at 11:58 AM, Santiago Revale <san...@gm...>
> wrote:
>
> Hi Brian/Serge,
>
> Brian's patch worked like a charm. I'll be continue executing the
> pacBioToCA script.
>
> A couple of quick questions before:
>
> 1) can I delete the "1-overlapper/" directory before the pacBioToCA script
> ended? Because it is 2Tb long as "asm.ovlStore" is that size too (1.8Tb).
>
> 2) could you give an estimated time the remaining portion of the script
> would take? And also an estimate on cores and memory usage?
>
> Thank you very much for your help and assistance.
>
> Regards,
> Santiago
>
>
> On Thu, Jun 19, 2014 at 12:53 PM, Santiago Revale <
> san...@gm...> wrote:
>
>> Thank you very much, guys.
>>
>> I'll be trying your suggestions this days, starting from Brian's, and
>> I'll be back to you with the outcome.
>>
>> Regards,
>>  Santiago
>>
>>
>>
>> On Thu, Jun 19, 2014 at 8:34 AM, Brian Walenz <th...@gm...> wrote:
>>
>>> Sergey is right; the vacation must be getting to me...
>>>
>>> Here is a simple patch to AS_OVS/overlapStoreBuild.C.  This will change
>>> the way the data is partitioned, so that the first partitions are merged
>>> into a few and the last one is split into many.  This should result in
>>> partitions of around 10gb in size -- the 1tb partition should be split into
>>> 128 pieces.
>>>
>>> The change is only an addition of ~15 lines, to function
>>> writeToDumpFile().  The new lines are enclosed in a #if/#endif block,
>>> currently enabled.  You can just drop this file into a svn checkout and
>>> recompile.  DO NOT USE FOR PRODUCTION!  There are hardcoded values specific
>>> to your assembly.  Please do check these values against gatekeeper
>>> dumpinfo.  I don't think they're critical to be exact, but if I'm off by an
>>> order of magnitude, it probably won't work well.
>>>
>>> b
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jun 18, 2014 at 11:43 PM, Serge Koren <ser...@gm...>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I don't believe the way the overlaps are created is a problem but the
>>>> way the overlap store is doing the partitioning is. It looks like you have
>>>> about 4X of PacBio data and about 150X of Illumina data. This a larger
>>>> difference than we normally use (usually we recommend no more than 50X of
>>>> Illumina data and 10X+ PacBio) which is likely why this error has not been
>>>> encountered before. The overlaps are only computed between the PacBio and
>>>> Illumina reads which are evenly distributed among the partitions so they
>>>> should all have approximately the same number of overlaps. This should be
>>>> easy to confirm if all your overlap ovb files are approximately the same
>>>> size and your output log seems to confirm this.
>>>>
>>>> The overlap store bucketizing is assuming equal number of overlaps for
>>>> each read in your dataset and your Illumina-Illumina overlaps do not exist
>>>> so as a result all the IIDs with overlaps end up in the last bucket. You've
>>>> got 505,893 pacbio fragments and 1,120,240,607 Illumina reads. To split the
>>>> PacBio reads among multiple partitions, you'd want to have be able to open
>>>> 10,000-20,000 files (partitions) which is above the current limit you have.
>>>> If you can modify it using ulimit -n 50000 and then run the store creation
>>>> specifying -f 20480  (or some other large number). That should make your
>>>> last partition significantly smaller. If you cannot increase the limit then
>>>> modifying the code is the only option. The good news is that if you are
>>>> able to build the store, you can re-launch the PBcR pipeline and it will
>>>> resume the correction after the overlapping step.
>>>>
>>>> Sergey
>>>>
>>>>
>>>> The hash is only composed of the last set of reads (PacBio) and the
>>>> refr sequences streamed against the hash are the Illumina data.
>>>> On Jun 18, 2014, at 8:16 PM, Brian Walenz <th...@gm...> wrote:
>>>>
>>>> Unfortunately, I'm on vacation at the moment, and finding little time
>>>> to spend helping you.
>>>>
>>>> "Too many open files" is a limit imposed by the OS.  Can you increase
>>>> this?  We've set our large memory machines to allow 100,000 open files.
>>>>
>>>> The output files sizes -- and the problem you're suffering from -- are
>>>> all caused by the way overlaps are created.  Correction asked for only
>>>> overlaps between Illumina and PacBio reads.  All the illumina reads are
>>>> 'first' in the store, and all the pacbio reads are at the end.  Overlap
>>>> jobs will find overlaps between 'other' reads and some subset of the store
>>>> - e.g., the first overlap job will process the first 10% of the reads, the
>>>> second will do the second 10% of the reads, etc.  Since the pacbio are
>>>> last, the last job found all the overlaps, so only the last file is of
>>>> significant size.  This also breaks the partitioning scheme used when
>>>> sorting overlaps.  It assumes overlaps are distributed randomly, but yours
>>>> are all piled up at the end.
>>>>
>>>> I don't see an easy fix here, but I think I can come up with a one-off
>>>> hack to get your store built.  Are you comfortable working with C code and
>>>> compiling?  Send the output of 'gatekeeper -dumpinfo *gkpStore' so I can
>>>> see the number of reads per library.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jun 17, 2014 at 6:45 PM, Santiago Revale <
>>>> san...@gm...> wrote:
>>>>
>>>>> Hi Brian,
>>>>>
>>>>> When using 1024, it said the OS wasn't able to handle it, and it
>>>>> recommended using 1008.
>>>>> When using 1008, CA ended arguing "Failed to open output file... Too
>>>>> many open files".
>>>>>
>>>>> Now I'm trying with fewer parts, but I don't think this would solve
>>>>> the problem.
>>>>>
>>>>> Do you have any more ideas?
>>>>>
>>>>> Thanks again in advance.
>>>>>
>>>>> Regards,
>>>>> Santiago
>>>>>
>>>>>
>>>>> On Sun, Jun 15, 2014 at 10:10 PM, Santiago Revale <
>>>>> san...@gm...> wrote:
>>>>>
>>>>>> Hi Brian,
>>>>>>
>>>>>> Thanks for your reply. In regards of your suggestions:
>>>>>>
>>>>>> 1) the PBcR process generates OVB files without zipping them; just to
>>>>>> be sure, I've tried to unzip some of them just in case the extension were
>>>>>> missing;
>>>>>>
>>>>>> 2) I've re-launched the process with the suggested parameters, but
>>>>>> using 512 instead of 1024; the result was exactly the same: same error in
>>>>>> the same step. Also, again 511 out of 512 files had a size of 2.3Gb while
>>>>>> the last file was 1.2Tb long. Do you know why does this happens?
>>>>>>
>>>>>> I'm trying one last time using 1024 instead.
>>>>>>
>>>>>> Thanks again for your reply. I'm open to some more suggestions.
>>>>>>
>>>>>> Regards,
>>>>>> Santiago
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 13, 2014 at 4:25 PM, Brian Walenz <th...@gm...>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi-
>>>>>>>
>>>>>>> This is a flaw in gzip, where it doesn't report the uncompressed
>>>>>>> size correctly for files larger than 2gb.  I'm not intimately familiar with
>>>>>>> this pipeline, so don't know exactly how to implement the fixes below.
>>>>>>>
>>>>>>> Fix with either:
>>>>>>>
>>>>>>> 1) gzip -d the *gz files before building the overlap store.  The
>>>>>>> 'find' command in the log indicates the pipeline will pick up the
>>>>>>> uncompressed files.  You might need to remove the 'asm.ovlStore.list' file
>>>>>>> before restarting (this has the list of inputs to overlapStoreBuild).
>>>>>>>
>>>>>>> 2) Set ovlStoreMemory to (exactly) "0 -f 1024".  This will tell it
>>>>>>> to use 0MB memory, and instead use 1024 files regardless of the size.  512
>>>>>>> files will also work, and is a little safer (not near some Linux 'number of
>>>>>>> open files' limits).
>>>>>>>
>>>>>>> 3) Build the overlap store by hand (with either the uncompressed
>>>>>>> input, or the -f instead of -M option), outside the script, and then
>>>>>>> restart the script.  The script will notice there is an overlap store
>>>>>>> already present, and skip the build.  The command is in the log file --
>>>>>>> make sure the final store is called 'asm.ovlStore', and not
>>>>>>> 'asm.ovlStore.BUILDING'.
>>>>>>>
>>>>>>> Option 1 should work, but option 2 is the easiest to try.  I
>>>>>>> wouldn't try option 3 until Sergey speaks up.
>>>>>>>
>>>>>>> b
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 13, 2014 at 12:33 PM, Santiago Revale <
>>>>>>> san...@gm...> wrote:
>>>>>>>
>>>>>>>> Dear CA community,
>>>>>>>>
>>>>>>>> I'm running the correction of some PacBio reads with high-identity
>>>>>>>> Illumina reads, in a high memory server, for a 750 Mbp genome. I've
>>>>>>>> considered the known issues addressed in the website when starting the
>>>>>>>> correction.
>>>>>>>>
>>>>>>>> When executing the pipeline, I've reached to the overlapStoreBuild
>>>>>>>> step with 48 ovb files, size 26 Gb each (totaling 1.2Tb). ovls files have
>>>>>>>> already been deleted by the script. The error happened while executing
>>>>>>>> overlapStoreBuild:
>>>>>>>>
>>>>>>>> ...
>>>>>>>> bucketizing DONE!
>>>>>>>> overlaps skipped:
>>>>>>>>                0 OBT - low quality
>>>>>>>>                0 DUP - non-duplicate overlap
>>>>>>>>                0 DUP - different library
>>>>>>>>                0 DUP - dedup not requested
>>>>>>>> terminate called after throwing an instance of 'std::bad_alloc'
>>>>>>>>   what():  std::bad_alloc
>>>>>>>>
>>>>>>>> Failed with 'Aborted'
>>>>>>>> ...
>>>>>>>>
>>>>>>>>
>>>>>>>> I ran this step twice: the first one having set ovlStoreMemory to
>>>>>>>> 8192 Mb, but the second one, set it on 160000 (160 Gb). In the "Overlap
>>>>>>>> store failure" FAQ, it mentioned as possible causes "Out of disk space"
>>>>>>>> (which is not my case) and "Corrupt gzip files / too many fragments". I
>>>>>>>> don't have gzip files and I have only 15 fragments. Also, bucketizing step
>>>>>>>> finishes OK.
>>>>>>>>
>>>>>>>> Also, some odd thing I've noticed (at least odd for me) is that 14
>>>>>>>> of the 15 temp files (tmp.sort.XXX) of the asm.ovlStore.BUILDING folder
>>>>>>>> have a size 79Gb while the last one size is 1.2Tb.
>>>>>>>>
>>>>>>>> Could anybody tell me what could be the cause of this error and how
>>>>>>>> to solve it?
>>>>>>>>
>>>>>>>> I'm attaching the asm.ovlStore.err and the pacBioToCA log files for
>>>>>>>> complete descriptions of the error and the executed commands.
>>>>>>>>
>>>>>>>> Thank you very much in advance.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Santiago
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk
>>>>>>>> Solutions
>>>>>>>> Find What Matters Most in Your Big Data with HPCC Systems
>>>>>>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>>>>>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>>>>>>> http://p.sf.net/sfu/hpccsystems
>>>>>>>> _______________________________________________
>>>>>>>> wgs-assembler-users mailing list
>>>>>>>> wgs...@li...
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk
>>>> Solutions
>>>> Find What Matters Most in Your Big Data with HPCC Systems
>>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>>>
>>>> http://p.sf.net/sfu/hpccsystems_______________________________________________
>>>> wgs-assembler-users mailing list
>>>> wgs...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/wgs-assembler-users
>>>>
>>>>
>>>>
>>>
>>
>
>